logo

Alxa_Compiler(SysY to PCODE)

Published on

Lexical analysis

Before Code

Requriement: read testfile.txt, parse every char to word and print them. At the same time, memorize type, content and line number of each word.

File reading

Read by line, scan every char of every string and analyse.

while ((s = bf.readLine()) != null) {
	...
}

Analyse

When i get the key word, enter the next analyst.

while ((c = getChar()) != null) {
  if (c == ' ' || c == '\r' || c == '\t') {
    continue;
  } else if (c == '+' || c == '-' || c == '*' || c == '%') {
    words.add(new Word(c));
  } else if (c == '/') {
    analyseSlash();
  } else if (c == '(' || c == ')' || c == '[' || c == ']' || c == '{' || c == '}') {
    words.add(new Word(c));
  } else if (c == '>' || c == '<' || c == '=' || c == '!') {
    analyseRelation(c);
  } else if (c == ',' || c == ';') {
    words.add(new Word(c));
  } else if (c == '"') {
    analyseCitation();
  } else if (c == '&' || c == '|') {
    analyseLogic(c);
  } else if (Character.isDigit(c)) {
    analyseDigit(c);
  } else if (Character.isLetter(c) || c == '_') {
    analyseLetter(c);
  }
}
Common

For example, when I get '+', I directly new a Word typify the "PLUS".

Function

For example

When I get < , enter functionanalyseRelation read one char more. If it is=, analyze LEQ...

if (c == '<') {
  c = getChar();
  if (c == '=') {
    words.add(new Word("<="));
  } else {
    unGetChar();
    words.add(new Word("<"));
  }

analyseLogic is as the same.

Digit and Letter

Digit: When I get a digit, it means I will scan a serial of some digits and turn them into a Word typify "INTCON".

Letter: When I get a letter, it means I will scan a string about letter or digit. It maybe a "IDENFR" or "STRCON", which depends on whether it is in key map or not.

Word

class Word:

public class Word {
    private String identification;
    private String content;
    private String type;
}

Capsulate the initial function, I only need to new Word(...) in the main processor, which will create the corresponding word.

For example

    public Word(char identification) {
        this.identification = String.valueOf(identification);
        this.type = new KeyWordMap().getType(this.identification);
        this.content = this.identification;
    }

As for KeyWordMap, it is a HashMap, mapping the string of word and its type.

    public KeyWordMap() {
        keyWords = new HashMap<>();
        keyWords.put("main", "MAINTK");
        keyWords.put("const", "CONSTTK");
        keyWords.put("int", "INTTK");
        ...
        }

After Code

File reading

Read file by line is not convenient for preread and undo, so I read the file into a single String at first.

The method is read by line, add \n after every line and scan every char. When I get \n, lineNum++

    private String transferFileToCode() {
        BufferedReader bf = new BufferedReader(reader);
        StringBuffer buffer = new StringBuffer();
        String s = null;
        while ((s = bf.readLine()) != null) {
            buffer.append(s).append("\n");
        }
        return buffer.toString();
    }

Analyse

About analyst, it is different from what before coding.

First, I need analyze word one by one, so I add global variety index to memorize where is the pointer.

Besides, I met the situation that I need read one more or undo, so I encapsulate the function ungetChar and getChar, which will be convenient for me to analyze.

    private Character getChar() {
        if (index < code.length()) {
            char c = code.charAt(index);
            if (c == '\n') {
                lineNum++;
            }
            index++;
            return c;
        } else {
            return null;
        }
    }

    private void unGetChar() {
        index--;
        char c = code.charAt(index);
        if (c == '\n') {
            lineNum--;
        }
    }
Slash
  1. // : When it comes to \n , stop.
do {
  c = getChar();
  if (c == null || c == '\n') {
    return;
    // 判断为//注释,结束分析
  }
} while (true);
  1. /* */: Get char until */ appears
do {
  c = getChar();
  if (c == null) {
    return;
  }
  if (c == '*') {
    c = getChar();
    if (c == '/') {
      return;
      // 判断为/* */注释,直接结束分析
    } else {
      unGetChar();
    }
  }
} while (true);

Grammatical analysis

Requirement: Based on the words identified by the lexical analysis program, identify various grammatical elements according to the grammatical rules. Recursive descent method is used to analyze the grammatical components defined in the grammar.

Before Code

Data Reading

Like the lexical analyst, I prepared function getWord getNextWord and so on. At the same time, there is a global variety (Word) curWord to display which word it is when I read ArrayList<Word> words from lexical analyst one by one.

My analyst tragedy is as follows:

To normal rule: I keep getting words and analyze them

To expression rule: I scan the whole expression first, which is implemented by function getExp. Then I divide the expression and use method recursive descent to analyze them.

getExp like

    private ArrayList<Word> getExp() {
        ArrayList<Word> exp = new ArrayList<>();
        while (true) {
            if (word is symbol of end) {
                break;
            }
            ...
            getWordWithoutAddToGrammar();
            exp.add(curWord);
            word = getNextWord();
        }
        return exp;
    }

recursive descent

According to Grammatical Rules, code function for every term of rule.

Main idea: read a word, check what it symbolize and enter the next analyzing function.

For example:

to

CompUnit → {Decl} {FuncDef} MainFuncDef // 1.是否存在Decl 2.是否存在 FuncDef

I analyze like this:

private void analyseCompUnit() {
  Word word = getNextWord();
  while (word.typeEquals("CONSTTK") || (
    word.typeEquals("INTTK") && getNext2Word().typeEquals("IDENFR") && !getNext3Word().typeEquals("LPARENT"))) {
    analyseDecl();
    word = getNextWord();
  }
  while (word.typeEquals("VOIDTK") || (
    (word.typeEquals("INTTK") && !getNext2Word().typeEquals("MAINTK")))) {
    analyseFuncDef();
    word = getNextWord();
  }
  if (word.typeEquals("INTTK") && getNext2Word().typeEquals("MAINTK")) {
    analyseMainFuncDef();
  } else {
    error();
  }
  grammar.add("<CompUnit>");
}

grammar is used for memorize output of lexical analyst and grammar analyst list.

left recursion

加减表达式 AddExpMulExp | AddExp ('+' | '−') MulExp // 1.MulExp 2.+ 需覆盖 3.- 需覆盖

Check if the exp has '+' or '-'. If it has, separate the exp to AddExp and MulExp. Then analyze them separately.

After Code

left recursion

The method used before is not perfect for recursive descent. So I changed my rewrite way.

to

加减表达式 AddExp → MulExp | AddExp ('+' | '−') MulExp // 1.MulExp 2.+ 需覆盖 3.- 需覆盖

Rewrite it like

AddExp → MulExp ('+' | '−') MulExp  ('+' | '−') MulExp ...

Code like

private void analyseMulExp(ArrayList<Word> exp) {
  Exps exps = divideExp(exp, new ArrayList<>(Arrays.asList("MULT", "DIV", "MOD")));
  int j = 0;
  for (ArrayList<Word> exp1 : exps.getWords()) {
    analyseUnaryExp(exp1);
    grammar.add("<MulExp>");
    if (j < exps.getSymbols().size()) {
      grammar.add(exps.getSymbols().get(j++).toString());
    }
  }
}

Function divideExp is used for divide the whole exp passed by getExp or the pre function.

divideExp:

In: orignal: exp stop symbol: symbol

Out: List of divided exp and symbol.

private Exps divideExp(ArrayList<Word> exp, ArrayList<String> symbol) {
  ArrayList<ArrayList<Word>> exps = new ArrayList<>();
  ArrayList<Word> exp1 = new ArrayList<>();
  ArrayList<Word> symbols = new ArrayList<>();
  boolean unaryFlag = false;
  int flag1 = 0;
  int flag2 = 0;
  for (int i = 0; i < exp.size(); i++) {
    Word word = exp.get(i);
    if (word.typeEquals("LPARENT")) {
      flag1++;
    }
    if (word.typeEquals("RPARENT")) {
      flag1--;
    }
    if (word.typeEquals("LBRACK")) {
      flag2++;
    }
    if (word.typeEquals("RBRACK")) {
      flag2--;
    }
    if (symbol.contains(word.getType()) && flag1 == 0 && flag2 == 0) {
      //UnaryOp
      if (word.typeOfUnary()) {
        if (!unaryFlag) {
          exp1.add(word);
          continue;
        }
      }
      exps.add(exp1);
      symbols.add(word);
      exp1 = new ArrayList<>();
    } else {
      exp1.add(word);
    }
    unaryFlag = word.typeEquals("IDENFR") || word.typeEquals("RPARENT") || word.typeEquals("INTCON") || word.typeEquals("RBRACK");
  }
  exps.add(exp1);
  return new Exps(exps, symbols);
}

Exps

public class Exps {
    private ArrayList<ArrayList<Word>> words;
    private ArrayList<Word> symbols;
}

other bugs

Most bugs are produced by function getExp and divideExp because of some situations are ignored. So I always get something like index out of range...

So I changed some symbol of stop getting expression and modify the rules to divide or not the expression and so on.

Error handling

Official errors defination

Before Code

Create the symbol table

Symbol class

public class Symbol {
    private String type;
    private int intType;
    private String content;
    private int area = 0;
}

Type means the type of the symbol.

IntType is an integer. If it's 0, the symbol is int. if it's 1, the symbol is int[], if it's 2, the symbol is int[] []...

Content is its content.

Area is where is it.

I create a HashMap of Symbols, memorizing symbols created in each area.

When I enter a new area, area++. When I leave an area, area--, with the corresponding Symbols are destroyed.

    private HashMap<Integer, Symbols> symbols = new HashMap<>();
    private HashMap<String, Function> functions = new HashMap<>();
    private ArrayList<Error> errors = new ArrayList<>();
    private int area = -1;
    private boolean needReturn = false;
    private int whileFlag = 0;

needReturn means if the current function need to return.

whileFlag means if the current code block is in while circle.

Errors

a

Just check format

public boolean isFormatIllegal() {
  for (int i = 1; i < content.length() - 1; i++) {
    char c = content.charAt(i);
    if (!isLegal(c)) {
      if (c == '%' && content.charAt(i + 1) == 'd') {
        continue;
      }
      return true;
    } else {
      if (c == '\\' && content.charAt(i + 1) != 'n') {
        return true;
      }
    }
  }
  return false;
}

private boolean isLegal(char c) {
    return c == 32 || c == 33 || (c >= 40 && c <= 126); //offical defination
}
b c

B: Every time I get an identity, check if there is the same symbol has been defined in this area.

    private boolean hasSymbolInThisArea(Word word) {
        return symbols.get(area).hasSymbol(word);
    }

C: Check all area. If the symbol has been defined. Functions are as the same.

    private boolean hasSymbol(Word word) {
        for (Symbols s : symbols.values()) {
            if (s.hasSymbol(word)) {
                return true;
            }
        }
        return false;
    }
d e

To check if the function parameters are matched, I memorize parameters of every function and when I met a function call, I will scan the function call parameters and match them. I prepare a function to do this. Finally I found I need to use recursive descent again, so I add the check procedure to the recursive descent of the grammatical analyst. Please check the After Code/Error d and e

f g

There is a global variety needReturn used to display if the current function need return. if it does but there is no return in the end of the code block, or if it doesn't but there is return, the error will be memorized.

h

Just check if it is a const.

if (isConst(word)) {
  error("h", word.getLineNum());
}
i j k

Capsulate function about checking missing of the symbol

For example:

    private void checkParent() {
        if (getNextWord().typeEquals("RPARENT")) {
            getWord();// )
        } else {
            error("j");
        }
    }
l

Count the number of the parameters of string and printf separately and check if they equal.

m

There is a global variety whileFlag symbolize if the code block is in while circle. If it isn't, any continue and break will produce error.

After Code

Area

I mark the area++ when I get a block or a function, but it will lead to the situation that when enter a code block of a function, the parameters of the function can't be memorize in the different are with the block of the function. So I changed the rules to mark area++.

    private boolean analyseBlock(boolean fromFunc) {
        ...
        if (!fromFunc) {
            addArea();
        }
        ...
    }

Only when the block is not from the function, the area++.

Error d and e

To check if the function parameters are matched, I set an array for every function.

public class Function {
    private String type;
    private String content;
    private String returnType;
    private ArrayList<Integer> paras;
}

When I get a function, I memorize its return type and paras.

As for the ArrayList<Integer> paras, it reflects as follows:

TypeExampleInteger
Void-1
Inta0
Int[]a[]1
Int[] []a[] [3]2

So when I get a function call, I will check the parameter of it with what I have memorized before.

private void checkParasMatchRParas(Word ident, ArrayList<Integer> paras, ArrayList<Integer> rparas) {
    if (paras.size() != rparas.size()) {
        error("d", ident.getLineNum());
    } else {
        for (int i = 0; i < paras.size(); i++) {
            if (!paras.get(i).equals(rparas.get(i))) {
                error("e", ident.getLineNum());
            }
        }
    }
}

As for getting the parameters real type, I add the analyst procedure to the recursive descent of the grammatical analyst. Just like:

    private int analyseExp(ArrayList<Word> exp) {
        int intType = analyseAddExp(exp);
        grammar.add("<Exp>");
        return intType;
    }

Every recursion will return an intType, which symbolize the final type of the expression.

Because the terms of one expression must be the same type, so I return only one of them.

This is the exit of the recursion. It will return a correct type of the expression to the top of the function.

    private int analyseLVal(ArrayList<Word> exp) {
        int intType = 0;
        ...
                if (word.typeEquals("LBRACK")) {
                    intType++;
                    ...
                }
         ...
        if (hasSymbol(ident)) {
            return getSymbol(ident).getIntType() - intType;
        } else {
            return 0;
        }
    }

Code generation

In this part, I chose to generate Pcode.

I designed a type of Pcode which is an Inverse Bolan expression stack and symbol table based virtual code.

At the same time, I designed virtual machine to execute them.

The Pcode virtual machine is an imaginary machine used to run Pcode commands. It consists of: A code area (code), an instruction pointer (EIP), a stack, a var_table, a func_table and a label_table.

In the following passage, I will introduce how Pcode executes first and how to produce Pcode next.

Before Code

How does the virtual machine run

First, we need a codes list and a stack(int).

An eip: presents the address of current running code.

A varTable: memorizes the address of the variety in stack.

A funcTable: memorizes the address of the function in codes list.

A labelTable: Memorizes the address of the label in codes list.

Then, run the code one after another and manage the stack.

How to distinguish different variety

Before generate codes, differentiate varieties from different scopes by its only scope number, like: areaID + "_" + curWord.getContent(). In this situation, the variety will not appear more than once in codes, except for recursive function call, which will be solved by push varTable to stack(show later).

Specific Code Definition

First, define a class for PCode:

public class PCode {
    private CodeType type;
    private Object value1 = null;
    private Object value2 = null;
}

It presents one code object, which has a CodeType and two operating values. CodeType is an enum. Value1 and value2 maybe Integer or String or null, which depends on specific code type.

Calculation Type

Two operators:

int b = pop();
int a = pop();
push(cal(a,b));

Single operator:

push(cal(pop()));
VAR

VAR command to declare a variable, save the variable name and the address assigned to it in the variable table.

case VAR: {
    Var var = new Var(stack.size());
    varTable.put((String) code.getValue1(), var);
}

Var.class:

public class Var {
    private int index;
    private int dimension = 0;
    private int dim1;
    private int dim2;
}
DIMVAR

DIMVAR command to declare an array. Set the dimension information of the var.

case DIMVAR: {
    Var var = getVar((String) code.getValue1());
    int n = (int) code.getValue2();
    var.setDimension(n);
    if (n == 1) {
        int i = pop();
        var.setDim1(i);
    }
    if (n == 2) {
        int j = pop(), i = pop();
        var.setDim1(i);
        var.setDim2(j);
    }
}
PLACEHOLDER

PLACEHOLDER command to grow the stack down, allocate the new space to the variety and array.

case PLACEHOLDER: {
    Var var = getVar((String) code.getValue1());
    int n = (int) code.getValue2();
    if (n == 0) {
        push(0);
    }
    if (n == 1) {
        for (int i = 0; i < var.getDim1(); i++) {
            push(0);
        }
    }
    if (n == 2) {
        for (int i = 0; i < var.getDim1() * var.getDim2(); i++) {
            push(0);
        }
    }
}
Other

Calculation type: pop the stack top once or twice, calculate them and push again.

Jump Type: When it's command about jump, just check if the condition is satisfied and change the eip.

Function call: as follows

Function call procedure

First, before function call, there will be some parameters to be pushed into the stack. Each will be followed by a RPARA command, which memorize the address of the previous variety.

case RPARA: {
    int n = (int) code.getValue1();
    if (n == 0) {
        rparas.add(stack.size() - 1);
    } else {
        rparas.add(stack.get(stack.size() - 1));
    }
}

Second, function CALL.

Memorize the eip, stack top address, and information about the function(In fact, they will be pushed into stack too). Then update the varTable and eip. Ready for execute function.

case CALL: {
    Func func = funcTable.get((String) code.getValue1());
    retInfos.add(new RetInfo(eip, varTable, stack.size() - 1, func.getArgs(), func.getArgs(), nowArgsNum));
    eip = func.getIndex();
    varTable = new HashMap<>();
    callArgsNum = func.getArgs();
    nowArgsNum = 0;
}

Finally, return when it's RET

Restore eip, varTable from RetInfo, clear the new information pushed when function in the stack.

case RET: {
    int n = (int) code.getValue1();
    RetInfo info = retInfos.remove(retInfos.size() - 1);
    eip = info.getEip();
    varTable = info.getVarTable();
    callArgsNum = info.getCallArgsNum();
    nowArgsNum = info.getNowArgsNum();
    if (n == 1) {
        stack.subList(info.getStackPtr() + 1 - info.getParaNum(), stack.size() - 1).clear();
    } else {
        stack.subList(info.getStackPtr() + 1 - info.getParaNum(), stack.size()).clear();
    }
}

Value or Address

Push value or address of the variety is an important thing, it depends on what I need, which will be presented when I describe how to generate codes.

The command action is as follows(getAddress is used for get the address of the previous variety ).

case VALUE: {
    Var var = getVar((String) code.getValue1());
    int n = (int) code.getValue2();
    int address = getAddress(var, n);
    push(stack.get(address));
}
...
case ADDRESS: {
    Var var = getVar((String) code.getValue1());
    int n = (int) code.getValue2();
    int address = getAddress(var, n);
    push(address);
}

Code Generate

Code generated from the grammatical analyst procedure.

Declaration

There is no need to distinguish const and var. When declare a variety, just new a variety and let it point to the stack top. Then if it has an initialization, just push the values one after another. If not, add a PLACEHOLDER command to push something(I push 0) to the stack to hold the place.

Assign sentence

In this situation, first calculate and push the address of the variety to the stack top. Then analyze expressions. After that, there are only two number in the stack, which are address and value. Assign the value to the address.

Condition control sentence

First, generate labels. Then, place jump sentences in the proper places.

labels about if and while will be generated and then stored in a stack type structure. like:

whileLabels.add(new HashMap<>());
whileLabels.get(whileLabels.size() - 1).put("while", labelGenerator.getLabel("while"));
whileLabels.get(whileLabels.size() - 1).put("while_end", labelGenerator.getLabel("while_end"));
whileLabels.get(whileLabels.size() - 1).put("while_block", labelGenerator.getLabel("while_block"));

Take if as example:

if (word.typeEquals("IFTK")) {
    codes.add(new PCode(CodeType.LABEL, ifLabels.get(ifLabels.size() - 1).get("if")));
    ...
    analyseCond("IFTK");
    ...
    codes.add(new PCode(CodeType.JZ, ifLabels.get(ifLabels.size() - 1).get("else")));
    codes.add(new PCode(CodeType.LABEL, ifLabels.get(ifLabels.size() - 1).get("if_block")));
    analyseStmt();
    codes.add(new PCode(CodeType.JMP, ifLabels.get(ifLabels.size() - 1).get("if_end")));
    codes.add(new PCode(CodeType.LABEL, ifLabels.get(ifLabels.size() - 1).get("else")));
    if (word.typeEquals("ELSETK")) {
        getWord(); //else
        analyseStmt();
    }
    codes.add(new PCode(CodeType.LABEL, ifLabels.get(ifLabels.size() - 1).get("if_end")));
}

while:

if (word.typeEquals("WHILETK")) {
    ...
    codes.add(new PCode(CodeType.LABEL, whileLabels.get(whileLabels.size() - 1).get("while")));
    ...
    analyseCond("WHILETK");
    ...
    codes.add(new PCode(CodeType.JZ, whileLabels.get(whileLabels.size() - 1).get("while_end")));
    codes.add(new PCode(CodeType.LABEL, whileLabels.get(whileLabels.size() - 1).get("while_block")));
    analyseStmt();
    ...
    codes.add(new PCode(CodeType.JMP, whileLabels.get(whileLabels.size() - 1).get("while")));
    codes.add(new PCode(CodeType.LABEL, whileLabels.get(whileLabels.size() - 1).get("while_end")));
    whileLabels.remove(whileLabels.size() - 1);
}

// break
if (word.typeEquals("BREAKTK")) {
    getWord();//break
    codes.add(new PCode(CodeType.JMP, whileLabels.get(whileLabels.size() - 1).get("while_end")));
 		...
}

// continue
if (word.typeEquals("CONTINUETK")) {
    getWord();//continue
    codes.add(new PCode(CodeType.JMP, whileLabels.get(whileLabels.size() - 1).get("while")));
    ...
}

After Code

Because of some runtime errors and information shortages, I added and removed some Pcode. At the same, there are some new troubles about address pass and short circuit calculation.

Specific Code Definition

In Operation, push() means put value into the top of the stack. pop() means pop the value from the top of the stack.

Common Type
CodeTypeValue1Value2Operation
LABELLabel_nameSet a label
VARIdent_nameDeclare a variety
PUSHIdent_name/Digitpush(value1)
POPAddressIdent_name*value1 = value2
JZLabel_nameJump if stack top is zero
JNZLabel_nameJump if stack top is not zero
JMPLabel_nameJump unconditionally
MAINMain function label
FUNCFunction label
ENDFUNCEnd of function label
PARAIdent_nameTypeParameters
RETReturn value or notFunction return
CALLFunction nameFunction call
RPARATypeGet parameters ready for function call
GETINTGet a integer and put it into stack top
PRINTStringPara numPop values and print.
DIMVARIdent_nameTypeSet dimension info for array variety
VALUEIdent_nameTypeGet the variety value
ADDRESSIdent_nameTypeGet the variety address
PLACEHOLDERPush something to hold places
EXITExit

| CodeType | Value1 | Value2 | Operation | | -------- | ------ | ------ | --------- | --- | | ADD | | | + | | SUB | | | - | | MUL | | | * | | DIV | | | / | | MOD | | | % | | CMPEQ | | | == | | CMPNE | | | != | | CMPGT | | | > | | CMPLT | | | < | | CMPGE | | | >= | | CMPLE | | | <= | | AND | | | && | | OR | | | \ | | | | NOT | | | ! | | NEG | | | - | | POS | | | + |

short circuit calculation

There are two situations I need to use short circuit calculation :

1. if(a&&b) // a is false
2. if(a||b) // b is true

This seems not an easy thing and I acutally spent lots of time to solve it.

My method is as follows:

First, when I analyze analyseLOrExp, every analyseLAndExp will be followed by a JNZ, which is used for detect if the cond is false. If it is, jump to the if body label. At the same time, I generated cond label, which is ready for the analyseLAndExp.

private void analyseLOrExp(ArrayList<Word> exp, String from) {
    ...
    for (...) {
        ...
        String label = labelGenerator.getLabel("cond_" + i);
        analyseLAndExp(exp1, from, label);
        codes.add(new PCode(CodeType.LABEL, label));
        if (...) {
            codes.add(new PCode(CodeType.OR));
        }
        if (...) {
            if (...) {
                codes.add(new PCode(CodeType.JNZ, ifLabels.get(ifLabels.size() - 1).get("if_block")));
            }
          ...
        }
        ...
    }
}

In the analyseLAndExp, every analyseEqExp will be followed by a JZ, which is used for detect if the cond is true. If it is, jump to the cond label I set just now.

private void analyseLAndExp(ArrayList<Word> exp, String from, String label) {
    ...
    for (...) {
        ...
        analyseEqExp(exp1);
        if (...) {
            codes.add(new PCode(CodeType.AND));
        }
        if (...) {
            if (...) {
                codes.add(new PCode(CodeType.JZ, label));
            }
          ...
        }
    }
}

By these means, short circuit calculation is solved.

Reference

https://www.bookstack.cn/read/pandolia-tinyc/about.md