Implement a lexical analyzer (a.k.a. scanner, lexer or tokenizer) for this mini-
ID: 3732922 • Letter: I
Question
Implement a lexical analyzer (a.k.a. scanner, lexer or tokenizer) for this mini-language. The scanner reads a file containing a sequence of characters and outputs the corresponding sequence of tokens (i.e., representations of terminal symbols), while omitting whitespace and comments. If it reads any character that is not allowed by the grammar, it should generate an appropriate error message (with the position of the character) and stop the computation.
For instance, the scanner should reject “$um”, as “$” is not included in the alphabet of identifiers. For simplicity, you can assume that all integers are within range, i.e., you do not have to check for overflow. Each token should include the following information:
1. The kind of the current lexeme. To keep things simple, use strings for representing the kind of lexemes. For identifiers and integers, the kinds are “ID” and “NUM”, respectively. For keywords, the keyword itself is the kind. For other symbols, the kind is a string corresponding to the symbol. Upon encountering the end of the input file, the scanner must generate a token whose kind is “end-oftext”. (For example, for identifier `speed`, the kind is “ID”. For the integer 3400, the kind is “NUM”. For the keyword `false`, the kind is “false” and for the symbol `:=` the kind is “:=“.)
2. The value of the lexeme, if applicable. Only identifiers and integers have values. For example the value of lexeme `19` is 19 and the value of lexeme `speed` is “speed”. However, the lexeme `(` does not have any value.
3. The position of the lexeme. The position is a pair consisting of the line number of the lexeme and the position of the first character of the lexeme in that line.
Your program should provide four procedures:
next() : reads the next lexeme in the input file.
kind() : returns the kind of the lexeme that was just read.
value() : returns the value of the lexeme (if it is an “ID” or a “NUM”).
position() : returns the position of the lexeme that was just read.
The main program should have a loop for reading the input and printing the tokens:
next();
while ( kind() != “end-of-text” ) {
next();
print( position(), kind(), value() ); }
Explanation / Answer
Implementation of parser lexeme by using java program:
class 1:
import java.io.*;
import java.util.*;
enum TokenTypes {NUMBER, RESERVEDWORD, ID, LT, GT, EQ, LE, GE, NE, OP, CP, ASSIGN, PLUS, MINUS, MULT, DIV, DELIM, COMMA, COLON, PERIOD, SCOLON, NUM, FINISH}
public class SymbolTable
{
Hashtable <String, Symbol> hash;
public Symbol get (String str)
{
return hash.get(str);
}
public boolean containsKey(String lexeme)
{
return hash.containsKey(lexeme);
}
public void put (String lexeme,Symbol sym)
{
hash.put(lexeme, sym);
}
public SymbolTable()
{
hash = new Hashtable <String,Symbol> ();
Symbol semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "program");
hash.put("program", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "var");
hash.put("var", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "begin");
hash.put("begin", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "end");
hash.put("end", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "if");
hash.put("if", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "then");
hash.put("then", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "else");
hash.put("else", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "while");
hash.put("while", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "do");
hash.put("do", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "integer");
hash.put("integer", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "read");
hash.put("read", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "write");
hash.put("write", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "or");
hash.put("or", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "and");
hash.put("and", semel);
semel = new Symbol(TokenTypes.RESERVEDWORD, 0, 0, "not");
hash.put("not", semel);
semel = new Symbol(TokenTypes.ASSIGN, 0, 0, ":=");
hash.put(":=", semel);
semel = new Symbol(TokenTypes.SCOLON, 0, 0, ";");
hash.put(";", semel);
semel = new Symbol(TokenTypes.PERIOD, 0, 0, ".");
hash.put(".", semel);
semel = new Symbol(TokenTypes.COMMA, 0, 0, ",");
hash.put(",", semel);
semel = new Symbol(TokenTypes.OP, 0, 0, "(");
hash.put("(", semel);
semel = new Symbol(TokenTypes.CP, 0, 0, ")");
hash.put(")", semel);
semel = new Symbol(TokenTypes.COLON, 0, 0, ":");
hash.put(":", semel);
semel = new Symbol(TokenTypes.LT, 0, 0, "<");
hash.put("<", semel);
semel = new Symbol(TokenTypes.GT, 0, 0, ">");
hash.put(">", semel);
semel = new Symbol(TokenTypes.LE, 0, 0, "<=");
hash.put("<=", semel);
semel = new Symbol(TokenTypes.GE, 0, 0, ">=");
hash.put(">=", semel);
semel = new Symbol(TokenTypes.EQ, 0, 0, "=");
hash.put("=", semel);
semel = new Symbol(TokenTypes.NE, 0, 0, "<>");
hash.put("<>", semel);
semel = new Symbol(TokenTypes.PLUS, 0, 0, "+");
hash.put("+", semel);
semel = new Symbol(TokenTypes.MINUS, 0, 0, "-");
hash.put("-", semel);
semel = new Symbol(TokenTypes.MULT, 0, 0, "*");
hash.put("*", semel);
semel = new Symbol(TokenTypes.DIV, 0, 0, "/");
hash.put("/", semel);
}
}
class 2:
import java.io.*;
import java.util.*;
class Symbol
{
TokenTypes sym; //token type
int lineNumber;
int position;
Object value; //semantic value of token
Symbol (TokenTypes symb, int ln, int pos, Object val)
{
sym = symb;
lineNumber = ln;
position = pos;
value = val;
}
}
class 3:
import java.io.*;
import java.util.*;
public class LexicalAnalyzer
{
int lexemebegin;
int state;
Symbol tokenValue;
TokenTypes currentTokenTypes;
String str;
int lineNumber;
int position=-1;
String lexeme;
SymbolTable Symb;
Scanner scan;
public LexicalAnalyzer(SymbolTable hash)
{
Symb=hash;
}
public TokenTypes nextToken()
{
char c;
state = 0;
c = nextChar();
while (true)//there is a token
{
switch (state)
{
case 0: if (c=='<')
state=1;
else
if(c=='=')
{
tokenValue = Symb.get("=");
return currentTokenTypes = TokenTypes.EQ;
}
else
if(c=='>')
state=6;
else
fail();
break;
case 1:
{
c = nextChar();
if (c=='=')
{
tokenValue = Symb.get("<=");
return currentTokenTypes = TokenTypes.GE;
}
else
if (c=='>')
{
tokenValue = Symb.get("<>");
return currentTokenTypes = TokenTypes.NE;
}
else
{
retract();
tokenValue= Symb.get("<");
return currentTokenTypes = TokenTypes.LT;
}
}
case 6:
{
c = nextChar ( );
if (c=='=')
{
tokenValue= Symb.get(">=");
return currentTokenTypes = TokenTypes.GE;
}
else
{
retract();
tokenValue = Symb.get(">");
return currentTokenTypes = TokenTypes.GT;
}
}
case 9: if ((c>='A' && c<='Z')||(c>='a' && c<='z'))
{
lexeme = "" +c;
state = 10;
}
else
fail();
break;
case 10:
{
c = nextChar();
while ((c>='A' && c<='Z')||(c>='a' && c<='z')||(c>='0' && c<='9'))
{
lexeme = lexeme + c;
c = nextChar();
}
//retract();
if (Symb.containsKey(lexeme))
tokenValue = Symb.get(lexeme);
else
{
tokenValue=new Symbol(TokenTypes.ID,lineNumber,position,lexeme);
Symb.put(lexeme, tokenValue);
}
return currentTokenTypes= TokenTypes.ID;
}
case 12:
if (c==':')
state=13;
else
if (c==';')
{
tokenValue = Symb.get(";");
return currentTokenTypes = TokenTypes.SCOLON;
}
else
if (c=='.')
{
tokenValue = Symb.get(".");
return currentTokenTypes = TokenTypes.PERIOD;
}
else
if (c==',')
{
tokenValue = Symb.get(",");
return currentTokenTypes = TokenTypes.COMMA;
}
else
if (c=='(')
{
tokenValue = Symb.get("(");
return currentTokenTypes = TokenTypes.OP;
}
else
if (c==')')
{
tokenValue = Symb.get(")");
return currentTokenTypes = TokenTypes.CP;
}
else
fail();
break;
case 13:
{
c = nextChar();
if (c=='=')
{
tokenValue = Symb.get(":"+c);
return currentTokenTypes = TokenTypes.ASSIGN;
}
else
{
retract();
tokenValue= Symb.get(":");
return currentTokenTypes = TokenTypes.COLON;
}
}
case 21:
if (c=='+')
{
tokenValue = Symb.get("+");
return currentTokenTypes = TokenTypes.PLUS;
}
else
if (c=='-')
{
tokenValue = Symb.get("-");
return currentTokenTypes = TokenTypes.MINUS;
}
else
if (c=='*')
{
tokenValue = Symb.get("*");
return currentTokenTypes = TokenTypes.MULT;
}
else
if (c=='/')
{
tokenValue = Symb.get("/");
return currentTokenTypes = TokenTypes.DIV;
}
else
fail();
break;
case 26:
if (c>='0' && c<='9')
{
lexeme = "" + c;
state = 27;
}
else
fail();
break;
case 27:
{
c = nextChar();
while (c>='0' && c<='9')
{
lexeme = lexeme + c;
c = nextChar();
}
retract();
if (Symb.containsKey(lexeme))
tokenValue = Symb.get(lexeme);
else
{
tokenValue=new Symbol(TokenTypes.NUM,lineNumber,position,lexeme);
Symb.put(lexeme, tokenValue);
}
return currentTokenTypes = TokenTypes.NUM;
}
case 29:
if(c==' ')
state = 30;
else
return currentTokenTypes = TokenTypes.DELIM;
break;
case 30: {
c = nextChar();
while (c == ' ') {
//
if (currentTokenTypes == TokenTypes.FINISH) {
return currentTokenTypes;
}
//
c = nextChar();
}
state = 0;
break;
}
}
}
}
public Symbol GetTokenValue()
{
return this.tokenValue;
}
public char nextChar()
{
position++;
if (position >= str.length())//end on line
if(scan.hasNext())
{
str=scan.nextLine();
lineNumber++;
position=0;
return str.charAt(position);
}
else // end of file
{
currentTokenTypes = TokenTypes.FINISH;
return (' ');
}
else //middle of line
return str.charAt(position);
}
public void retract()
{
position--;
}
public void fail()
{
switch (state)
{
case 0: state=9; break;
case 9: state=12; break;
case 12: state=21; break;
case 21: state=26; break;
case 26: state=29; break;
}
}
public void PrintToken()
{
System.out.println("position is :"+tokenValue.position);
System.out.println("kind of lexeme is :"+currentTokenTypes.toString());
System.out.println("value of lexeme :"+tokenValue.value.toString());
}
public void inputFileName(String fname)
{
try
{
scan=new Scanner(new File(fname));
str=scan.nextLine();
}
catch (Exception e)
{
System.out.println (e.getMessage());
}
}
}
class 4:
import java.io.*;
import java.util.*;
public class LexemeRun
{
public static void main(String ar[])
{
try
{
Scanner input = new Scanner(System.in);
String fname;
SymbolTable St = new SymbolTable();
LexicalAnalyzer la = new LexicalAnalyzer(St);
System.out.println("Type the file name: ");
fname = input.nextLine();
la.inputFileName(fname);
TokenTypes currentToken=la.nextToken();
while(currentToken != TokenTypes.FINISH)
{
la.PrintToken();
System.out.println();
currentToken=la.nextToken();
}
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
}
}
input fie:
token.txt
content in the file:
abcdefg:=;3456
Execution and output:
javac LexemeRun.java
java LexemeRun
Type the file name
token.txt
position is :7
kind of lexeme is :ID
value of lexeme :abcdefg
position is :0
kind of lexeme is :EQ
value of lexeme :=
position is :0
kind of lexeme is :SCOLON
value of lexeme :;
position is :13
kind of lexeme is :NUM
value of lexeme :3456
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.