Chapter 3 - Lexical Analysis

Reading: pp. 83 - 113

The first half of this course consists of understanding the relationships between the lexical analyzer, parser, and ST. In general, we have something that looks like the following:

           source code -->  Lexical Analyzer  -->  Parser
                                                         <--
                                    \                /
                                     \              /
                                       Symbol Table

Q#1: What is the purpose of the lexical analyzer?
Q#2: What is the purpose of the parser?
Q#3: What is the purpose of the symbol table?

In general, the lexical analyzer, the parser, and the ST are three distinct modules within the compiler. It is possible for the parser to be combined with the lexical analyser but this tends to lead to more inefficient compilers in general.

Reasons for separating lexical analysis and parsing:

1) simpler design (e.g. easier to separate white space in lexical analyzer than in parser)

2) more efficient (e.g. better I/O if in lexical analyzer and not combined with parser)

3) better portability (e.g. machine dependent specifics can be limited to lexical analyzer)

A few definitions are necessary:

token - identifies a category understood by the parser (e.g. relop, id)

pattern - rules that define strings of a particular token type (e.g. id is a letter followed by letters and digits)

lexeme - a sequence of characters matched by a pattern of a particular token type (e.g. cost is a lexeme for token id)

Our lexical analyzer will return token,value pairs.

Q#4: Consider the following C statement: sum = 2 + sum - num--; Identify the token,value pairs for each of the lexemes in the statement. Note: each token does not necessarily have a value.

lexeme token
value

A limited amount of error handling happens during the lexical analysis stage. If an error does occur, how do we proceed?

Possibilities:

Continue processing until a valid token is found
Delete extraneous characters
Insert missing characters
Replace what appears to be incorrect characters
If it makes sense, transpose two adjacent characters

Q#5: Would the following statement cause an error to the lexical analyzer? Why or why not?

fi (vals == nums[i]);

So how might we proceed with implementing a lexical analyzer?

We might use Lex which is a compiler that produces a lexical analyzer based on a regular-expression based spec
Write the code in a high level language using the I/O proivded by the language
Write the code in assembly managing the I/O explicitly

Before getting into the specifics of lexical analysis implementation, let's do a quick review of some formal language theory (Oh, yeah!!! Doesn't get any better than this).