Sei sulla pagina 1di 8

Lex

Lex stands for Lexical Analyzer. Lex is a tool for generating scanners. Scanners are programs that recognize lexical patterns in text. These lexical patterns (or regular expressions) are defined in a particular syntax. A matched regular expression may have an associated action. This action may also include returning a token. When Lex receives input in the form of a file or text, it attempts to match the text with the regular expression. It takes input one character at a time and continues until a pattern is matched. If a pattern can be matched, then Lex performs the associated action (which may include returning a token). If, on the other hand, no regular expression can be matched, further processing stops and Lex displays an error message. Regular expressions in Lex A regular expression is a pattern description using a meta language. An expression is made up of symbols. Normal symbols are characters and numbers, but there are other symbols that have special meaning in Lex. The following two tables define some of the symbols used in Lex and give a few typical examples. Defining regular expressions in Lex Character Meaning A-Z, 0-9, a-z Characters and numbers that form part of the pattern. . Matches any character except \n. Used to denote range. Example: A-Z implies all characters from A to Z. [] A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Example: [abC] matches either of a, b, and C. * Match zero or more occurrences of the preceding pattern. + Matches one or more occurrences of the preceding pattern. ? Matches zero or one occurrences of the preceding pattern. $ Matches end of line as the last character of the pattern. {} Indicates how many times a pattern can be present. Example: A{1,3} implies one or three occurrences of A may be present. \ Used to escape meta characters. Also used to remove the special meaning of characters as defined in this table. | Logical OR between expressions. "<some symbols>" Literal meanings of characters. Meta characters hold. () Groups a series of regular expressions. Examples of regular expressions Regular expression Meaning joke[rs] Matches either jokes or joker. A{1,2}shis+ Matches AAshis, Ashis, AAshi, Ashi. (A[b-e])+ Matches zero or one occurrences of A followed by any character from b to e. Tokens in Lex are declared like variable names in C. Every token has an associated expression.

Examples of token declarations Token number chars blank word variable Associated expression ([0-9])+ [A-Za-z] "" (chars)+ (chars)+(number)*(chars)*( number)* Meaning 1 or more occurrences of a digit Any character A blank space 1 or more occurrences of chars

Programming in Lex A Lex program is divided into three sections: the first section has global C and Lex declarations, the second section has the patterns (coded in C), and the third section has supplemental C functions. main(), for example, would typically be found in the third section. These sections are delimited by %%. GlobalCandLexdeclarations In this section we can add C variable declarations. We will declare an integer variable here for our word-counting program that holds the number of words counted by the program. We'll also perform token declarations of Lex. Declarations for the word-counting program
%{ int wordCount = 0; %} chars [A-za-z\_\'\.\"] numbers ([0-9])+ delim [" "\n\t] whitespace {delim}+ words {chars}+ %% The double percent sign implies the end of this section and the beginning of the second of the three sections in Lex programming.

Lex rules for matching patterns Let's look at the Lex rules for describing the token that we want to match. (We'll use C to define what to do when a token is matched.) Continuing with our word-counting program, here are the rules for matching tokens. Lex rules for the word-counting program
{words} { wordCount++; /* increase the word count by one*/ } {whitespace} { /* do nothing*/ } {numbers} { /* one may want to add some processing here*/ } %%

C code The third and final section of programming in Lex covers C function declarations (and

occasionally the main function) Note that this section has to include the yywrap() function. Lex has a set of functions and variables that are available to the user. C code section for the word-counting program
void main() { yylex(); /* start the analysis*/ printf(" No of words: %d\n", wordCount); } int yywrap() { return 1; }

Example 2:
/* A standalone LEX program that counts identifiers and commas */ %{ int nident = 0; int ncomma = 0; %} digit [0-9] alph %% [a-zA-Z] /* # of identifiers in the file being scanned */ /* # of commas in the file */ /* /* definitions of basic character classes occurs in the first part of file */

the second part of the file contains the definitions of patterns to recognize */ {alph}({alph}|{digit})* {++nident;} "," {++ncomma;} . ; %% /* the last part of the file contains user defined code, as shown here. */

main() { yylex(); printf( "%s%d\n", "The no. of identifiers = ", nident); printf( "%s%d\n", "The no. of commas = ", ncomma); } yywrap(){} /* LEX calls this function when the end of the input file is reached */

example : %{ #include <stdio.h> %} /*REGULAR DEFINITION*/ delim [\t\n] ws {delim}+ letter [A-Za-z] digit [0-9]

id {letter}({letter}|{digit})* number {digit}+ %% {ws} if {printf("IF");} then {printf("THEN");} else {printf("ELSE");} {id} {printf("Identifier");} {number} {printf("NUM");} "<" {printf("LT");} "<=" {printf("LE");} ">" {printf("GT");} ">=" {printf("GE"); } "=" {printf("EQ"); } %% OUTPUT: $ lex manifest.l $ cc lex.yy.c -o manifest -ll $ ./manifest if(a>b) then a=20 IF(IdentifierGTIdentifier) THEN IdentifierEQNUM

Working with Yacc


Yacc(yet another compiler compiler) implements one of the parsing algorithms. Specifically, yacc implements the bottom up shift reduce parsing of CFG. Much like lex, one develops text files which will be processed by yacc to generate C functions which can be used in a larger C project to perform parsing. The input to yacc consists of three sections: ... definitions ... %% ... rules ... %% ... functions ... The definitions section consists of declarations of tokens in the form of the following. %token INTEGER The definitions section can also contain C code enclosed by %{ ... C code here ... %} The rules section consists of all the production rules of the context free grammar. The head of the first production rule is considered as the start symbol by yacc. Finally the functions section contains any implementations of C functions which are necessary. The command yacc is used to process the input file, and the output is two new files: one is the header file

that contains the symbol definitions for the token types, and another is the c file that contains the parser. By convention, the input file should have .y as its extension. By default, the two files are named y.tab.h and y.tab.c. However, yacc has an option, -o name.c, to allow any names (instead of y.tab) to be used. Note: yacc, by default, generates only the .c file. You must explicitly use the -d option to force it to generate the definitions (.h) file. Note: many grammars (even the simple ones) will contain shift/reduce conflicts [see lecture], which means that your grammar is probably ambiguous. Yacc will complain if it detects shift/reduce conflicts, but they are just warnings. Things to keep in mind: 1. The grammar is in the rules section of the input file to yacc. 2. Tokens are declared in the definitions section of the input file. 3. yacc will process the input file and produce two outputs: a C header (.h) file and a .c file, typically named as y.tab.h and y.tab.c respectively. 4. The header file contains the token definitions. The header file should be used by lex when performing the lexical analysis.

An example
Let's try to learn yacc by a concrete example. Consider the following grammar (for arithmetic expressions with only plus and minus operations allowed). The start symbol is PROG which consists of one or more expressions terminated by a newline character. PROG -> | E -> | | |

PROG E \n INTEGER E + E E - E (E)

When expressing the grammar in yacc, we need to observe a few rules:

1. Non-terminal symbols must be in lowercase. In this case, we have only one non-terminal. We
have to rename it to lowercase, so let's call E expr from now on. 2. Terminal symbols can either be token types or string literals. In this case, we have INTEGER as a token type, and strings: +, -, (, ). The rule is that string must be quoted by single quote. So we have to write '+' and '-', ... from now on. 3. Each rule must end with ; 4. The head of a production rule and its body is separated by ":". 5. Repeated whitespaces do not matter. So, let's rewrite the grammar in the syntax of yacc. prog : | ; prog expr '\n'

expr

: INTEGER | expr '+' expr | expr '-' expr | '(' expr ')' ;

There are only two rules, but nonetheless, it's a complete grammar. Note the rule for prog has an empty alternation, corresponding to the epsilon in the grammar. Since repeated whitespace doesn't matter, you can actually write the rule for expr as: expr : INTEGER|expr '+' expr|expr '-' expr|'(' expr ')'; Ouch. That's really hard to read. So write beautiful code please. Let's write the entire expr.y file.
%{ #include <stdio.h> void yyerror(char *); %} %token INTEGER %% program : program expr '\n' | ; expr : INTEGER | expr '+' expr | expr '-' expr | '(' expr ')' ; %% void yyerror(char *err) { printf("Error: %s\n" , err); }

int main(void) { yyparse(); return 0; }

Note: yyparse() starts the parser. It will call yylex() to get the stream of tokens.

We needed to declare yyerror() in the definitions section. We needed to implement yyerror() in the functions section. We embedded the main program directly inside the yacc input file. Normally, one would have a separate C project (with many other .c files) to host the main program(s).

Recall that parsers do not perform lexical analysis on the raw text input. This is good in that we can think in terms of token types like INTEGER instead of character sets containing 0, 1, 2, ... But the downside, is that we have to write a lexical analyzer for yacc to use. This is to be done by lex. The following is the lex code, expr.l :
%{ #include "y.tab.h" #include <stdlib.h> void yyerror(char *); %}

%%

[0-9]+

{ return INTEGER; }

[-+()\n]

{ return yytext[0]; }

[ \t]

/* skip whitespace */

yyerror("Unknown character");

%%

int yywrap(void) { return 1; }

Note: Lex needs to know the tokens (INTEGER for example), so it needs to include the definition header (.h) file generated by yacc. Lex generates INTEGER when it encounters a list of numeric characters. Lex returns the first character of yytext when it encounters strings needed by yacc: '+', '-', '(', ')', '\n'. Lex helps to clean up the input by throwing away whitespaces.

Let's compile the input files using yacc and lex:

yacc -d expr.y This generates the y.tab.c and y.tab.h. Note y.tab.h is needed by lex as it is included by expr.l lex expr.l This generates lex.yy.c gcc -o expr.exe y.tab.c lex.yy.c The two .c files are really important: y.tab.c contains the parser code, and the lex.yy.c contains the lexical analysis code. In this example, we (lazily) placed the main() function inside y.tab.c as well, so we don't have to write another .c file separately.

If you run ./expr.exe, it will read from standard input line by line and verify that the syntax is correct. For example: echo "89+(48-(34+90))" | expr.exe will succeed (without error), but echo "abc+(48-))" | expr.exe will fail.

Potrebbero piacerti anche