All Databases MacTech Vol 06-1990

Lexical Analysis

Volume Number: 6

Issue Number: 5

Column Tag: Language Translation

Ambiguities and Lexical Analysis

By Clifford Story, Mount Prospect, IL

Part III. Ambiguities, Etc.

A. Introduction

This is the third part in my series on Language Translation. Language translation

has three phases: lexical analysis, parsing, and code generation. The first two parts

have dealt with building parsers using YACC; this is the final installment on that

subject.

That will take up only about half the article. The remainder will begin the new

topic of lexical analysis by presenting a skeleton filter tool.

B. Parsing

I now conclude parsing with a couple of miscellaneous topics. The first is a

simple way to handle ambiguities in a grammar. Then comes the horrible question of

error detection and reporting.

B(1). Ambiguities

Parser ambiguities, you may recall, are spots in the parse table where YACC’s

generation algorithm would place a shift and a reduction, or two reductions. A naive

grammar would generate an ambiguity when considering the input string

2 + 3 * 4

since it isn’t clear whether this means (2 + 3) * 4 or 2 + (3 * 4).

We faced this problem in the first part and solved it by re-writing the grammar

to eliminate the ambiguity. YACC offers another way of handling ambiguities (also

known as shift/reduce and reduce/reduce conflicts).

B(1)(a). YACC’s Default Rule

First, something you should know: YACC will work through any conflicts that it

finds and produce a non-ambiguous grammar for you. It resolves conflicts

automatically, and will only issues warnings, not errors.

Unfortunately, it does this by making assumptions about what you intended, and

these assumptions need not be correct. You can get a grammar that simply does not do

what you meant it to do, despite the lack of error messages. In my opinion, this is a

design flaw in YACC; conflicts should be fatal errors.

Even worse, MACYACC (see the first part for a review) uses an inordinately

complicated rule for resolving conflicts. Unix YACC uses a very simple rule, so you

can guess what it’s going to do. MACYACC, no such luck.

Therefore, YOU SHOULD ALWAYS REACT TO CONFLICT WARNINGS AS IF THEY

WERE FATAL ERRORS!

B(1)(b). Operator Precedence

Remember, conflicts arise when the parser doesn’t know which operation to

perform first, as in the 2 + 3 * 4 example. We intelligent humans know that

multiplication comes before addition: multiplication has a higher precedence than

addition. The conflict is easily and correctly resolved if YACC knows the order of

precedence among all operators in the grammar.

All that need be done is tell YACC:

/* 1 */
%left ‘+’ ‘-’
%left ‘*’ ‘/’

Multiplication and division have the same precedence (because they are on the

same line), and higher precedence than addition and subtraction (because their line is

after addition and subtraction’s line). That solves the problem (read on for an

explanation of the “left” business).

But what about strings like

2 - 3 + 4

Is that 2 - (3 + 4) or (2 - 3) + 4? The above rules don’t seem to say, since

subtraction and addition have the same precedence.

B(1)(c). What is Associativity?

Good question. In a string of the form

... • ID • ...

where ‘•’ is an operator and ‘ID’ an identifier, the • to the left of the ID will have a

higher precedence than the • to the right if • is left-associative; and conversely.

Thus,

2 - 3 - 4

is conventionally -5 because subtraction is left-associative. If it were

right-associative, 2 - 3 - 4 would equal 2 - (3 - 4), or 3. Just about everything is

left-associative.

So associativity defines order of evaluation among operators on the same level of

operator precedence. And the line

%left ‘+’ ‘-’

says that addition and subtraction are left-associative among one another.

YACC also includes keywords %right, which means just what you might think,

and %nonassoc, which means constructs of the form

... • ID • ...

are illegal (think of logical operators).

B(1)(d). The Lonely Minus...

There’s one more thing to worry about: unary operators. The classic example is

the minus sign. Do you remember my asking why a hex calculator was easier to write

than a decimal calculator? The answer is that a hex calculator doesn’t have to deal

with unary minus...

The problem arises because the minus sign is used for both unary and binary

operators (negation and subtraction). When we assign it a precedence in the %left

statement, we’re thinking about subtraction, so we give it a lower precedence than

multiplication and division. But negation should have the same precedence as

multiplication and division.

The solution is to give a grammar rule a precedence. This can be done with the

%prec keyword:

expr : ‘-’ expr %prec ‘*’

gives this rule the same precedence as multiplication. Thus, unary minus will have

the right precedence, and everything is at last conflict-free.

B(1)(e).  An Example
%token NUM
%left ‘+’ ‘-’
%left ‘*’ ‘/’
%left MINUS
%%
prob : expr ‘\n’
{
printf(“\t= %d\n”, $1);
return(0);
}
;
expr : expr ‘+’ expr
{
$$ = $1 + $3;
}
| expr ‘-’ expr
{
$$ = $1 - $3;
}
| expr ‘*’ expr
{
$$ = $1 * $3;
}
| expr ‘/’ expr
{
$$ = $1 / $3;
}
| ‘-’ expr %prec MINUS
{
$$ = - $2;
}
| ‘(‘ expr ‘)’
{
$$ = $2;
}
| NUM
{
$$ = $1;
}
;
%%
/***************************************/
#include “stdio.h”
#include “ctype.h”
#include “string.h”
/***************************************/
char *input;
char *token;
/***************************************/
#define yyerror(x)
{
printf(“\t%s [%s]\n”, x, token);
  return(0);
}
yyparse();
int yylex();
/***************************************/
void main(int argc, char *argv[])
{
char thestring[256];
if (argc < 1)
printf(“\tImpossible error!\n”);
else if (argc > 2)
printf(“\tHey!  One at a time!\n”);
else if (argc == 2)
{
input = argv[1];
yyparse();
}
else
{
printf(“? “);
while (strlen(gets(thestring)) > 2)
{
input = &thestring[2];
yyparse();
printf(“? “);
}
}
}
/***************************************/
int yylex()
{
token = strtok(input, “ “);
input = 0;
if (token == 0)
return(‘\n’);
else if (sscanf(token, “%d”, &yylval) == 1)
return(NUM);
else
return(token[0]);
}
/***************************************/

B(2). Catching Errors

You may have noticed that Pascal compilers write real error messages, while C

compilers write error messages more cryptic than C itself. There’s a reason for this:

Pascal compilers use recursive-descent parsers, while C compilers use table-driven

parsers. Recursive-descent parsers are the sort of parsers a person might naturally

write: get the next token; if it’s a ‘+’, do this, else get the next token and do that, and

so on. A location in the code of the parser corresponds to a particular grammar

Referenced by (5):