Using Flex and Bison
Volume Number: 16
Issue Number: 7
Column Tag: Programming
by Aaron Montgomery
Scanners, Parsers and Abstract Syntax Trees, oh
my!
What Are Flex and Bison (And, Why Do I Need Them)
Flex and bison are code-generating tools designed to aid in compiler development. In
particular, flex (short for fast lexical analyzer) will take a sequence of characters
and break them apart into tokens (words). It could be used to write code breaking this
article into a sequence of words and punctuation. Bison (whose name comes from
YACC-Yet Another Compiler Compiler) will group the words into sentences, the
sentences into paragraphs and the paragraphs into sections. But wait, you protest, I
don't write compilers. That's okay, there are still a number of reasons to learn a little
about flex and bison. First, you may find that writing (or even understanding the code
behind) a simple compiler may make the syntax errors coming from other compilers
more understandable. There are also situations where you might want to be able to
analyze a sequence of tokens even if the final output is not compiled code. For example,
if you are transferring data over the Internet in the form of strings, you can use flex
and bison to create a parser to decode the information.
My own personal experience with flex and bison began in a course in compiler design
at the University of Wisconsin-Madison and I am currently working on
reimplementing an interpreter for the language isetl in ANSI C++. This recent work
has required that I brush up on my skills with flex and bison (in the Metrowerks'
CodeWarrior environment) and doing a small project with a toy language similar to
isetl seemed like a good first step.
This article will present a small simple language called Miniset and then present the
steps I followed to develop a parser for the language. It is highly unlikely that you
have ever used (or even heard of) Miniset, largely because I made it up last month in
order to practice with flex and bison before working on isetl. The final result will be
an application which simply echoes out the source you type in (although it will make it
conform to my stylistic conventions). This echo application is probably a good first
step in any parser's development since it allows you to confirm that your parser is
correctly interpreting the source. The next step would be to write the code to
implement the Miniset source, but that would be another article.
This article will assume some understanding of C++ although bison and flex can be
used even if you intend to do all your programming in C. You are also expected to have a
basic understanding of the CodeWarrior IDE (in particular, how to adjust project
settings in preference panels). While I will spend some time on the peculiarities of
using flex and bison in the CodeWarrior environment, much of the article remains
valid in other environments (or even other operating systems). A basic understanding
of Miniset may come in handy in places and there is a short overview of the language
included with the project file. This article does not assume that you have any previous
experience with either flex or bison. After reading the article, you will hopefully have
enough understanding of flex and bison to determine if they suit your needs. In
addition, you will have seen some of the issues that can arise as you try to build a
parser or a scanner with these tools. This article does not intend to provide a complete
introduction to these tools. If you decide that you are going to use flex or bison, then I
would strongly encourage you to obtain a copy of the book lex & yacc as well as
download the bison and flex documentation (references available at end of article).
How Flex and Bison Work
Before discussing this particular project, an overview of how flex and bison operate is
probably appropriate. I will call the input file for flex a "scanner specification" and it
is conventional for it to have an l suffix. I will call the input file for bison a "parser
specification" and it is conventional for it to have a y suffix. When you tell flex or
bison to process its input file, it will produce a file containing C/C++ source code. I
will call these files source files in order to distinguish them from the specification
files. Your own code will probably call the function yyparse() created by bison from
your parser specification. This function will in turn call yylex() created by flex from
your scanner specification. You can use yylex() independent of yyparse() and you can
supply your own yylex() to be used with a bison generated yyparse().
When you call yyparse() from your code, the parser will repeatedly call yylex() to
obtain tokens (words and punctuation) from the input stream. With each new token,
the parser will decide whether to push the token onto its internal stack or to reduce a
sequence of tokens on the stack to create a new language part to push onto the stack.
These new parts of the language are referred to as "non-terminals". One of the
non-terminals is the "start-state". When this non-terminal is matched, yyparse()
will return. This is probably best explained with the following example.
Assume that the language knows that a noun-phrase consists of an article ("the" or
"a") followed by a noun. That a sentence consists of a noun-phrase, a verb, a
noun-phrase and a period. Furthermore, assume that the parser is attempting to
match a sentence. The non-terminals here are the noun-phrase and the sentence, the
tokens are articles, nouns, verbs and periods. The start-state is a complete sentence.
If the input stream consists of "The man ate a worm." then a call to yyparse() will
result in the sequence of events depicted in Figure 1.
Figure 1. A sample parsing.
• The parser will obtain the first token from the scanner - obtaining "The",
an article. The parser will push the "The" on the stack (Figure 1-a).
• The parser will obtain another token from the scanner - obtaining "man",
a noun. Since there is an article on the stack, the parser will pop the article
off the stack and replace it with a noun-phrase "The man" (Figure 1-b).
• The parser will obtain another token from the scanner - obtaining "ate",
a verb. There is no rule reducing "noun-phrase verb" so it pushes "ate" on the
stack (Figure 1-c).
• The parser will obtain another token from the scanner - obtaining "a", an
article. There is no rule reducing "noun-phrase verb article" and so it pushes
the "a" on the stack (Figure 1-d).
• The parser will obtain another token from the scanner - obtaining
"worm", a noun. The parser can now reduce the "article noun" on the top of the
stack to a noun-phrase and places the noun-phrase on the stack (Figure 1-e).
• The parser will obtain another token from the scanner - obtaining ".", a
period. The parser can now reduce the "noun-phrase verb noun-phrase
period" to a sentence and places this on the stack (Figure 1-f).
The parser has now matched a sentence and returns. When the parser returns, its
internal stack vanishes and so you are responsible for finding some method of
returning the information you need from that stack. The most common method (the one
used in this article) is to use a global pointer to the needed information. You can save
this information because will execute a block of C/C++ code that you provide each time
bison matches a pattern. Frequently, these blocks of C code will be used to produce
something called an "abstract syntax tree" or "AST" for short. Abstract syntax trees
are data structures designed to mirror the syntactic structure of the pattern being
parsed. For example, the abstract syntax tree from the sentence above might look like
the one in Figure 2 (note, we don't need to have a period included in the tree since it
occurs for all sentences).
Figure 2. An abstract syntax tree.
Bison facilitates this construction by assigning to each token or non-terminal a value
and the block of C/C++ code for a rule reduction can use the values of the tokens and
non-terminals in the pattern to set the value of the non-terminal being reduced. You
will be introduced to the details of this in the description of Parser 3 below.
Installing Flex & Bison for CodeWarrior
The flex and bison plug-ins do not come packaged with CodeWarrior, so the first thing
you will need to do is get them. (an URL is provided at the end of this article). After
decompressing the folders, you should place the flex and bison folders into the
CodeWarrior Plugins folder of CodeWarrior. You now need to make sure that the
file-mappings are correct. In the project settings, you should make sure flex is
associated with the file suffix l and the precompile flag is set and that bison is
associated with the file suffix y and the precompile flag is set. If you don't set the
precompile flag, CodeWarrior will complain of an error (from the old source file) and
then show you the source where the error no longer exists.
If you intend to only use them to generate PPC code, then you are set. If you plan to
generate code for a 68K processor, you will need to change the parser skeleton file
called bison.simple which is in the Bison plug-in folder. This change is needed to
support a limitation in CodeWarrior's implementation of the alloca() function in the
68K environment. I reported this to Metrowerks and received a work-around. You
should replace the file in the Bison plug-in folder with the file included with the
project. The alloca() limitation only affects 68K development and there is a
work-around so I suspect (even hope to be honest) that it sits low on Metrowerks'
priority list.
Another problem is that it is easy to edit the source file (for example after a compile
error or while debugging) and forget to change the specification file. If you do this,
flex and bison will overwrite your changes the next time you run them. This can be
particularly frustrating when you infrequently change the specification files. You can
fix the error in the source file, leave the files untouched for a month, then rebuild the
project and (mysteriously) a bug which you thought you had fixed a month ago will
reappear.
The Project
The project builds a parser for a language called Miniset. Miniset is a small language