![]() |
Writing a Compiler |
Dr. Rick Smith
|
|
Dr. Smith Home | Research | Classes | Blackboard | Cryptosmith | QMCS Home | UST A-Z | UST Home
last update: |
||
This is essentially the 20,000 foot view of a task that can easily take months to finish. For background: I've written several lexical analyzers, parsers, and code generators over the years, but it's been a long time. What I describe here is the classical outline of a compiler's design. While this design predates the application of object oriented technology to compilers, I doubt that object oriented design has really affected compiler design at this level.
Traditional compiler writing breaks the task into 2 or 3 phases, depending on how you look at it. I'll talk about it in 3 phases.
Here you use a regular grammar (implemented with a finite state machine) to parse the input text and isolate lexical units, like identifiers, punctuation, operators represented as special characters, brackets, and so on.
There's a lot of theoretical and practical work on regular grammars, also called 'type 0' grammars. When you see "regular expression" in something like GREP, they're referring to something that implements a regular grammar. Essentially it's something where you use simple loops with finite nesting to read and intepret the text.
Lexical analysis can be implemented "by hand" without too much pain or you can look for existing tools to automatically generate a lexical analyzer. Existing Java methods for token scanning should make it pretty easy to implement even without a tool.
This phase can go on more or less in parallel with 1A in that this phase reads the output of the previous phase. Phase 1A may need to read ahead a little to analyze complete lexical units, but it never needs to be more than one text line ahead.
In 'real parsing' you use a context free grammar (implemented with a push down automaton, or directly through a set of recursive procedures) to build a tree representation of the components of the parsed source code. For example, the parse tree of a C function would have a subtree whenever there's a nested set of curly braces.
The design is often called "recursive descent parser" since it applies the same parsing functions recursively as you move to more deeply nested language constructs.
Again, there's a whole bunch of theoretical work that's been done on parsing and context free (Type 1) grammars and languages. Spoken English is generally said to be a "context sensitive" (Type 2) language, for which there aren't any clean and effective general purpose parsing starategies.
Generally when I've had to write parsers, I find it easiest to write a grammar description first. The traditional way was to use "Bacus Naur Form" (BNF) which describes context free languages and can easily be typed with a text editor. You can also do diagrams, like they occasionally put in introductory programming books.
Traditionally, a compiler would use 1 pass to generate a parsed intermediate form of the program (which was written to a temporary file) and then would generate the code during a second pass that read the temporary file. Today, a typical source file can fit entirely in RAM along with its parse tree, and have plenty of room left over, so the "two pass" design seems irrelevant.
Once you have the parse tree, you generate code by analyzing the subtrees and building your instruction stream from there. If you build the tree correctly, you ought to be able to construct the code directly from the information available in the tree and associated symbol tables. It should be pretty straightforward to look at expressions in the parse tree and tell if they resolve to constants.
Generally when I've done languages I've used them to generate text in a different language. Years ago, Honeywell had this weird research project where I read in database descriptions and operations in some "graph oriented" structure (a fetish at Honeywell back then) and converted it to relational SQL, since that was the only type of database most people had. The only times I've generated binary is when I've had to write assembly language translators.
I have never tried to map this into an object oriented design. I've never thought about the two at the same time before this, though no doubt some people have. What I described above is what you'd see in a traditional compiler course. There may be a lot of ways to streamline things in the object oriented world. For example, there's a flavor of recursive descent to the way object inheritance works, though I suspect you'd still need a hierarchy of actual object instances to make the parsing yield something you can work with.
This work by Dr. Rick Smith is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. |