nach oben

2011 | Buch

Kapitel lesen Erstes Kapitel lesen

Introduction to Compiler Design

Name: Introduction to Compiler Design
ISBN: 978-0-85729-828-7

verfasst von: Torben Ægidius Mogensen

Verlag: Springer London

Buchreihe : Undergraduate Topics in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This textbook is intended for an introductory course on Compiler Design, suitable for use in an undergraduate programme in computer science or related fields.

Introduction to Compiler Design presents techniques for making realistic, though non-optimizing compilers for simple programming languages using methods that are close to those used in "real" compilers, albeit slightly simplified in places for presentation purposes. All phases required for translating a high-level language to machine language is covered, including lexing, parsing, intermediate-code generation, machine-code generation and register allocation. Interpretation is covered briefly.

Aiming to be neutral with respect to implementation languages, algorithms are presented in pseudo-code rather than in any specific programming language, and suggestions for implementation in several different language flavors are in many cases given. The techniques are illustrated with examples and exercises.

The author has taught Compiler Design at the University of Copenhagen for over a decade, and the book is based on material used in the undergraduate Compiler Design course there.

Additional material for use with this book, including solutions to

selected exercises, is available at http://www.diku.dk/~torbenm/ICD

Inhaltsverzeichnis

Frontmatter

Chapter 1. Lexical Analysis

Abstract

A lexical analyser, also called a lexer or scanner, will as its input take a string of individual letters and divide this string into word-like entities called tokens. Additionally, it will filter out whatever separates the tokens (the so-called white-space), i.e., lay-out characters (spaces, newlines etc.) and comments. For lexical analysis, specifications are traditionally written using regular expressions: An algebraic notation for describing sets of strings. The generated lexers are in a class of extremely simple programs called finite automata. This chapter will describe regular expressions and finite automata, their properties and how regular expressions can be converted to finite automata. Finally, we discuss some practical aspects of lexer generators.

Torben Ægidius Mogensen

Chapter 2. Syntax Analysis

Abstract

Where lexical analysis splits the input into tokens, the purpose of syntax analysis (also known as parsing) is to recombine these tokens. Not back into a list of characters, but into something that reflects the structure of the text. This “something” is typically a data structure called the syntax tree of the text. As the name indicates, this is a tree structure. The leaves of this tree are the tokens found by the lexical analysis, and if the leaves are read from left to right, the sequence is the same as in the input text. Hence, what is important in the syntax tree is how these leaves are combined to form the structure of the tree and how the interior nodes of the tree are labelled. In addition to finding the structure of the input text, the syntax analysis must also reject invalid texts by reporting syntax errors. As syntax analysis is less local in nature than lexical analysis, more advanced methods are required. We, however, use the same basic strategy: A notation suitable for human understanding is transformed into a machine-like low-level notation suitable for efficient execution. This process is called parser generation.

Torben Ægidius Mogensen

Chapter 3. Scopes and Symbol Tables

Abstract

An important concept in programming languages is the ability to name objects such as variables, functions and types. Each such named object will have a declaration, where the name is defined as a synonym for the object. This is called binding. Each name will also have a number of uses, where the name is used as a reference to the object to which it is bound. Often, the declaration of a name has a limited scope: a portion of the program where the name will be visible. Such declarations are called local declarations, whereas a declaration that makes the declared name visible in the entire program is called global. It may happen that the same name is declared in several nested scopes. In this case, it is normal that the declaration closest to a use of the name will be the one that defines that particular use. A compiler will need to keep track of names and the objects these are bound to, so that any use of a name will be attributed correctly to its declaration. This is typically done using a symbol table.

Torben Ægidius Mogensen

Chapter 4. Interpretation

Abstract

The simplest way to execute a program is interpretation. Interpretation is done by a program called an interpreter, which takes the abstract syntax tree of a program and executes it by inspecting the syntax tree to see what needs to be done. This is similar to how a human evaluates a mathematical expression: We insert the values of variables in the expression and evaluate it bit by bit, starting with the innermost parentheses and moving out until we have the result of the expression. We can then repeat the process with other values for the variables. There are some differences, however. Where a human being will copy the text of the formula with variables replaced by values and then write a sequence of more and more reduced copies of a formula until it is reduced to a single value, an interpreter will keep the formula (or, rather, the abstract syntax tree of an expression) unchanged and use a symbol table to keep track of the values of variables. Instead of reducing a formula, the interpreter is a function that takes an abstract syntax tree and a symbol table as arguments and returns the value of the expression represented by the abstract syntax tree. The function can call itself recursively on parts of the abstract syntax tree to find the values of subexpressions, and when it evaluates a variable, it can look its value up in the symbol table. This process can be extended to also handle statements and declarations, but the basic idea is the same: A function takes the abstract syntax tree of the program and, possibly, some extra information about the context (such as a symbol table or the input to the program) and returns the output of the program.

Torben Ægidius Mogensen

Chapter 5. Type Checking

Abstract

Lexing and parsing will reject many texts as not being correct programs. However, many languages have well-formedness requirements that can not be handled exclusively by the techniques seen so far. These requirements can, for example, be static type correctness or a requirement that pattern-matching or case-statements are exhaustive. These properties are most often not context-free, i.e., they can not be checked by membership of a context-free language. Consequently, they are checked by a phase that (conceptually) comes after syntax analysis (though it may be interleaved with it). These checks may happen in a phase that does nothing else, or they may be combined with the actual execution or translation to another language. Often, the translator may exploit or depend on type information, which makes it natural to combine calculation of types with the actual translation. In Chap. 4, we covered type-checking during execution, which is normally called dynamic typing. We will in this chapter assume that type checking and related checks are done in a phase previous to execution or translation (i.e., static typing), and similarly assume that any information gathered by this phase is available in subsequent phases.

Torben Ægidius Mogensen

Chapter 6. Intermediate-Code Generation

Abstract

The final goal of a compiler is to get programs written in a high-level language to run on a computer. This means that, eventually, the program will have to be expressed as machine code which can run on the computer. This does not mean that we need to translate directly from the high-level abstract syntax to machine code. Many compilers use a medium-level language as a stepping-stone between the high-level language and the very low-level machine code. Such stepping-stone languages are called intermediate code. We will generate intermediate code using translation functions for each syntactic category, similar to the functions we used for interpretation and type checking. We generate code for a syntactic construct independently of the constructs around it, except that the parameters of a translation function may hold information about the context (such as symbol tables) and the result of a translation function may (in addition to the generated code) hold information about how the generated code interfaces with its context (such as which variables it uses).

Torben Ægidius Mogensen

Chapter 7. Machine-Code Generation

Abstract

The intermediate language we have used in Chap. 6 is quite low-level and similar to the type of machine code you can find on modern RISC processors, with a few exceptions: We have used an unbounded number of variables, where a processor will have a bounded number of registers, we have used high-level instructions for function definitions, calls and return, we have assumed that any constant can be an operand to an arithmetic instruction. Typically, RISC processors allow only small constants as operands, and in the intermediate language, the IF-THEN-ELSE instruction has two target labels, where, on most processors, the conditional jump instruction has only one target label, and simply falls through to the next instruction when the condition is false. The problem of mapping a large set of variables to a small number of registers is handled by register allocation, as explained in Chap. 8. Functions are treated in Chap. 9. We will look at the remaining two problems in this chapter.

Torben Ægidius Mogensen

Chapter 8. Register Allocation

Abstract

When generating intermediate code in Chap. 6, we have freely used as many variables as we found convenient. In Chap. 7, we have simply translated variables in the intermediate language one-to-one into registers in the machine language. Processors, however, do not have an unlimited number of registers, so we need register allocation to handle this conflict. The purpose of register allocation is to map a large number of variables into a small(ish) number of registers. This can often be done by letting several variables share a single register, but sometimes there are simply not enough registers in the processor. In this case, some of the variables must be temporarily stored in memory. This is called spilling.

Torben Ægidius Mogensen

Chapter 9. Functions

Abstract

In Chap. 6 we have shown how to translate the body of a single function. Function calls and returns were left (mostly) untranslated by using the CALL and RETURN instructions in the intermediate code. Nor did we in Chap. 7 show how these instructions should be translated. We will, in this chapter, remedy these omissions. We will initially assume that all variables are local to the function that access them and that parameters are call-by-value, meaning that the value of an argument expression is passed to the called function. This is the default parameter-passing mechanism in most languages, and in many languages (e.g, C or SML) it is the only one. A single procedure body uses (in most languages) a finite number of variables. We have seen in Chap. 8 that we can map these variables into a (possibly smaller) set of registers. A program that uses recursive procedures or functions may, however, use an unbounded number of variables, as each recursive invocation of the function has its own set of variables, and there is no bound on the recursion depth. We can not hope to keep all these variables in registers, so we will use memory for some of these. The basic idea is that only variables that are local to the active (most recently called) function will be kept in registers. All other variables will be kept in memory. When a function is called, all the live variables of the calling function (which we will refer to as the caller) need to be stored in memory so the registers will be free for use by the called function (the callee). When the callee returns, the stored variables are loaded back into registers. We will use a stack for this storing and loading, pushing register contents on the stack when they must be saved and popping them back into registers when they must be restored.

Torben Ægidius Mogensen

Backmatter

Titel: Introduction to Compiler Design
verfasst von: Torben Ægidius Mogensen
Verlag: Springer London
Electronic ISBN: 978-0-85729-829-4
Print ISBN: 978-0-85729-828-7
DOI: https://doi.org/10.1007/978-0-85729-829-4

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Chapter 1. Lexical Analysis

Chapter 2. Syntax Analysis

Chapter 3. Scopes and Symbol Tables

Chapter 4. Interpretation

Chapter 5. Type Checking

Chapter 6. Intermediate-Code Generation

Chapter 7. Machine-Code Generation

Chapter 8. Register Allocation

Chapter 9. Functions

Backmatter

Premium Partner