Skip to main content

2008 | Buch

Parsing Techniques

A Practical Guide

verfasst von: Dick Grune, Ceriel J. H. Jacobs

Verlag: Springer New York

insite
SUCHEN

Über dieses Buch

Parsing, also referred to as syntax analysis, has been and continues to be an essential part of computer science and linguistics. Today, parsing techniques are also implemented in a number of other disciplines, including but not limited to, document preparation and conversion, typesetting chemical formulae, and chromosome recognition.

This second edition presents new developments and discoveries that have been made in the field. Parsing techniques have grown considerably in importance, both in computational linguistics where such parsers are the only option, and computer science, where advanced compilers often use general CF parsers. Parsing techniques provide a solid basis for compiler construction and contribute to all existing software: enabling Web browsers to analyze HTML pages and PostScript printers to analyze PostScript. Some of the more advanced techniques are used in code generation in compilers and in data compression.

In linguistics, the importance of formal grammars was recognized early on, but only recently have the corresponding parsing techniques been applied. Also their importance as general pattern recognizers is slowly being acknowledged. This text Parsing Techniques explores new developments, such as generalized deterministic parsing, linear-time substring parsing, parallel parsing, parsing as intersection, non-canonical methods, and non-Chomsky systems.

To provide readers with low-threshold access to the full field of parsing techniques, this new edition uses a two-tiered structure. The basic ideas behind the dozen or so existing parsing techniques are explained in an intuitive and narrative style, and problems are presented at the conclusion of each chapter, allowing the reader to step outside the bounds of the covered material and explore parsing techniques at various levels. The reader is also provided with an extensive annotated bibliography as well as hints and partial solutions to a number of problems. In the bibliography, hundreds of realizations and improvements of parsing techniques are explained in a much terser, yet still informal, style, improving its readability and usability.

The reader should have an understanding of algorithmic thinking, especially recursion; however, knowledge of any particular programming language is not required.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Parsing is the process of structuring a linear representation in accordance with a given grammar. This definition has been kept abstract on purpose to allow as wide an interpretation as possible. The “linear representation” may be a sentence, a computer program, a knitting pattern, a sequence of geological strata, a piece of music, actions in ritual behavior, in short any linear sequence in which the preceding elements in some way restrict1 the next element. For some of the examples the grammar is well known, for some it is an object of research, and for some our notion of a grammar is only just beginning to take shape.
Dick Grune, Ceriel J. H. Jacobs
Chapter 2. Grammars as a Generating Device
In computer science as in everyday parlance, a “grammar” serves to “describe” a “language”. If taken at face value, this correspondence, however, is misleading, since the computer scientist and the naive speaker mean slightly different things by the three terms. To establish our terminology and to demarcate the universe of discourse, we shall examine the above terms, starting with the last one.
Dick Grune, Ceriel J. H. Jacobs
Chapter 3. Introduction to Parsing
To parse a string according to a grammar means to reconstruct the production tree (or trees) that indicate how the given string can be produced from the given grammar. It is significant in this respect that one of the first publications on parsing (Greibach’s 1963 doctoral thesis [6]), was titled “Inverses of Phrase Structure Generators”, where a phrase structure generator is to be understood as a system for producing phrases from a phrase structure (actually context-free) grammar.
Dick Grune, Ceriel J. H. Jacobs
Chapter 4. General Non-Directional Parsing
In this chapter we will present two general parsing methods, both non-directional: Unger’s method and the CYK method. These methods are called non-directional because they access the input in a seemingly arbitrary order. They require the entire input to be in memory before parsing can start.
Dick Grune, Ceriel J. H. Jacobs
Chapter 5. Regular Grammars and Finite-State Automata
Regular grammars, Type 3 grammars, are the simplest form of grammars that still have generative power. They can describe concatenation (joining two strings together) and repetition, and can specify alternatives, but they cannot express nesting. Regular grammars are probably the best-understood part of formal linguistics and almost all questions about them can be answered.
Dick Grune, Ceriel J. H. Jacobs
Chapter 6. General Directional Top-Down Parsing
In this chapter, we will discuss top-down parsing methods that try to rederive the input sentence by prediction. As explained in Section 3.2.1, we start with the start symbol and try to produce the input from it; at any point in time, we have a sentential form that represents our prediction of the rest of the input sentence.
Dick Grune, Ceriel J. H. Jacobs
Chapter 7. General Directional Bottom-Up Parsing
As explained in Section 3.2.2, directional bottom-up parsing is conceptually very simple. At all times we are in the possession of a sentential form that derives from the input text through a series of leftmost reductions. These leftmost reductions during parsing correspond to rightmost productions that produced the input text: the first leftmost reduction corresponds to the last rightmost production, the second corresponds to the one but last, etc.
Dick Grune, Ceriel J. H. Jacobs
Chapter 8. Deterministic Top-Down Parsing
In Chapter 6 we discussed two general top-down methods: one using breadth-first search and one using depth-first search. These methods have in common the need to search to find derivations, and thus are not efficient. In this chapter and the next we will concentrate on parsers that do not have to search: there will always be only one possibility to choose from. Parsers with this property are called deterministic. Deterministic parsers have several advantages over non-deterministic ones: they are much faster; they produce only one parse tree, so ambiguity is no longer a problem; and this parse tree can be constructed on the fly rather than having to be retrieved afterwards. But there is a penalty: the class of grammars that the deterministic parsing methods are suitable for, while depending on the method chosen, is more restricted than that of the grammars suitable for non-deterministic parsing methods. In particular, only non-ambiguous grammars can be used.
In this chapter we will focus on deterministic top-down methods. As has been explained in Section 3.5.5, there is only one such method, this in contrast with the deterministic bottom-up methods, which will be discussed in the next chapter. From Chapters 3 and 6 we know that in a top-down parser we have a prediction for the rest of the input, and that this prediction has either a terminal symbol in front, in which case we “match”, or a non-terminal, in which case we “predict”.
Dick Grune, Ceriel J. H. Jacobs
Chapter 9. Deterministic Bottom-Up Parsing
There is a great variety of deterministic bottom-up parsing methods. The first deterministic parsers (Wolpe [110], Adams and Schlesinger [109]) were bottom-up parsers and interest has only increased since. The full bibliography of this book on its web site contains about 280 entries on deterministic bottom-up parsing against some 85 on deterministic top-down parsing. These figures may not directly reflect the relative importance of the methods, but they are certainly indicative of the fascination and complexity of the subject of this chapter.
Dick Grune, Ceriel J. H. Jacobs
Chapter 10. Non-Canonical Parsers
Top-down parsers make their predictions in pre-order, in which the parent nodes are identified before any of their children, and which imitates leftmost derivations (see Section 6.1); bottom-up parsers perform their reductions in post-order, in which the parent nodes are identified after all of their children have been identified, and which imitates rightmost derivation (see the introduction in Chapter 7). These two orders of producing and visiting trees are called “canonical”, and so are the parsing techniques that follow them.
Dick Grune, Ceriel J. H. Jacobs
Chapter 11. Generalized Deterministic Parsers
Generalized deterministic parsers are general breadth-first context-free parsers that gain efficiency by exploiting the methods and tables used by deterministic parsers, even if these tables have conflicts (inadequate states) in them. Viewed alternatively, generalized deterministic parsers are deterministic parsers extended with a breadth- first search mechanism so they will be able to operate with tables with some multiple, conflicting entries. The latter view is usually more to the point.
Dick Grune, Ceriel J. H. Jacobs
Chapter 12. Substring Parsing
In Chapter 2 grammars were explained as finite devices for generating infinite sets of strings, and parsing was explained as the mechanism for determining whether a given string—a sentence —belongs to the set of strings—the language —generated by the grammar. As a bonus parsing also reveals the syntactic structure of that sentence.
Dick Grune, Ceriel J. H. Jacobs
Chapter 13. Parsing as Intersection
In 1961 Bar-Hillel, Perles and Shamir [219] proved that “the intersection of a context-free language with a regular language is again a context-free language”. On the face of it, this means that when we take the set of strings that constitute a given CF language and remove from it all strings that do not occur in a given FS language, we get a set of strings for which a CF grammar exists. Actually, it means quite a bit more.
Dick Grune, Ceriel J. H. Jacobs
Chapter 14. Parallel Parsing
There are two main reasons for doing parallel programming: the problem has inherent parallelism, in which case the parallel programming comes naturally; and the problem is inherently sequential but we need the speed-up promised by parallel processing, in which case the parallel programming is often a struggle.
Dick Grune, Ceriel J. H. Jacobs
Chapter 15. Non-Chomsky Grammars and Their Parsers
Just as the existence of non-stick pans points to user dissatisfaction with “sticky” pans, the existence of non-Chomsky grammars points to user dissatisfaction with the traditional Chomsky hierarchy. In both cases ease of use is the issue.
Dick Grune, Ceriel J. H. Jacobs
Chapter 16. Error Handling
Until now, we have discussed parsing techniques while largely ignoring what happens when the input contains errors. In practice, however, the input often contains errors, the most common being typing errors and misconceptions, but we could also be dealing with a grammar that only roughly, not precisely, describes the input, for example in pattern matching. So the question arises how to deal with errors. A considerable amount of research has been done on this subject, far too much to discuss in one chapter. We will therefore limit our discussion to some of the more well-known error handling methods, and not pretend to cover the field; see (Web)Section 18.2.7 for references to more in-depth information.
Dick Grune, Ceriel J. H. Jacobs
Chapter 17. Practical Parser Writing and Usage
Practical parsing is concerned almost exclusively with context-free (Type 2) and regular (Type 3) grammars. Unrestricted (Type 0) and context-sensitive (Type 1) grammars are hardly used since, first, they are user-unfriendly in that it is next to impossible to construct a clear and readable Type 0 or Type 1 grammar and, second, all known parsers for them have exponential time requirements. Chapter 15 describes a number of polynomial-time and even linear-time parsers for non-Chomsky systems, but few have seen practical application. For more experimental results see (Web)Section 18.2.6.
Dick Grune, Ceriel J. H. Jacobs
Chapter 18. Annotated Bibliography
The purpose of this annotated bibliography is to supply the reader with more material and with more detail than was possible in the preceding chapters, rather than to just list the works referenced in the text. The annotations cover a considerable number of subjects that have not been treated in the rest of the book.
Dick Grune, Ceriel J. H. Jacobs
Backmatter
Metadaten
Titel
Parsing Techniques
verfasst von
Dick Grune
Ceriel J. H. Jacobs
Copyright-Jahr
2008
Verlag
Springer New York
Electronic ISBN
978-0-387-68954-8
Print ISBN
978-0-387-20248-8
DOI
https://doi.org/10.1007/978-0-387-68954-8

Premium Partner