Skip to main content

2014 | Buch

Language Processing with Perl and Prolog

Theories, Implementation, and Application

insite
SUCHEN

Über dieses Buch

The areas of natural language processing and computational linguistics have continued to grow in recent years, driven by the demand to automatically process text and spoken data. With the processing power and techniques now available, research is scaling up from lab prototypes to real-world, proven applications.

This book teaches the principles of natural language processing, first covering practical linguistics issues such as encoding and annotation schemes, defining words, tokens and parts of speech and morphology, as well as key concepts in machine learning, such as entropy, regression and classification, which are used throughout the book. It then details the language-processing functions involved, including part-of-speech tagging using rules and stochastic techniques, using Prolog to write phase-structure grammars, syntactic formalisms and parsing techniques, semantics, predicate logic and lexical semantics and analysis of discourse and applications in dialogue systems. A key feature of the book is the author's hands-on approach throughout, with sample code in Prolog and Perl, extensive exercises, and a detailed introduction to Prolog. The reader is supported with a companion website that contains teaching slides, programs and additional material.

The second edition is a complete revision of the techniques exposed in the book to reflect advances in the field the author redesigned or updated all the chapters, added two new ones and considerably expanded the sections on machine-learning techniques.

Inhaltsverzeichnis

Frontmatter
Chapter 1. An Overview of Language Processing
Abstract
Linguistics is the study and the description of human languages. Linguistic theories on grammar and meaning have developed since ancient times and the Middle Ages. However, modern linguistics originated at the end of the nineteenth century and the beginning of the twentieth century. Its founder and most prominent figure was probably Ferdinand de Saussure (1916). Over time, modern linguistics has produced an impressive set of descriptions and theories.
Pierre M. Nugues
Chapter 2. Corpus Processing Tools
Abstract
A corpus, plural corpora, is a collection of texts or speech stored in an electronic machine-readable format. A few years ago, large electronic corpora of more than a million of words were rare, expensive, or simply not available. At present, huge quantities of texts are accessible in many languages of the world. They can easily be collected from a variety of sources, most notably the Internet, where corpora of hundreds of millions of words are within the reach of most computational linguists.
Pierre M. Nugues
Chapter 3. Encoding and Annotation Schemes
Abstract
At the most basic level, computers only understand binary digits and numbers. Corpora as well as any computerized texts have to be converted into a digital format to be read by machines. From their American early history, computers inherited encoding formats designed for the English language. The most famous one is the American Standard Code for Information Interchange (ASCII). Although well established for English, the adaptation of ASCII to other languages led to clunky evolutions and many variants. It ended (temporarily?) with Unicode, a universal scheme compatible with ASCII and intended to cover all the scripts of the world.
Pierre M. Nugues
Chapter 4. Topics in Information Theory and Machine Learning
Abstract
Information theory underlies the design of codes. Claude Shannon probably started the field with a seminal article (1948), in which he defined a measure of information: the entropy. In this chapter, we introduce essential concepts in information theory: entropy, optimal coding, cross entropy, and perplexity. Entropy is a very versatile measure of the average information content of symbol sequences and we will explore how it can help us design efficient encodings.
Pierre M. Nugues
Chapter 5. Counting Words
Abstract
We saw in Chap. 2 that words have specific contexts of use. Pairs of words like strong and tea or powerful and computer are not random associations but the result of a preference. A native speaker will use them naturally, while a learner will have to learn them from books – dictionaries – where they are explicitly listed. Similarly, the words rider and writer sound much alike in American English, but they are likely to occur with different surrounding words. Hence, hearing an ambiguous phonetic sequence, a listener will discard the improbable rider of books or writer of horses and prefer writer of books or rider of horses (Church and Mercer 1993) .
Pierre M. Nugues
Chapter 6. Words, Parts of Speech, and Morphology
Abstract
We can divide the lexicon into parts of speech (POS), that is, classes whose words share common grammatical properties. The concept of part of speech dates back to the classical antiquity philosophy and teaching. Plato made a distinction between the verb and the noun. After him, the word classification further evolved, and parts of speech grew in number until Dionysius Thrax fixed and formulated them in a form that we still use today. Aelius Donatus popularized the list of the eight parts of speech: noun, pronoun, verb, adverb, participle, conjunction, preposition, and interjection, in his work Ars grammatica , a reference reading in the Middle Ages.
Pierre M. Nugues
Chapter 7. Part-of-Speech Tagging Using Rules
Abstract
We saw that looking up a word in a lexicon or carrying out a morphological analysis on a word can leave it with an ambiguous part of speech. The word chair, which can be assigned two tags, noun or verb, is an example of ambiguity. It is a noun in the phrase a chair, and a verb in to chair a session.
Pierre M. Nugues
Chapter 8. Part-of-Speech Tagging Using Statistical Techniques
Abstract
Like transformation-based tagging, statistical part-of-speech (POS) tagging assumes that each word is known and has a finite set of possible tags. These tags can be drawn from a dictionary or a morphological analysis. Statistical methods enable us to determine a sequence of part-of-speech tags \(T = t_{1},t_{2},t_{3},\ldots,t_{n}\), given a sequence of words \(W = w_{1},w_{2},w_{3},\ldots,w_{n}\).
Pierre M. Nugues
Chapter 9. Phrase-Structure Grammars in Prolog
Abstract
This chapter introduces parsing using phrase-structure rules and grammars. It uses the Definite Clause Grammar (DCG) notation (Pereira and Warren 1980) , which is a feature of virtually all Prologs. The DCG notation enables us to transcribe a set of phrase-structure rules directly into a Prolog program.
Pierre M. Nugues
Chapter 10. Partial Parsing
Abstract
The description of language in terms of layers – words, parts of speech, and syntax – could suggest that a parse tree is a necessary step to obtain the semantic representation of a sentence. Yet, many industrial applications do not rely on syntax as we presented it before. The reason is that a syntactic parser can be expensive in terms of resources and sometimes it is not worth the cost.
Pierre M. Nugues
Chapter 11. Syntactic Formalisms
Abstract
Studies on syntax have been the core of linguistics for most of the twentieth century. While the goals of traditional grammars had been mostly to prescribe what the correct usage of a language is, the then-emerging syntactic theories aimed at an impartial description of language structures. These ideas revolutionized the field. Research activity was particularly intense in the years 1940–1970, and the focus on syntax was so great that, for a time, it nearly eclipsed phonetics, morphology, semantics, and other disciplines of linguistics.
Pierre M. Nugues
Chapter 12. Constituent Parsing
Abstract
In the previous chapters, we used Prolog’s built-in search mechanism and the DCG notation to parse sentences and constituents. This search mechanism has drawbacks, however. To name some of them: its depth-first strategy does not handle left-recursive rules well, and backtracking is sometimes inefficient. In addition, if DCGs are appropriate to describe constituents, we haven’t seen means to parse dependencies until now.
Pierre M. Nugues
Chapter 13. Dependency Parsing
Abstract
Parsing dependencies consists of finding links between heads (also called governors) and modifiers (or dependents) – one word being the root of the sentence (Fig. 13.1). In addition, each link can be annotated with a grammatical function.
Pierre M. Nugues
Chapter 14. Semantics and Predicate Logic
Abstract
Semantics deals with the meaning of words, phrases, and sentences. It is a wide and open subject intricately interwoven with the structure of the mind. The potential domain of semantics is immense and covers many of the human cognitive activities. It has naturally spurred a great number of theories. From the philosophers of ancient and medieval times, to logicians of the nineteenth century, psychologists and linguists of the twentieth century, and now computer scientists, a huge effort has been made on this subject.
Pierre M. Nugues
Chapter 15. Lexical Semantics
Abstract
Formal semantics provides clean grounds and well-mastered devices for bridging language and logic. Although debated, the assumption of such a link is common sense. There is obviously a connection – at least partial – between sentences and logical representations. However, there are more controversial issues. For instance, can the whole language be handled in terms of logical forms? Language practice, psychology, or pragmatics are not taken into account. These areas pertain to cognition: processes of symbolization, conceptualization, or understanding.
Pierre M. Nugues
Chapter 16. Discourse
Abstract
The grammatical concepts we have seen so far apply mostly to isolated words, phrases, or sentences. Texts and conversations, either full or partial, are out of their scope. Yet to us, human readers, writers, and speakers, language goes beyond the simple sentence. It is now time to describe models and processing techniques to deal with a succession of sentences. Although analyzing texts or conversations often requires syntactic and semantic treatments, it goes further. In this chapter, we shall make an excursion to the discourse side, that is, paragraphs, texts, and documents. In the next chapter, we shall consider dialogue, that is, a spoken or written interaction between a user and a machine.
Pierre M. Nugues
Chapter 17. Dialogue
Abstract
While discourse materialized in texts delivers static information, dialogue is dynamic and consists of two interacting discourses. Once written, a discourse content is unalterable and will remain as it is for its future readers. On the contrary, a dialogue enables exchange information flows, to complement and to merge them in a composition, which is not known in advance. Both dialoguing parties provide feedback, influence, or modify the final content along with the course of the conversation.
Pierre M. Nugues
Backmatter
Metadaten
Titel
Language Processing with Perl and Prolog
verfasst von
Pierre M. Nugues
Copyright-Jahr
2014
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-41464-0
Print ISBN
978-3-642-41463-3
DOI
https://doi.org/10.1007/978-3-642-41464-0