Structure Discovery in Natural Language

verfasst von: Chris Biemann

Verlag: Springer Berlin Heidelberg

Buchreihe : Theory and Applications of Natural Language Processing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Current language technology is dominated by approaches that either enumerate a large set of rules, or are focused on a large amount of manually labelled data. The creation of both is time-consuming and expensive, which is commonly thought to be the reason why automated natural language understanding has still not made its way into “real-life” applications yet.

This book sets an ambitious goal: to shift the development of language processing systems to a much more automated setting than previous works. A new approach is defined: what if computers analysed large samples of language data on their own, identifying structural regularities that perform the necessary abstractions and generalisations in order to better understand language in the process?
After defining the framework of Structure Discovery and shedding light on the nature and the graphic structure of natural language data, several procedures are described that do exactly this: let the computer discover structures without supervision in order to boost the performance of language technology applications. Here, multilingual documents are sorted by language, word classes are identified, and semantic ambiguities are discovered and resolved without using a dictionary or other explicit human input. The book concludes with an outlook on the possibilities implied by this paradigm and sets the methods in perspective to human computer interaction.

The target audience are academics on all levels (undergraduate and graduate students, lecturers and professors) working in the fields of natural language processing and computational linguistics, as well as natural language engineers who are seeking to improve their systems.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract

The Structure Discovery paradigm for Natural Language Processing is introduced. This is a framework for learning structural regularities from large samples of text data, and for making these regularities explicit by introducing them in the data via self-annotation. In contrast to the predominant paradigms, Structure Discovery involves neither language-specific knowledge nor supervision and is therefore independent of language, domain and data representation. Working in this paradigm means to set up discovery procedures that operate on raw language material and iteratively enrich the data by using the annotations of previously applied Structure Discovery processes. Structure Discovery is motivated and justified by discussing this paradigm along Chomsky’s levels of adequacy for linguistic theories. Further, the vision of the complete Structure Discovery Machine is sketched: A series of processes that allow analysing language data by proceeding from generic to specific. At this, abstractions of previous processes are used to discover and annotate even higher abstractions. Aiming solely to identify structure, the effectiveness of these processes is judged by their utility for other processes that access their annotations and by measuring their contribution in application-based settings. A data-driven approach is also advocated when defining these applications, proposing crowdsourcing and user logs as a means to widen the data acquisition bottleneck.

Chris Biemann

Chapter 2. Graph Models

Abstract

This chapter provides basic definitions of graph theory, which is a wellestablished field in mathematics, dealing with properties of graphs in their abstract form. Graph models are a way of representing information by encoding it in vertices and edges. In the context of language processing, vertices will denote language units, whereas edges represent relations between these, e.g. a neighbourhood or similarity relation. This way, units and their similarities are naturally and intuitively translated into a graph representation. Note that graph models discussed here are not to be confused with graphical models [133], which are notations that represent random variables as nodes in Bayesian learning approaches. After revisiting notions of graph theory in Section 2.1, the focus is set on large-scale properties of graphs occurring in many complex systems, such as the Small World property and scale-free degree distributions. A variety of random graph generation models exhibiting these properties on their graphs will be discussed in Section 2.2. The study of large-scale characteristics of graphs that arise in Natural Language Processing using graph representations are an essential step towards approaching the data, in which structural regularities shall be found. Structure Discovery processes have to be designed with awareness about these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and their evolution.

Chris Biemann

Chapter 3. SmallWorlds of Natural Language

Abstract

In this chapter, power-law distributions and Small World Graphs originating from natural language data are examined in the fashion of Quantitative Linguistics. After giving several data sources that exhibit power-law distributions in rank-frequency in Section 3.1, graphs with Small World properties in language data are discussed in Section 3.2. We shall see that these characteristics are omnipresent in language data, and we should be aware of them when designing Structure Discovery processes. When knowing e.g. that a few hundreds of words make the bulk of words in a text, it is safe to use only these as contextual features without losing a lot of text coverage. Knowing that word co-occurrence networks possess the scale-free Small World property has implications for clustering these networks. An interesting aspect is whether these characteristics are only inherent to real natural language data or whether they can be produced with generators of linear sequences in a much simpler way than our intuition about language complexity would suggest –in other words, we shall see how distinctive these characteristics are with respect to tests deciding whether a given sequence is natural language or not. Finally, an emergent random text generation model that captures many of the characteristics of natural language is defined and quantitatively verified in Section 3.3.

Chris Biemann

Chapter 4. Graph Clustering

Abstract

This chapter is devoted to discovering structure in graphs by grouping their vertices together in meaningful clusters. In Section 4.1, clustering in its broadness is briefly reviewed in general. The discipline of graph clustering is embedded into the broad field of clustering and discussed in detail and a variety of graph clustering algorithms are examined in terms of mechanism, algorithmic complexity and adequacy for scale-free SmallWorld graphs. Taking their virtues and drawbacks into consideration, an efficient graph partitioning algorithm called Chinese Whispers is developed in Section 4.2. It is time-linear in the number of edges, finds the number of clusters automatically and does not impose relative size restrictions on clusters, which is adequate for graphs from language data. Several extensions and parametrisations of the method are discussed. This algorithm will be used throughout later chapters to solve several NLP tasks.

Chris Biemann

Chapter 5. Unsupervised Language Separation

Abstract

This chapter presents an unsupervised solution to language identification. The method sorts multilingual text corpora sentence-wise into different languages. In this attempt, the main difference to previous methods is that no training data for the different languages is provided and the number of languages does not have to be known beforehand. This application illustrates the benefits of a parameter-free graph clustering algorithm like Chinese Whispers, as the data–words and their statistical dependencies – are represented naturally in a graph, and the number of clusters (here: languages) as well as their size distribution is unknown. The feasibility and robustness of the approach for non-standard language data is demonstrated in a case study on Twitter data.

Chris Biemann

Chapter 6. Unsupervised Part-of-Speech Tagging

Abstract

In this chapter, homogeneity with respect to syntactic word classes (partsof- speech, POS) is aimed at. The method presented in this section is called unsupervised POS-tagging, as its application results in corpus annotation in a comparable way to what POS-taggers provide. Nevertheless, its application results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger, which hampers evaluation methods that compare unsupervised POS tags to linguistic annotations. To measure the extent to which unsupervised POS tagging can contribute in application-based settings, the system is evaluated in supervised POS tagging, word sense disambiguation, named entity recognition and chunking, improving on the state-of-the-art for supervised POS tagging and word sense disambiguation. Unsupervised POS-tagging has been explored since the beginning of the 1990s. Unlike in previous approaches, the kind and number of different tags is here generated by the method itself. Another difference to other methods is that not all words above a certain frequency rank get assigned a tag, but the method is allowed to exclude words from the clustering, if their distribution does not match closely enough with other words. The lexicon size is considerably larger than in previous approaches, which results in a more robust tagging.

Chris Biemann

Chapter 7. Word Sense Induction and Disambiguation

Abstract

Major difficulties in language processing are caused by the fact that many words are ambiguous, i.e. they have different meanings in different contexts, but are written (or pronounced) in the same way. While syntactic ambiguities have already been addressed in the previous chapter, now the focus is set on the semantic dimension of this problem. In this chapter, the problem of word sense ambiguity is discussed in detail. A Structure Discovery process is set up, which is used as a feature to successfully improve a supervised word sense disambiguation (WSD) system. On this basis, a high-precision system for automatically providing lexical substitutions is constructed.

Chris Biemann

Chapter 8. Conclusion

Abstract

Unsupervised and knowledge-free natural language processing in the Structure Discovery paradigm has shown to be successful and capable of producing a pre-processing quality equal to traditional systems, if just sufficient in-domain raw text can be provided. It is therefore not only a viable alternative for languages with scarce annotated resources, but might also overcome the acquisition bottleneck of language processing for new tasks and applications. In this chapter, the contributions of this book are summarised and put in to a larger perspective. An outlook is given on how Structure Discovery might change the way we design NLP systems in the future.

Chris Biemann

Backmatter

Titel: Structure Discovery in Natural Language
verfasst von: Chris Biemann
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-25923-4
Print ISBN: 978-3-642-25922-7
DOI: https://doi.org/10.1007/978-3-642-25923-4