Skip to main content

2018 | Buch

Turkish Natural Language Processing

insite
SUCHEN

Über dieses Buch

This book brings together work on Turkish natural language and speech processing over the last 25 years, covering numerous fundamental tasks ranging from morphological processing and language modeling, to full-fledged deep parsing and machine translation, as well as computational resources developed along the way to enable most of this work. Owing to its complex morphology and free constituent order, Turkish has proved to be a fascinating language for natural language and speech processing research and applications.
After an overview of the aspects of Turkish that make it challenging for natural language and speech processing tasks, this book discusses in detail the main tasks and applications of Turkish natural language and speech processing. A compendium of the work on Turkish natural language and speech processing, it is a valuable reference for new researchers considering computational work on Turkish, as well as a one-stop resource for commercial and research institutions planning to develop applications for Turkish. It also serves as a blueprint for similar work on other Turkic languages such as Azeri, Turkmen and Uzbek.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Turkish and Its Challenges for Language and Speech Processing
Abstract
We present a short survey and exposition of some of the important aspects of Turkish that have proved to be interesting and challenging for natural language and speech processing. Most of the challenges stem from the complex morphology of Turkish and how morphology interacts with syntax. Finally we provide a short overview of the major tools and resources developed for Turkish over the last two decades. (Parts of this chapter were previously published as Oflazer (Lang Resour Eval 48(4):639–653, 2014).)
Kemal Oflazer, Murat Saraçlar
Chapter 2. Morphological Processing for Turkish
Abstract
This chapter presents an overview of Turkish morphology followed by the architecture of a state-of-the-art wide coverage morphological analyzer for Turkish implemented using the Xerox Finite State Tools. It covers the morphophonological and morphographemic phenomena in Turkish such as vowel harmony, the morphotactics of words, and issues that one encounters when processing real text with myriads of phenomena: numbers, foreign words with Turkish inflections, unknown words, and multi-word constructs. The chapter presents ample illustrations of phenomena and provides many examples for sometimes ambiguous morphological interpretations.
Kemal Oflazer
Chapter 3. Morphological Disambiguation for Turkish
Abstract
Morphological disambiguation is the task of determining the contextually correct morphological parses of tokens in a sentence. A morphological disambiguator takes in a set of morphological parses for each token, generated by a morphological analyzer, and then selects a morphological parse for each, considering statistical and/or linguistic contextual information. This task can be seen as a generalization of the part-of-speech (POS) tagging problem, for morphologically rich languages. The disambiguated morphological analysis is usually crucial for further processing steps such as dependency parsing. In this chapter, we review the morphological disambiguation problem for Turkish and discuss approaches for solving this problem as they have evolved from manually crafted constraint-based rule systems to systems employing machine learning.
Dilek Zeynep Hakkani-Tür, Murat Saraçlar, Gökhan Tür, Kemal Oflazer, Deniz Yuret
Chapter 4. Language Modeling for Turkish Text and Speech Processing
Abstract
This chapter presents an overview of language modeling followed by a discussion of the challenges in Turkish language modeling. Sub-lexical units are commonly used to reduce the high out-of-vocabulary (OOV) rates of morphologically rich languages. These units are either obtained by morphological analysis or by unsupervised statistical techniques. For Turkish, the morphological analysis yields word segmentations both at the lexical and surface forms which can be used as sub-lexical language modeling units. Discriminative language models, which outperform generative models for various tasks, allow for easy integration of morphological and syntactic features into language modeling. The chapter provides a review of both generative and discriminative approaches for Turkish language modeling.
Ebru Arısoy, Murat Saraçlar
Chapter 5. Turkish Speech Recognition
Abstract
Automatic speech recognition (ASR) is one of the most important applications of speech and language processing, as it forms the bridge between spoken and written language processing. This chapter presents an overview of the foundations of ASR, followed by a summary of Turkish language resources for ASR and a review of various Turkish ASR systems. Language resources include acoustic and text corpora as well as linguistic tools such as morphological parsers, morphological disambiguators, and dependency parsers, discussed in more detail in other chapters. Turkish ASR systems vary in the type and amount of data used for building the models. The focus of most of the research for Turkish ASR is the language modeling component covered in Chap. 4.
Ebru Arısoy, Murat Saraçlar
Chapter 6. Turkish Named-Entity Recognition
Abstract
Named-entity recognition is an important task for many other natural language processing tasks and applications such as information extraction, question answering, sentiment analysis, machine translation, etc. Over the last decades named-entity recognition for Turkish has attracted significant attention both in terms of systems development and resource development. After a brief description of the general named-entity recognition task, this chapter presents a comprehensive overview of the work on Turkish named-entity recognition along with the data resources various research efforts have built.
Reyyan Yeniterzi, Gökhan Tür, Kemal Oflazer
Chapter 7. Dependency Parsing of Turkish
Abstract
Syntactic parsing is the process of taking an input sentence and producing an appropriate syntactic structure for it. It is a crucial stage in that it provides a way to pass from core NLP tasks to the semantic layer and it has been shown to increase the performance of many high-tier NLP applications such as machine translation, sentiment analysis, question answering, and so on. Statistical dependency parsing with its high coverage and easy-to-use outputs has become very popular in recent years for many languages including Turkish. In this chapter, we describe the issues in developing and evaluating a dependency parser for Turkish, which poses interesting issues and many different challenges due to its agglutinative morphology and freeness of its constituent order. Our approach is an adaptation of a language-independent data-driven statistical parsing system to Turkish.
Gülşen Eryiğit, Joakim Nivre, Kemal Oflazer
Chapter 8. Wide-Coverage Parsing, Semantics, and Morphology
Abstract
Wide-coverage parsing poses three demands: broad coverage over preferably free text, depth in semantic representation for purposes such as inference in question answering, and computational efficiency. We show for Turkish that these goals are not inherently contradictory when we assign categories to sub-lexical elements in the lexicon. The presumed computational burden of processing such lexicons does not arise when we work with automata-constrained formalisms that are trainable on word-meaning correspondences at the level of predicate-argument structures for any string, which is characteristic of radically lexicalizable grammars. This is helpful in morphologically simpler languages too, where word-based parsing has been shown to benefit from sub-lexical training.
Ruket Çakıcı, Mark Steedman, Cem Bozşahin
Chapter 9. Deep Parsing of Turkish with Lexical-Functional Grammar
Abstract
In this chapter we present a large scale, deep grammar for Turkish based on the Lexical-Functional Grammar formalism. In dealing with the rich derivational morphology of Turkish, we follow an approach based on morphological units that are larger than a morpheme but smaller than a word, in encoding rules of the grammar in order to capture the linguistic phenomena in a more formal and accurate way. Our work covers phrases that are building blocks of a large scale grammar, and also focuses on linguistically—and implementation-wise—more interesting cases such as long distance dependencies and complex predicates.
Özlem Çetinoğlu, Kemal Oflazer
Chapter 10. Statistical Machine Translation and Turkish
Abstract
Machine translation is one of the most important applications of natural language processing. The last 25 years have seen tremendous progress in machine translation, enabled by the development of statistical techniques and availability of large-scale parallel sentence corpora from which statistical models of translation can be learned. Turkish poses quite many challenges for statistical machine translation as alluded to in Chap. 1, owing mainly to its complex morphology. This chapter discusses in more detail the challenges of Turkish in the context of statistical machine translation and describes two widely different approaches that have been employed in the last several years to English to Turkish machine translation.
Kemal Oflazer, Reyyan Yeniterzi, İlknur Durgar-El Kahlout
Chapter 11. Machine Translation Between Turkic Languages
Abstract
Turkish belongs to the Turkic family of languages and these languages exhibit tremendous similarity when it comes to morphological and grammatical structure but have somewhat different lexicons owing to various historical, geographical, and cultural interactions with neighboring languages. In this chapter we briefly cover the similarities and differences of these languages and introduce a machine translation methodology that exploits the similarities among these languages. This methodology relies on rule-based and statistical components and can be applicable for not only Turkic languages but also any other cognate language pairs.
A. Cüneyd Tantuğ, Eşref Adalı
Chapter 12. Sentiment Analysis in Turkish
Abstract
In this chapter, we give an overview of sentiment analysis problem and present a system to estimate the sentiment of movie reviews in Turkish. Our approach combines supervised learning and lexicon-based approaches, making use of a recently constructed Turkish polarity lexicon called SentiTurkNet. For performance evaluation, we investigate the contribution of different feature sets, as well as the effect of lexicon size on the overall classification performance.
Gizem Gezici, Berrin Yanıkoğlu
Chapter 13. The Turkish Treebank
Abstract
In the last three decades, treebanks have become a crucial resource for building and evaluating natural language processing tools and applications. In this chapter, we review the essential aspects of the first treebank for Turkish that was built in early 2000s and its evolution and extensions since then.
Gülşen Eryiğit, Kemal Oflazer, Umut Sulubacak
Chapter 14. Linguistic Corpora: A View from Turkish
Abstract
Usage-based linguistic studies have gained new insights as corpus-based and corpus-driven analyses have advanced in recent years. Linguists working in different domains have turned to corpora as a major source in their study of language at all levels of representation. Currently, corpus linguistics is evolving into a sophisticated methodology in extracting and analyzing data. Building and using corpora in Turkish linguistics is a recent undertaking, initially motivated by work on natural language processing (NLP) research. The number of available corpora is increasing and linguistic research has come to test hypotheses on attested data, or uncover more lexical and grammatical patterns of use that have gone unnoticed in the absence of corpus data. Advances in NLP research and tools provided for corpus building and annotation further contribute to corpus studies in Turkish linguistics.
Mustafa Aksan, Yeşim Aksan
Chapter 15. Turkish Wordnet
Abstract
Turkish Wordnet is a lexical database for Turkish, built at Sabancı University in Istanbul, Turkey, between 2001 and 2004 as part of the Balkanet project. It currently contains 20,345 lexical items organized into 14,795 synonym sets (synsets hereafter), which are linked to each other via semantic relations such as hypernymy, antonymy, and meronymy. Turkish Wordnet uses the same concept pool as Princeton Wordnet, the eight wordnets of the Euro Wordnet project, and the five other wordnets of the Balkanet project. Synsets were added in several phases, starting with the most basic concepts at the top of the concept hierarchy. Monolingual resources were used to automatically extract semantic relations. Some semantic relations were extracted using the regular morphology of Turkish. Turkish Wordnet is available to researchers in the form of an XML file.
Özlem Çetinoğlu, Orhan Bilgin, Kemal Oflazer
Chapter 16. Turkish Discourse Bank: Connectives and Their Configurations
Abstract
The Turkish Discourse Bank (TDB) is a resource of approximately 400,000 words in its current release in which explicit discourse connectives and phrasal expressions are annotated along with the textual spans they relate. The corpus has been annotated by annotators using a semiautomatic annotation tool. We expect that it will enable researchers to study aspects of language beyond the sentence level. The TDB follows the Penn Discourse Tree Bank (PDTB) in adopting a connective-based annotation for discourse. The connectives are considered heads of annotated discourse relations. We have so far found only applicative structures in Turkish discourse, which, unlike syntactic heads, seem to have no need for composition. Interleaving in-text spans of arguments appears to be only apparently-crossing, and related to information structure.
Deniz Zeyrek, Işın Demirşahin, Cem Bozşahin
Backmatter
Metadaten
Titel
Turkish Natural Language Processing
herausgegeben von
Prof. Kemal Oflazer
Prof. Murat Saraçlar
Copyright-Jahr
2018
Electronic ISBN
978-3-319-90165-7
Print ISBN
978-3-319-90163-3
DOI
https://doi.org/10.1007/978-3-319-90165-7

Neuer Inhalt