2012 | OriginalPaper | Buchkapitel
Unsupervised Language Separation
verfasst von : Chris Biemann
Erschienen in: Structure Discovery in Natural Language
Verlag: Springer Berlin Heidelberg
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
This chapter presents an unsupervised solution to language identification. The method sorts multilingual text corpora sentence-wise into different languages. In this attempt, the main difference to previous methods is that no training data for the different languages is provided and the number of languages does not have to be known beforehand. This application illustrates the benefits of a parameter-free graph clustering algorithm like Chinese Whispers, as the data–words and their statistical dependencies – are represented naturally in a graph, and the number of clusters (here: languages) as well as their size distribution is unknown. The feasibility and robustness of the approach for non-standard language data is demonstrated in a case study on Twitter data.