2015 | OriginalPaper | Chapter
Language Set Identification in Noisy Synthetic Multilingual Documents
Authors : Tommi Jauhiainen, Krister Lindén, Heidi Jauhiainen
Published in: Computational Linguistics and Intelligent Text Processing
Publisher: Springer International Publishing
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence. powered by
Select sections of text to find additional relevant content using AI-assisted search. powered by
In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an
F
1
-score of 97.6 when classifying between 44 languages.