2014 | OriginalPaper | Buchkapitel
Automatic Collection of the Parallel Corpus with Little Prior Knowledge
verfasst von : Shutian Ma, Chengzhi Zhang
Erschienen in: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data
Verlag: Springer International Publishing
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
As an important resource for machine translation and cross-language information retrieval, collecting large-scale parallel corpus has been paid wide attention. With the development of the Internet, researchers begin to mine the parallel corpora from the multilingual websites. They use some prior knowledge like ad hoc heuristics or calculate the similarity of the webpages structure and content to find the bilingual webpages. This paper presents a method that uses the search engine and little prior knowledge about the URL patterns to get the bilingual websites from the Internet. The method is fast for its low time cost and there is no need for large-scale computation on URL pattern matching. We have collected 88 915 candidate parallel Chinese-English webpages, which average accuracy is around 90.8%. During the evaluation, the true bilingual websites that we found have high similar html structure and good quality translations.