Skip to main content

2015 | OriginalPaper | Buchkapitel

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)CrossRef Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)CrossRef
2.
Zurück zum Zitat Pal, S., Pakray, P., Naskar, S.: Automatic building and using parallel resources for SMT from comparable corpora (2014) Pal, S., Pakray, P., Naskar, S.: Automatic building and using parallel resources for SMT from comparable corpora (2014)
3.
Zurück zum Zitat Tyer, F., Pienaar J.: Extracting bilingual words pairs from Wikipedia (2008) Tyer, F., Pienaar J.: Extracting bilingual words pairs from Wikipedia (2008)
4.
Zurück zum Zitat Clark, J., Dyer, C., Lavie, A., Smith, N.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the Association for Computational Lingustics, Portland, Oregon, USA (2011) Clark, J., Dyer, C., Lavie, A., Smith, N.: Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the Association for Computational Lingustics, Portland, Oregon, USA (2011)
5.
Zurück zum Zitat Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the 9th International Workshop on Spoken Language Translation IWSLT 2012, pp. 126–129, Hong Kong (2012) Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the 9th International Workshop on Spoken Language Translation IWSLT 2012, pp. 126–129, Hong Kong (2012)
6.
Zurück zum Zitat Smith J., Quirk C., Toutanova K.: Extracting parallel sentences from comparable corpora using document level alignmen (2010) Smith J., Quirk C., Toutanova K.: Extracting parallel sentences from comparable corpora using document level alignmen (2010)
7.
Zurück zum Zitat Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of ACL 2013, pp 34–42 (2013) Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of ACL 2013, pp 34–42 (2013)
8.
Zurück zum Zitat Adafree, S., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia (2006) Adafree, S., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia (2006)
9.
Zurück zum Zitat Skadiņa, I., Aker, A.: Collecting and using comparable corpora for statistical machine translation. In: Proceedings of LREC 2012, Instanbul (2012) Skadiņa, I., Aker, A.: Collecting and using comparable corpora for statistical machine translation. In: Proceedings of LREC 2012, Instanbul (2012)
10.
Zurück zum Zitat Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT 2012 Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 317–321, Stroudsburg, PA, USA (2012) Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: WMT 2012 Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 317–321, Stroudsburg, PA, USA (2012)
12.
Zurück zum Zitat Tiedemann, J.: Parallel data, tools and interfaces in OPUS.: In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218 (2012) Tiedemann, J.: Parallel data, tools and interfaces in OPUS.: In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218 (2012)
13.
Zurück zum Zitat Wołk, K., Marasek, K.: Real-Time statistical speech translation. In: Rocha, Á., Correia, A.M., Tan, F., Stroetmann, K. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 107–113. Springer, Heidelberg (2014)CrossRef Wołk, K., Marasek, K.: Real-Time statistical speech translation. In: Rocha, Á., Correia, A.M., Tan, F., Stroetmann, K. (eds.) New Perspectives in Information Systems and Technologies, Volume 1. AISC, vol. 275, pp. 107–113. Springer, Heidelberg (2014)CrossRef
14.
Zurück zum Zitat Kilgarriff, A., Avinesh, P.V.S., Pomikalek, J.: BootCatting comparable corpora. In: Proceedings of 9th International Conference on Terminology and Artificial Intelligence, Paris, France (2011) Kilgarriff, A., Avinesh, P.V.S., Pomikalek, J.: BootCatting comparable corpora. In: Proceedings of 9th International Conference on Terminology and Artificial Intelligence, Paris, France (2011)
15.
Zurück zum Zitat Strotgen, J., Gertz, M.: Temporal tagging on different domains:challenges, strategies, and gold standards. In: Proceedings of LREC 2012, Instanbul (2012) Strotgen, J., Gertz, M.: Temporal tagging on different domains:challenges, strategies, and gold standards. In: Proceedings of LREC 2012, Instanbul (2012)
16.
Zurück zum Zitat Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of EAMT, pp. 261–268, Trento, Italy (2012) Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of EAMT, pp. 261–268, Trento, Italy (2012)
17.
Zurück zum Zitat Zeng, W., Church, R.L.: Finding shortest paths on real road networks: the case for A*. Int. J. Geogr. Inf. Sci. 23(4), 531–543 (2009)CrossRef Zeng, W., Church, R.L.: Finding shortest paths on real road networks: the case for A*. Int. J. Geogr. Inf. Sci. 23(4), 531–543 (2009)CrossRef
18.
Zurück zum Zitat Wołk, K., Marasek, K.: Alignment of the polish-english parallel text for a statistical machine translation. Comput. Technol. Appl. 4, 575–583 (2013). David Publishing, ISSN:1934–7332 (Print), ISSN: 1934-7340 (Online) Wołk, K., Marasek, K.: Alignment of the polish-english parallel text for a statistical machine translation. Comput. Technol. Appl. 4, 575–583 (2013). David Publishing, ISSN:1934–7332 (Print), ISSN: 1934-7340 (Online)
19.
Zurück zum Zitat Yang, W., Lepage, Y.: Inflating a training corpus for SMT by using unrelated unaligned monolingual data. In: Ogrodniczuk, A., Przepiórkowski, M. (eds.) PolTAL 2014. LNCS, vol. 8686, pp. 236–248. Springer, Heidelberg (2014) Yang, W., Lepage, Y.: Inflating a training corpus for SMT by using unrelated unaligned monolingual data. In: Ogrodniczuk, A., Przepiórkowski, M. (eds.) PolTAL 2014. LNCS, vol. 8686, pp. 236–248. Springer, Heidelberg (2014)
21.
Zurück zum Zitat Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRef Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRef
Metadaten
Titel
Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25252-0_46

Premium Partner