Skip to main content
Erschienen in: Knowledge and Information Systems 3/2019

28.03.2018 | Short Paper

Cross-lingual document similarity estimation and dictionary generation with comparable corpora

verfasst von: Tadej Štajner, Dunja Mladenić

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper proposes an approach for performing bilingual dictionary generation even when trained on widely available comparable bilingual corpora. We also show its capability to provide cross-lingual similarity estimates that correlate well with human judgments. We implement an approach using a nonlinear bilingual translation model that we train using comparable corpora. We propose a method using word embeddings and kernel approximation to train scalable nonlinear transformations. We demonstrate that this novel method works better on a majority of evaluated language pairs.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Barrón-Cedeno A, Paramita ML, Clough P, Rosso P (2014) A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In: ECIR, pp 424–429 Barrón-Cedeno A, Paramita ML, Clough P, Rosso P (2014) A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In: ECIR, pp 424–429
2.
Zurück zum Zitat Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Springer, Paris, pp 177–187. http://leon.bottou.org/papers/bottou-2010 Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Springer, Paris, pp 177–187. http://​leon.​bottou.​org/​papers/​bottou-2010
3.
Zurück zum Zitat Cassidy T, Ji H, Deng H, Zheng J, Han J (2012) Analysis and refinement of cross-lingual entity linking. In: Information access evaluation. Multilinguality, multimodality, and visual analytics. Springer, New York, pp 1–12 Cassidy T, Ji H, Deng H, Zheng J, Han J (2012) Analysis and refinement of cross-lingual entity linking. In: Information access evaluation. Multilinguality, multimodality, and visual analytics. Springer, New York, pp 1–12
4.
Zurück zum Zitat Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: ACL (Short Papers), pp 429–433 Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: ACL (Short Papers), pp 429–433
5.
Zurück zum Zitat Dumais ST, Letsche TA, Littman ML, Landauer TK (1997) Automatic cross-language retrieval using latent semantic indexing. In: AAAI spring symposium on cross-language text and speech retrieval, vol 15, p 21 Dumais ST, Letsche TA, Littman ML, Landauer TK (1997) Automatic cross-language retrieval using latent semantic indexing. In: AAAI spring symposium on cross-language text and speech retrieval, vol 15, p 21
6.
Zurück zum Zitat Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. Learning with multiple views, workshop at the ICML Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. Learning with multiple views, workshop at the ICML
7.
Zurück zum Zitat Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp 1–17. Springer Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine Translation and the Information Soup, pp 1–17. Springer
8.
Zurück zum Zitat Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42CrossRefMATH Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42CrossRefMATH
9.
Zurück zum Zitat Hellmann S, Brekle J, Auer S (2013) Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In: Semantic Technology. Springer, pp 191–206 Hellmann S, Brekle J, Auer S (2013) Leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic data cloud. In: Semantic Technology. Springer, pp 191–206
10.
Zurück zum Zitat Lauly S, Boulanger A, Larochelle H (2014) Learning multilingual word representations using a bag-of-words autoencoder. arXiv preprint arXiv:1401.1803 Lauly S, Boulanger A, Larochelle H (2014) Learning multilingual word representations using a bag-of-words autoencoder. arXiv preprint arXiv:​1401.​1803
11.
Zurück zum Zitat Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems, pp 2177–2185 Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems, pp 2177–2185
12.
Zurück zum Zitat Littman ML, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. In: Cross-language information retrieval. Springer, pp 51–62 Littman ML, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. In: Cross-language information retrieval. Springer, pp 51–62
13.
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
14.
15.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:​1310.​4546
16.
Zurück zum Zitat Ni J, Dinu G, Florian R (2017) Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. arXiv preprint arXiv:1707.02483 Ni J, Dinu G, Florian R (2017) Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. arXiv preprint arXiv:​1707.​02483
17.
Zurück zum Zitat Paramita ML, Clough P, Aker A, Gaizauskas RJ (2012) Correlation between similarity measures for inter-language linked wikipedia articles. In: LREC, pp 790–797 Paramita ML, Clough P, Aker A, Gaizauskas RJ (2012) Correlation between similarity measures for inter-language linked wikipedia articles. In: LREC, pp 790–797
19.
Zurück zum Zitat Rupnik J, Fortuna B (2008) Regression canonical correlation analysis. Learning from (2008) Rupnik J, Fortuna B (2008) Regression canonical correlation analysis. Learning from (2008)
20.
Zurück zum Zitat Rupnik J, Muhic A, Leban G, Skraba P, Fortuna B, Grobelnik M (2016) News across languages-cross-lingual document similarity and event tracking. J Artif Intell Res 55:283–316MathSciNetCrossRef Rupnik J, Muhic A, Leban G, Skraba P, Fortuna B, Grobelnik M (2016) News across languages-cross-lingual document similarity and event tracking. J Artif Intell Res 55:283–316MathSciNetCrossRef
22.
Zurück zum Zitat Skadiņa I, Aker A, Mastropavlos N, Su F, Tufis D, Verlic M, Vasiļjevs A, Babych B, Clough P, Gaizauskas R, et al (2012) Collecting and using comparable corpora for statistical machine translation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC), Istanbul, Turkey Skadiņa I, Aker A, Mastropavlos N, Su F, Tufis D, Verlic M, Vasiļjevs A, Babych B, Clough P, Gaizauskas R, et al (2012) Collecting and using comparable corpora for statistical machine translation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC), Istanbul, Turkey
23.
Zurück zum Zitat Sorg P, Cimiano P (2012) Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 74:26–45CrossRef Sorg P, Cimiano P (2012) Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 74:26–45CrossRef
24.
Zurück zum Zitat Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492CrossRef Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell 34(3):480–492CrossRef
25.
Zurück zum Zitat Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the 14th annual conference on neural information processing systems, EPFL-CONF-161322, pp 682–688 Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the 14th annual conference on neural information processing systems, EPFL-CONF-161322, pp 682–688
26.
Zurück zum Zitat Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: NIPS, pp 485–493 Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: NIPS, pp 485–493
27.
Zurück zum Zitat Zhang L, Rettinger A, Färber M, Tadić M (2013) A comparative evaluation of cross-lingual text annotation techniques. In: Information access evaluation. Multilinguality, multimodality, and visualization. Springer, pp 124–135 Zhang L, Rettinger A, Färber M, Tadić M (2013) A comparative evaluation of cross-lingual text annotation techniques. In: Information access evaluation. Multilinguality, multimodality, and visualization. Springer, pp 124–135
Metadaten
Titel
Cross-lingual document similarity estimation and dictionary generation with comparable corpora
verfasst von
Tadej Štajner
Dunja Mladenić
Publikationsdatum
28.03.2018
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2019
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-018-1179-9

Weitere Artikel der Ausgabe 3/2019

Knowledge and Information Systems 3/2019 Zur Ausgabe