Skip to main content
Top

2018 | OriginalPaper | Chapter

Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study

Authors : Colin Layfield, Dragan Ivanović, Joel Azzopardi

Published in: Semantic Keyword-Based Search on Structured Data Sources

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

One of the challenges in information retrieval is attempting to search a corpus of documents that may contain multiple languages. This exploratory study expands upon earlier research employing Latent Semantic Analysis (so called Multi-Lingual Latent Semantic Indexing, or ML-LSI/LSA). We experiment using this approach, and a new one, in a multi-lingual context utilising two similar languages, namely Serbian and Croatian. Traditionally, with an LSA approach, a parallel corpus would be needed in order to train the system by combining identical documents in two languages into one document. We repeat that approach and also experiment with creating a semantic space using the parallel corpus on its own without merging the documents together to test the hypothesis that, with very similar languages, the merging of documents may not be required for good results.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
As a side effect, the XML turned out to be badly formed in places and needed to be fixed by hand.
 
3
Diacritics are added to the top or bottom of a letter to indicate appropriate stress, special pronunciation, or unusual sounds not common in the Roman alphabet. In Serbian and Croatian, these markings indicate special pronunciation, like the difference between the pronunciation of C compared to Ć.
 
4
The stop word list is available at http://​www.​lextek.​com/​manuals/​onix/​stopwords1.​html. Note that single character stop words were not included as it was found that many Serbian/Croatian documents were flagged as English when they were present in the list.
 
5
We discovered, serendipitously, that the results of using tf-idf and l-e were actually superior when the folded-in search queries were only weighted using raw term-frequency. This was unexpected and will be a topic of future research. The results reported here use the commonly accepted approach of weighting the query appropriately with the weighting method used for the creation of the semantic space.
 
6
The same similarity score is the cosine similarity between the two ‘mate’ documents.
 
Literature
1.
go back to reference Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, 2nd edn. SIAM, Philadelphia (2005)CrossRefMATH Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, 2nd edn. SIAM, Philadelphia (2005)CrossRefMATH
2.
go back to reference Chew, P., Abdelali, A.: The effects of language relatedness on multilingual information retrieval: a case study with Indo-European and semitic languages. In: Proceedings of the 2nd International Workshop on “Cross Lingual Information Access” Addressing the Information Need of Multilingual Societies, pp. 1–9, January 2008. http://anthology.aclweb.org/I/I08/I08-6.pdf#page=10 Chew, P., Abdelali, A.: The effects of language relatedness on multilingual information retrieval: a case study with Indo-European and semitic languages. In: Proceedings of the 2nd International Workshop on “Cross Lingual Information Access” Addressing the Information Need of Multilingual Societies, pp. 1–9, January 2008. http://​anthology.​aclweb.​org/​I/​I08/​I08-6.​pdf#page=​10
3.
go back to reference Corbett, G.G., Browne, W.: Serbo-croat: Bosnian, Croatian, Montenegrin, Serbian. In: The World’s Major Languages, pp. 330–346. Routledge, London (2009) Corbett, G.G., Browne, W.: Serbo-croat: Bosnian, Croatian, Montenegrin, Serbian. In: The World’s Major Languages, pp. 330–346. Routledge, London (2009)
4.
go back to reference Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRef
5.
go back to reference Dhavachelvan, P., Pothula, S.: A review on the cross and multilingual information retrieval. Int. J. Web Semantic Technol. 2(4), 115–124 (2011)CrossRef Dhavachelvan, P., Pothula, S.: A review on the cross and multilingual information retrieval. Int. J. Web Semantic Technol. 2(4), 115–124 (2011)CrossRef
6.
go back to reference Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. AAAI Technical Report SS-97-05, pp. 18–24 (1997) Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. AAAI Technical Report SS-97-05, pp. 18–24 (1997)
7.
go back to reference Dwivedi, S., Chandra, G.: A survey on cross language information retrieval. Int. J. Cybern. Inform. 5(1), 127–142 (2016) Dwivedi, S., Chandra, G.: A survey on cross language information retrieval. Int. J. Cybern. Inform. 5(1), 127–142 (2016)
8.
go back to reference Greenberg, R.D.: Language politics in the federal republic of Yugoslavia: the crisis over the future of serbian. Slavic Rev. 59(3), 625–640 (2008)CrossRef Greenberg, R.D.: Language politics in the federal republic of Yugoslavia: the crisis over the future of serbian. Slavic Rev. 59(3), 625–640 (2008)CrossRef
9.
go back to reference Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)CrossRef Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)CrossRef
10.
go back to reference Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH
11.
go back to reference Sharma, M., Morwal, S.: A survey on cross language information retrieval. Int. J. Adv. Res. Comput. Commun. Eng. 4(2), 384–387 (2015)CrossRef Sharma, M., Morwal, S.: A survey on cross language information retrieval. Int. J. Adv. Res. Comput. Commun. Eng. 4(2), 384–387 (2015)CrossRef
13.
go back to reference Young, P.G.: Cross-language information retrieval using latent semantic indexing. Master’s thesis. University of Knoxville, Tennessee (1994) Young, P.G.: Cross-language information retrieval using latent semantic indexing. Master’s thesis. University of Knoxville, Tennessee (1994)
Metadata
Title
Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study
Authors
Colin Layfield
Dragan Ivanović
Joel Azzopardi
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-74497-1_15