Skip to main content
Erschienen in:
Buchtitelbild

2017 | OriginalPaper | Buchkapitel

Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources

verfasst von : Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović

Erschienen in: Transactions on Computational Collective Intelligence XXVI

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named entity recognition. Documents in this geological database are described by a summary report, and other data, such as title, domain, keywords, abstract, and geographical location. These metadata were used for generating a bag of words for each document with the aid of morphological dictionaries and transducers. Named entities within metadata were also recognized with the help of a rule-based system. Both the bag of words and the metadata were then used for pre-indexing each document. A combination of several \(tf\_idf\) based measures was applied for selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where the InQuery method performed the best.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Actually, almost 9,000 geological projects were financed in this period, but some of them were lost, some are not open to general public, and for some only basic data exists.
 
3
Tokens are all occurrences (in this case, NEs) in a given texts, types are different occurrences.
 
4
InQuery, an indexing and retrieval “engine” is developed at the Center for Intelligent Information Retrieval (CIIR), College of Information and Computer Sciences, University of Massachusetts Amherst [2]. The Okapi system was originally developed at the Polytechnic of Central London in the early 1980’s and later developed at City University London and Microsoft Research [21].
 
Literatur
1.
Zurück zum Zitat Bizer, C., Boncz, P., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives-four challenges. ACM SIGMOD Rec. 40(4), 56–60 (2012)CrossRef Bizer, C., Boncz, P., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives-four challenges. ACM SIGMOD Rec. 40(4), 56–60 (2012)CrossRef
2.
Zurück zum Zitat Callan, J., Croft, W.B., Harding, S.: The inquery retrieval system, pp. 78–83 (1992) Callan, J., Croft, W.B., Harding, S.: The inquery retrieval system, pp. 78–83 (1992)
3.
Zurück zum Zitat Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990) Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990)
4.
Zurück zum Zitat Croft, W.B., Smith, L.A., Turtle, H.R.: A loosely-coupled integration of a text retrieval system and an object-oriented database system. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 223–232. ACM (1992) Croft, W.B., Smith, L.A., Turtle, H.R.: A loosely-coupled integration of a text retrieval system and an object-oriented database system. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 223–232. ACM (1992)
5.
Zurück zum Zitat Furlan, B., Batanović, V., Nikolić, B.: Semantic similarity of short texts in languages with a deficient natural language processing support. Decis. Support Syst. 55(3), 710–719 (2013)CrossRef Furlan, B., Batanović, V., Nikolić, B.: Semantic similarity of short texts in languages with a deficient natural language processing support. Decis. Support Syst. 55(3), 710–719 (2013)CrossRef
6.
Zurück zum Zitat Graovac, J.: Wordnet-based serbian text categorization. INFOtheca 14(2), 2a–17a (2013) Graovac, J.: Wordnet-based serbian text categorization. INFOtheca 14(2), 2a–17a (2013)
7.
Zurück zum Zitat Gross, M.: The use of finite automata in the lexical representation of natural language. In: Gross, M., Perrin, D. (eds.) LITP 1987. LNCS, vol. 377, pp. 34–50. Springer, Heidelberg (1989). doi:10.1007/3-540-51465-1_3 CrossRef Gross, M.: The use of finite automata in the lexical representation of natural language. In: Gross, M., Perrin, D. (eds.) LITP 1987. LNCS, vol. 377, pp. 34–50. Springer, Heidelberg (1989). doi:10.​1007/​3-540-51465-1_​3 CrossRef
8.
Zurück zum Zitat Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001) Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001)
9.
Zurück zum Zitat Ivanović, D., Milosavljević, G., Milosavljević, B., Surla, D.: A CERIF-compatible research management system based on the MARC 21 format. Inf. Knowl. Manag. 44(3), 229–251 (2010) Ivanović, D., Milosavljević, G., Milosavljević, B., Surla, D.: A CERIF-compatible research management system based on the MARC 21 format. Inf. Knowl. Manag. 44(3), 229–251 (2010)
10.
Zurück zum Zitat Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction and categorization, vol. 5. John Benjamins Publishing, Amsterdam (2007)CrossRef Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction and categorization, vol. 5. John Benjamins Publishing, Amsterdam (2007)CrossRef
11.
Zurück zum Zitat Kešelj, V., Šipka, D.: A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOtheca 9(1–2), 23a–33a (2008) Kešelj, V., Šipka, D.: A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOtheca 9(1–2), 23a–33a (2008)
12.
Zurück zum Zitat Krstev, C.: Processing of Serbian - Automata. University of Belgrade, Belgrade, Texts and Electronic Dictionaries. Faculty of Philology (2008) Krstev, C.: Processing of Serbian - Automata. University of Belgrade, Belgrade, Texts and Electronic Dictionaries. Faculty of Philology (2008)
13.
Zurück zum Zitat Krstev, C., Obradović, I., Utvić, M., Vitas, D.: A system for named entity recognition based on local grammars. J. Logic Comput. 24(2), 473–489 (2014)CrossRef Krstev, C., Obradović, I., Utvić, M., Vitas, D.: A system for named entity recognition based on local grammars. J. Logic Comput. 24(2), 473–489 (2014)CrossRef
14.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH
15.
Zurück zum Zitat Martinović, M.: Transfer of natural language processing technology: experiments, possibilities and limitations case study: English to Serbian. INFOtheca 9(1–2), 11a–21a (2008) Martinović, M.: Transfer of natural language processing technology: experiments, possibilities and limitations case study: English to Serbian. INFOtheca 9(1–2), 11a–21a (2008)
16.
Zurück zum Zitat Maurel, D., Friburger, N., Antoine, J.Y., Eshkol, I., Nouvel, D., et al.: Cascades de transducteurs autour de la reconnaissance des entités nommées. Traitement Automatique des Langues 52(1), 69–96 (2011) Maurel, D., Friburger, N., Antoine, J.Y., Eshkol, I., Nouvel, D., et al.: Cascades de transducteurs autour de la reconnaissance des entités nommées. Traitement Automatique des Langues 52(1), 69–96 (2011)
18.
Zurück zum Zitat Mladenović, M., Mitrović, J., Krstev, C., Vitas, D.: Hybrid sentiment analysis framework for a morphologically rich language. J. Intell. Inf. Syst. 1–22, to appear Mladenović, M., Mitrović, J., Krstev, C., Vitas, D.: Hybrid sentiment analysis framework for a morphologically rich language. J. Intell. Inf. Syst. 1–22, to appear
19.
Zurück zum Zitat Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: Sekine, S., Ranchhod, E. (eds.) Named Entities: Recognition, Classification and Use, pp. 3–28. John Benjamins Publishing Company, Amsterdam (2009)CrossRef Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: Sekine, S., Ranchhod, E. (eds.) Named Entities: Recognition, Classification and Use, pp. 3–28. John Benjamins Publishing Company, Amsterdam (2009)CrossRef
21.
Zurück zum Zitat Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: TREC, vol. 8, pp. 151–162 (1999) Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: TREC, vol. 8, pp. 151–162 (1999)
22.
Zurück zum Zitat Salton, G., McGill, M.J.: Introduction to modern information retrieval (1983) Salton, G., McGill, M.J.: Introduction to modern information retrieval (1983)
23.
Zurück zum Zitat Stanković, R., Prodanović, J., Kitanović, O., Nikolić, V.E.: Development of the Serbian geological resources portal. In: Proceedings of 17th Meeting of the Association of European Geological Societies, pp. 61–65 (2011) Stanković, R., Prodanović, J., Kitanović, O., Nikolić, V.E.: Development of the Serbian geological resources portal. In: Proceedings of 17th Meeting of the Association of European Geological Societies, pp. 61–65 (2011)
24.
Zurück zum Zitat Stanković, R., Trivić, B., Kitanović, O., Blagojević, B., Nikolić, V.: The development of the geolissterm terminological dictionary. INFOtheca 12(1), 49a–63a (2011) Stanković, R., Trivić, B., Kitanović, O., Blagojević, B., Nikolić, V.: The development of the geolissterm terminological dictionary. INFOtheca 12(1), 49a–63a (2011)
25.
Zurück zum Zitat Utvić, M.: Annotating the corpus of contemporary Serbian. INFOtheca - J. Inform. Librariansh. 12(2), 36a–47a (2011) Utvić, M.: Annotating the corpus of contemporary Serbian. INFOtheca - J. Inform. Librariansh. 12(2), 36a–47a (2011)
26.
Zurück zum Zitat Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović-Lažetić, G., Stanojević, M.: Srpski jezik u digitalnom dobu - The Serbian Language in the Digital Age. In: Rehm and Uszkoreit [20] (2012). http://www.meta-net.eu/whitepapers Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović-Lažetić, G., Stanojević, M.: Srpski jezik u digitalnom dobu - The Serbian Language in the Digital Age. In: Rehm and Uszkoreit [20] (2012). http://​www.​meta-net.​eu/​whitepapers
27.
Metadaten
Titel
Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources
verfasst von
Ranka Stanković
Cvetana Krstev
Ivan Obradović
Olivera Kitanović
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-59268-8_8