nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources

verfasst von : Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović

Erschienen in: Transactions on Computational Collective Intelligence XXVI

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named entity recognition. Documents in this geological database are described by a summary report, and other data, such as title, domain, keywords, abstract, and geographical location. These metadata were used for generating a bag of words for each document with the aid of morphological dictionaries and transducers. Named entities within metadata were also recognized with the help of a rule-based system. Both the bag of words and the metadata were then used for pre-indexing each document. A combination of several \(tf\_idf\) based measures was applied for selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where the InQuery method performed the best.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://geoliss.mre.gov.rs; search of fund documentation http://geoliss.mre.gov.rs/index.php?page=fodib.

Actually, almost 9,000 geological projects were financed in this period, but some of them were lost, some are not open to general public, and for some only basic data exists.

Tokens are all occurrences (in this case, NEs) in a given texts, types are different occurrences.

InQuery, an indexing and retrieval “engine” is developed at the Center for Intelligent Information Retrieval (CIIR), College of Information and Computer Sciences, University of Massachusetts Amherst [2]. The Okapi system was originally developed at the Polytechnic of Central London in the early 1980’s and later developed at City University London and Microsoft Research [21].

Available at http://geoliss.mre.gov.rs/fodibevaluacija/.

For the initial system at http://geoliss.mre.gov.rs/fodibevaluacija/statistika.php, the improved system at http://geoliss.mre.gov.rs/fodibevaluacija/statistika-index.php for individual queries, and for the entire sets of queries at http://geoliss.mre.gov.rs/fodibevaluacija/statistika-all-methods.php.

Bizer, C., Boncz, P., Brodie, M.L., Erling, O.: The meaningful use of big data: four perspectives-four challenges. ACM SIGMOD Rec. 40(4), 56–60 (2012)CrossRef

Callan, J., Croft, W.B., Harding, S.: The inquery retrieval system, pp. 78–83 (1992)

Courtois, B., Silberztein, M.: Dictionnaires électroniques du français. Larousse, Paris (1990)

Croft, W.B., Smith, L.A., Turtle, H.R.: A loosely-coupled integration of a text retrieval system and an object-oriented database system. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 223–232. ACM (1992)

Furlan, B., Batanović, V., Nikolić, B.: Semantic similarity of short texts in languages with a deficient natural language processing support. Decis. Support Syst. 55(3), 710–719 (2013)CrossRef

Graovac, J.: Wordnet-based serbian text categorization. INFOtheca 14(2), 2a–17a (2013)

Gross, M.: The use of finite automata in the lexical representation of natural language. In: Gross, M., Perrin, D. (eds.) LITP 1987. LNCS, vol. 377, pp. 34–50. Springer, Heidelberg (1989). doi:10.1007/3-540-51465-1_3 CrossRef

Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001)

Ivanović, D., Milosavljević, G., Milosavljević, B., Surla, D.: A CERIF-compatible research management system based on the MARC 21 format. Inf. Knowl. Manag. 44(3), 229–251 (2010)

10.

Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction and categorization, vol. 5. John Benjamins Publishing, Amsterdam (2007)CrossRef

11.

Kešelj, V., Šipka, D.: A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOtheca 9(1–2), 23a–33a (2008)

12.

Krstev, C.: Processing of Serbian - Automata. University of Belgrade, Belgrade, Texts and Electronic Dictionaries. Faculty of Philology (2008)

13.

Krstev, C., Obradović, I., Utvić, M., Vitas, D.: A system for named entity recognition based on local grammars. J. Logic Comput. 24(2), 473–489 (2014)CrossRef

14.

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH

15.

Martinović, M.: Transfer of natural language processing technology: experiments, possibilities and limitations case study: English to Serbian. INFOtheca 9(1–2), 11a–21a (2008)

16.

Maurel, D., Friburger, N., Antoine, J.Y., Eshkol, I., Nouvel, D., et al.: Cascades de transducteurs autour de la reconnaissance des entités nommées. Traitement Automatique des Langues 52(1), 69–96 (2011)

17.

Milosevic, N.: Stemmer for Serbian language. CoRR abs/1209.4471 (2012). http://arxiv.org/abs/1209.4471

18.

Mladenović, M., Mitrović, J., Krstev, C., Vitas, D.: Hybrid sentiment analysis framework for a morphologically rich language. J. Intell. Inf. Syst. 1–22, to appear

19.

Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. In: Sekine, S., Ranchhod, E. (eds.) Named Entities: Recognition, Classification and Use, pp. 3–28. John Benjamins Publishing Company, Amsterdam (2009)CrossRef

20.

Rehm, G., Uszkoreit, H. (eds.): META-NET White Paper Series. Springer, Heidelberg (2012). http://www.meta-net.eu/whitepapers

21.

Robertson, S.E., Walker, S.: Okapi/Keenbow at TREC-8. In: TREC, vol. 8, pp. 151–162 (1999)

22.

Salton, G., McGill, M.J.: Introduction to modern information retrieval (1983)

23.

Stanković, R., Prodanović, J., Kitanović, O., Nikolić, V.E.: Development of the Serbian geological resources portal. In: Proceedings of 17th Meeting of the Association of European Geological Societies, pp. 61–65 (2011)

24.

Stanković, R., Trivić, B., Kitanović, O., Blagojević, B., Nikolić, V.: The development of the geolissterm terminological dictionary. INFOtheca 12(1), 49a–63a (2011)

25.

Utvić, M.: Annotating the corpus of contemporary Serbian. INFOtheca - J. Inform. Librariansh. 12(2), 36a–47a (2011)

26.

Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović-Lažetić, G., Stanojević, M.: Srpski jezik u digitalnom dobu - The Serbian Language in the Digital Age. In: Rehm and Uszkoreit [20] (2012). http://www.meta-net.eu/whitepapers

27.

Zečević, A., Stanković-Vujičić, S.: Language identification–the case of Serbian. In: Pavlović-Lažetić, G., Krstev, C., Vitas, D., Obradović, I. (eds.) Natural Language Processing for Serbian – Resources and Applications, pp. 101–112. Faculty of Mathematics, University of Belgrade. http://jerteh.rs/wp-content/uploads/2015/05/Zecevic.pdf

Titel: Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources
verfasst von: Ranka Stanković
Cvetana Krstev
Ivan Obradović
Olivera Kitanović
Verlag: Springer International Publishing
Buch: Transactions on Computational Collective Intelligence XXVI
Print ISBN: 978-3-319-59267-1

Electronic ISBN: 978-3-319-59268-8

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-59268-8_8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"