Skip to main content
Top

2018 | OriginalPaper | Chapter

Exploring Influence of Topic Segmentation on Information Retrieval Quality

Authors : Gennady Shtekh, Polina Kazakova, Nikita Nikitinsky, Nikolay Skachkov

Published in: Internet Science

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the present paper we address the issue of how an information retrieval system might be improved via text segmentation and to what extent. We assume that topic text segmentation allows one to better model text structure and therefore language itself, which influences the quality of text representation. We propose a search pipeline based on text segmentation by means of BigARTM tool and TopicTiling algorithm. We test the initial hypothesis by conducting experiments with several baseline models on two textual collections. The results are rather contradictory: while one collection showed that segmentation does improve the quality of retrieval, the other one demonstrated that segmentation does not influence the quality significantly.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016) Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:​1607.​04606 (2016)
2.
go back to reference Chan, S.K., Xie, L., Meng, H.: Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation. In: Eighth Annual Conference of the International Speech Communication Association (2007) Chan, S.K., Xie, L., Meng, H.: Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation. In: Eighth Annual Conference of the International Speech Communication Association (2007)
4.
go back to reference Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2013) Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2013)
5.
go back to reference Galley, M., McKeown, K.R., Fosler-Lussier, E., Jing, H.: Discourse segmentation of multi-party conversation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (2003) Galley, M., McKeown, K.R., Fosler-Lussier, E., Jing, H.: Discourse segmentation of multi-party conversation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (2003)
6.
go back to reference Galušcáková, P.: Application of topic segmentation in audiovisual information retrieval Galušcáková, P.: Application of topic segmentation in audiovisual information retrieval
7.
go back to reference Ganguly, D., Leveling, J., Jones, G.J.: Utilizing sub-topical structure of documents for information retrieval. In: Proceedings of the 4th Workshop on Workshop for Ph. D. Students in Information & Knowledge Management, pp. 75–78. ACM (2011) Ganguly, D., Leveling, J., Jones, G.J.: Utilizing sub-topical structure of documents for information retrieval. In: Proceedings of the 4th Workshop on Workshop for Ph. D. Students in Information & Knowledge Management, pp. 75–78. ACM (2011)
8.
go back to reference Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear) Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear)
10.
go back to reference Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016) Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:​1607.​05368 (2016)
11.
go back to reference Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014) Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
12.
go back to reference Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018) Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)
13.
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
14.
go back to reference Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507 (2017) Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:​1703.​02507 (2017)
15.
go back to reference Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
17.
go back to reference Riedl, M., Biemann, C.: Text segmentation with topic models. J. Lang. Technol. Comput. Linguist. 27(1), 47–69 (2012) Riedl, M., Biemann, C.: Text segmentation with topic models. J. Lang. Technol. Comput. Linguist. 27(1), 47–69 (2012)
18.
go back to reference Skachkov, N., Vorontsov, K.: Improving topic models with segmental structure of texts. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue, pp. 652–661 (2018) Skachkov, N., Vorontsov, K.: Improving topic models with segmental structure of texts. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue, pp. 652–661 (2018)
19.
go back to reference Vorontsov, K., Frei, O., Apishev, M., Romov, P., Dudarenko, M.: BigARTM: open source library for regularized multimodal topic modeling of large collections. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds.) AIST 2015. CCIS, vol. 542, pp. 370–381. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2_36CrossRef Vorontsov, K., Frei, O., Apishev, M., Romov, P., Dudarenko, M.: BigARTM: open source library for regularized multimodal topic modeling of large collections. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds.) AIST 2015. CCIS, vol. 542, pp. 370–381. Springer, Cham (2015). https://​doi.​org/​10.​1007/​978-3-319-26123-2_​36CrossRef
20.
Metadata
Title
Exploring Influence of Topic Segmentation on Information Retrieval Quality
Authors
Gennady Shtekh
Polina Kazakova
Nikita Nikitinsky
Nikolay Skachkov
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-01437-7_11

Premium Partner