Top

International Journal on Digital Libraries

Published in:

15-05-2018

Fusion architectures for automatic subject indexing under concept drift

Analysis and empirical results on short texts

Authors: Martin Toepfer, Christin Seifert

Published in: International Journal on Digital Libraries | Issue 2/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study.

previous article Building and querying semantic layers for web archives (extended version)

next article A framework for modelling and visualizing the US Constitutional Convention of 1787

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

www.eurovoc.europa.eu, accessed 28. 11. 2017.

www.nlm.nih.gov/mesh, accessed 28. 11. 2017.

www.fao.org/agrovoc, accessed 28. 11. 2017.

www.zbw.eu/en/stw-info, accessed 28. 11. 2017.

© 2017 IEEE. All rights reserved. Reprinted, with permission, from Martin Toepfer and Christin Seifert: Descriptor-invariant Fusion Architectures for Automatic Subject Indexing, 2017 ACM IEEE Joint Conference on Digital Libraries (JCDL). Personal use of this material is permitted. However, permission to reuse this material for any other purpose must be obtained from the IEEE.

The number of indexing terms depends on the particular content of a document and several other factors, such as individual institutional guidelines. As a consequence, averages reported in related work vary considerably. Some data sets are actually very similar to single-label document classification, as mentioned in Sect. 2.

www.w3.org/2004/02/skos, accessed 10. 11. 2017.

In related work, especially in the domain of machine learning, the term “label” is often used for classes, which in turn represent concepts.

This meaning of descriptors has been used in related work, but please note that descriptors denote special labels in SKOS.

At the time of the experiments (Sect. 7), release 9.02 was the latest version. Version 9.04 of the STW has been released on June 21st, 2017.

Different meanings of \( \mathbf {x} \) will be used in other sections, for instance, in Sect. 5.

https://github.com/JasonKessler/scattertext, accessed 24. 08. 2017.

Journal of Economic Literature (JEL) codes: https://www.aeaweb.org/econlit/jelCodes.php, accessed 10. 11. 2017.

Links to approaches that relax this constraint are given in the related work, see Sect. 2.

https://github.com/HaraldKi/monqjfa, accessed 10. 11. 2017.

https://github.com/zelandiya/maui, accessed 10.11.2017.

several hours on several thousand documents.

www.scikit-learn.org, accessed 10. 11. 2017.

In some cases, the data were not shown to be normally distributed (Shapiro-Wilk test, \(p<0.05\)), and thus the assumptions for t tests were not met.

http://zbw.eu/stw/thsys/70002, accessed 10. 11. 2017.

http://zbw.eu/stw/thsys/70041, accessed 10. 11. 2017.

In STWFSA, we added special processing routines. For instance, it distinguishes upper and lower case words in certain cases, which in particular enables disambiguation of acronyms like SALT (Strategic Arms Limitation Talks) versus salt (mineral) or AIDS (virus) versus aids (plural of aid).

49 documents have been rated by two indexers. Corresponding concept-level ratings have been averaged, using the floor function in order to resolve odd values.

Aronson, A.R., Demner-Fushman, D., Humphrey, S.M., Lin, J.J., Ruch, P., Ruiz, M.E., Smith, L.H., Tanabe, L.K., Wilbur, W.J., Liu, H.: Fusion of knowledge-intensive and statistical approaches for retrieving and annotating textual genomics documents. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Text REtrieval Conference, TREC 2005, NIST, vol Special Publication 500-266 (2005)

Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol. 66(11), 2215–2222 (2015)CrossRef

Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655 CrossRefMATH

Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)MathSciNet

Erbs, N., Gurevych, I., Rittberger, M.: Bringing order to digital libraries: from keyphrase extraction to index term assignment. D-Lib Mag. 19(9/10), 1–16 (2013). https://doi.org/10.1045/september2013-erbs CrossRef

Ferber, R.: Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval. In: Peters, C., Thanos, C. (eds.) Research and Advanced Technology for Digital Libraries, pp. 233–252. Springer, Berlin (1997). https://doi.org/10.1007/bfb0026731 CrossRef

Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Dean, T. (ed.) Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI ’99, Morgan Kaufmann, pp. 668–673 (1999)

Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)CrossRef

Gastmeyer, M., Wannags, M., Neubert, J.: Relaunch des Standard-Thesaurus Wirtschaft—Dynamik in der Wissensrepräsentation. Inf. Wiss. Praxis. 67(4), 217–240 (2016). https://doi.org/10.1515/iwp-2016-0039 CrossRef

10.

Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3), 52:1–52:38 (2015). https://doi.org/10.1145/2716262 CrossRef

11.

Große-Bölting, G., Nishioka, C., Scherp, A.: A comparison of different strategies for automated semantic document annotation. In: Proceedings of the International Conference on Knowledge Capture, K-CAP 2015, ACM, pp. 8:1–8:8 (2015). https://doi.org/10.1145/2815833.2815838

12.

Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, London, United Kingdom, September 8–12, 2014, IEEE Computer Society, pp. 229–238 (2014). https://doi.org/10.1109/JCDL.2014.6970173

13.

Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-size-fits-all indexing method does not exist: automatic selection based on meta-learning. JCSE 6(2), 151–160 (2012). https://doi.org/10.5626/JCSE.2012.6.2.151 CrossRef

14.

Kessler, J.: Scattertext: a browser-based tool for visualizing how corpora differ. In: Bansal, M., Ji, H. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, System Demonstrations, Association for Computational Linguistics, pp. 85–90 (2017). https://doi.org/10.18653/v1/P17-4015

15.

Kosnik, L.R.: What have economists been doing for the last 50 years? A text analysis of published academic research from 1960–2010. Economics: the open-access. Open-Assess. E-J. 9, 1–38 (2015). https://doi.org/10.5018/economics-ejournal.ja.2015-13 CrossRef

16.

Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350(6266), 1332–1338 (2015)MathSciNetCrossRef

17.

Lauser, B., Hotho, A.: Automatic multi-label subject indexing in a multilingual environment. In: Koch, T., Sølvberg, I. (eds.) Proceedings of the Conference on Research and Advanced Technology for Digital Libraries, ECDL 2003, Springer, LNCS, vol 2769, pp. 140–151 (2003). https://doi.org/10.1007/978-3-540-45175-4_14 CrossRef

18.

Mencía, E.L., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts—Where the Language of Law Meets the Law of Language, LNAI, vol 6036, 1st edn, pp. 192–215. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-12837-0_11 CrossRef

19.

Manning, C.D.: Computational linguistics and deep learning. Comput. Linguist. 41(4), 701–707 (2015). https://doi.org/10.1162/COLI_a_00239 MathSciNetCrossRef

20.

Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. J. Am. Soc. Inf. Sci. Technol. 59(7), 1026–1040 (2008). https://doi.org/10.1002/asi.20790 CrossRef

21.

Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Koehn, P., Mihalcea, R. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, ACM, pp. 1318–1327 (2009)

22.

Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2010). https://doi.org/10.1126/science.1199644 CrossRef

23.

Nam, J., Mencía, E. Loza, Kim, H.J., Fürnkranz, J.: Predicting unseen labels using label hierarchies in large-scale multi-label learning. In: Proceedings of the Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2015, pp. 102–118. Springer (2015). https://doi.org/10.1007/978-3-319-23528-8_7 CrossRef

24.

Palatucci, M., Pomerleau, D., Hinton, G., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: Proceedings of the International Conference on Neural Information Processing Systems, NIPS ’09, Curran Associates Inc., USA, pp. 1410–1418 (2009)

25.

Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. In: Proceedings of the Workshop Ontologies and Information Extraction, EUROLAN 2003. arXiv:abs/cs/0609059 (2003)

26.

Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D. (eds.): Dataset shift in machine learning. Neural information processing series, MIT Press, Cambridge, Mass (2009). https://mitpress.mit.edu/books/dataset-shift-machine-learning

27.

Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981). https://doi.org/10.1016/0306-4573(81)90028-5 CrossRef

28.

Sappadla, P.V., Nam, J., Mencía, E. Loza, Fürnkranz, J.: Using semantic similarity for multi-label zero-shot classification of text documents. In: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, d-side publications (2016)

29.

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef

30.

Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the digital humanities. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L.S., Karydis, I. (eds.) Proceedings of the Research and Advanced Technology for Digital Libraries—21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18–21, 2017, Lecture Notes in Computer Science, vol. 10450, pp 246–257. Springer (2017). https://doi.org/10.1007/978-3-319-67008-9_20 CrossRef

31.

Ting, K.M., Witten, I.H.: Issues in stacked generalization. J. Artif. Intell. Res. (JAIR) 10, 271–289 (1999). https://doi.org/10.1613/jair.594 CrossRefMATH

32.

Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL 2017, Toronto, ON, Canada, June 19–23, 2017, IEEE Computer Society, pp. 31–40 (2017). https://doi.org/10.1109/JCDL.2017.7991557

33.

Toepfer, M., Seifert, C.: Towards Semantic Quality Control of Automatic Subject Indexing, pp. 616–619. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_56 CrossRef

34.

Toepfer, M., Corovic, H., Fette, G., Klügl, P., Störk, S., Puppe, F.: Fine-grained information extraction from german transthoracic echocardiography reports. BMC Med. Inform. Decis. Mak. 15, 91 (2015). https://doi.org/10.1186/s12911-015-0215-x CrossRef

35.

Tsoumakas, G., Laliotis, M., Markantonatos, N., Vlahavas, I.P.: Large-scale semantic indexing of biomedical publications. In: Ngomo, A.N., Paliouras, G. (eds.) Proceedings of the first Workshop on Bio-Medical Semantic Indexing and Question Answering, CEUR-WS.org, CEUR Workshop Proceedings, vol. 1094 (2013). URL http://ceur-ws.org/Vol-1094/bioasq2013_submission_6.pdf

36.

Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of mesh for pubmed records. In: Proceedings of the AMIA Annual Symposium, pp. 1198–1207 (2014)

37.

Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1 CrossRef

Title: Fusion architectures for automatic subject indexing under concept drift
Analysis and empirical results on short texts
Authors: Martin Toepfer
Christin Seifert
Publication date: 15-05-2018
Publisher: Springer Berlin Heidelberg
Published in: International Journal on Digital Libraries / Issue 2/2020
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI: https://doi.org/10.1007/s00799-018-0240-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2020

Toward comprehensive event collections

Introduction to the focused issue on the 2017 ACM/IEEE-CS Joint Conference on Digital Libraries JCDL 2017

Building and querying semantic layers for web archives (extended version)

A framework for modelling and visualizing the US Constitutional Convention of 1787

Bias-aware news analysis using matrix-based news aggregation

On the effectiveness of the scientific peer-review system: a case study of the Journal of High Energy Physics

Premium Partner