Skip to main content
Top
Published in:
Cover of the book

2018 | OriginalPaper | Chapter

Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints

Authors : Martin Toepfer, Christin Seifert

Published in: Digital Libraries for Open Knowledge

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Digital libraries strive for integration of automatic subject indexing methods into operative information retrieval systems, yet integration is prevented by misleading and incomplete semantic annotations. For this reason, we investigate approaches to detect documents where quality criteria are met. In contrast to mainstream methods, our approach, named Qualle, estimates quality at the document-level rather than the concept-level. Qualle is implemented as a combination of different machine learning models into a deep, multi-layered regression architecture that comprises a variety of content-based indicators, in particular label set size calibration. We evaluated the approach on very short texts from law and economics, investigating the impact of different feature groups on recall estimation. Our results show that Qualle effectively determined subsets of previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Such filtering can therefore be used to control compliance with data quality standards in practice. Qualle allows to make trade-offs between indexing quality and collection coverage, and it can complement semi-automatic indexing to process large datasets more efficiently.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
For brevity, the term subject may be omitted in subject indexing, subject indexer, ..., respectively.
 
2
The concept identifier 29638-6 refers to the concept “Low-interest-rate policy”.
 
3
If only ranking is relevant, rank-based correlation coefficients should be considered.
 
Literature
1.
go back to reference Bennett, P.N., Chickering, D.M., Meek, C., Zhu, X.: Algorithms for active classifier selection: maximizing recall with precision constraints. In: Proceedings of WSDM 2017, pp. 711–719. ACM (2017) Bennett, P.N., Chickering, D.M., Meek, C., Zhu, X.: Algorithms for active classifier selection: maximizing recall with precision constraints. In: Proceedings of WSDM 2017, pp. 711–719. ACM (2017)
3.
go back to reference Culotta, A., McCallum, A.: Confidence estimation for information extraction. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 109–112. ACL (2004) Culotta, A., McCallum, A.: Confidence estimation for information extraction. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 109–112. ACL (2004)
4.
go back to reference Drucker, H.: Improving regressors using boosting techniques. In: Proceedings of ICML 1997, pp. 107–115. Morgan Kaufmann (1997) Drucker, H.: Improving regressors using boosting techniques. In: Proceedings of ICML 1997, pp. 107–115. Morgan Kaufmann (1997)
8.
go back to reference Huang, M., Névéol, A., Lu, Z.: Recommending MeSH terms for annotating biomedical articles. JAMIA 18(5), 660–667 (2011) Huang, M., Névéol, A., Lu, Z.: Recommending MeSH terms for annotating biomedical articles. JAMIA 18(5), 660–667 (2011)
9.
go back to reference Liu, J., Chang, W., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of SIGIR 2017, pp. 115–124. ACM (2017) Liu, J., Chang, W., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of SIGIR 2017, pp. 115–124. ACM (2017)
10.
go back to reference Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 192–215. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_11CrossRef Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 192–215. Springer, Heidelberg (2010). https://​doi.​org/​10.​1007/​978-3-642-12837-0_​11CrossRef
11.
go back to reference Medelyan, O., Witten, I.H.: Measuring inter-indexer consistency using a thesaurus. In: Proceedings of JCDL 2006, pp. 274–275. ACM (2006) Medelyan, O., Witten, I.H.: Measuring inter-indexer consistency using a thesaurus. In: Proceedings of JCDL 2006, pp. 274–275. ACM (2006)
12.
go back to reference Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. JASIST 59(7), 1026–1040 (2008)CrossRef Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. JASIST 59(7), 1026–1040 (2008)CrossRef
13.
go back to reference Neveol, A., Zeng, K., Bodenreider, O.: Besides precision & recall: exploring alternative approaches to evaluating an automatic indexing tool for MEDLINE. In: AMIA Annual Symposium Proceedings, pp. 589–593 (2006) Neveol, A., Zeng, K., Bodenreider, O.: Besides precision & recall: exploring alternative approaches to evaluating an automatic indexing tool for MEDLINE. In: AMIA Annual Symposium Proceedings, pp. 589–593 (2006)
14.
go back to reference Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
15.
go back to reference Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?: explaining the predictions of any classifier. In: Proceedings of SIGKDD 2016, pp. 1135–1144. ACM (2016) Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?: explaining the predictions of any classifier. In: Proceedings of SIGKDD 2016, pp. 1135–1144. ACM (2016)
16.
go back to reference Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981)CrossRef Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981)CrossRef
17.
go back to reference Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef
19.
go back to reference Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of WWW 2009, pp. 211–220. ACM (2009) Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of WWW 2009, pp. 211–220. ACM (2009)
20.
go back to reference Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: Proceedings of JCDL 2017, pp. 31–40. IEEE Computer Society (2017) Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: Proceedings of JCDL 2017, pp. 31–40. IEEE Computer Society (2017)
21.
go back to reference Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij, W., Rebholz-Schuhmann, D.: MeSH up: effective MeSH text classification for improved document retrieval. Bioinformatics 25(11), 1412–1418 (2009)CrossRef Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij, W., Rebholz-Schuhmann, D.: MeSH up: effective MeSH text classification for improved document retrieval. Bioinformatics 25(11), 1412–1418 (2009)CrossRef
22.
go back to reference Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of MeSH for PubMed records. In: AMIA Annual Symposium Proceedings 2014, pp. 1198–1207 (2014) Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of MeSH for PubMed records. In: AMIA Annual Symposium Proceedings 2014, pp. 1198–1207 (2014)
23.
go back to reference Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of SIGKDD 2002, pp. 694–699. ACM (2002) Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of SIGKDD 2002, pp. 694–699. ACM (2002)
Metadata
Title
Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints
Authors
Martin Toepfer
Christin Seifert
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-00066-0_1

Premium Partner