Skip to main content
Erschienen in: International Journal on Digital Libraries 2/2015

01.06.2015

A generalized topic modeling approach for automatic document annotation

verfasst von: Suppawong Tuarob, Line C. Pouchard, Prasenjit Mitra, C. Lee Giles

Erschienen in: International Journal on Digital Libraries | Ausgabe 2/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
16
http://​daac.​ornl.​gov/​cgi-bin/​dsviewer.​pl?​ds_​id=​930.
Table 3
Comparison of the recommended keywords by the TF-IDF, TM, and KEA (baseline) algorithms on a sample document “ISLSCP II IGBP DISCOVER AND SIB LAND COVER, 1992–1993
Actual Tags
TFIDF
TM
KEA(Baseline)
1. albedo
1. field investig
1. land cover
1. model
2. land cover
2. analysi
2. modi moder resolut imag spectroradiomet
2. geograph distribut
3. veget cover
3. land cover
3. terra morn equatori cross time satellit
3. classif
4. veget index
4. comput model
4. field investig
4. lba
5. leaf area meter
5. reflect
5. veget cover
5. amazonia
6. terra morn equatori cross time satellit
6. veget cover
6. reflect
6. area
7. noaa nation ocean amp amp atmospher administr
7. biomass
7. veget index
7. south america
8. plant characterist
8. primari product
8. leaf characterist
8. ecolog
9. steel measur tape
9. steel measur tape
9. canopi characterist
9. reflect
10. canopi characterist
10. weigh balanc
10. plant characterist
10. calibr
11. modi moder resolut imag spectroradiomet
11. precipit amount
11. albedo
11. field investig
12. leaf characterist
12. canopi characterist
12. steel measur tape
12. speci
13. avhrr advanc high resolut radiomet
13. leaf characterist
13. avhrr advanc high resolut radiomet
13. factor
14. field investig
14. water vapor
14. noaa nation ocean amp amp atmospher administr
14. sequenc
15. reflect
15. quadrat sampl frame
15. leaf area meter
15. hawaiian island
 
16. rain gaug
16. analysi
16. genera
 
17. surfac air temperatur
17. comput model
17. fern
 
18. air temperatur
18. noaa
18. systemat
 
19. meteorolog station
19. avhrr
19. steel measur tape
 
20. human observ
20. popul distribut
20. correl
The first column lists the actual tags. The bold, underlined terms are correctly recommended items
 
Literatur
1.
Zurück zum Zitat AlSumait, L., Barbar, D., Domeniconi, C.: On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: IEEE Computer Society ICDM, pp. 3–12 (2008) AlSumait, L., Barbar, D., Domeniconi, C.: On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: IEEE Computer Society ICDM, pp. 3–12 (2008)
2.
Zurück zum Zitat Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 27–34. AUAI Press, Arlington (2009) Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 27–34. AUAI Press, Arlington (2009)
3.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
4.
Zurück zum Zitat Bron, M., Huurnink, B., de Rijke, M.: Linking archives using document enrichment and term selection. In: Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL’11, pp. 360–371 (2011) Bron, M., Huurnink, B., de Rijke, M.: Linking archives using document enrichment and term selection. In: Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL’11, pp. 360–371 (2011)
5.
Zurück zum Zitat Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, pp. 25–32. ACM, New York (2004) Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, pp. 25–32. ACM, New York (2004)
6.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
7.
Zurück zum Zitat Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)CrossRef Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)CrossRef
8.
Zurück zum Zitat Heymann, P., Ramage, D., Garcia-Molina, H.: Social tag prediction. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 531–538. ACM, New York (2008) Heymann, P., Ramage, D., Garcia-Molina, H.: Social tag prediction. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 531–538. ACM, New York (2008)
9.
Zurück zum Zitat Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefMATH Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefMATH
10.
Zurück zum Zitat Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C.L., Rokach, L.: Recommending citations: translating papers into references. CIKM ’12, pp. 1910–1914. ACM, New York (2012) Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C.L., Rokach, L.: Recommending citations: translating papers into references. CIKM ’12, pp. 1910–1914. ACM, New York (2012)
11.
Zurück zum Zitat Iwata, T., Yamada, T., Sakurai, Y., Ueda, N.: Online multiscale dynamic topic models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pp. 663–672. ACM, New York (2010) Iwata, T., Yamada, T., Sakurai, Y., Ueda, N.: Online multiscale dynamic topic models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pp. 663–672. ACM, New York (2010)
12.
Zurück zum Zitat Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI’10, p. 1 (2010) Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI’10, p. 1 (2010)
13.
Zurück zum Zitat Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pp. 61–68. ACM, New York (2009) Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pp. 61–68. ACM, New York (2009)
14.
Zurück zum Zitat Liu, Z., Chen, X., Sun, M.: A simple word trigger method for social tag suggestion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pp. 1577–1588. Association for Computational Linguistics, Stroudsburg (2011) Liu, Z., Chen, X., Sun, M.: A simple word trigger method for social tag suggestion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pp. 1577–1588. Association for Computational Linguistics, Stroudsburg (2011)
15.
Zurück zum Zitat Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pp. 366–376. Association for Computational Linguistics, Stroudsburg (2010) Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pp. 366–376. Association for Computational Linguistics, Stroudsburg (2010)
16.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATH
17.
Zurück zum Zitat Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 296–297. ACM, New York (2006) Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 296–297. ACM, New York (2006)
18.
Zurück zum Zitat Michener, W., Vieglais, D., Vision, T., Kunze, J., Cruse, P., Janée, G.: DataONE: data observation network for earth—preserving data and enabling innovation in the biological and environmental sciences. DLib Mag. 17(1/2), 1–12 (2011) Michener, W., Vieglais, D., Vision, T., Kunze, J., Cruse, P., Janée, G.: DataONE: data observation network for earth—preserving data and enabling innovation in the biological and environmental sciences. DLib Mag. 17(1/2), 1–12 (2011)
19.
Zurück zum Zitat Mishne, G.: Autotag: a collaborative approach to automated tag assignment for weblog posts. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 953–954. ACM, New York (2006) Mishne, G.: Autotag: a collaborative approach to automated tag assignment for weblog posts. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 953–954. ACM, New York (2006)
20.
Zurück zum Zitat Newman, D., Hagedorn, K., Chemudugunta, C., Smyth, P.: Subject metadata enrichment using statistical topic models. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL ’07, pp. 366–375. ACM, New York (2007) Newman, D., Hagedorn, K., Chemudugunta, C., Smyth, P.: Subject metadata enrichment using statistical topic models. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL ’07, pp. 366–375. ACM, New York (2007)
21.
Zurück zum Zitat Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pp. 100–108. Association for Computational Linguistics, Stroudsburg (2010) Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pp. 100–108. Association for Computational Linguistics, Stroudsburg (2010)
22.
Zurück zum Zitat Newman, D., Smyth, P., Welling, M., Asuncion, A.U.: Distributed inference for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2007) Newman, D., Smyth, P., Welling, M., Asuncion, A.U.: Distributed inference for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2007)
23.
Zurück zum Zitat Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W.-C., Giles, C.L.: Real-time automatic tag recommendation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 515–522 (2008) Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W.-C., Giles, C.L.: Real-time automatic tag recommendation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 515–522 (2008)
24.
Zurück zum Zitat Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007) Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)
25.
Zurück zum Zitat Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 239–248. ACM, New York (2013) Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 239–248. ACM, New York (2013)
26.
Zurück zum Zitat Tuarob, S., Pouchard, L.C., Noy, N., Horsburgh, J.S., Palanisamy, G.: Onemercury: towards automatic annotation of earth science metadata. In: AGU Fall Meeting Abstracts, vol. 1, p. 1482 (2012) Tuarob, S., Pouchard, L.C., Noy, N., Horsburgh, J.S., Palanisamy, G.: Onemercury: towards automatic annotation of earth science metadata. In: AGU Fall Meeting Abstracts, vol. 1, p. 1482 (2012)
27.
Zurück zum Zitat Tuarob, S., Pouchard, L.C., Noy, N., Horsburgh, J.S., Palanisamy, G.: Onemercury: towards automatic annotation of environmental science metadata. In: Proceedings of the 2nd International Workshop on Linked Science (2012) Tuarob, S., Pouchard, L.C., Noy, N., Horsburgh, J.S., Palanisamy, G.: Onemercury: towards automatic annotation of environmental science metadata. In: Proceedings of the 2nd International Workshop on Linked Science (2012)
28.
Zurück zum Zitat Tuarob, S., Tucker, C.S.: Fad or here to stay: predicting product market adoption and longevity using large scale, social media data. In: Proceedings ASME 2013 Internationl Design Engineering Technical Conference Computers and Information in Engineering Conference, IDETC/CIE ’13 (2013) Tuarob, S., Tucker, C.S.: Fad or here to stay: predicting product market adoption and longevity using large scale, social media data. In: Proceedings ASME 2013 Internationl Design Engineering Technical Conference Computers and Information in Engineering Conference, IDETC/CIE ’13 (2013)
29.
Zurück zum Zitat Tuarob, S., Tucker, C.S.: Discovering next generation product innovations by identifying lead user preferences expressed through large scale social media data. In: Proceedings ASME 2014 International Design Engineering Technical Conference Computers and Information in Engineering Conference, IDETC/CIE ’14 (2014) Tuarob, S., Tucker, C.S.: Discovering next generation product innovations by identifying lead user preferences expressed through large scale social media data. In: Proceedings ASME 2014 International Design Engineering Technical Conference Computers and Information in Engineering Conference, IDETC/CIE ’14 (2014)
30.
Zurück zum Zitat Tuarob, S., Tucker, C.S.: Automated discovery of lead users and latent product features by mining large scale social media networks. J. Mech. Des. (2015, accepted) Tuarob, S., Tucker, C.S.: Automated discovery of lead users and latent product features by mining large scale social media networks. J. Mech. Des. (2015, accepted)
31.
Zurück zum Zitat Tuarob, S., Tucker, C.S.: Quantifying product favorability and extracting notable product features using large scale social media data. J. Comput. Inf. Sci. Eng. (2015). doi:10.1115/1.4029562 Tuarob, S., Tucker, C.S.: Quantifying product favorability and extracting notable product features using large scale social media data. J. Comput. Inf. Sci. Eng. (2015). doi:10.​1115/​1.​4029562
32.
Zurück zum Zitat Tuarob, S., Tucker, C. S., Salathe, M., Ram, N.: Discovering health-related knowledge in social media using ensembles of heterogeneous features. In: Proceedings of the 22Nd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’13, pp. 1685–1690. ACM, New York (2013) Tuarob, S., Tucker, C. S., Salathe, M., Ram, N.: Discovering health-related knowledge in social media using ensembles of heterogeneous features. In: Proceedings of the 22Nd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’13, pp. 1685–1690. ACM, New York (2013)
33.
Zurück zum Zitat Tuarob, S., Tucker, C.S., Salathe, M., Ram, N.: An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. J. Biomed. Inf. 49, 255–268 (2014)CrossRef Tuarob, S., Tucker, C.S., Salathe, M., Ram, N.: An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. J. Biomed. Inf. 49, 255–268 (2014)CrossRef
34.
Zurück zum Zitat Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of TREC-8, pp. 77–82 (1999) Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of TREC-8, pp. 77–82 (1999)
35.
Zurück zum Zitat Widdows, D., Ferraro, K.: Semantic vectors: a scalable open source package and online technology management application. In: LREC (2008) Widdows, D., Ferraro, K.: Semantic vectors: a scalable open source package and online technology management application. In: LREC (2008)
36.
Zurück zum Zitat Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C. G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital libraries, DL ’99, pp. 254–255. ACM, New York (1999) Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C. G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital libraries, DL ’99, pp. 254–255. ACM, New York (1999)
37.
Zurück zum Zitat Wu, L., Yang, L., Yu, N., Hua, X.-S.: Learning to tag. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pp. 361–370 (2009) Wu, L., Yang, L., Yu, N., Hua, X.-S.: Learning to tag. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pp. 361–370 (2009)
38.
Zurück zum Zitat Zhou, T., Ma, H., Lyu, M., King, I.: Userrec: A user recommendation framework in social tagging systems. In: Proceedings of AAAI, pp. 1486–1491 (2010) Zhou, T., Ma, H., Lyu, M., King, I.: Userrec: A user recommendation framework in social tagging systems. In: Proceedings of AAAI, pp. 1486–1491 (2010)
Metadaten
Titel
A generalized topic modeling approach for automatic document annotation
verfasst von
Suppawong Tuarob
Line C. Pouchard
Prasenjit Mitra
C. Lee Giles
Publikationsdatum
01.06.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Digital Libraries / Ausgabe 2/2015
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-015-0146-2

Weitere Artikel der Ausgabe 2/2015

International Journal on Digital Libraries 2/2015 Zur Ausgabe