skip to main content
10.1145/1255175.1255248acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Subject metadata enrichment using statistical topic models

Published:18 June 2007Publication History

ABSTRACT

Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.

References

  1. Blei, D., Ng, A., Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Buntine, W., Lofstrom, J., Perki, J., Perttu, S., Poroshin, V., Silander, A Scalable Topic-Based Open Source Search Engine. In IEEE/WIC/ACM International Conference on Web Intelligence, 228--234, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chemudugunta, C., Smyth, P., Steyvers, M., Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS'06, Advances in Neural Information Processing Systems 19. 2006.Google ScholarGoogle Scholar
  4. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  5. Dhillon, I. S., Modha, D. S., Concept decompositions for large sparse text data using clustering. Machine Learning. 42:143--175, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Griffiths, T., Steyvers, M., Finding Scientific Topics. PNAS, 101(suppl. 1):5228--5235. 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. Hoffman, T. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177--196, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Krowne, A., Halbert, M. An initial evaluation of automated organization for digital library browsing. Joint Conference on Digital Libraries. pp 246--255. June 7-11, 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lee, D., Seung, H. S., Learning the parts of objects by non-negative matrix factorization. Nature, v.401, 788--791, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  10. Li, W., McCallum, A. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. In ICML'06. 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mann, G. S., Mimno, D., McCallum, A. Bibliometric impact measures leveraging topic analysis. Joint Conference on Digital Libraries. pp 65--74. June 11-15, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Manning, C., Schutze, H. Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Newman, D., Block, S. Probabilistic Topic Decomposition of and Eighteenth Century Newspaper. JASIST, 57(6):753--767, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In LNCS-IEEE Conference on Intelligence and Security Informatics. pp 93--104. San Diego, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Subject metadata enrichment using statistical topic models

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
            June 2007
            534 pages
            ISBN:9781595936448
            DOI:10.1145/1255175

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 June 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate415of1,482submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader