Article

Subject metadata enrichment using statistical topic models

Authors:
David Newman

UC Irvine, Irvine, CA

UC Irvine, Irvine, CA
View Profile

,
Kat Hagedorn

University of Michigan, Ann Arbor, MI

University of Michigan, Ann Arbor, MI
View Profile

,
Chaitanya Chemudugunta

UC Irvine, Irvine, CA

UC Irvine, Irvine, CA
View Profile

,
Padhraic Smyth

UC Irvine, Irvine, CA

UC Irvine, Irvine, CA
View Profile

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital librariesJune 2007Pages 366–375https://doi.org/10.1145/1255175.1255248

Published:18 June 2007Publication History

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Pages 366–375

ABSTRACT

Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.

References

Blei, D., Ng, A., Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
Buntine, W., Lofstrom, J., Perki, J., Perttu, S., Poroshin, V., Silander, A Scalable Topic-Based Open Source Search Engine. In IEEE/WIC/ACM International Conference on Web Intelligence, 228--234, 2004. Google ScholarDigital Library
Chemudugunta, C., Smyth, P., Steyvers, M., Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS'06, Advances in Neural Information Processing Systems 19. 2006.Google Scholar
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarCross Ref
Dhillon, I. S., Modha, D. S., Concept decompositions for large sparse text data using clustering. Machine Learning. 42:143--175, 2001. Google ScholarDigital Library
Griffiths, T., Steyvers, M., Finding Scientific Topics. PNAS, 101(suppl. 1):5228--5235. 2004.Google ScholarCross Ref
Hoffman, T. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177--196, 2001. Google ScholarDigital Library
Krowne, A., Halbert, M. An initial evaluation of automated organization for digital library browsing. Joint Conference on Digital Libraries. pp 246--255. June 7-11, 2005 Google ScholarDigital Library
Lee, D., Seung, H. S., Learning the parts of objects by non-negative matrix factorization. Nature, v.401, 788--791, 1999.Google ScholarCross Ref
Li, W., McCallum, A. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. In ICML'06. 2006 Google ScholarDigital Library
Mann, G. S., Mimno, D., McCallum, A. Bibliometric impact measures leveraging topic analysis. Joint Conference on Digital Libraries. pp 65--74. June 11-15, 2006. Google ScholarDigital Library
Manning, C., Schutze, H. Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: 1999. Google ScholarDigital Library
Newman, D., Block, S. Probabilistic Topic Decomposition of and Eighteenth Century Newspaper. JASIST, 57(6):753--767, 2006. Google ScholarDigital Library
Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In LNCS-IEEE Conference on Intelligence and Security Informatics. pp 93--104. San Diego, 2006 Google ScholarDigital Library

Index Terms

Subject metadata enrichment using statistical topic models

Recommendations

Review-oriented metadata enrichment: a case study
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

Book reviews contributed by readers in social sites contain valuable information on books' content, style and merit, many informative words in which can be used to enrich metadata of books in China-Us Million Book Digital Library. In this paper, we ...
Read More
Metadata enrichment via topic models for author name disambiguation
NLP4DL'09/AT4DL'09: Proceedings of the 2009 international conference on Advanced language technologies for digital libraries

This paper tackles the well known problem of Author Name Disambiguation (AND) in Digital Libraries (DL). Following [14,13], we assume that an individual tends to create a distinctively coherent body of work that can hence form a single cluster ...
Read More
Orchestrating metadata enhancement services: introducing Lenny
DCMI '05: Proceedings of the 2005 international conference on Dublin Core and metadata applications: vocabularies in practice

Harvested metadata often suffers from uneven quality to the point that utility is compromised. Although some aggregators have developed methods for evaluating and repairing specific metadata problems, it has been unclear how these methods might be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
June 2007
534 pages
ISBN:9781595936448
DOI:10.1145/1255175
General Chair:
Edie Rasmussen
University of British Columbia, Canada
,
Program Chairs:
Ray R. Larson
University of California, Berkeley
,
Elaine Toms
Dalhousie University, Canada
,
Shigeo Sugimoto
University of Tsukuba, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
OAI
browsing
clustering
digital libraries
metadata enhancement
metadata enrichment
topic model
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 764
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Subject metadata enrichment using statistical topic models

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Review-oriented metadata enrichment: a case study

Metadata enrichment via topic models for author name disambiguation

Orchestrating metadata enhancement services: introducing Lenny

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Subject metadata enrichment using statistical topic models

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Review-oriented metadata enrichment: a case study

Metadata enrichment via topic models for author name disambiguation

Orchestrating metadata enhancement services: introducing Lenny

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media