ABSTRACT
Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.
- Blei, D., Ng, A., Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- Buntine, W., Lofstrom, J., Perki, J., Perttu, S., Poroshin, V., Silander, A Scalable Topic-Based Open Source Search Engine. In IEEE/WIC/ACM International Conference on Web Intelligence, 228--234, 2004. Google ScholarDigital Library
- Chemudugunta, C., Smyth, P., Steyvers, M., Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS'06, Advances in Neural Information Processing Systems 19. 2006.Google Scholar
- Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarCross Ref
- Dhillon, I. S., Modha, D. S., Concept decompositions for large sparse text data using clustering. Machine Learning. 42:143--175, 2001. Google ScholarDigital Library
- Griffiths, T., Steyvers, M., Finding Scientific Topics. PNAS, 101(suppl. 1):5228--5235. 2004.Google ScholarCross Ref
- Hoffman, T. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177--196, 2001. Google ScholarDigital Library
- Krowne, A., Halbert, M. An initial evaluation of automated organization for digital library browsing. Joint Conference on Digital Libraries. pp 246--255. June 7-11, 2005 Google ScholarDigital Library
- Lee, D., Seung, H. S., Learning the parts of objects by non-negative matrix factorization. Nature, v.401, 788--791, 1999.Google ScholarCross Ref
- Li, W., McCallum, A. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. In ICML'06. 2006 Google ScholarDigital Library
- Mann, G. S., Mimno, D., McCallum, A. Bibliometric impact measures leveraging topic analysis. Joint Conference on Digital Libraries. pp 65--74. June 11-15, 2006. Google ScholarDigital Library
- Manning, C., Schutze, H. Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: 1999. Google ScholarDigital Library
- Newman, D., Block, S. Probabilistic Topic Decomposition of and Eighteenth Century Newspaper. JASIST, 57(6):753--767, 2006. Google ScholarDigital Library
- Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In LNCS-IEEE Conference on Intelligence and Security Informatics. pp 93--104. San Diego, 2006 Google ScholarDigital Library
Index Terms
- Subject metadata enrichment using statistical topic models
Recommendations
Review-oriented metadata enrichment: a case study
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital librariesBook reviews contributed by readers in social sites contain valuable information on books' content, style and merit, many informative words in which can be used to enrich metadata of books in China-Us Million Book Digital Library. In this paper, we ...
Metadata enrichment via topic models for author name disambiguation
NLP4DL'09/AT4DL'09: Proceedings of the 2009 international conference on Advanced language technologies for digital librariesThis paper tackles the well known problem of Author Name Disambiguation (AND) in Digital Libraries (DL). Following [14,13], we assume that an individual tends to create a distinctively coherent body of work that can hence form a single cluster ...
Orchestrating metadata enhancement services: introducing Lenny
DCMI '05: Proceedings of the 2005 international conference on Dublin Core and metadata applications: vocabularies in practiceHarvested metadata often suffers from uneven quality to the point that utility is compromised. Although some aggregators have developed methods for evaluating and repairing specific metadata problems, it has been unclear how these methods might be ...
Comments