ABSTRACT
One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.
- Document understanding conferences. http://duc.nist.gov/.Google Scholar
- J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop-1998, 1998.Google Scholar
- D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003. Google ScholarDigital Library
- G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Google ScholarDigital Library
- B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In Proceedings of KDD-07, pages 619--628, New York, 2007. ACM Press. Google ScholarDigital Library
- I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proceedings of TREC-2003, 2003.Google Scholar
Index Terms
- Identifying the original contribution of a document via language modeling
Recommendations
Identifying the original contribution of a document via Language Modeling
ECMLPKDD'09: Proceedings of the 2009th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part IIOne major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document ...
Large vocabulary Russian speech recognition using syntactico-statistical language modeling
Speech is the most natural way of human communication and in order to achieve convenient and efficient human-computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally ...
Evaluation of Advanced Language Modeling Techniques for Russian LVCSR
SPECOM 2013: Proceedings of the 15th International Conference on Speech and Computer - Volume 8113The Russian language is characterized by very flexible word order, which limits the ability of the standard <em>n</em>-grams to capture important regularities in the data. Moreover, it is highly inflectional language with rich morphology, which leads to ...
Comments