poster

Identifying the original contribution of a document via language modeling

Authors:
Benyah Shaparenko

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Thorsten Joachims

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 696–697https://doi.org/10.1145/1571941.1572083

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 696–697

ABSTRACT

One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.

References

Document understanding conferences. http://duc.nist.gov/.Google Scholar
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop-1998, 1998.Google Scholar
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003. Google ScholarDigital Library
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Google ScholarDigital Library
B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In Proceedings of KDD-07, pages 619--628, New York, 2007. ACM Press. Google ScholarDigital Library
I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proceedings of TREC-2003, 2003.Google Scholar

Index Terms

Identifying the original contribution of a document via language modeling
1. Information systems
  1. Information systems applications

Recommendations

Identifying the original contribution of a document via Language Modeling
ECMLPKDD'09: Proceedings of the 2009th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II

One major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document ...
Read More
Large vocabulary Russian speech recognition using syntactico-statistical language modeling

Speech is the most natural way of human communication and in order to achieve convenient and efficient human-computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally ...
Read More
Evaluation of Advanced Language Modeling Techniques for Russian LVCSR
SPECOM 2013: Proceedings of the 15th International Conference on Speech and Computer - Volume 8113

The Russian language is characterized by very flexible word order, which limits the ability of the standard <em>n</em>-grams to capture important regularities in the data. Moreover, it is highly inflectional language with rich morphology, which leads to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 Copyright is held by the author/owner(s)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
language modeling
original contributions
topic modeling
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 246
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Identifying the original contribution of a document via language modeling

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Identifying the original contribution of a document via Language Modeling

Large vocabulary Russian speech recognition using syntactico-statistical language modeling

Evaluation of Advanced Language Modeling Techniques for Russian LVCSR