skip to main content
10.1145/1571941.1572083acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
poster

Identifying the original contribution of a document via language modeling

Published:19 July 2009Publication History

ABSTRACT

One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.

References

  1. Document understanding conferences. http://duc.nist.gov/.Google ScholarGoogle Scholar
  2. J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop-1998, 1998.Google ScholarGoogle Scholar
  3. D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In Proceedings of KDD-07, pages 619--628, New York, 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proceedings of TREC-2003, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Identifying the original contribution of a document via language modeling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
      July 2009
      896 pages
      ISBN:9781605584836
      DOI:10.1145/1571941

      Copyright © 2009 Copyright is held by the author/owner(s)

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 July 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader