skip to main content
10.1145/2396761.2396857acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Temporal corpus summarization using submodular word coverage

Published:29 October 2012Publication History

ABSTRACT

In many areas of life, we now have almost complete electronic archives reaching back for well over two decades. This includes, for example, the body of research papers in computer science, all news articles written in the US, and most people's personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in archives, we still lack methods for understanding a corpus as a whole. In this paper, we explore methods that provide a temporal summary of such corpora in terms of landmark documents, authors, and topics. In particular, we explicitly model the temporal nature of influence between documents and re-interpret summarization as a coverage problem over words anchored in time. The resulting models provide monotone sub-modular objectives for computing informative and non-redundant summaries over time, which can be efficiently optimized with greedy algorithms. Our empirical study shows the effectiveness of our approach over several baselines.

References

  1. J. Allan, R. Gupta, and V. Khandelwal. Temporal summaries of new topics. In SIGIR, pages 10--18, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,3:993--1022, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR, pages 335--336, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. C. Chen and M. C. Chen. Tscan: a novel method for topic summarization and content anatomy. In SIGIR, pages 579--586, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Chen, H. Xie, S. Maslov, and S. Redner. Finding scientific gems with google's pagerank algorithm. Journal of Informetrics, 1(1):8--15, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. El-Arini and C. Guestrin. Beyond keyword search: discovering relevant scientific literature. In KDD, pages 439--447, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD, pages 289--298, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Information Processing Letters, 70(1), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J.-M. Lim, I.-S. Kang, J.-H. Bae, and J.-H. Lee. Sentence extraction using time features in multi-document summarization. In Information Retrieval Technology, volume 3411 of Lecture Notes in Computer Science, pages 82--93. Springer Berlin / Heidelberg, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Lin and J. Bilmes. Multi-document summarization via budgeted maximization of submodular functions. In HLT, pages 912--920, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. McDonald. A study of global inference algorithms. In Lecture Notes in Computer Science, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Nallapati, A. Feng, F. Peng, and J. Allan. Event threading within news topics. In CIKM, pages 446--453, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14:265--294, 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Nenkova and K. McKeown. Automatic summarization. Foundations and Trends in Information Retrieval, 5(2-3):103--233, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. R. Radev, P. Muthukrishnan, and V. Qazvinian. The ACL anthology network corpus. In Proceedings, ACL Workshop on Natural Language Processing and Information Retrieval for Digital Libraries, Singapore, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Raman, T. Joachims, and P. Shivaswamy. Structured learning of two-level dynamic rankings. In CIKM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell University, Ithaca, NY, USA, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In KDD, pages 619--628, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization methods. In EACL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. SubašiĆ and B. Berendt. From bursty patterns to bursty facts: The effectiveness of temporal text mining for news. In ECAI, pages 517--522, Amsterdam, The Netherlands, The Netherlands, 2010. IOS Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Swaminthan, C. Metthew, and D. Kirovski. Essential pages. In Technical Report, MSR-TR-2008-15, Microsoft Research, 2008.Google ScholarGoogle Scholar
  22. R. Swan and J. Allan. Automatic generation of overview timelines. In SIGIR, pages 49--56, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Takamura and M. Okumura. Text summarization model based on maximum coverage problem and its variant. In EACL, pages 781--789, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Torres, S. M. McNee, M. Abel, J. A. Konstan, and J. Riedl. Enhancing digital libraries with techlens+. In Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, JCDL '04, pages 228--236, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Wu, W. Li, Q. Lu, and K.-F. Wong. Event-based summarization using time features. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing '07, pages 563--574, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Yan and Y. Ding. Weighted citation: An indicator of an article's prestige. Journal of the American Society for Information Science and Technology, 61(8):1635--1643, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li, and Y. Zhang. Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. In SIGIR, pages 745--754, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In ICML, pages 271--278, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Temporal corpus summarization using submodular word coverage

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader