ABSTRACT
In many areas of life, we now have almost complete electronic archives reaching back for well over two decades. This includes, for example, the body of research papers in computer science, all news articles written in the US, and most people's personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in archives, we still lack methods for understanding a corpus as a whole. In this paper, we explore methods that provide a temporal summary of such corpora in terms of landmark documents, authors, and topics. In particular, we explicitly model the temporal nature of influence between documents and re-interpret summarization as a coverage problem over words anchored in time. The resulting models provide monotone sub-modular objectives for computing informative and non-redundant summaries over time, which can be efficiently optimized with greedy algorithms. Our empirical study shows the effectiveness of our approach over several baselines.
- J. Allan, R. Gupta, and V. Khandelwal. Temporal summaries of new topics. In SIGIR, pages 10--18, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,3:993--1022, Mar. 2003. Google ScholarDigital Library
- J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In SIGIR, pages 335--336, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- C. C. Chen and M. C. Chen. Tscan: a novel method for topic summarization and content anatomy. In SIGIR, pages 579--586, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- P. Chen, H. Xie, S. Maslov, and S. Redner. Finding scientific gems with google's pagerank algorithm. Journal of Informetrics, 1(1):8--15, 2007.Google ScholarCross Ref
- K. El-Arini and C. Guestrin. Beyond keyword search: discovering relevant scientific literature. In KDD, pages 439--447, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD, pages 289--298, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Information Processing Letters, 70(1), 1999. Google ScholarDigital Library
- J.-M. Lim, I.-S. Kang, J.-H. Bae, and J.-H. Lee. Sentence extraction using time features in multi-document summarization. In Information Retrieval Technology, volume 3411 of Lecture Notes in Computer Science, pages 82--93. Springer Berlin / Heidelberg, 2005. Google ScholarDigital Library
- H. Lin and J. Bilmes. Multi-document summarization via budgeted maximization of submodular functions. In HLT, pages 912--920, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
- R. McDonald. A study of global inference algorithms. In Lecture Notes in Computer Science, 2007. Google ScholarDigital Library
- R. Nallapati, A. Feng, F. Peng, and J. Allan. Event threading within news topics. In CIKM, pages 446--453, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14:265--294, 1978.Google ScholarDigital Library
- A. Nenkova and K. McKeown. Automatic summarization. Foundations and Trends in Information Retrieval, 5(2-3):103--233, 2011.Google ScholarCross Ref
- D. R. Radev, P. Muthukrishnan, and V. Qazvinian. The ACL anthology network corpus. In Proceedings, ACL Workshop on Natural Language Processing and Information Retrieval for Digital Libraries, Singapore, 2009. Google ScholarDigital Library
- K. Raman, T. Joachims, and P. Shivaswamy. Structured learning of two-level dynamic rankings. In CIKM, 2011. Google ScholarDigital Library
- G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell University, Ithaca, NY, USA, 1987. Google ScholarDigital Library
- B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In KDD, pages 619--628, 2007. Google ScholarDigital Library
- R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization methods. In EACL, 2012. Google ScholarDigital Library
- I. SubašiĆ and B. Berendt. From bursty patterns to bursty facts: The effectiveness of temporal text mining for news. In ECAI, pages 517--522, Amsterdam, The Netherlands, The Netherlands, 2010. IOS Press. Google ScholarDigital Library
- A. Swaminthan, C. Metthew, and D. Kirovski. Essential pages. In Technical Report, MSR-TR-2008-15, Microsoft Research, 2008.Google Scholar
- R. Swan and J. Allan. Automatic generation of overview timelines. In SIGIR, pages 49--56, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- H. Takamura and M. Okumura. Text summarization model based on maximum coverage problem and its variant. In EACL, pages 781--789, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. Google ScholarDigital Library
- R. Torres, S. M. McNee, M. Abel, J. A. Konstan, and J. Riedl. Enhancing digital libraries with techlens+. In Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, JCDL '04, pages 228--236, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- M. Wu, W. Li, Q. Lu, and K.-F. Wong. Event-based summarization using time features. In Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing '07, pages 563--574, Berlin, Heidelberg, 2007. Springer-Verlag. Google ScholarDigital Library
- E. Yan and Y. Ding. Weighted citation: An indicator of an article's prestige. Journal of the American Society for Information Science and Technology, 61(8):1635--1643, 2010. Google ScholarDigital Library
- R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li, and Y. Zhang. Evolutionary timeline summarization: a balanced optimization framework via iterative substitution. In SIGIR, pages 745--754, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In ICML, pages 271--278, 2008. Google ScholarDigital Library
Index Terms
- Temporal corpus summarization using submodular word coverage
Recommendations
Automatic Document Summarization using Sentiment Analysis
ICIA-16: Proceedings of the International Conference on Informatics and AnalyticsWith the advent of information revolution, electronic documents have become the powerhouse of business and academic information. Modern organizations handle terabytes of data in text format alone. In order to fully understand and utilize these documents,...
Enhanced web document summarization using hyperlinks
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermediaThis paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider ...
A new sentence similarity measure and sentence based extractive technique for automatic text summarization
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, ...
Comments