ABSTRACT
As the number and size of large timestamped collections (e.g. sequences of digitized newspapers, periodicals, blogs) increase, the problem of efficiently indexing and searching such data becomes more important. Term burstiness has been extensively researched as a mechanism to address event detection in the context of such collections. In this paper, we explore how burstiness information can be further utilized to enhance the search process. We present a novel approach to model the burstiness of a term, using discrepancy theory concepts. This allows us to build a parameter-free, linear-time approach to identify the time intervals of maximum burstiness for a given term. Finally, we describe the first burstiness-driven search framework and thoroughly evaluate our approach in the context of different scenarios.
Supplemental Material
- D. Agarwal, J. M. Phillips, and S. Venkatasubramanian. The hunting of the bump: on maximizing statistical discrepancy. In SODA '06, pages 1137--1146, New York. Google ScholarDigital Library
- N. Bansal and N. Koudas. Blogscope: a system for online analysis of high volume text streams. In VLDB '07. Google ScholarDigital Library
- N. Bansal and N. Koudas. BlogScope: spatio-temporal analysis of the blogosphere. In WWW '07. Google ScholarDigital Library
- B. Chazelle. The discrepancy method: randomness and complexity. Cambridge University Press, NY, 2000. Google ScholarDigital Library
- C.-H. Cheng, K.-Y. Chen, W.-C. Tien, and K.-M. Chao. Improved algorithmms for the k maximum-sums problems. Theor. Comput. Sci., 362(1):162--170, 2006. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. The MIT Press, September 2001. Google ScholarDigital Library
- D. P. Dobkin, D. Gunopulos, and W. Maass. Computing the maximum bichromatic discrepancy, with applications to computer graphics and machine learning. J. Comput. Syst. Sci., 52(3):453--470, 1996. Google ScholarDigital Library
- D. Dobkin and D. Eppstein. Computing the discrepancy. In SCG '93, pages 47--52, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS '01. Google ScholarDigital Library
- G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter free bursty events detection in text streams. In VLDB '05. Google ScholarDigital Library
- Q. He, K. Chang, and E.-P. Lim. Analyzing feature trajectories for event detection. In SIGIR '07. Google ScholarDigital Library
- Q. He, K. Chang, E.-P. Lim, and J. Zhang. Bursty feature representation for clustering text streams. In SIAM '07.Google Scholar
- J. Kleinberg. Bursty and hierarchical structure in streams. In KDD '02, pages 91--101, New York, USA. Google ScholarDigital Library
- R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In WWW '03. Google ScholarDigital Library
- National Digital Newspaper Program (NDNP), http://www.loc.gov/ndnp.Google Scholar
- Qi He and Kuiyu Chang and Ee-Peng Lim. Using burstiness to improve clustering of topics in news streams. In ICDM '07, Washington, DC, USA, 2007. Google ScholarDigital Library
- W. L. Ruzzo and M. Tompa. A linear time algorithm for finding all maximal scoring subsequences. In ISMB 1999. Google ScholarDigital Library
- M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In SIGMOD '04, pages 131--142, New York. Google ScholarDigital Library
- Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In KDD '03, pages 336--345, New York. Google ScholarDigital Library
Index Terms
- On burstiness-aware search for document sequences
Recommendations
A burstiness-aware approach for document dating
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalA large number of mainstream applications, like temporal search, event detection, and trend identification, assume knowledge of the timestamp of every document in a given textual collection. In many cases, however, the required timestamps are either ...
Service Burstiness and Dynamic Burstiness Measures: A Framework
We introduce a framework for characterizing service disciplines in terms of service burstiness. A wide range of service disciplines recently proposed for high-speed integrated services networks fall within this unifying framework. We introduce dynamic ...
Document search interface design for large-scale collections and intelligent access
JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital librariesAs the universe of documents has enlarged from those available via the online catalog to a larger cluster of databases and web-accessible resources, interfaces are being created that can search multiple document collections simultaneously. Also, ...
Comments