ABSTRACT
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
- R. Agrawal, R. Srikant, "Mining sequential patterns," Proc. Intl. Conf. on Data Engineering, 1995. Google ScholarDigital Library
- J. Allan, J.G. Carbonell, G. Doddington, J. Yamron, Y. Yang, 'Topic Detection and Tracking Pilot Study: Final Report," Proc. DARPA Broadcast News Transcription and Understanding Workshop, Feb. 1998.Google Scholar
- J. Allan, R. Papka, V. Lavrenko, "On-line new event detection and tracking," Proc. SIGIR Intl. Conf. Information Retrieval, 1998. Google ScholarDigital Library
- K. Becker, M. Cardoso, "Mail-by-Example: A visual query interface for managing large volumes of electronic messages," Proc. 15th Brazilian Symposium on Databases, 2000. Google ScholarDigital Library
- D. Beeferman, A. Berger, J. Lafferty, "Statistical Models for Text Segmentation," Machine Learning 34(1999), pp. 177--210. Google ScholarDigital Library
- H. Berghel, "E-mail: The good, the bad, and the ugly," Communications of the ACM, 40:4(April 1997), pp. 11--15. Google ScholarDigital Library
- A. Birrell, S. Perl, M. Schroeder, T. Wobber, The Pachyderm E-mail System, 1997, at http://www.research.compaq.com/SRC/pachyderm/.Google Scholar
- T. Blanton, Ed., White House E-mail, New Press, 1995.Google Scholar
- G. Boone, "Concept features in Re:Agent, an intelligent e-mail agent," Proc. 2nd Intl. Conf. Autonomous Agents, 1998. Google ScholarDigital Library
- C. Chatfield, The Analysis of time Series: An Introduction, Chapman and Hall, 1996.Google ScholarCross Ref
- S. Chatman, Story and Discourse: Narrative Structure in Fiction and Film, Cornell Univ. Press, 1978.Google Scholar
- D. Chudova, P. Smyth, "Unsupervised identification of sequential patterns under a Markov assumption," KDD Workshop on Temporal Data Mining, 2001.Google Scholar
- W. Cohen. "Learning rules that classify e-mail." Proc. AAAI Spring Symp. Machine Learning and Information Access, 1996.Google Scholar
- T. Cover, P. Hart, "Nearest neighbor pattern classification," IEEE Trans. Information Theory IT-13(1967), pp. 21--27.Google Scholar
- W. Davison, L. Wall, S. Barber, trn, 1993 http://web.mit.edu/afs/sipb/project/trn/src/trn-3.6/.Google Scholar
- R. Ehrich, J. Foith, "Representation of Random Waveforms by Relational 'Trees," IEEE Trans. Computers, C25:7(1976).Google Scholar
- S. Fine, Y. Singer, N. Tishby, "The hierarchical hidden Markov model: Analysis and applications," Machine Learning 32(1998). Google ScholarDigital Library
- E.M. Forster, Aspects of the Novel, Harcourt, Brace, and World, Inc. 1927.Google Scholar
- G. Gay, M. Stefanone, M. Grace-Martin, H. Hembrooke, "The effect of wireless computing in collaborative learning environments," Intl. J. Human-Computer Interaction, to appear.Google Scholar
- G. Genette, Narrative Discourse: An Essay in Method, English translation (J.E. Lewin), Cornell Univ. Press, 1980.Google Scholar
- G. Genette, Narrative Discourse Revisited, English translation (J.E. Lewin), Cornell Univ. Press, 1988.Google Scholar
- B. Grosz, C. Sidner, "Attention, intentions, and the structure of discourse," Computational Linguistics 12(1986). Google ScholarDigital Library
- T. Gruber, Hypermail, Enterprise Integration Technologies.Google Scholar
- V. Guralnik, J. Srivastava, "Event detection from time series data," Intl. Conf. Knowledge Discovery and Data Mining, 1999. Google ScholarDigital Library
- J. Han, W. Gong, Y. Yin, "Mining Segment-Wise Periodic Patterns in Time-Related Databases", Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1998.Google Scholar
- D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, MIT Press, 2001. Google ScholarDigital Library
- S. Havre, B. Hetzler, L. Nowell, "ThemeRiver: Visualizing Theme Changes over Time," Proc. IEEE Symposium on Information Visualization, 2000. Google ScholarDigital Library
- D. Hawkins, "Point estimation of the parameters of piecewise regression models," Applied Statistics 25(1976)Google Scholar
- B. Heckel, B. Hamann, "EmVis --- A Visual e-Mail Analysis Tool," Proc. Workshop on New Paradigms in Information Visualization and Manipulation, in conjunction with Conf. on Information and Knowledge Management, 1997. Google ScholarDigital Library
- J. Helfman, C. Isbell, "Ishmail: Immediate identification of important information," AT&T Labs Technical Report, 1995.Google Scholar
- E. Horvitz, "Principles of Mixed-Initiative User Interfaces," Proc. ACM Conf. Human Factors in Computing Systems, 1999. Google ScholarDigital Library
- D. Hudson, "Fitting segmented curves whose join points have to be estimated," Journal of the American Statistical Association 61(1966) pp. 1097--1129.Google ScholarCross Ref
- F. P. Kelly, "Notes on effective bandwidths," in Stochastic Networks: Theory and Applications, (F.P. Kelly, S. Zachary, I. Ziedins, eds.) Oxford Univ. Press, 1996.Google Scholar
- E. Keogh, P. Smyth, "A probabilistic approach to fast pattern matching in time series databases," Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1997.Google Scholar
- J.l. Klein et al., Plaintiffs' Memorandum in Support of Proposed Final Judgment, United States of America v. Microsoft Corporation and State of New York, ex rel. Attorney General Eliot Spitzer, et al., v. Microsoft Corporation, Civil Actions No. 98-1232 (TPJ) and 98-1233 (TPJ), April 2000.Google Scholar
- M. Last, Y. Klein, A. Kandel, "Knowledge Discovery in Time Series Databases," IEEE Transactions on Systems, Man, and Cybernetics 31B(2001). Google ScholarDigital Library
- V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, J. Allan, "Mining of Concurrent Text and Time-Series," KDD-2000 Workshop on Text Mining, 2000.Google Scholar
- D.D. Lewis, K.A. Knowles, "Threading electronic mail: A preliminary study," Inf. Proc. Management 33(1997). Google ScholarDigital Library
- S.S. Lukesh, "E-mail and potential loss to future archives and scholarship, or, The dog that didn't bark," First Monday 4(9) (September 1999), at http://firstmonday.orgGoogle Scholar
- P. Maes, "Agents that reduce work and information overload," Communications of the ACM 37:7(1994), pp. 30--40. Google ScholarDigital Library
- H. Mannila, M. Salmenkivi, "Finding simple intensity descriptions from event sequence data," Proc. Intl. Conf. on Knowledge Discovery and Data Mining, 2001. Google ScholarDigital Library
- H. Mannila, H. Toivonen, A.I. Verkamo, "Discovering frequent episodes in sequences," Proc. Intl. Conf. on Knowledge Discovery and Data Mining, 1995.Google Scholar
- R. Martin, V. Yohai, "Data mining for unusual movements in temporal data," KDD Wkshp. Temporal Data Mining, 2001.Google Scholar
- M.L. Markus, "Finding a Happy Medium: Explaining the Negative Effects of Electronic Communication on Social Life at Work," ACM Trans. Info. Sys. 12(1994), pp. 119--149. Google ScholarDigital Library
- N. Miller, P. Wong, M. Brewster, H. Foote, "Topic Islands: A Wavelet-Based Text Visualization System," Proc. IEEE Visualization, 1998. Google ScholarDigital Library
- R. Moore, C. Baru, A. Rajasekar, B. Ludaescher, R. Marciano, M. Wan, W. Schroeder, A. Gupta, "Collection-Based Persistent Digital Archives -- Part 2," D-Lib Magazine, 6(2000). Google ScholarDigital Library
- K. Murphy, M. Paskin, "Linear time inference in hierarchical HMMs," Advances in Neural Information Processing Systems (NIPS) 14, 2001.Google Scholar
- F. Olsen, "Facing Flood of E-Mail, Archives Seeks Help From Supercomputer Researchers," Chronicle of Higher Education, August 24, 1999.Google Scholar
- T. Payne, P. Edwards, "Interface agents that learn: An investigation of learning issues in a mail agent interface," Applied Artificial Intelligence 11(1997), pp. 1--32.Google ScholarCross Ref
- S. Pollock, "A rule-based message filtering system," ACM Trans. Office Automation Systems 6(3):232--254, 1988. Google ScholarDigital Library
- L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE 77(1989).Google Scholar
- M. Redmond, B. Adelson, "AlterEgo e-mail filtering agent," Proc. AAAI Workshop on Case-Based Reasoning, 1998.Google Scholar
- J. Rennie, "ifile: An application of machine learning to e-mail filtering," Proc. KDD Workshop on Text Mining, 2000.Google Scholar
- M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. "A Bayesian approach to filtering junk email," Proc. AAAI Workshop on Learning for Text Categorization, 1998.Google Scholar
- B. Schneier, Applied Cryptography Wiley, 1996.Google Scholar
- R. Segal, J. Kephart. "MailCat: An intelligent assistant for organizing e-mail," Proc. Intl. Conf. Autonomous Agents, 1999. Google ScholarDigital Library
- R. Segal, J. Kephart. "Incremental Learning in SwiftFile," Proc. Intl. Conf. on Machine Learning, 2000. Google ScholarDigital Library
- S. Shaw, R. DeFigueiredo, "Structural Processing of Waveforms as Trees," IEEE Transactions on Acoustics, Speech, and Signal Processing 38:2(1990)Google ScholarCross Ref
- R. Swan, J. Allan, "Extracting significant time-varying features from text," Proc. 8th Intl. Conf. on Information Knowledge Management, 1999. Google ScholarDigital Library
- R. Swan, J. Allan, "Automatic generation of overview timelines," Proc. SIGIR Intl. Conf. Information Retrieval, 2000. Google ScholarDigital Library
- R. Swan, D. Jensen, "TimeMines: Constructing Timelines with Statistical Models of Word Usage," KDD-2000 Workshop on Text Mining, 2000.Google Scholar
- S. Whittaker, C. Sidner, "E-mail overload: Exploring personal information management of e-mail," Proc. ACM SIGCHI Conf. on Human Factors in Computing Systems, 1996. Google ScholarDigital Library
- P. Wong, W. Cowley, H. Foote, E. Jurrus, J. Thomas, "Visualizing sequential patterns for text mining," Proc. IEEE Information Visualization, 2000 Google ScholarDigital Library
- Y. Yang, T. Ault, T. Pierce, C.W. Lattimer, "Improving text categorization methods for event tracking," Proc. SIGIR Intl. Conf. Information Retrieval, 2000. Google ScholarDigital Library
- Y. Yang, T. Pierce, J.G. Carbonell, "A Study on Retrospective and On-line Event Detection," Proc. SIGIR Intl. Conf. Information Retrieval, 1998. Google ScholarDigital Library
Index Terms
- Bursty and hierarchical structure in streams
Recommendations
Bursty and Hierarchical Structure in Streams
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in ...
Mining correlated bursty topic patterns from coordinated text streams
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningPrevious work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text ...
Backpressure in shared-memory-based ATM switches under multiplexed bursty sources
INFOCOM'96: Proceedings of the Fifteenth annual joint conference of the IEEE computer and communications societies conference on The conference on computer communications - Volume 2In this paper, we study a shared-memory center-stage switch with input multiplexers and output demultiplexers, where backpressure is applied from the demultiplexers to the center stage, and from the center stage to the multiplexers. We consider three ...
Comments