skip to main content
10.1145/775047.775061acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Bursty and hierarchical structure in streams

Published:23 July 2002Publication History

ABSTRACT

A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.

References

  1. R. Agrawal, R. Srikant, "Mining sequential patterns," Proc. Intl. Conf. on Data Engineering, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan, J.G. Carbonell, G. Doddington, J. Yamron, Y. Yang, 'Topic Detection and Tracking Pilot Study: Final Report," Proc. DARPA Broadcast News Transcription and Understanding Workshop, Feb. 1998.Google ScholarGoogle Scholar
  3. J. Allan, R. Papka, V. Lavrenko, "On-line new event detection and tracking," Proc. SIGIR Intl. Conf. Information Retrieval, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Becker, M. Cardoso, "Mail-by-Example: A visual query interface for managing large volumes of electronic messages," Proc. 15th Brazilian Symposium on Databases, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Beeferman, A. Berger, J. Lafferty, "Statistical Models for Text Segmentation," Machine Learning 34(1999), pp. 177--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Berghel, "E-mail: The good, the bad, and the ugly," Communications of the ACM, 40:4(April 1997), pp. 11--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Birrell, S. Perl, M. Schroeder, T. Wobber, The Pachyderm E-mail System, 1997, at http://www.research.compaq.com/SRC/pachyderm/.Google ScholarGoogle Scholar
  8. T. Blanton, Ed., White House E-mail, New Press, 1995.Google ScholarGoogle Scholar
  9. G. Boone, "Concept features in Re:Agent, an intelligent e-mail agent," Proc. 2nd Intl. Conf. Autonomous Agents, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Chatfield, The Analysis of time Series: An Introduction, Chapman and Hall, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Chatman, Story and Discourse: Narrative Structure in Fiction and Film, Cornell Univ. Press, 1978.Google ScholarGoogle Scholar
  12. D. Chudova, P. Smyth, "Unsupervised identification of sequential patterns under a Markov assumption," KDD Workshop on Temporal Data Mining, 2001.Google ScholarGoogle Scholar
  13. W. Cohen. "Learning rules that classify e-mail." Proc. AAAI Spring Symp. Machine Learning and Information Access, 1996.Google ScholarGoogle Scholar
  14. T. Cover, P. Hart, "Nearest neighbor pattern classification," IEEE Trans. Information Theory IT-13(1967), pp. 21--27.Google ScholarGoogle Scholar
  15. W. Davison, L. Wall, S. Barber, trn, 1993 http://web.mit.edu/afs/sipb/project/trn/src/trn-3.6/.Google ScholarGoogle Scholar
  16. R. Ehrich, J. Foith, "Representation of Random Waveforms by Relational 'Trees," IEEE Trans. Computers, C25:7(1976).Google ScholarGoogle Scholar
  17. S. Fine, Y. Singer, N. Tishby, "The hierarchical hidden Markov model: Analysis and applications," Machine Learning 32(1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E.M. Forster, Aspects of the Novel, Harcourt, Brace, and World, Inc. 1927.Google ScholarGoogle Scholar
  19. G. Gay, M. Stefanone, M. Grace-Martin, H. Hembrooke, "The effect of wireless computing in collaborative learning environments," Intl. J. Human-Computer Interaction, to appear.Google ScholarGoogle Scholar
  20. G. Genette, Narrative Discourse: An Essay in Method, English translation (J.E. Lewin), Cornell Univ. Press, 1980.Google ScholarGoogle Scholar
  21. G. Genette, Narrative Discourse Revisited, English translation (J.E. Lewin), Cornell Univ. Press, 1988.Google ScholarGoogle Scholar
  22. B. Grosz, C. Sidner, "Attention, intentions, and the structure of discourse," Computational Linguistics 12(1986). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Gruber, Hypermail, Enterprise Integration Technologies.Google ScholarGoogle Scholar
  24. V. Guralnik, J. Srivastava, "Event detection from time series data," Intl. Conf. Knowledge Discovery and Data Mining, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Han, W. Gong, Y. Yin, "Mining Segment-Wise Periodic Patterns in Time-Related Databases", Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1998.Google ScholarGoogle Scholar
  26. D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, MIT Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Havre, B. Hetzler, L. Nowell, "ThemeRiver: Visualizing Theme Changes over Time," Proc. IEEE Symposium on Information Visualization, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Hawkins, "Point estimation of the parameters of piecewise regression models," Applied Statistics 25(1976)Google ScholarGoogle Scholar
  29. B. Heckel, B. Hamann, "EmVis --- A Visual e-Mail Analysis Tool," Proc. Workshop on New Paradigms in Information Visualization and Manipulation, in conjunction with Conf. on Information and Knowledge Management, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Helfman, C. Isbell, "Ishmail: Immediate identification of important information," AT&T Labs Technical Report, 1995.Google ScholarGoogle Scholar
  31. E. Horvitz, "Principles of Mixed-Initiative User Interfaces," Proc. ACM Conf. Human Factors in Computing Systems, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Hudson, "Fitting segmented curves whose join points have to be estimated," Journal of the American Statistical Association 61(1966) pp. 1097--1129.Google ScholarGoogle ScholarCross RefCross Ref
  33. F. P. Kelly, "Notes on effective bandwidths," in Stochastic Networks: Theory and Applications, (F.P. Kelly, S. Zachary, I. Ziedins, eds.) Oxford Univ. Press, 1996.Google ScholarGoogle Scholar
  34. E. Keogh, P. Smyth, "A probabilistic approach to fast pattern matching in time series databases," Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1997.Google ScholarGoogle Scholar
  35. J.l. Klein et al., Plaintiffs' Memorandum in Support of Proposed Final Judgment, United States of America v. Microsoft Corporation and State of New York, ex rel. Attorney General Eliot Spitzer, et al., v. Microsoft Corporation, Civil Actions No. 98-1232 (TPJ) and 98-1233 (TPJ), April 2000.Google ScholarGoogle Scholar
  36. M. Last, Y. Klein, A. Kandel, "Knowledge Discovery in Time Series Databases," IEEE Transactions on Systems, Man, and Cybernetics 31B(2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, J. Allan, "Mining of Concurrent Text and Time-Series," KDD-2000 Workshop on Text Mining, 2000.Google ScholarGoogle Scholar
  38. D.D. Lewis, K.A. Knowles, "Threading electronic mail: A preliminary study," Inf. Proc. Management 33(1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S.S. Lukesh, "E-mail and potential loss to future archives and scholarship, or, The dog that didn't bark," First Monday 4(9) (September 1999), at http://firstmonday.orgGoogle ScholarGoogle Scholar
  40. P. Maes, "Agents that reduce work and information overload," Communications of the ACM 37:7(1994), pp. 30--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Mannila, M. Salmenkivi, "Finding simple intensity descriptions from event sequence data," Proc. Intl. Conf. on Knowledge Discovery and Data Mining, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. H. Mannila, H. Toivonen, A.I. Verkamo, "Discovering frequent episodes in sequences," Proc. Intl. Conf. on Knowledge Discovery and Data Mining, 1995.Google ScholarGoogle Scholar
  43. R. Martin, V. Yohai, "Data mining for unusual movements in temporal data," KDD Wkshp. Temporal Data Mining, 2001.Google ScholarGoogle Scholar
  44. M.L. Markus, "Finding a Happy Medium: Explaining the Negative Effects of Electronic Communication on Social Life at Work," ACM Trans. Info. Sys. 12(1994), pp. 119--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. N. Miller, P. Wong, M. Brewster, H. Foote, "Topic Islands: A Wavelet-Based Text Visualization System," Proc. IEEE Visualization, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. R. Moore, C. Baru, A. Rajasekar, B. Ludaescher, R. Marciano, M. Wan, W. Schroeder, A. Gupta, "Collection-Based Persistent Digital Archives -- Part 2," D-Lib Magazine, 6(2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. K. Murphy, M. Paskin, "Linear time inference in hierarchical HMMs," Advances in Neural Information Processing Systems (NIPS) 14, 2001.Google ScholarGoogle Scholar
  48. F. Olsen, "Facing Flood of E-Mail, Archives Seeks Help From Supercomputer Researchers," Chronicle of Higher Education, August 24, 1999.Google ScholarGoogle Scholar
  49. T. Payne, P. Edwards, "Interface agents that learn: An investigation of learning issues in a mail agent interface," Applied Artificial Intelligence 11(1997), pp. 1--32.Google ScholarGoogle ScholarCross RefCross Ref
  50. S. Pollock, "A rule-based message filtering system," ACM Trans. Office Automation Systems 6(3):232--254, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE 77(1989).Google ScholarGoogle Scholar
  52. M. Redmond, B. Adelson, "AlterEgo e-mail filtering agent," Proc. AAAI Workshop on Case-Based Reasoning, 1998.Google ScholarGoogle Scholar
  53. J. Rennie, "ifile: An application of machine learning to e-mail filtering," Proc. KDD Workshop on Text Mining, 2000.Google ScholarGoogle Scholar
  54. M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. "A Bayesian approach to filtering junk email," Proc. AAAI Workshop on Learning for Text Categorization, 1998.Google ScholarGoogle Scholar
  55. B. Schneier, Applied Cryptography Wiley, 1996.Google ScholarGoogle Scholar
  56. R. Segal, J. Kephart. "MailCat: An intelligent assistant for organizing e-mail," Proc. Intl. Conf. Autonomous Agents, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. R. Segal, J. Kephart. "Incremental Learning in SwiftFile," Proc. Intl. Conf. on Machine Learning, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Shaw, R. DeFigueiredo, "Structural Processing of Waveforms as Trees," IEEE Transactions on Acoustics, Speech, and Signal Processing 38:2(1990)Google ScholarGoogle ScholarCross RefCross Ref
  59. R. Swan, J. Allan, "Extracting significant time-varying features from text," Proc. 8th Intl. Conf. on Information Knowledge Management, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. R. Swan, J. Allan, "Automatic generation of overview timelines," Proc. SIGIR Intl. Conf. Information Retrieval, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. R. Swan, D. Jensen, "TimeMines: Constructing Timelines with Statistical Models of Word Usage," KDD-2000 Workshop on Text Mining, 2000.Google ScholarGoogle Scholar
  62. S. Whittaker, C. Sidner, "E-mail overload: Exploring personal information management of e-mail," Proc. ACM SIGCHI Conf. on Human Factors in Computing Systems, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. P. Wong, W. Cowley, H. Foote, E. Jurrus, J. Thomas, "Visualizing sequential patterns for text mining," Proc. IEEE Information Visualization, 2000 Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Y. Yang, T. Ault, T. Pierce, C.W. Lattimer, "Improving text categorization methods for event tracking," Proc. SIGIR Intl. Conf. Information Retrieval, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Y. Yang, T. Pierce, J.G. Carbonell, "A Study on Retrospective and On-line Event Detection," Proc. SIGIR Intl. Conf. Information Retrieval, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bursty and hierarchical structure in streams

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
                    July 2002
                    719 pages
                    ISBN:158113567X
                    DOI:10.1145/775047

                    Copyright © 2002 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 23 July 2002

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • Article

                    Acceptance Rates

                    KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

                    Upcoming Conference

                    KDD '24

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader