article

Mining data streams under block evolution

Authors:
Venkatesh Ganti

Microsoft Research

Microsoft Research
View Profile

,
Johannes Gehrke

Cornell University

Cornell University
View Profile

,
Raghu Ramakrishnan

UW-Madison

UW-Madison
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 3 Issue 2January 2002pp 1–10https://doi.org/10.1145/507515.507517

Published:01 January 2002Publication History

ACM SIGKDD Explorations Newsletter

Abstract

In this paper we survey recent work on incremental data mining model maintenance and change detection under block evolution. In block evolution, a dataset is updated periodically through insertions and deletions of blocks of records at a time. We describe two techniques: (1) We describe a generic algorithm for model maintenance that takes any traditional incremental data mining model maintenance algorithm and transforms it into an algorithm that allows restrictions on a temporal subset of the database. (2) We also describe a generic framework for change detection, that quantifies the difference between two datasets in terms of the data mining models they induce.

References

A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors. VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt. Morgan Kaufmann, 2000.]] Google ScholarDigital Library
S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support systems using approximate query answers. In Atkinson et al. {14}, pages 754-757.]] Google Scholar
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In Delis et al. {29}, pages 574-576.]] Google Scholar
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Delis et al. {29}, pages 275-286.]] Google Scholar
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
R. Agrawal and G. Psaila. Acive data mining. Proceedings of the first international conference on knowledge discovery and data mining, 1995.]]Google ScholarDigital Library
R. Agrawal and A. Swami. A one-pass space-efficient algorithm for finding quantiles. In S. Chaudhuri, A. Deshpande, and R. Krishnamurthy, editors, Proceedings of the 7th International Conference on Management of Data (COMAD), December 1995.]]Google Scholar
M. Akinde, D. Chatziantoniou, T. Johnson, and S. Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001.]] Google ScholarDigital Library
Alon, Matias, and Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]] Google ScholarDigital Library
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 31 - June 2, 1999, Philadelphia, Pennsylvania, pages 10-20. ACM Press, 1999.]] Google ScholarDigital Library
K. Alsabti, S. Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355. Morgan Kaufmann, 1997.]] Google ScholarDigital Library
M. Altinel and M. J. Franklin. Efficient filtering of xml documents for selective dissemination of information. In Abbadi et al. {1}, pages 53-64.]] Google Scholar
T. W. Anderson. The statistical analysis of time series. John Wiley & Sons, Inc., 1971.]]Google Scholar
M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors. VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann, 1999.]] Google ScholarDigital Library
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272. ACM, 2000.]] Google ScholarDigital Library
D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The new jersey data reduction report. Data Engineering Bulletin, 20(4):3-45, 1997.]]Google Scholar
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.]]Google Scholar
S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases, pages 606-617, August 1998.]] Google ScholarDigital Library
D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]] Google ScholarDigital Library
D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]] Google ScholarDigital Library
S. Chaudhuri, U. Dayal, and V. Ganti. Database technology for decision support systems. In IEEE Computer, volume 34, pages 48-55, December 2001.]] Google ScholarDigital Library
J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagracq: A scalable continuous query system for internet databases. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, volume 29, pages 379-390. ACM, 2000.]] Google ScholarDigital Library
Z. Chen, J. Gehrke, and F. Korn. Query optimization in compressed database systems. In SIGMOD 2001, Proceedings ACM SIGMOD International Conference on Management of Data, 2001, Santa Barbara, California, USA. ACM Press, 2001.]] Google ScholarDigital Library
D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In Proceedings of the twelfth international conference on data engineering (ICDE), February 1996.]] Google ScholarDigital Library
D. Cheung, S. Lee, and B. Kao. A general incremental technique for maintaining discovered association rules. In Proceedings of the fifth DASFAA Conference, April 1997.]] Google ScholarDigital Library
D. Cheung, T. Vincent, and W. Benjamin. Maintenance of discovered knowledge: A case in multi-level association rules. In Proceedings of the second international conference on knowledge discovery in databases, August 1996.]]Google Scholar
C. Cortes, K. Fisher, D. Pregibon, and A. Rogers. Hancock: a language for extracting signatures from data streams. pages 9-17. ACM Press, 2000.]] Google ScholarDigital Library
R. B. D'Agostino and M. A. Stephens. Goodness-of-fit techniques. New York: M.Dekker, 1986.]] Google ScholarDigital Library
A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania, USA. ACM Press, 1999.]]Google Scholar
P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, MA, August 2000. ACM.]] Google ScholarDigital Library
M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment. In Proceedings of the 24th International Conference on Very Large Databases, pages 323-333, August 1998.]] Google ScholarDigital Library
M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the 1st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]]Google Scholar
Feigenbaum, Kannan, Strauss, and Viswanathan. An approximate L1-difference algorithm for massive data streams (extended abstract). In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]] Google ScholarDigital Library
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms, 2000.]] Google ScholarDigital Library
Fong and Strauss. An approximate Lp difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]] Google ScholarDigital Library
V. Ganti, J. Gehrke, and R. Ramakrishnan. Demon: Mining and monitoring evolving data. IEEE Transactions on knowledge and data engineering, 13(1):50-63, January-Febraury 2001.]] Google ScholarDigital Library
V. Ganti, J. Gehrke, R. Ramakrishnan, and W.-Y. Loh. A framework for measuring changes in data characteristics. In Proceedings of the 18th Symposium on Principles of Database Systems, 1999.]] Google ScholarDigital Library
J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {29}, pages 169-180.]] Google Scholar
P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]] Google ScholarDigital Library
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Haas and Tiwary {44}, pages 331-342.]] Google Scholar
P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]] Google ScholarDigital Library
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD 2001, Proceedings ACM SIGMOD International Conference on Management of Data, 2001, Santa Barbara, California, USA. ACM Press, 2001.]] Google ScholarDigital Library
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]] Google ScholarDigital Library
L. M. Haas and A. Tiwary, editors. SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press, 1998.]] Google ScholarCross Ref
P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In Delis et al. {29}, pages 287-298.]] Google Scholar
J. M. Hellerstein, P. J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 171-182. ACM Press, 1997.]] Google ScholarDigital Library
M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Eqipment Corporation, Systems Research Center, May, 1998.]]Google Scholar
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pages 97-106, San Francisco, CA, 2001.]] Google ScholarDigital Library
F. Kilander and C. G. Jansson. COBBIT - A control procedure for COBWEB in the presence of concept drift. In P. B. Brazdil, editor, Proceedings of the European Conference on Machine Learning (ECML-93), volume 667 of LNAI, pages 244-261, Vienna, Austria, Apr. 1993. Springer Verlag.]] Google ScholarDigital Library
M. Klenner and U. Hahn. Concept versioning: A methodology for tracking evolutionary concept drift in dynamic concept systems. In A. G. Cohn, editor, Proceedings of the Eleventh European Conference on Artificial Intelligence, pages 473-477, Chichester, Aug. 8-12 1994. John Wiley and Sons.]]Google Scholar
L. Liu, C. Pu, W. Tang, D. Buttler, J. Biggs, T. Zhou, P. Benninghoff, W. Han, and F. Yu. Cq: A personalized update monitoring toolkit. In Haas and Tiwary {44}, pages 547-549.]] Google Scholar
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Haas and Tiwary {44}, pages 426-435.]] Google Scholar
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Delis et al. {29}, pages 251-262.]] Google Scholar
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. High-performance clustering of streams and large data sets. In Proceedings of the 18th International Conference on Data Engineering, 2002.]]Google ScholarCross Ref
C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In Abbadi et al. {1}, pages 144-155.]] Google Scholar
V. Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In Atkinson et al. {14}, pages 709-720.]] Google Scholar
A. Silbershatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6), 1996.]] Google ScholarDigital Library
D. B. Terry, D. Goldberg, D. Nichols, and B. M. Oki. Continuous queries over append-only databases. In M. Stonebraker, editor, Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, June 2-5, 1992, pages 321-330. ACM Press, 1992.]] Google ScholarDigital Library
P. Utgoff. ID5: An incremental ID3. In Proceedings of the Fifth International Conference on Machine Learning, pages 107-120. Morgan Kaufmann, 1988.]]Google Scholar
G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69-101, 1996.]] Google ScholarDigital Library
P. Willett. Recent trends in hierarchical document clustering: A critical review. Information Processing and management, 24(5):577-597, 1988.]] Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 1997.]] Google ScholarDigital Library

Index Terms

Mining data streams under block evolution
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Mining of Frequent Itemsets from Streams of Uncertain Data
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

Frequent itemset mining plays an essential role in the mining of various patterns and is in demand in many real-life applications. Hence, mining of frequent itemsets has been the subject of numerous studies since its introduction. Generally, most of ...
Read More
Mining Recent Frequent Itemsets in Data Streams
FSKD '08: Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04

Mining frequent itemsets in data streams is a hot research topic in recent years. Due to the continuous, high-speed and unbounded properties of data streams, traditional algorithms on static dataset are not suitable for mining in data streams. In this ...
Read More
Mining maximal frequent itemsets from data streams

Frequent pattern mining from data streams is an active research topic in data mining. Existing research efforts often rely on a two-phase framework to discover frequent patterns: (1) using internal data structures to store meta-patterns obtained by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 3, Issue 2
January 2002
81 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/507515
Issue’s Table of Contents

Copyright © 2002 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 January 2002
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 1,216
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Mining of Frequent Itemsets from Streams of Uncertain Data

Mining Recent Frequent Itemsets in Data Streams

Mining maximal frequent itemsets from data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Mining of Frequent Itemsets from Streams of Uncertain Data

Mining Recent Frequent Itemsets in Data Streams

Mining maximal frequent itemsets from data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media