skip to main content
article

Mining data streams under block evolution

Published:01 January 2002Publication History
Skip Abstract Section

Abstract

In this paper we survey recent work on incremental data mining model maintenance and change detection under block evolution. In block evolution, a dataset is updated periodically through insertions and deletions of blocks of records at a time. We describe two techniques: (1) We describe a generic algorithm for model maintenance that takes any traditional incremental data mining model maintenance algorithm and transforms it into an algorithm that allows restrictions on a temporal subset of the database. (2) We also describe a generic framework for change detection, that quantifies the difference between two datasets in terms of the data mining models they induce.

References

  1. A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors. VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt. Morgan Kaufmann, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support systems using approximate query answers. In Atkinson et al. {14}, pages 754-757.]] Google ScholarGoogle Scholar
  3. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In Delis et al. {29}, pages 574-576.]] Google ScholarGoogle Scholar
  4. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Delis et al. {29}, pages 275-286.]] Google ScholarGoogle Scholar
  5. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Agrawal and G. Psaila. Acive data mining. Proceedings of the first international conference on knowledge discovery and data mining, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Agrawal and A. Swami. A one-pass space-efficient algorithm for finding quantiles. In S. Chaudhuri, A. Deshpande, and R. Krishnamurthy, editors, Proceedings of the 7th International Conference on Management of Data (COMAD), December 1995.]]Google ScholarGoogle Scholar
  8. M. Akinde, D. Chatziantoniou, T. Johnson, and S. Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alon, Matias, and Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 31 - June 2, 1999, Philadelphia, Pennsylvania, pages 10-20. ACM Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Alsabti, S. Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355. Morgan Kaufmann, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Altinel and M. J. Franklin. Efficient filtering of xml documents for selective dissemination of information. In Abbadi et al. {1}, pages 53-64.]] Google ScholarGoogle Scholar
  13. T. W. Anderson. The statistical analysis of time series. John Wiley & Sons, Inc., 1971.]]Google ScholarGoogle Scholar
  14. M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors. VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272. ACM, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The new jersey data reduction report. Data Engineering Bulletin, 20(4):3-45, 1997.]]Google ScholarGoogle Scholar
  17. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.]]Google ScholarGoogle Scholar
  18. S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases, pages 606-617, August 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Chaudhuri, U. Dayal, and V. Ganti. Database technology for decision support systems. In IEEE Computer, volume 34, pages 48-55, December 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagracq: A scalable continuous query system for internet databases. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, volume 29, pages 379-390. ACM, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. Chen, J. Gehrke, and F. Korn. Query optimization in compressed database systems. In SIGMOD 2001, Proceedings ACM SIGMOD International Conference on Management of Data, 2001, Santa Barbara, California, USA. ACM Press, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In Proceedings of the twelfth international conference on data engineering (ICDE), February 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Cheung, S. Lee, and B. Kao. A general incremental technique for maintaining discovered association rules. In Proceedings of the fifth DASFAA Conference, April 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Cheung, T. Vincent, and W. Benjamin. Maintenance of discovered knowledge: A case in multi-level association rules. In Proceedings of the second international conference on knowledge discovery in databases, August 1996.]]Google ScholarGoogle Scholar
  27. C. Cortes, K. Fisher, D. Pregibon, and A. Rogers. Hancock: a language for extracting signatures from data streams. pages 9-17. ACM Press, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. B. D'Agostino and M. A. Stephens. Goodness-of-fit techniques. New York: M.Dekker, 1986.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania, USA. ACM Press, 1999.]]Google ScholarGoogle Scholar
  30. P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, MA, August 2000. ACM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment. In Proceedings of the 24th International Conference on Very Large Databases, pages 323-333, August 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the 1st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]]Google ScholarGoogle Scholar
  33. Feigenbaum, Kannan, Strauss, and Viswanathan. An approximate L1-difference algorithm for massive data streams (extended abstract). In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM Symposium on Discrete Algorithms, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Fong and Strauss. An approximate Lp difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. V. Ganti, J. Gehrke, and R. Ramakrishnan. Demon: Mining and monitoring evolving data. IEEE Transactions on knowledge and data engineering, 13(1):50-63, January-Febraury 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. Ganti, J. Gehrke, R. Ramakrishnan, and W.-Y. Loh. A framework for measuring changes in data characteristics. In Proceedings of the 18th Symposium on Principles of Database Systems, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {29}, pages 169-180.]] Google ScholarGoogle Scholar
  39. P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Haas and Tiwary {44}, pages 331-342.]] Google ScholarGoogle Scholar
  41. P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD 2001, Proceedings ACM SIGMOD International Conference on Management of Data, 2001, Santa Barbara, California, USA. ACM Press, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. L. M. Haas and A. Tiwary, editors. SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press, 1998.]] Google ScholarGoogle ScholarCross RefCross Ref
  45. P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In Delis et al. {29}, pages 287-298.]] Google ScholarGoogle Scholar
  46. J. M. Hellerstein, P. J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 171-182. ACM Press, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Eqipment Corporation, Systems Research Center, May, 1998.]]Google ScholarGoogle Scholar
  48. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pages 97-106, San Francisco, CA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. F. Kilander and C. G. Jansson. COBBIT - A control procedure for COBWEB in the presence of concept drift. In P. B. Brazdil, editor, Proceedings of the European Conference on Machine Learning (ECML-93), volume 667 of LNAI, pages 244-261, Vienna, Austria, Apr. 1993. Springer Verlag.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Klenner and U. Hahn. Concept versioning: A methodology for tracking evolutionary concept drift in dynamic concept systems. In A. G. Cohn, editor, Proceedings of the Eleventh European Conference on Artificial Intelligence, pages 473-477, Chichester, Aug. 8-12 1994. John Wiley and Sons.]]Google ScholarGoogle Scholar
  51. L. Liu, C. Pu, W. Tang, D. Buttler, J. Biggs, T. Zhou, P. Benninghoff, W. Han, and F. Yu. Cq: A personalized update monitoring toolkit. In Haas and Tiwary {44}, pages 547-549.]] Google ScholarGoogle Scholar
  52. G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Haas and Tiwary {44}, pages 426-435.]] Google ScholarGoogle Scholar
  53. G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Delis et al. {29}, pages 251-262.]] Google ScholarGoogle Scholar
  54. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. High-performance clustering of streams and large data sets. In Proceedings of the 18th International Conference on Data Engineering, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  55. C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In Abbadi et al. {1}, pages 144-155.]] Google ScholarGoogle Scholar
  56. V. Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In Atkinson et al. {14}, pages 709-720.]] Google ScholarGoogle Scholar
  57. A. Silbershatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6), 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. B. Terry, D. Goldberg, D. Nichols, and B. M. Oki. Continuous queries over append-only databases. In M. Stonebraker, editor, Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, June 2-5, 1992, pages 321-330. ACM Press, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. P. Utgoff. ID5: An incremental ID3. In Proceedings of the Fifth International Conference on Machine Learning, pages 107-120. Morgan Kaufmann, 1988.]]Google ScholarGoogle Scholar
  60. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69-101, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. P. Willett. Recent trends in hierarchical document clustering: A critical review. Information Processing and management, 24(5):577-597, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining data streams under block evolution
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGKDD Explorations Newsletter
        ACM SIGKDD Explorations Newsletter  Volume 3, Issue 2
        January 2002
        81 pages
        ISSN:1931-0145
        EISSN:1931-0153
        DOI:10.1145/507515
        Issue’s Table of Contents

        Copyright © 2002 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 January 2002

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader