skip to main content
10.1145/775047.775114acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A new two-phase sampling based algorithm for discovering association rules

Published:23 July 2002Publication History

ABSTRACT

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90--95% accuracy using a final sample having a size of only 15--33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database --- even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

References

  1. S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In Proc. Proc. 2000 ACM SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth first generation of long patterns. In Proc. Sixth ACM SIGKDD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM SIGMOD, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th VLDB, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. J. Bayardo. Efficiently mining large patterns from databases. In Proc. 1998 ACM SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, New York, 1994.Google ScholarGoogle Scholar
  7. S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. 1997 AGM SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Choudhuri, M. Datar, R. Motwani, V. Narasayya. Overcoming limitations of sampling for aggregation queries. In Proc. 17th ICDE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Ernvall and O. Nevalainen. An algorithm for unbiased random sampling. Comput. J., 25:45--47, 1982.Google ScholarGoogle Scholar
  10. P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB Conference., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. J. Haas. Techniques for Online Exploration of Large Object-Relational Datasets. In Proc. 11th Intl. Conf. Scientific and Statistical Database Management, pages 4--12. IEEE Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lisa Hellerstein. Proof of the NP-completeness of FAST. Private Correspondence, 2000.Google ScholarGoogle Scholar
  13. C. Hidber. Online association rule mining. In Proc. ACM SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiawei Han, Jian Pei and Yiwen Yin. Mining frequent patterns without candidate generation. In Proc. ACM SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. J. Huber. Robust Statistics. Wiley, New York, 1981.Google ScholarGoogle Scholar
  16. G. H. John and P. Langley. Static versus dynamic sampling for data mining. In Proc. Second Intl. Conf. Knowledge Discovery and Data Mining, pages 367--370. AAAI Press, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In Proc. Thirteenth ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Sys., pages 77--85. ACM Press, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. G. Miller. Beyond ANOVA: Basics of Applied Statistics. Wiley, New York, 1986.Google ScholarGoogle Scholar
  19. F. Olken. Random Sampling from Databases. Ph.D. Dissertation, University of California, Berkeley, CA, 1993. Available as Tech. Report LBL-32883, Lawrence Berkeley Laboratories, Berkeley, CA.Google ScholarGoogle Scholar
  20. H. Toivonen. Sampling large databases for association rules. In Proc. 22nd VLDB, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Winter and K. Auerbach. The big time: 1998 Winter VLDB survey. Database Programming Design, August, 1998.Google ScholarGoogle Scholar
  22. M. J. Zaki, S. Parthasarathy, W. Lin, and M. Ogihara. Evaluation of sampling for data mining of association rules. Technical Report 617, University of Rochester, Rochester, NY, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A new two-phase sampling based algorithm for discovering association rules

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
                July 2002
                719 pages
                ISBN:158113567X
                DOI:10.1145/775047

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 23 July 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

                Upcoming Conference

                KDD '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader