Article

A new two-phase sampling based algorithm for discovering association rules

Authors:
Bin Chen

Exelixis, Inc., S. San Francisco, CA

Exelixis, Inc., S. San Francisco, CA
View Profile

,
Peter Haas

IBM Almaden Research Ctr, San Jose, CA

IBM Almaden Research Ctr, San Jose, CA
View Profile

,
Peter Scheuermann

Northwestern University, Evanston, IL

Northwestern University, Evanston, IL
View Profile

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2002Pages 462–468https://doi.org/10.1145/775047.775114

Published:23 July 2002Publication History

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 462–468

ABSTRACT

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90--95% accuracy using a final sample having a size of only 15--33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database --- even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

References

S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In Proc. Proc. 2000 ACM SIGMOD, 2000. Google ScholarDigital Library
R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth first generation of long patterns. In Proc. Sixth ACM SIGKDD, 2000. Google ScholarDigital Library
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM SIGMOD, 1993. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th VLDB, 1994. Google ScholarDigital Library
R. J. Bayardo. Efficiently mining large patterns from databases. In Proc. 1998 ACM SIGMOD, 1998. Google ScholarDigital Library
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, New York, 1994.Google Scholar
S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. 1997 AGM SIGMOD, 1997. Google ScholarDigital Library
S. Choudhuri, M. Datar, R. Motwani, V. Narasayya. Overcoming limitations of sampling for aggregation queries. In Proc. 17th ICDE, 2001. Google ScholarDigital Library
J. Ernvall and O. Nevalainen. An algorithm for unbiased random sampling. Comput. J., 25:45--47, 1982.Google Scholar
P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB Conference., 2001. Google ScholarDigital Library
P. J. Haas. Techniques for Online Exploration of Large Object-Relational Datasets. In Proc. 11th Intl. Conf. Scientific and Statistical Database Management, pages 4--12. IEEE Press, 1999. Google ScholarDigital Library
Lisa Hellerstein. Proof of the NP-completeness of FAST. Private Correspondence, 2000.Google Scholar
C. Hidber. Online association rule mining. In Proc. ACM SIGMOD, 1999. Google ScholarDigital Library
Jiawei Han, Jian Pei and Yiwen Yin. Mining frequent patterns without candidate generation. In Proc. ACM SIGMOD, 2000. Google ScholarDigital Library
P. J. Huber. Robust Statistics. Wiley, New York, 1981.Google Scholar
G. H. John and P. Langley. Static versus dynamic sampling for data mining. In Proc. Second Intl. Conf. Knowledge Discovery and Data Mining, pages 367--370. AAAI Press, 1996.Google ScholarDigital Library
J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In Proc. Thirteenth ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Sys., pages 77--85. ACM Press, 1994. Google ScholarDigital Library
R. G. Miller. Beyond ANOVA: Basics of Applied Statistics. Wiley, New York, 1986.Google Scholar
F. Olken. Random Sampling from Databases. Ph.D. Dissertation, University of California, Berkeley, CA, 1993. Available as Tech. Report LBL-32883, Lawrence Berkeley Laboratories, Berkeley, CA.Google Scholar
H. Toivonen. Sampling large databases for association rules. In Proc. 22nd VLDB, 1996. Google ScholarDigital Library
R. Winter and K. Auerbach. The big time: 1998 Winter VLDB survey. Database Programming Design, August, 1998.Google Scholar
M. J. Zaki, S. Parthasarathy, W. Lin, and M. Ogihara. Evaluation of sampling for data mining of association rules. Technical Report 617, University of Rochester, Rochester, NY, 1996. Google ScholarDigital Library

Index Terms

Recommendations

Discovering association rules change from large databases
AICI'11: Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I

Discovering association rules and association rules change (ARC) from existing large databases is an important problem. This paper presents an approach based on multi-hash chain structures to mine association rules change from large database with ...
Read More
Progressive sampling for association rules based on sampling error estimation
PAKDD'05: Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

We explore in this paper a progressive sampling algorithm, called Sampling Error Estimation (SEE), which aims to identify an appropriate sample size for mining association rules. SEE has two advantages over previous works in the literature. First, SEE ...
Read More
A sampling based algorithm for finding association rules from uncertain data
AICI'10: Proceedings of the 2010 international conference on Artificial intelligence and computational intelligence: Part I

Since there are many real-life situations in which people are uncertain about the content of transactions, association rule mining with uncertain data is in demand. Most of these studies focus on the improvement of classical algorithms for frequent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 86
  Total Citations
  View Citations
- 1,009
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A new two-phase sampling based algorithm for discovering association rules

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Discovering association rules change from large databases

Progressive sampling for association rules based on sampling error estimation

A sampling based algorithm for finding association rules from uncertain data