research-article

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Authors:
Matteo Riondato

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Justin A. DeBrabant

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Rodrigo Fonseca

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Eli Upfal

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementOctober 2012Pages 85–94https://doi.org/10.1145/2396761.2396776

Published:29 October 2012Publication History

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 85–94

ABSTRACT

Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.

References

R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22:207--216, June 1993. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 487--499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Apache Foundation. Apache mahout. http://mahout.apache.org/.Google Scholar
G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in Map-Reduce. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 231--240, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4--7, 2006, pages 281--288. MIT Press, 2007.Google Scholar
F. Coenen. The LUCS-KDD FP-growth association rule mining algorithm.Google Scholar
S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '05, pages 255--265, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
L. Cristofor. ARTool. http://www.cs.umb.edu/ laur/ARtool/, 2006.Google Scholar
J.-D. Cryans, S. Ratté, and R. Champagne. Adaptation of APriori to MapReduce to build a warehouse of relations between named entities across the web. In Advances in Databases Knowledge and Data Applications (DBKDA), 2010 Second International Conference on, pages 185--189, april 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
M. El-Hajj and O. Zaiane. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, volume 1, page 8 pp., 0-0 2006. Google ScholarDigital Library
W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, Y. Yang, B. He, Q. Luo, P. V. Sander, and K. Yang. Parallel data mining on graphics processors. Technical Report 07, The Hong Kong University of Science & Technology, 2008.Google Scholar
A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 334--342, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. CoRR, abs/1101.1902, 2011.Google Scholar
S. Hammoud. MapReduce Network Enabled Algorithms for Classification Based on Association Rules. PhD thesis, Brunel University, 2011.Google Scholar
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29:1--12, May 2000. Google ScholarDigital Library
H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: Parallel FP-Growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys '08, pages 107--114, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages 475--478, july 2011. Google ScholarDigital Library
Y. Li and R. Gopalan. Effective sampling for mining association rules. In G. Webb and X. Yu, editors, AI 2004: Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Computer Science, pages 73--75. Springer, Berlin / Heidelberg, 2005. Google ScholarDigital Library
J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 1275--1285. VLDB Endowment, 2007. Google ScholarDigital Library
H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. In KDD Workshop, pages 181--192, Menlo Park, CA, USA, 1994. The AAAI Press.Google Scholar
M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarDigital Library
E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10):1632--1640, oct. 2011. Google ScholarDigital Library
S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, pages 354--361. IEEE Computer Society, 2002. Google ScholarDigital Library
A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for MapReduce computations. CoRR, abs/1111.2228, 2011. Google ScholarDigital Library
A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21:310--326, 2010. Google ScholarDigital Library
M. Riondato and E. Upfal. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR, abs/1111.6937, November 2011.Google Scholar
J. Ruoming, Y. Ge, and G. Agrawal. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. Knowledge and Data Engineering, IEEE Transactions on, 17(1):71 -- 89, jan. 2005. Google ScholarDigital Library
N. V. Sahinidis and M. Tawarmalani. BARON 9.0.4: Global Optimization of Mixed-Integer Nonlinear Programs, User's Manual, 2010.Google Scholar
H. Toivonen. Sampling large databases for association rules. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
X. Y. Yang, Z. Liu, and Y. Fu. MapReduce as a programming model for association rules algorithm on Hadoop. In Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on, pages 99--102, june 2010.Google ScholarCross Ref
M. Zaki. Parallel and distributed association mining: a survey. Concurrency, IEEE, 7(4):14 --25, oct-dec 1999. Google ScholarDigital Library
M. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In Proceedings of the Seventh International Workshop on Research Issues in Data Engineering, RIDE '97, pages 42--50. IEEE Computer Society, apr 1997. Google ScholarDigital Library
L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pages 243--246, nov. 2010.Google ScholarCross Ref

Index Terms

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set
COMPSAC '01: Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development

The association rule mining can be divided into two steps. The first step is to find out all frequent itemsets, whose occurrences are greater than or equal to the user-specified threshold. The second step is to generate reliable association rules based ...
Read More
A Novel Parallel Algorithm for Frequent Itemsets Mining in Large Transactional Databases
Advances in Data Mining. Applications and Theoretical Aspects
Abstract
Since the era of data explosion, data mining in large transactional databases has become more and more important. There are many data mining techniques like association rule mining, the most important and well-researched one. Furthermore, frequent ...
Read More
Interestingness measures for association rules: Combination between lattice and hash tables

There are many methods which have been developed for improving the time of mining frequent itemsets. However, the time for generating association rules were not put in deep research. In reality, if a database contains many frequent itemsets (from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
association rules
frequent itemsets
sampling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 81
  Total Citations
  View Citations
- 903
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set

A Novel Parallel Algorithm for Frequent Itemsets Mining in Large Transactional Databases

Interestingness measures for association rules: Combination between lattice and hash tables

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set

A Novel Parallel Algorithm for Frequent Itemsets Mining in Large Transactional Databases

Interestingness measures for association rules: Combination between lattice and hash tables

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media