ABSTRACT
Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.
- R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22:207--216, June 1993. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 487--499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Apache Foundation. Apache mahout. http://mahout.apache.org/.Google Scholar
- G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in Map-Reduce. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 231--240, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4--7, 2006, pages 281--288. MIT Press, 2007.Google Scholar
- F. Coenen. The LUCS-KDD FP-growth association rule mining algorithm.Google Scholar
- S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '05, pages 255--265, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- L. Cristofor. ARTool. http://www.cs.umb.edu/ laur/ARtool/, 2006.Google Scholar
- J.-D. Cryans, S. Ratté, and R. Champagne. Adaptation of APriori to MapReduce to build a warehouse of relations between named entities across the web. In Advances in Databases Knowledge and Data Applications (DBKDA), 2010 Second International Conference on, pages 185--189, april 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- M. El-Hajj and O. Zaiane. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, volume 1, page 8 pp., 0-0 2006. Google ScholarDigital Library
- W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, Y. Yang, B. He, Q. Luo, P. V. Sander, and K. Yang. Parallel data mining on graphics processors. Technical Report 07, The Hong Kong University of Science & Technology, 2008.Google Scholar
- A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 334--342, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. CoRR, abs/1101.1902, 2011.Google Scholar
- S. Hammoud. MapReduce Network Enabled Algorithms for Classification Based on Association Rules. PhD thesis, Brunel University, 2011.Google Scholar
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29:1--12, May 2000. Google ScholarDigital Library
- H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: Parallel FP-Growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys '08, pages 107--114, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages 475--478, july 2011. Google ScholarDigital Library
- Y. Li and R. Gopalan. Effective sampling for mining association rules. In G. Webb and X. Yu, editors, AI 2004: Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Computer Science, pages 73--75. Springer, Berlin / Heidelberg, 2005. Google ScholarDigital Library
- J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 1275--1285. VLDB Endowment, 2007. Google ScholarDigital Library
- H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. In KDD Workshop, pages 181--192, Menlo Park, CA, USA, 1994. The AAAI Press.Google Scholar
- M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarDigital Library
- E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10):1632--1640, oct. 2011. Google ScholarDigital Library
- S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, pages 354--361. IEEE Computer Society, 2002. Google ScholarDigital Library
- A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for MapReduce computations. CoRR, abs/1111.2228, 2011. Google ScholarDigital Library
- A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21:310--326, 2010. Google ScholarDigital Library
- M. Riondato and E. Upfal. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR, abs/1111.6937, November 2011.Google Scholar
- J. Ruoming, Y. Ge, and G. Agrawal. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. Knowledge and Data Engineering, IEEE Transactions on, 17(1):71 -- 89, jan. 2005. Google ScholarDigital Library
- N. V. Sahinidis and M. Tawarmalani. BARON 9.0.4: Global Optimization of Mixed-Integer Nonlinear Programs, User's Manual, 2010.Google Scholar
- H. Toivonen. Sampling large databases for association rules. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- X. Y. Yang, Z. Liu, and Y. Fu. MapReduce as a programming model for association rules algorithm on Hadoop. In Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on, pages 99--102, june 2010.Google ScholarCross Ref
- M. Zaki. Parallel and distributed association mining: a survey. Concurrency, IEEE, 7(4):14 --25, oct-dec 1999. Google ScholarDigital Library
- M. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In Proceedings of the Seventh International Workshop on Research Issues in Data Engineering, RIDE '97, pages 42--50. IEEE Computer Society, apr 1997. Google ScholarDigital Library
- L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pages 243--246, nov. 2010.Google ScholarCross Ref
Index Terms
- PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce
Recommendations
An Efficient Hash-Based Method for Discovering the Maximal Frequent Set
COMPSAC '01: Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software DevelopmentThe association rule mining can be divided into two steps. The first step is to find out all frequent itemsets, whose occurrences are greater than or equal to the user-specified threshold. The second step is to generate reliable association rules based ...
A Novel Parallel Algorithm for Frequent Itemsets Mining in Large Transactional Databases
Advances in Data Mining. Applications and Theoretical AspectsAbstractSince the era of data explosion, data mining in large transactional databases has become more and more important. There are many data mining techniques like association rule mining, the most important and well-researched one. Furthermore, frequent ...
Interestingness measures for association rules: Combination between lattice and hash tables
There are many methods which have been developed for improving the time of mining frequent itemsets. However, the time for generating association rules were not put in deep research. In reality, if a database contains many frequent itemsets (from ...
Comments