skip to main content
10.1145/2396761.2396776acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Published:29 October 2012Publication History

ABSTRACT

Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.

References

  1. R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22:207--216, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 487--499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Foundation. Apache mahout. http://mahout.apache.org/.Google ScholarGoogle Scholar
  4. G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in Map-Reduce. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 231--240, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4--7, 2006, pages 281--288. MIT Press, 2007.Google ScholarGoogle Scholar
  7. F. Coenen. The LUCS-KDD FP-growth association rule mining algorithm.Google ScholarGoogle Scholar
  8. S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '05, pages 255--265, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Cristofor. ARTool. http://www.cs.umb.edu/ laur/ARtool/, 2006.Google ScholarGoogle Scholar
  10. J.-D. Cryans, S. Ratté, and R. Champagne. Adaptation of APriori to MapReduce to build a warehouse of relations between named entities across the web. In Advances in Databases Knowledge and Data Applications (DBKDA), 2010 Second International Conference on, pages 185--189, april 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. El-Hajj and O. Zaiane. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, volume 1, page 8 pp., 0-0 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, Y. Yang, B. He, Q. Luo, P. V. Sander, and K. Yang. Parallel data mining on graphics processors. Technical Report 07, The Hong Kong University of Science & Technology, 2008.Google ScholarGoogle Scholar
  14. A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 334--342, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. CoRR, abs/1101.1902, 2011.Google ScholarGoogle Scholar
  16. S. Hammoud. MapReduce Network Enabled Algorithms for Classification Based on Association Rules. PhD thesis, Brunel University, 2011.Google ScholarGoogle Scholar
  17. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29:1--12, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: Parallel FP-Growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys '08, pages 107--114, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages 475--478, july 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Li and R. Gopalan. Effective sampling for mining association rules. In G. Webb and X. Yu, editors, AI 2004: Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Computer Science, pages 73--75. Springer, Berlin / Heidelberg, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 1275--1285. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. In KDD Workshop, pages 181--192, Menlo Park, CA, USA, 1994. The AAAI Press.Google ScholarGoogle Scholar
  24. M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10):1632--1640, oct. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, pages 354--361. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for MapReduce computations. CoRR, abs/1111.2228, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21:310--326, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Riondato and E. Upfal. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR, abs/1111.6937, November 2011.Google ScholarGoogle Scholar
  30. J. Ruoming, Y. Ge, and G. Agrawal. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. Knowledge and Data Engineering, IEEE Transactions on, 17(1):71 -- 89, jan. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. V. Sahinidis and M. Tawarmalani. BARON 9.0.4: Global Optimization of Mixed-Integer Nonlinear Programs, User's Manual, 2010.Google ScholarGoogle Scholar
  32. H. Toivonen. Sampling large databases for association rules. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Y. Yang, Z. Liu, and Y. Fu. MapReduce as a programming model for association rules algorithm on Hadoop. In Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on, pages 99--102, june 2010.Google ScholarGoogle ScholarCross RefCross Ref
  34. M. Zaki. Parallel and distributed association mining: a survey. Concurrency, IEEE, 7(4):14 --25, oct-dec 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In Proceedings of the Seventh International Workshop on Research Issues in Data Engineering, RIDE '97, pages 42--50. IEEE Computer Society, apr 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pages 243--246, nov. 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader