skip to main content
10.1145/3148055.3148064acmconferencesArticle/Chapter ViewAbstractPublication PagesbdcatConference Proceedingsconference-collections
research-article

DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems

Authors Info & Claims
Published:05 December 2017Publication History

ABSTRACT

Transactional frequent subgraph mining identifies frequent structural patterns in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Flink or Apache Spark. It determines the complete set of frequent subgraphs from arbitrary string-labeled directed multigraphs as they occur in social, business and knowledge networks. DIMSpan is optimized to runtime and minimal network traffic but memory-aware. An extensive performance evaluation on large graph collections shows the scalability of DIMSpan and the effectiveness of its optimization techniques.

References

  1. C. C. Aggarwal and J. Han. Frequent pattern mining. Springer, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Aridhi, L. D'Orazio, M. Maddouri, and E. Mephu. A novel mapreduce-based approach for distributed frequent subgraph mining. In Reconnaissance de Formes et Intelligence Artificielle (RFIA), 2014.Google ScholarGoogle Scholar
  3. M. A. Bhuiyan and M. Al Hasan. An iterative mapreduce based frequent subgraph mining algorithm. Knowledge and Data Engineering, IEEE Transactions on, 27(3):608--620, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining (ICDM), pages 51--58, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bringmann and S. Nijssen. What is frequent in a single graph? In PAKDD, pages 858--863. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Data Engineering, page 28, 2015.Google ScholarGoogle Scholar
  7. R. Cyganiak, A. Harth, and A. Hogan. N-quads: Extending n-triples with context. W3C Recommendation, 2008.Google ScholarGoogle Scholar
  8. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. volume 51, pages 107--113. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment, 7(7):517--528, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Hill, B. Srichandan, and R. Sunderraman. An iterative mapreduce approach to frequent subgraph mining in biological datasets. In Proc. ACM Conf. on Bioinformatics, Computational Biology and Biomedicine, pages 661--666, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE Int. Conf. on Data Mining (ICDM), pages 549--552, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 13--23. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Jiang, F. Coenen, and M. Zito. A survey of frequent subgraph mining algorithms. The Knowledge Eng. Review, 28(01):75--105, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Junghanns, A. Petermann, N. Teichmann, K. Gómez, and E. Rahm. Analyzing extended property graphs with apache flink. In Proc. ACM SIGMOD Workshop on Network Data Analytics, pages 3:1--3:8, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Kessl, N. Talukder, P. Anchuri, and M. Zaki. Parallel graph mining with gpus. In BigMine, pages 1--16, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE Int. Conf. on Data Mining (ICDM), pages 313--320, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1--29, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent subgraph mining in mapreduce. In International Conference on Data Engineering (ICDE), pages 844--855. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently extracting frequent subgraphs using mapreduce. In IEEE Int. Conf. on Big Data, pages 639--647, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Nijssen and J. N. Kok. The gaston tool for frequent subgraph mining. Electronic Notes in Theoretical Computer Science, 127(1):77--87, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Nijssen and J. N. Kok. Frequent subgraph miners: runtimes don't say everything. MLG 2006, page 173, 2006.Google ScholarGoogle Scholar
  23. A. Petermann et al. Mining and Ranking of Generalized Multi-Dimensional Frequent Subgraphs. In IEEE Int. Conf. on Digital Inf. Management (ICDIM), 2017.Google ScholarGoogle Scholar
  24. A. Petermann, M. Junghanns, S. Kemper, K. Gómez, N. Teichmann, and E. Rahm. Graph mining for complex data analytics. In IEEE Int. Conf. on Data Mining Workshops (ICDMW), pages 1316--1319, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Petermann, M. Junghanns, R. Müller, and E. Rahm. Graph-based Data Integration and Business Intelligence with BIIIG. PVLDB, 7(13), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Ranu and A. K. Singh. Graphsig: A scalable approach to mining significant subgraphs in large graph databases. In IEEE Int. Conf. on Data Engineering (ICDE), pages 844--855. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Stratikopoulos et al. Hpc-gspan: An fpga-based parallel system for frequent subgraph mining. In IEEE Int. Conf. on Field Programmable Logic and Applications (FPL), pages 1--4, 2014.Google ScholarGoogle Scholar
  28. C. H. Teixeira et al. Arabesque: a system for distributed graph mining. In Proc. of the 25th Symposium on Operating Systems Principles, pages 425--440. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. T. Thomas, S. R. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(3):10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Vo, D. Nguyen, and T.-L. Nguyen. A parallel algorithm for frequent subgraph mining. In Advanced Computational Methods for Knowledge Engineering, pages 163--173. Springer, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. Wörlein et al. A quantitative comparison of the subgraph miners mofa, gspan, 'sm, and gaston. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 392--403. Springer, 2005.Google ScholarGoogle Scholar
  32. X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), pages 721--724, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Technical Report UIUCDCS-R-2002.2296, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In KDD, pages 286--295. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international conference on World Wide Web, pages 387--396. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
        December 2017
        288 pages
        ISBN:9781450355490
        DOI:10.1145/3148055

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 December 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        BDCAT '17 Paper Acceptance Rate27of93submissions,29%Overall Acceptance Rate27of93submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader