ABSTRACT
Transactional frequent subgraph mining identifies frequent structural patterns in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Flink or Apache Spark. It determines the complete set of frequent subgraphs from arbitrary string-labeled directed multigraphs as they occur in social, business and knowledge networks. DIMSpan is optimized to runtime and minimal network traffic but memory-aware. An extensive performance evaluation on large graph collections shows the scalability of DIMSpan and the effectiveness of its optimization techniques.
- C. C. Aggarwal and J. Han. Frequent pattern mining. Springer, 2014. Google ScholarDigital Library
- S. Aridhi, L. D'Orazio, M. Maddouri, and E. Mephu. A novel mapreduce-based approach for distributed frequent subgraph mining. In Reconnaissance de Formes et Intelligence Artificielle (RFIA), 2014.Google Scholar
- M. A. Bhuiyan and M. Al Hasan. An iterative mapreduce based frequent subgraph mining algorithm. Knowledge and Data Engineering, IEEE Transactions on, 27(3):608--620, 2015.Google ScholarDigital Library
- C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining (ICDM), pages 51--58, 2002. Google ScholarDigital Library
- B. Bringmann and S. Nijssen. What is frequent in a single graph? In PAKDD, pages 858--863. Springer, 2008. Google ScholarDigital Library
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Data Engineering, page 28, 2015.Google Scholar
- R. Cyganiak, A. Harth, and A. Hogan. N-quads: Extending n-triples with context. W3C Recommendation, 2008.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. volume 51, pages 107--113. ACM, 2008. Google ScholarDigital Library
- M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment, 7(7):517--528, 2014. Google ScholarDigital Library
- S. Hill, B. Srichandan, and R. Sunderraman. An iterative mapreduce approach to frequent subgraph mining in biological datasets. In Proc. ACM Conf. on Bioinformatics, Computational Biology and Biomedicine, pages 661--666, 2012. Google ScholarDigital Library
- J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE Int. Conf. on Data Mining (ICDM), pages 549--552, 2003. Google ScholarDigital Library
- A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 13--23. Springer, 2000. Google ScholarDigital Library
- C. Jiang, F. Coenen, and M. Zito. A survey of frequent subgraph mining algorithms. The Knowledge Eng. Review, 28(01):75--105, 2013.Google ScholarCross Ref
- M. Junghanns, A. Petermann, N. Teichmann, K. Gómez, and E. Rahm. Analyzing extended property graphs with apache flink. In Proc. ACM SIGMOD Workshop on Network Data Analytics, pages 3:1--3:8, 2016. Google ScholarDigital Library
- R. Kessl, N. Talukder, P. Anchuri, and M. Zaki. Parallel graph mining with gpus. In BigMine, pages 1--16, 2014. Google ScholarDigital Library
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE Int. Conf. on Data Mining (ICDM), pages 313--320, 2001. Google ScholarDigital Library
- D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1--29, 2015. Google ScholarDigital Library
- W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent subgraph mining in mapreduce. In International Conference on Data Engineering (ICDE), pages 844--855. IEEE, 2014.Google ScholarCross Ref
- W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently extracting frequent subgraphs using mapreduce. In IEEE Int. Conf. on Big Data, pages 639--647, 2013.Google ScholarCross Ref
- R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015. Google ScholarDigital Library
- S. Nijssen and J. N. Kok. The gaston tool for frequent subgraph mining. Electronic Notes in Theoretical Computer Science, 127(1):77--87, 2005. Google ScholarDigital Library
- S. Nijssen and J. N. Kok. Frequent subgraph miners: runtimes don't say everything. MLG 2006, page 173, 2006.Google Scholar
- A. Petermann et al. Mining and Ranking of Generalized Multi-Dimensional Frequent Subgraphs. In IEEE Int. Conf. on Digital Inf. Management (ICDIM), 2017.Google Scholar
- A. Petermann, M. Junghanns, S. Kemper, K. Gómez, N. Teichmann, and E. Rahm. Graph mining for complex data analytics. In IEEE Int. Conf. on Data Mining Workshops (ICDMW), pages 1316--1319, 2016.Google ScholarCross Ref
- A. Petermann, M. Junghanns, R. Müller, and E. Rahm. Graph-based Data Integration and Business Intelligence with BIIIG. PVLDB, 7(13), 2014. Google ScholarDigital Library
- S. Ranu and A. K. Singh. Graphsig: A scalable approach to mining significant subgraphs in large graph databases. In IEEE Int. Conf. on Data Engineering (ICDE), pages 844--855. IEEE, 2009. Google ScholarDigital Library
- A. Stratikopoulos et al. Hpc-gspan: An fpga-based parallel system for frequent subgraph mining. In IEEE Int. Conf. on Field Programmable Logic and Applications (FPL), pages 1--4, 2014.Google Scholar
- C. H. Teixeira et al. Arabesque: a system for distributed graph mining. In Proc. of the 25th Symposium on Operating Systems Principles, pages 425--440. ACM, 2015. Google ScholarDigital Library
- L. T. Thomas, S. R. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(3):10, 2010. Google ScholarDigital Library
- B. Vo, D. Nguyen, and T.-L. Nguyen. A parallel algorithm for frequent subgraph mining. In Advanced Computational Methods for Knowledge Engineering, pages 163--173. Springer, 2015.Google ScholarCross Ref
- M. Wörlein et al. A quantitative comparison of the subgraph miners mofa, gspan, 'sm, and gaston. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 392--403. Springer, 2005.Google Scholar
- X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), pages 721--724, 2002. Google ScholarDigital Library
- X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Technical Report UIUCDCS-R-2002.2296, 2002.Google ScholarDigital Library
- X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In KDD, pages 286--295. ACM, 2003. Google ScholarDigital Library
- M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2, 2012. Google ScholarDigital Library
- J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international conference on World Wide Web, pages 387--396. ACM, 2008. Google ScholarDigital Library
Index Terms
- DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems
Recommendations
On the Multichromatic Number of s-Stable Kneser Graphs
For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Efficient algorithms for mining high-utility itemsets in uncertain databases
High-utility itemset mining (HUIM) is a useful set of techniques for discovering patterns in transaction databases, which considers both quantity and profit of items. However, most algorithms for mining high-utility itemsets (HUIs) assume that the ...
A method for mining top-rank-k frequent closed itemsets
Collective intelligent information and database systemsMining frequent closed itemsets (FCIs) is important in mining non-redundant (minimal) association rules. Therefore, many algorithms have been developed for mining FCIs with reduced mining time and memory usage. For mining FCIs, algorithms use the minimum ...
Comments