research-article

DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems

Authors:
André Petermann

University of Leipzig & ScaDS Dresden/Leipzig, Leipzig, Germany

University of Leipzig & ScaDS Dresden/Leipzig, Leipzig, Germany
View Profile

,
Martin Junghanns

University of Leipzig & ScaDS Dresden/Leipzig, Leipzig, Germany

University of Leipzig & ScaDS Dresden/Leipzig, Leipzig, Germany
View Profile

,
Erhard Rahm

University of Leipzig & ScaDS Dresden/Leipzig, Leipzig, Germany

University of Leipzig & ScaDS Dresden/Leipzig, Leipzig, Germany
View Profile

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesDecember 2017Pages 237–246https://doi.org/10.1145/3148055.3148064

Published:05 December 2017Publication History

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Pages 237–246

ABSTRACT

Transactional frequent subgraph mining identifies frequent structural patterns in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Flink or Apache Spark. It determines the complete set of frequent subgraphs from arbitrary string-labeled directed multigraphs as they occur in social, business and knowledge networks. DIMSpan is optimized to runtime and minimal network traffic but memory-aware. An extensive performance evaluation on large graph collections shows the scalability of DIMSpan and the effectiveness of its optimization techniques.

References

C. C. Aggarwal and J. Han. Frequent pattern mining. Springer, 2014. Google ScholarDigital Library
S. Aridhi, L. D'Orazio, M. Maddouri, and E. Mephu. A novel mapreduce-based approach for distributed frequent subgraph mining. In Reconnaissance de Formes et Intelligence Artificielle (RFIA), 2014.Google Scholar
M. A. Bhuiyan and M. Al Hasan. An iterative mapreduce based frequent subgraph mining algorithm. Knowledge and Data Engineering, IEEE Transactions on, 27(3):608--620, 2015.Google ScholarDigital Library
C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining (ICDM), pages 51--58, 2002. Google ScholarDigital Library
B. Bringmann and S. Nijssen. What is frequent in a single graph? In PAKDD, pages 858--863. Springer, 2008. Google ScholarDigital Library
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Data Engineering, page 28, 2015.Google Scholar
R. Cyganiak, A. Harth, and A. Hogan. N-quads: Extending n-triples with context. W3C Recommendation, 2008.Google Scholar
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. volume 51, pages 107--113. ACM, 2008. Google ScholarDigital Library
M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment, 7(7):517--528, 2014. Google ScholarDigital Library
S. Hill, B. Srichandan, and R. Sunderraman. An iterative mapreduce approach to frequent subgraph mining in biological datasets. In Proc. ACM Conf. on Bioinformatics, Computational Biology and Biomedicine, pages 661--666, 2012. Google ScholarDigital Library
J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE Int. Conf. on Data Mining (ICDM), pages 549--552, 2003. Google ScholarDigital Library
A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 13--23. Springer, 2000. Google ScholarDigital Library
C. Jiang, F. Coenen, and M. Zito. A survey of frequent subgraph mining algorithms. The Knowledge Eng. Review, 28(01):75--105, 2013.Google ScholarCross Ref
M. Junghanns, A. Petermann, N. Teichmann, K. Gómez, and E. Rahm. Analyzing extended property graphs with apache flink. In Proc. ACM SIGMOD Workshop on Network Data Analytics, pages 3:1--3:8, 2016. Google ScholarDigital Library
R. Kessl, N. Talukder, P. Anchuri, and M. Zaki. Parallel graph mining with gpus. In BigMine, pages 1--16, 2014. Google ScholarDigital Library
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE Int. Conf. on Data Mining (ICDM), pages 313--320, 2001. Google ScholarDigital Library
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1--29, 2015. Google ScholarDigital Library
W. Lin, X. Xiao, and G. Ghinita. Large-scale frequent subgraph mining in mapreduce. In International Conference on Data Engineering (ICDE), pages 844--855. IEEE, 2014.Google ScholarCross Ref
W. Lu, G. Chen, A. Tung, and F. Zhao. Efficiently extracting frequent subgraphs using mapreduce. In IEEE Int. Conf. on Big Data, pages 639--647, 2013.Google ScholarCross Ref
R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015. Google ScholarDigital Library
S. Nijssen and J. N. Kok. The gaston tool for frequent subgraph mining. Electronic Notes in Theoretical Computer Science, 127(1):77--87, 2005. Google ScholarDigital Library
S. Nijssen and J. N. Kok. Frequent subgraph miners: runtimes don't say everything. MLG 2006, page 173, 2006.Google Scholar
A. Petermann et al. Mining and Ranking of Generalized Multi-Dimensional Frequent Subgraphs. In IEEE Int. Conf. on Digital Inf. Management (ICDIM), 2017.Google Scholar
A. Petermann, M. Junghanns, S. Kemper, K. Gómez, N. Teichmann, and E. Rahm. Graph mining for complex data analytics. In IEEE Int. Conf. on Data Mining Workshops (ICDMW), pages 1316--1319, 2016.Google ScholarCross Ref
A. Petermann, M. Junghanns, R. Müller, and E. Rahm. Graph-based Data Integration and Business Intelligence with BIIIG. PVLDB, 7(13), 2014. Google ScholarDigital Library
S. Ranu and A. K. Singh. Graphsig: A scalable approach to mining significant subgraphs in large graph databases. In IEEE Int. Conf. on Data Engineering (ICDE), pages 844--855. IEEE, 2009. Google ScholarDigital Library
A. Stratikopoulos et al. Hpc-gspan: An fpga-based parallel system for frequent subgraph mining. In IEEE Int. Conf. on Field Programmable Logic and Applications (FPL), pages 1--4, 2014.Google Scholar
C. H. Teixeira et al. Arabesque: a system for distributed graph mining. In Proc. of the 25th Symposium on Operating Systems Principles, pages 425--440. ACM, 2015. Google ScholarDigital Library
L. T. Thomas, S. R. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(3):10, 2010. Google ScholarDigital Library
B. Vo, D. Nguyen, and T.-L. Nguyen. A parallel algorithm for frequent subgraph mining. In Advanced Computational Methods for Knowledge Engineering, pages 163--173. Springer, 2015.Google ScholarCross Ref
M. Wörlein et al. A quantitative comparison of the subgraph miners mofa, gspan, 'sm, and gaston. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 392--403. Springer, 2005.Google Scholar
X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), pages 721--724, 2002. Google ScholarDigital Library
X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Technical Report UIUCDCS-R-2002.2296, 2002.Google ScholarDigital Library
X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In KDD, pages 286--295. ACM, 2003. Google ScholarDigital Library
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2, 2012. Google ScholarDigital Library
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th international conference on World Wide Web, pages 387--396. ACM, 2008. Google ScholarDigital Library

Index Terms

DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

On the Multichromatic Number of s-Stable Kneser Graphs

For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Read More
Efficient algorithms for mining high-utility itemsets in uncertain databases

High-utility itemset mining (HUIM) is a useful set of techniques for discovering patterns in transaction databases, which considers both quantity and profit of items. However, most algorithms for mining high-utility itemsets (HUIs) assume that the ...
Read More
A method for mining top-rank-k frequent closed itemsets
Collective intelligent information and database systems

Mining frequent closed itemsets (FCIs) is important in mining non-redundant (minimal) association rules. Therefore, many algorithms have been developed for mining FCIs with reduced mining time and memory usage. For mining FCIs, algorithms use the minimum ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
December 2017
288 pages
ISBN:9781450355490
DOI:10.1145/3148055
General Chairs:
Ashiq Anjum
University of Derby, UK
,
Alan Sill
Texas Tech University, USA
,
Program Chairs:
Xinghui Zhao
Washington State University Vancouver, USA
,
Mohsen Farid
University of Derby, UK
,
Shrideep Pallickara
Colorado State University, USA
,
Jiannong Cao
The HongKong Polytechnic University, Hong Kong
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 December 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed frequent subgraph mining
shared nothing cluster
Qualifiers
- research-article
Conference

Acceptance Rates
BDCAT '17 Paper Acceptance Rate27of93submissions,29%Overall Acceptance Rate27of93submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 129
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DIMSpan: Transactional Frequent Subgraph Mining with Distributed In-Memory Dataflow Systems

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Multichromatic Number of s-Stable Kneser Graphs

Efficient algorithms for mining high-utility itemsets in uncertain databases

A method for mining top-rank-k frequent closed itemsets