Skip to main content
Erschienen in: Journal of Intelligent Information Systems 2/2014

01.04.2014

Finding the most descriptive substructures in graphs with discrete and numeric labels

verfasst von: Michael Davis, Weiru Liu, Paul Miller

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many graph datasets are labelled with discrete and numeric attributes. Most frequent substructure discovery algorithms ignore numeric attributes; in this paper we show how they can be used to improve search performance and discrimination. Our thesis is that the most descriptive substructures are those which are normative both in terms of their structure and in terms of their numeric values. We explore the relationship between graph structure and the distribution of attribute values and propose an outlier-detection step, which is used as a constraint during substructure discovery. By pruning anomalous vertices and edges, more weight is given to the most descriptive substructures. Our method is applicable to multi-dimensional numeric attributes; we outline how it can be extended for high-dimensional data. We support our findings with experiments on transaction graphs and single large graphs from the domains of physical building security and digital forensics, measuring the effect on runtime, memory requirements and coverage of discovered patterns, relative to the unconstrained approach.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Sometimes overlapping substructures can contribute independently: Palacio (2005) discusses alternative strategies for handling overlapping substructures in Subdue.
 
2
Figure 3 shows the results for R-MAT random graphs, with 0, 1, . . . , 9 binary labels, i.e. 1–512 vertex partitions. The label values were assigned independently from a uniform distribution. Our experiments on real datasets (Section 5) verify that the complexity of substructure discovery increases with the homogeneity of vertices and edges, and that this holds when the independence assumption is removed.
 
4
Graphs are created from the August 21, 2009 version of the Enron corpus, https://​www.​cs.​cmu.​edu/​~enron/​. Identification of individuals and job roles are from “Ex-employee Status Report”, http://​www.​isi.​edu/​~adibi/​Enron/​Enron.​htm. This spreadsheet contains 161 names, but two of them appear to be duplicates.
 
7
Environment for Developing KDD Applications Supported by Index Structures (Achert et al. 2013), http://​elki.​dbs.​ifi.​lmu.​de/​
 
8
Graph Exchange XML Format, http://​gexf.​net/​format/​
 
9
We were unable to calculate RP + PINN + LOF for larger datasets due to memory constraints. The random projection (Achlioptas 2003) used by PINN is designed to be “database-friendly”; by changing the indexing method it would be possible to create a PINN implementation which creates its index on disk in order to process larger or higher-dimensional datasets.
 
Literatur
Zurück zum Zitat Achert, E., Kriegel, H.P., Schubert, E., Zimek, A. (2013). Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM international conference on management of data (SIGMOD). Achert, E., Kriegel, H.P., Schubert, E., Zimek, A. (2013). Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM international conference on management of data (SIGMOD).
Zurück zum Zitat Barnett, V., & Lewis, T. (1994). Outliers in statistical data, 3rd edn. Chichester: Wiley Probability & Statistics.MATH Barnett, V., & Lewis, T. (1994). Outliers in statistical data, 3rd edn. Chichester: Wiley Probability & Statistics.MATH
Zurück zum Zitat Borgelt, C. (2006). Canonical forms for frequent graph mining. In R. Decker & H.J. Lenz (Eds.), GfKl, springer, studies in classification, data analysis, and knowledge organization (pp. 337–349). Borgelt, C. (2006). Canonical forms for frequent graph mining. In R. Decker & H.J. Lenz (Eds.), GfKl, springer, studies in classification, data analysis, and knowledge organization (pp. 337–349).
Zurück zum Zitat Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J. (2000). LOF: identifying density-based local outliers. In W. Chen, J.F. Naughton, P.A. Bernstein (Eds.), SIGMOD conference (pp. 93–104). ACM. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J. (2000). LOF: identifying density-based local outliers. In W. Chen, J.F. Naughton, P.A. Bernstein (Eds.), SIGMOD conference (pp. 93–104). ACM.
Zurück zum Zitat Cook, D.J., & Holder, L.B. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1(1), 231–255. Cook, D.J., & Holder, L.B. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1(1), 231–255.
Zurück zum Zitat Cook, D.J., & Holder, L.B. (2000). Graph-based data mining. IEEE Intelligent Systems, 15, 32–41.CrossRef Cook, D.J., & Holder, L.B. (2000). Graph-based data mining. IEEE Intelligent Systems, 15, 32–41.CrossRef
Zurück zum Zitat Davis, M., Liu, W., Miller, P., Redpath, G. (2011). Detecting anomalies in graphs with numeric labels. In C. Macdonald, I. Ounis, I. Ruthven (Eds.), Proceedings of the 20th ACM conference on information and knowledge management (CIKM 2011) (pp. 1197–1202). ACM. Davis, M., Liu, W., Miller, P., Redpath, G. (2011). Detecting anomalies in graphs with numeric labels. In C. Macdonald, I. Ounis, I. Ruthven (Eds.), Proceedings of the 20th ACM conference on information and knowledge management (CIKM 2011) (pp. 1197–1202). ACM.
Zurück zum Zitat de Vries, T., Chawla, S., Houle, M. (2010). Finding local anomalies in very high dimensional space. In ICDM 2010 (pp. 128–137). IEEE. de Vries, T., Chawla, S., Houle, M. (2010). Finding local anomalies in very high dimensional space. In ICDM 2010 (pp. 128–137). IEEE.
Zurück zum Zitat Eberle, W., & Holder, L. (2011). Compression versus frequency for mining patterns and anomalies in graphs. In Ninth workshop on mining and learning with graphs (MLG 2011), SIGKDD, at the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2011). San Diego. Eberle, W., & Holder, L. (2011). Compression versus frequency for mining patterns and anomalies in graphs. In Ninth workshop on mining and learning with graphs (MLG 2011), SIGKDD, at the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2011). San Diego.
Zurück zum Zitat Eichinger, F., Böhm, K., Huber, M. (2008). Mining edge-weighted call graphs to localise software bugs. In W. Daelemans, B. Goethals, K. Morik (Eds.), ECML/PKDD (1), lecture notes in computer science (Vol. 5211, pp. 333–348). Springer. Eichinger, F., Böhm, K., Huber, M. (2008). Mining edge-weighted call graphs to localise software bugs. In W. Daelemans, B. Goethals, K. Morik (Eds.), ECML/PKDD (1), lecture notes in computer science (Vol. 5211, pp. 333–348). Springer.
Zurück zum Zitat Fortin, S. (1996). The graph isomorphism problem. Tech. rep., Univ. of Alberta. Fortin, S. (1996). The graph isomorphism problem. Tech. rep., Univ. of Alberta.
Zurück zum Zitat Fowler, J. H., & Christakis, N.A. (2008). Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. BMJ, 337, doi:10.1136/bmj.a2338. Fowler, J. H., & Christakis, N.A. (2008). Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. BMJ, 337, doi:10.​1136/​bmj.​a2338.
Zurück zum Zitat Günnemann, S., Boden, B., Seidl, T. (2011). DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, ECML PKDD11 (Vol. Part I, pp. 565–580). Springer. Günnemann, S., Boden, B., Seidl, T. (2011). DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, ECML PKDD11 (Vol. Part I, pp. 565–580). Springer.
Zurück zum Zitat Huan, J., Wang, W., Prins, J., Yang, J. (2004). SPIN: mining maximal frequent subgraphs from graph databases. In KDD 2004 (pp. 581–586). ACM. Huan, J., Wang, W., Prins, J., Yang, J. (2004). SPIN: mining maximal frequent subgraphs from graph databases. In KDD 2004 (pp. 581–586). ACM.
Zurück zum Zitat Inokuchi, A., Washio, T., Motoda, H. (2000). An Apriori-based algorithm for mining frequent substructures from graph data. In PKDD 2000 (pp. 13–23). Springer. Inokuchi, A., Washio, T., Motoda, H. (2000). An Apriori-based algorithm for mining frequent substructures from graph data. In PKDD 2000 (pp. 13–23). Springer.
Zurück zum Zitat Janssens, J., Flesch, I., Postma, E. (2009). Outlier detection with one-class classifiers from ML and KDD. In International conference on machine learning and applications (ICMLA ’09) (pp. 147–153). doi:10.1109/ICMLA.2009.16. Janssens, J., Flesch, I., Postma, E. (2009). Outlier detection with one-class classifiers from ML and KDD. In International conference on machine learning and applications (ICMLA ’09) (pp. 147–153). doi:10.​1109/​ICMLA.​2009.​16.
Zurück zum Zitat Jiang, C., Coenen, F., Zito, M. (2010). Frequent sub-graph mining on edge weighted graphs. In Proceedings of the 12th international conference on data warehousing and knowledge discovery, DaWaK’10 (pp. 77–88). Berlin: Springer.CrossRef Jiang, C., Coenen, F., Zito, M. (2010). Frequent sub-graph mining on edge weighted graphs. In Proceedings of the 12th international conference on data warehousing and knowledge discovery, DaWaK’10 (pp. 77–88). Berlin: Springer.CrossRef
Zurück zum Zitat Jin, W., Tung, A.K.H., Han, J. (2001). Mining top-n local outliers in large databases. In D. Lee, M. Schkolnick, F.J. Provost, R. Srikant (Eds.), KDD (pp. 293–298). ACM. Jin, W., Tung, A.K.H., Han, J. (2001). Mining top-n local outliers in large databases. In D. Lee, M. Schkolnick, F.J. Provost, R. Srikant (Eds.), KDD (pp. 293–298). ACM.
Zurück zum Zitat Jin, W., Tung, A.K.H., Han, J., Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In W.K. Ng, M. Kitsuregawa, J. Li, K. Chang (Eds.), PAKDD, lecture notes in computer science (Vol. 3918, pp. 577–593). Springer. Jin, W., Tung, A.K.H., Han, J., Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In W.K. Ng, M. Kitsuregawa, J. Li, K. Chang (Eds.), PAKDD, lecture notes in computer science (Vol. 3918, pp. 577–593). Springer.
Zurück zum Zitat Kim, M., & Leskovec, J. (2011). Modeling social networks with node attributes using the multiplicative attribute graph model. In F.G Cozman & A. Pfeffer (Eds.), UAI (pp. 400–409). AUAI Press. Kim, M., & Leskovec, J. (2011). Modeling social networks with node attributes using the multiplicative attribute graph model. In F.G Cozman & A. Pfeffer (Eds.), UAI (pp. 400–409). AUAI Press.
Zurück zum Zitat Kim, M., & Leskovec, J. (2012). Multiplicative attribute graph model of real-world networks. Internet Mathematics, 8(1–2), 113–160.CrossRefMATHMathSciNet Kim, M., & Leskovec, J. (2012). Multiplicative attribute graph model of real-world networks. Internet Mathematics, 8(1–2), 113–160.CrossRefMATHMathSciNet
Zurück zum Zitat Klimt, B., & Yang, Y. (2004). The Enron corpus: a new dataset for email classification research. In J.F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), ECML, lecture notes in computer science (Vol. 3201, pp. 217–226). Springer. Klimt, B., & Yang, Y. (2004). The Enron corpus: a new dataset for email classification research. In J.F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), ECML, lecture notes in computer science (Vol. 3201, pp. 217–226). Springer.
Zurück zum Zitat Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In T. Theeramunkong, B. Kijsirikul, N. Cercone, T.B. Ho (Eds.), PAKDD, lecture notes in computer science (Vol. 5476, pp. 831–838). Springer. Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In T. Theeramunkong, B. Kijsirikul, N. Cercone, T.B. Ho (Eds.), PAKDD, lecture notes in computer science (Vol. 5476, pp. 831–838). Springer.
Zurück zum Zitat Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2011). Interpreting and unifying outlier scores. In SDM (pp. 13–24). SIAM/Omnipress. Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2011). Interpreting and unifying outlier scores. In SDM (pp. 13–24). SIAM/Omnipress.
Zurück zum Zitat Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In ICDM 2001 (pp. 313–320). IEEE. Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In ICDM 2001 (pp. 313–320). IEEE.
Zurück zum Zitat McLean, B., & Elkind, P. (2003). The smartest guys in the room: the amazing rise and scandalous fall of Enron. USA: Penguin Group. McLean, B., & Elkind, P. (2003). The smartest guys in the room: the amazing rise and scandalous fall of Enron. USA: Penguin Group.
Zurück zum Zitat Moser, F., Colak, R., Rafiey, A., Ester, M. (2009). Mining cohesive patterns from graphs with feature vectors. In SDM (pp. 593–604). SIAM. Moser, F., Colak, R., Rafiey, A., Ester, M. (2009). Mining cohesive patterns from graphs with feature vectors. In SDM (pp. 593–604). SIAM.
Zurück zum Zitat Newman, M. (2010). Networks: an introduction. New York: Oxford University Press, Inc.CrossRef Newman, M. (2010). Networks: an introduction. New York: Oxford University Press, Inc.CrossRef
Zurück zum Zitat Nijssen, S., & Kok, J.N. (2004). A quickstart in frequent structure mining can make a difference. In W. Kim, R. Kohavi, J. Gehrke, W. DuMouchel (Eds.), KDD (pp. 647–652). ACM. Nijssen, S., & Kok, J.N. (2004). A quickstart in frequent structure mining can make a difference. In W. Kim, R. Kohavi, J. Gehrke, W. DuMouchel (Eds.), KDD (pp. 647–652). ACM.
Zurück zum Zitat Palacio, M.A.P. (2005). Spatial data modeling and mining using a graph-based representation. PhD thesis, Department of Computer Systems Engineering. Puebla: University of the Americas. Palacio, M.A.P. (2005). Spatial data modeling and mining using a graph-based representation. PhD thesis, Department of Computer Systems Engineering. Puebla: University of the Americas.
Zurück zum Zitat Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In U. Dayal, K. Ramamritham, T.M. Vijayaraman (Eds.), ICDE (pp. 315–326). IEEE Computer Society. Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In U. Dayal, K. Ramamritham, T.M. Vijayaraman (Eds.), ICDE (pp. 315–326). IEEE Computer Society.
Zurück zum Zitat Schubert, E., Zimek, A., Kriegel, H.P. (2012). Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery. doi:10.1007/s10618-012-0300-z. Schubert, E., Zimek, A., Kriegel, H.P. (2012). Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery. doi:10.​1007/​s10618-012-0300-z.
Zurück zum Zitat Tang, J., Chen, Z., Fu, A.W.C., Cheung, D.W.L. (2002). Enhancing effectiveness of outlier detections for low density patterns. In M.S. Cheng, P.S. Yu, B. Liu (Eds.), PAKDD, lecture notes in computer science (Vol. 2336, pp. 535–548). Springer. Tang, J., Chen, Z., Fu, A.W.C., Cheung, D.W.L. (2002). Enhancing effectiveness of outlier detections for low density patterns. In M.S. Cheng, P.S. Yu, B. Liu (Eds.), PAKDD, lecture notes in computer science (Vol. 2336, pp. 535–548). Springer.
Zurück zum Zitat Wang, C., Zhu, Y., Wu, T., Wang, W., Shi, B. (2005). Constraint-based graph mining in large database. In Y. Zhang, K. Tanaka, J.X. Yu, S. Wang, M. Li (Eds.), APWeb, lecture notes in computer science (Vol. 3399, pp. 133–144). Springer. Wang, C., Zhu, Y., Wu, T., Wang, W., Shi, B. (2005). Constraint-based graph mining in large database. In Y. Zhang, K. Tanaka, J.X. Yu, S. Wang, M. Li (Eds.), APWeb, lecture notes in computer science (Vol. 3399, pp. 133–144). Springer.
Zurück zum Zitat Wörlein, M., Meinl, T., Fischer, I., Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, J. Gama (Eds.), PKDD, lecture notes in computer science (Vol. 3721, pp. 392–403). Springer. Wörlein, M., Meinl, T., Fischer, I., Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, J. Gama (Eds.), PKDD, lecture notes in computer science (Vol. 3721, pp. 392–403). Springer.
Zurück zum Zitat Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM 2002 (pp. 721–724). IEEE. Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM 2002 (pp. 721–724). IEEE.
Zurück zum Zitat Yan, X., & Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In KDD 2003 (pp. 286–295). ACM. Yan, X., & Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In KDD 2003 (pp. 286–295). ACM.
Zurück zum Zitat Zimek, A., Schubert, E., Kriegel, H.P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363–387.CrossRefMathSciNet Zimek, A., Schubert, E., Kriegel, H.P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363–387.CrossRefMathSciNet
Metadaten
Titel
Finding the most descriptive substructures in graphs with discrete and numeric labels
verfasst von
Michael Davis
Weiru Liu
Paul Miller
Publikationsdatum
01.04.2014
Verlag
Springer US
Erschienen in
Journal of Intelligent Information Systems / Ausgabe 2/2014
Print ISSN: 0925-9902
Elektronische ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-013-0299-7

Weitere Artikel der Ausgabe 2/2014

Journal of Intelligent Information Systems 2/2014 Zur Ausgabe