research-article

Mining frequent cross-graph quasi-cliques

Authors:
Daxin Jiang

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Jian Pei

Simon Fraser University, Burnaby, BC Canada

Simon Fraser University, Burnaby, BC Canada
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 2 Issue 4Article No.: 16pp 1–42https://doi.org/10.1145/1460797.1460799

Published:16 January 2009Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different labs or during various biological processes may overcome the heavy noise in the data. Moreover, by joint mining of gene expression data and protein-protein interaction data, we may discover clusters of genes which show coherent expression patterns and also produce interacting proteins. Such clusters may be potential pathways.

In this article, we investigate a novel data mining problem, mining frequent cross-graph quasi-cliques, which is generalized from several interesting applications in bioinformatics, cross-market customer segmentation, social network analysis, and Web mining. In a graph, a set of vertices S is a γ-quasi-clique (0 < γ ≤ 1) if each vertex v in S directly connects to at least γ ⋅ (|S| − 1) other vertices in S. Given a set of graphs G₁, …, G_n and parameter min_sup (0 < min_sup ≤ 1), a set of vertices S is a frequent cross-graph quasi-clique if S is a γ-quasi-clique in at least min_sup ⋅ n graphs, and there does not exist a proper superset of S having the property.

We build a general model, show why the complete set of frequent cross-graph quasi-cliques cannot be found by previous data mining methods, and study the complexity of the problem. While the problem is difficult, we develop practical algorithms which exploit several interesting and effective techniques and heuristics to efficaciously mine frequent cross-graph quasi-cliques. A systematic performance study is reported on both synthetic and real data sets. We demonstrate some interesting and meaningful frequent cross-graph quasi-cliques in bioinformatics. The experimental results also show that our algorithms are efficient and scalable.

References

Abello, J., Resende, M., and Sudarsky, S. 2002. Massive quasi-clique detection. Proceedings of the Latin-American Symposium on Theoretical Intormatics, 598--612. Google ScholarDigital Library
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc. Natl. Acad. Sci. 96, 12, 6745--6750.Google ScholarCross Ref
Bader, G. and Hogue, C. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1, 2.Google ScholarCross Ref
Barab´si, A. and Albert, R. 1999. Emergence of scaling in random networks. Science 286.Google Scholar
Bayada, D., Simpson, R., and Johnson, A. 1992. An algorithm for the multiple common subgraph problem. J. Chemi. Inform. Comput. Sci., 680--685.Google Scholar
Beissbarth, T., Fellenberg, K., Brors, B., et al. 2000. Processing and quality control of DNA array hybridization data. Bioinformatics 16, 1014--1022.Google ScholarCross Ref
Ben-Dor, A., Shamir, R., and Yakhini, Z. 1999. Clustering gene expression patterns. J. Computati. Biol. 6, 3/4, 281--297.Google Scholar
Brazma, A. and Vilo, J. 2000. Minireview: Gene expression data analysis. Federation European Biochemical Societies 480, 17--24.Google ScholarCross Ref
Bu, D., Zhao, Y., Cai, L., et al. 2003. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucl. Acids Research. 31, 9, 2443--2450.Google ScholarCross Ref
Cho, R., Campbell, M., Winzeler, E., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mole. Cell 2, 1, 65--73.Google Scholar
Choi, K., Satterberg, B., Lyons, D., and Elion, E. 1994. Ste5 tethers multiple protein kinases in the map kinase cascade required for mating in S. cerevisiae. Cell 78, 499--C512.Google ScholarCross Ref
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282, 5389, 699--705.Google Scholar
DeRisi, J., Iyer, V., and Brown, P. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 680--686.Google Scholar
Ding, C., He, X., Zha, H., Gu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 1st IEEE International Conference on Data Mining. 107--114. Google ScholarDigital Library
Dongen, S. 2000. Graph clustering by flow simulation. PhD Thesis, University of Utrecht, The Netherlands.Google Scholar
Dzeroski, S. and Raedt, L. D. 2003. On multi relational data mining (mrdm). SIGKDD Explorations 5, 1. Google ScholarDigital Library
Eisen, M., Spellman, P., Brown, P., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863--14868.Google ScholarCross Ref
Enright, A., Dongen, S., and Ouzounis, C. 2002. An efficient algorithm for large-scale detection of protein families. Nucl. Acids Resear. 30, 7, 1575--1584.Google ScholarCross Ref
Faloutsos, C., McCurley, K., and Tomkins, A. 2004. Fast discovery of connection subgraphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarDigital Library
Garey, M. and Johnson, D. 1979. Computers and Intractability: a Guide to The Theory of NP-Completeness. Freeman and Company, New York. Google ScholarDigital Library
Gasch, A., Huang, M., Metzner, S., Botstein, D., Elledge, S., and Brown, P. 2001. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell 12, 10, 2987--3003.Google ScholarCross Ref
Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., and Brown, P. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 12, 4241--4257.Google ScholarCross Ref
Gavin, A., Bosche, M., Krause, R., and et. al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 6868, 123--124.Google Scholar
Hartuv, E. and Shamir, R. 2000. A clustering algorithm based on graph connectivity. Inform. Process. Lett. 76, 4-6, 175--181. Google ScholarDigital Library
Ho, Y., Gruhler, A., Heilbut, A., et al. 2002. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415, 180--183.Google ScholarCross Ref
Holder, L., Cook, D., and Djoko, S. 1994. Substructure discovery in the subdue system. In Proceedings of the AAAI'94 Workshop on Knowledge Discvoery in Databases (KDD'94). 359--370.Google Scholar
Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the European Symposium on Principle of Data Mining and Knolwedge Discovery (PKDD'00). 12--23. Google ScholarDigital Library
Ito, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S., and Sakaki, Y. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine twohybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci.Google Scholar
Jeh, G. and Widom, J. 2004. Mining the space of graph properties. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarDigital Library
Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proceedings of the International Conference on Data Mining (ICDM'01). 313--320. Google ScholarDigital Library
Lee, H., Hsu, A., Sajdak, J., Qin, J., and Pavlidis, P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Resear. 14, 1085--1094.Google ScholarCross Ref
Lyons, D., Mahanty, S., Choi, K., Manandhar, M., and Elion, E. 1996. The Sh3-domain protein Bem1 coordinates mitogen-activated protein kinase cascade activation with cell cycle control in Saccharomyces cerevisiae. Mol. Cell Biol. 16, 4095--C4106.Google ScholarCross Ref
Matsuda, M., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210, 2, 305--325. Google ScholarDigital Library
Mewes, H., Frishman, D., Mayer, K., et al. 2006. MIPS: Analysis and annotation of proteins from whole genomes in 2005. Nucl. Acids Resear. 34 (Supplement 1), D169--D172.Google ScholarCross Ref
Moreau, Y., Aerts, S., Moor, B., Strooper, B., and Dabrowski, M. 2003. Comparison and meta-analysis of microarray data: From the bench to the computer desk. TRENDS Genetics 19, 570--577.Google ScholarCross Ref
Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. Adv. Neural Inform. Process. Syst. 14.Google Scholar
Page, D. and Craven, M. 2003. Biological applications of multi-relational data mining. SIGKDD Explor. 5, 1, 69--79. Google ScholarDigital Library
Palmer, C., Gibbons, P., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'02). Google ScholarDigital Library
Pei, J., Jiang, D., and Zhang, A. 2005. On mining cross-graph quasi-cliques. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). Google ScholarDigital Library
Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of International Conference on Principles of Knowledge Representation Reasoning (RR'92).Google Scholar
Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, i264--i272.Google ScholarCross Ref
Shamir, R. and Sharan, R. 2000. Click: A clustering algorithm for gene expression analysis. In Proceedings of Intelligent Systems for Molecular Biology Conference (ISMB'00). Google ScholarDigital Library
Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., and Futcher, B. 1998. Exploring the metabolic and genetic control of gene expression on a genomic scale. Mol. Biol. Cell 3273.Google Scholar
Stuart, J., Segal, E., Koller, D., and Kim, S. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249--255.Google ScholarCross Ref
Takahashi, Y., Satoh, Y., and Sasaki, S. 1987. Recognition of largest common fragment among a variety of chemical structures. Analyt. Sci., 23--28.Google Scholar
Tamayo, P., Solni, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96, 6, 2907--2912.Google ScholarCross Ref
Tavazoie, S., Hughes, D., Campbell, M., Cho, R., and Church, G. 1999. Systematic determination of genetic network architecture. Nature Genet. 281--285.Google Scholar
Tong, A. H., Evangelista, M., Parsons, A. B., and et al. 2001. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364--2368.Google ScholarCross Ref
Tong, A. H. Y., Lesage, G., Bader, G. D., and et al. 2004. Global mapping of the yeast genetic interaction network. Science 303, 808--813.Google ScholarCross Ref
Tseng, G., Oh, M., Rohlin, L., Liao, J., and Wong, W. 2001. Issues in cDNA microarray analysis: Quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Resear. 29, 2549--2557.Google ScholarCross Ref
Uetz, P., Giot, L., Cagney, G., and et. al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces Cerevisiae. Nature 403, 6770, 601--603.Google Scholar
Wang, C., Wang, W., Pei, J., Zhu, Y., and Shi, B. 2004. Scalable mining of large disk-based graph databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarDigital Library
Xu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536--545.Google ScholarCross Ref
Yan, X. and Han, J. 2000. gspan: Graph-based substructure pattern mining. In Proceedings of the International Conference on Data Mining (ICDM'02). Google ScholarDigital Library
Yan, X. and Han, J. 2003. Closegraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). Google ScholarDigital Library
Yan, X., Yu, P., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD'04). Google ScholarDigital Library

Index Terms

Mining frequent cross-graph quasi-cliques
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

On mining cross-graph quasi-cliques
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Joint mining of multiple data sets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in cross-market customer segmentation, a group of customers who behave similarly in ...
Read More
Mining Frequent Subgraph Patterns from Uncertain Graph Data

In many real applications, graph data is subject to uncertainties due to incompleteness and imprecision of data. Mining such uncertain graph data is semantically different from and computationally more challenging than mining conventional exact graph ...
Read More
Mining Frequent Correlated-Quasi-Cliques from PPI Networks
ICIE '10: Proceedings of the 2010 WASE International Conference on Information Engineering - Volume 02

Many of the previous studies show convincing arguments that mining frequent subgraphs is especially useful. Many hidden frequent patterns which are very interesting can not be found by mining single graph. Previous studies as Quasi-Clique have little ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 2, Issue 4
January 2009
154 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1460797
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 January 2009
- Accepted: 1 July 2008
- Revised: 1 March 2008
- Received: 1 August 2006
Published in tkdd Volume 2, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Graph mining
bioinformatics
clique
joint mining
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 1,045
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining frequent cross-graph quasi-cliques

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

On mining cross-graph quasi-cliques

Mining Frequent Subgraph Patterns from Uncertain Graph Data

Mining Frequent Correlated-Quasi-Cliques from PPI Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mining frequent cross-graph quasi-cliques

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

On mining cross-graph quasi-cliques

Mining Frequent Subgraph Patterns from Uncertain Graph Data

Mining Frequent Correlated-Quasi-Cliques from PPI Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media