Abstract
Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different labs or during various biological processes may overcome the heavy noise in the data. Moreover, by joint mining of gene expression data and protein-protein interaction data, we may discover clusters of genes which show coherent expression patterns and also produce interacting proteins. Such clusters may be potential pathways.
In this article, we investigate a novel data mining problem, mining frequent cross-graph quasi-cliques, which is generalized from several interesting applications in bioinformatics, cross-market customer segmentation, social network analysis, and Web mining. In a graph, a set of vertices S is a γ-quasi-clique (0 < γ ≤ 1) if each vertex v in S directly connects to at least γ ⋅ (|S| − 1) other vertices in S. Given a set of graphs G1, …, Gn and parameter min_sup (0 < min_sup ≤ 1), a set of vertices S is a frequent cross-graph quasi-clique if S is a γ-quasi-clique in at least min_sup ⋅ n graphs, and there does not exist a proper superset of S having the property.
We build a general model, show why the complete set of frequent cross-graph quasi-cliques cannot be found by previous data mining methods, and study the complexity of the problem. While the problem is difficult, we develop practical algorithms which exploit several interesting and effective techniques and heuristics to efficaciously mine frequent cross-graph quasi-cliques. A systematic performance study is reported on both synthetic and real data sets. We demonstrate some interesting and meaningful frequent cross-graph quasi-cliques in bioinformatics. The experimental results also show that our algorithms are efficient and scalable.
- Abello, J., Resende, M., and Sudarsky, S. 2002. Massive quasi-clique detection. Proceedings of the Latin-American Symposium on Theoretical Intormatics, 598--612. Google ScholarDigital Library
- Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc. Natl. Acad. Sci. 96, 12, 6745--6750.Google ScholarCross Ref
- Bader, G. and Hogue, C. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1, 2.Google ScholarCross Ref
- Barab´si, A. and Albert, R. 1999. Emergence of scaling in random networks. Science 286.Google Scholar
- Bayada, D., Simpson, R., and Johnson, A. 1992. An algorithm for the multiple common subgraph problem. J. Chemi. Inform. Comput. Sci., 680--685.Google Scholar
- Beissbarth, T., Fellenberg, K., Brors, B., et al. 2000. Processing and quality control of DNA array hybridization data. Bioinformatics 16, 1014--1022.Google ScholarCross Ref
- Ben-Dor, A., Shamir, R., and Yakhini, Z. 1999. Clustering gene expression patterns. J. Computati. Biol. 6, 3/4, 281--297.Google Scholar
- Brazma, A. and Vilo, J. 2000. Minireview: Gene expression data analysis. Federation European Biochemical Societies 480, 17--24.Google ScholarCross Ref
- Bu, D., Zhao, Y., Cai, L., et al. 2003. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucl. Acids Research. 31, 9, 2443--2450.Google ScholarCross Ref
- Cho, R., Campbell, M., Winzeler, E., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mole. Cell 2, 1, 65--73.Google Scholar
- Choi, K., Satterberg, B., Lyons, D., and Elion, E. 1994. Ste5 tethers multiple protein kinases in the map kinase cascade required for mating in S. cerevisiae. Cell 78, 499--C512.Google ScholarCross Ref
- Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282, 5389, 699--705.Google Scholar
- DeRisi, J., Iyer, V., and Brown, P. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 680--686.Google Scholar
- Ding, C., He, X., Zha, H., Gu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 1st IEEE International Conference on Data Mining. 107--114. Google ScholarDigital Library
- Dongen, S. 2000. Graph clustering by flow simulation. PhD Thesis, University of Utrecht, The Netherlands.Google Scholar
- Dzeroski, S. and Raedt, L. D. 2003. On multi relational data mining (mrdm). SIGKDD Explorations 5, 1. Google ScholarDigital Library
- Eisen, M., Spellman, P., Brown, P., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863--14868.Google ScholarCross Ref
- Enright, A., Dongen, S., and Ouzounis, C. 2002. An efficient algorithm for large-scale detection of protein families. Nucl. Acids Resear. 30, 7, 1575--1584.Google ScholarCross Ref
- Faloutsos, C., McCurley, K., and Tomkins, A. 2004. Fast discovery of connection subgraphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarDigital Library
- Garey, M. and Johnson, D. 1979. Computers and Intractability: a Guide to The Theory of NP-Completeness. Freeman and Company, New York. Google ScholarDigital Library
- Gasch, A., Huang, M., Metzner, S., Botstein, D., Elledge, S., and Brown, P. 2001. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell 12, 10, 2987--3003.Google ScholarCross Ref
- Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., and Brown, P. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 12, 4241--4257.Google ScholarCross Ref
- Gavin, A., Bosche, M., Krause, R., and et. al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 6868, 123--124.Google Scholar
- Hartuv, E. and Shamir, R. 2000. A clustering algorithm based on graph connectivity. Inform. Process. Lett. 76, 4-6, 175--181. Google ScholarDigital Library
- Ho, Y., Gruhler, A., Heilbut, A., et al. 2002. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415, 180--183.Google ScholarCross Ref
- Holder, L., Cook, D., and Djoko, S. 1994. Substructure discovery in the subdue system. In Proceedings of the AAAI'94 Workshop on Knowledge Discvoery in Databases (KDD'94). 359--370.Google Scholar
- Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the European Symposium on Principle of Data Mining and Knolwedge Discovery (PKDD'00). 12--23. Google ScholarDigital Library
- Ito, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S., and Sakaki, Y. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine twohybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci.Google Scholar
- Jeh, G. and Widom, J. 2004. Mining the space of graph properties. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarDigital Library
- Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proceedings of the International Conference on Data Mining (ICDM'01). 313--320. Google ScholarDigital Library
- Lee, H., Hsu, A., Sajdak, J., Qin, J., and Pavlidis, P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Resear. 14, 1085--1094.Google ScholarCross Ref
- Lyons, D., Mahanty, S., Choi, K., Manandhar, M., and Elion, E. 1996. The Sh3-domain protein Bem1 coordinates mitogen-activated protein kinase cascade activation with cell cycle control in Saccharomyces cerevisiae. Mol. Cell Biol. 16, 4095--C4106.Google ScholarCross Ref
- Matsuda, M., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210, 2, 305--325. Google ScholarDigital Library
- Mewes, H., Frishman, D., Mayer, K., et al. 2006. MIPS: Analysis and annotation of proteins from whole genomes in 2005. Nucl. Acids Resear. 34 (Supplement 1), D169--D172.Google ScholarCross Ref
- Moreau, Y., Aerts, S., Moor, B., Strooper, B., and Dabrowski, M. 2003. Comparison and meta-analysis of microarray data: From the bench to the computer desk. TRENDS Genetics 19, 570--577.Google ScholarCross Ref
- Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. Adv. Neural Inform. Process. Syst. 14.Google Scholar
- Page, D. and Craven, M. 2003. Biological applications of multi-relational data mining. SIGKDD Explor. 5, 1, 69--79. Google ScholarDigital Library
- Palmer, C., Gibbons, P., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'02). Google ScholarDigital Library
- Pei, J., Jiang, D., and Zhang, A. 2005. On mining cross-graph quasi-cliques. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). Google ScholarDigital Library
- Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of International Conference on Principles of Knowledge Representation Reasoning (RR'92).Google Scholar
- Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, i264--i272.Google ScholarCross Ref
- Shamir, R. and Sharan, R. 2000. Click: A clustering algorithm for gene expression analysis. In Proceedings of Intelligent Systems for Molecular Biology Conference (ISMB'00). Google ScholarDigital Library
- Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., and Futcher, B. 1998. Exploring the metabolic and genetic control of gene expression on a genomic scale. Mol. Biol. Cell 3273.Google Scholar
- Stuart, J., Segal, E., Koller, D., and Kim, S. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249--255.Google ScholarCross Ref
- Takahashi, Y., Satoh, Y., and Sasaki, S. 1987. Recognition of largest common fragment among a variety of chemical structures. Analyt. Sci., 23--28.Google Scholar
- Tamayo, P., Solni, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96, 6, 2907--2912.Google ScholarCross Ref
- Tavazoie, S., Hughes, D., Campbell, M., Cho, R., and Church, G. 1999. Systematic determination of genetic network architecture. Nature Genet. 281--285.Google Scholar
- Tong, A. H., Evangelista, M., Parsons, A. B., and et al. 2001. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364--2368.Google ScholarCross Ref
- Tong, A. H. Y., Lesage, G., Bader, G. D., and et al. 2004. Global mapping of the yeast genetic interaction network. Science 303, 808--813.Google ScholarCross Ref
- Tseng, G., Oh, M., Rohlin, L., Liao, J., and Wong, W. 2001. Issues in cDNA microarray analysis: Quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Resear. 29, 2549--2557.Google ScholarCross Ref
- Uetz, P., Giot, L., Cagney, G., and et. al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces Cerevisiae. Nature 403, 6770, 601--603.Google Scholar
- Wang, C., Wang, W., Pei, J., Zhu, Y., and Shi, B. 2004. Scalable mining of large disk-based graph databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarDigital Library
- Xu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536--545.Google ScholarCross Ref
- Yan, X. and Han, J. 2000. gspan: Graph-based substructure pattern mining. In Proceedings of the International Conference on Data Mining (ICDM'02). Google ScholarDigital Library
- Yan, X. and Han, J. 2003. Closegraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). Google ScholarDigital Library
- Yan, X., Yu, P., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD'04). Google ScholarDigital Library
Index Terms
- Mining frequent cross-graph quasi-cliques
Recommendations
On mining cross-graph quasi-cliques
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningJoint mining of multiple data sets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in cross-market customer segmentation, a group of customers who behave similarly in ...
Mining Frequent Subgraph Patterns from Uncertain Graph Data
In many real applications, graph data is subject to uncertainties due to incompleteness and imprecision of data. Mining such uncertain graph data is semantically different from and computationally more challenging than mining conventional exact graph ...
Mining Frequent Correlated-Quasi-Cliques from PPI Networks
ICIE '10: Proceedings of the 2010 WASE International Conference on Information Engineering - Volume 02Many of the previous studies show convincing arguments that mining frequent subgraphs is especially useful. Many hidden frequent patterns which are very interesting can not be found by mining single graph. Previous studies as Quasi-Clique have little ...
Comments