skip to main content
research-article

Mining frequent cross-graph quasi-cliques

Published:16 January 2009Publication History
Skip Abstract Section

Abstract

Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different labs or during various biological processes may overcome the heavy noise in the data. Moreover, by joint mining of gene expression data and protein-protein interaction data, we may discover clusters of genes which show coherent expression patterns and also produce interacting proteins. Such clusters may be potential pathways.

In this article, we investigate a novel data mining problem, mining frequent cross-graph quasi-cliques, which is generalized from several interesting applications in bioinformatics, cross-market customer segmentation, social network analysis, and Web mining. In a graph, a set of vertices S is a γ-quasi-clique (0 < γ ≤ 1) if each vertex v in S directly connects to at least γ ⋅ (|S| − 1) other vertices in S. Given a set of graphs G1, …, Gn and parameter min_sup (0 < min_sup ≤ 1), a set of vertices S is a frequent cross-graph quasi-clique if S is a γ-quasi-clique in at least min_supn graphs, and there does not exist a proper superset of S having the property.

We build a general model, show why the complete set of frequent cross-graph quasi-cliques cannot be found by previous data mining methods, and study the complexity of the problem. While the problem is difficult, we develop practical algorithms which exploit several interesting and effective techniques and heuristics to efficaciously mine frequent cross-graph quasi-cliques. A systematic performance study is reported on both synthetic and real data sets. We demonstrate some interesting and meaningful frequent cross-graph quasi-cliques in bioinformatics. The experimental results also show that our algorithms are efficient and scalable.

References

  1. Abello, J., Resende, M., and Sudarsky, S. 2002. Massive quasi-clique detection. Proceedings of the Latin-American Symposium on Theoretical Intormatics, 598--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc. Natl. Acad. Sci. 96, 12, 6745--6750.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bader, G. and Hogue, C. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1, 2.Google ScholarGoogle ScholarCross RefCross Ref
  4. Barab´si, A. and Albert, R. 1999. Emergence of scaling in random networks. Science 286.Google ScholarGoogle Scholar
  5. Bayada, D., Simpson, R., and Johnson, A. 1992. An algorithm for the multiple common subgraph problem. J. Chemi. Inform. Comput. Sci., 680--685.Google ScholarGoogle Scholar
  6. Beissbarth, T., Fellenberg, K., Brors, B., et al. 2000. Processing and quality control of DNA array hybridization data. Bioinformatics 16, 1014--1022.Google ScholarGoogle ScholarCross RefCross Ref
  7. Ben-Dor, A., Shamir, R., and Yakhini, Z. 1999. Clustering gene expression patterns. J. Computati. Biol. 6, 3/4, 281--297.Google ScholarGoogle Scholar
  8. Brazma, A. and Vilo, J. 2000. Minireview: Gene expression data analysis. Federation European Biochemical Societies 480, 17--24.Google ScholarGoogle ScholarCross RefCross Ref
  9. Bu, D., Zhao, Y., Cai, L., et al. 2003. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucl. Acids Research. 31, 9, 2443--2450.Google ScholarGoogle ScholarCross RefCross Ref
  10. Cho, R., Campbell, M., Winzeler, E., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mole. Cell 2, 1, 65--73.Google ScholarGoogle Scholar
  11. Choi, K., Satterberg, B., Lyons, D., and Elion, E. 1994. Ste5 tethers multiple protein kinases in the map kinase cascade required for mating in S. cerevisiae. Cell 78, 499--C512.Google ScholarGoogle ScholarCross RefCross Ref
  12. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282, 5389, 699--705.Google ScholarGoogle Scholar
  13. DeRisi, J., Iyer, V., and Brown, P. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 680--686.Google ScholarGoogle Scholar
  14. Ding, C., He, X., Zha, H., Gu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 1st IEEE International Conference on Data Mining. 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dongen, S. 2000. Graph clustering by flow simulation. PhD Thesis, University of Utrecht, The Netherlands.Google ScholarGoogle Scholar
  16. Dzeroski, S. and Raedt, L. D. 2003. On multi relational data mining (mrdm). SIGKDD Explorations 5, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Eisen, M., Spellman, P., Brown, P., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863--14868.Google ScholarGoogle ScholarCross RefCross Ref
  18. Enright, A., Dongen, S., and Ouzounis, C. 2002. An efficient algorithm for large-scale detection of protein families. Nucl. Acids Resear. 30, 7, 1575--1584.Google ScholarGoogle ScholarCross RefCross Ref
  19. Faloutsos, C., McCurley, K., and Tomkins, A. 2004. Fast discovery of connection subgraphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Garey, M. and Johnson, D. 1979. Computers and Intractability: a Guide to The Theory of NP-Completeness. Freeman and Company, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gasch, A., Huang, M., Metzner, S., Botstein, D., Elledge, S., and Brown, P. 2001. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell 12, 10, 2987--3003.Google ScholarGoogle ScholarCross RefCross Ref
  22. Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., and Brown, P. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 12, 4241--4257.Google ScholarGoogle ScholarCross RefCross Ref
  23. Gavin, A., Bosche, M., Krause, R., and et. al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 6868, 123--124.Google ScholarGoogle Scholar
  24. Hartuv, E. and Shamir, R. 2000. A clustering algorithm based on graph connectivity. Inform. Process. Lett. 76, 4-6, 175--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ho, Y., Gruhler, A., Heilbut, A., et al. 2002. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415, 180--183.Google ScholarGoogle ScholarCross RefCross Ref
  26. Holder, L., Cook, D., and Djoko, S. 1994. Substructure discovery in the subdue system. In Proceedings of the AAAI'94 Workshop on Knowledge Discvoery in Databases (KDD'94). 359--370.Google ScholarGoogle Scholar
  27. Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proceedings of the European Symposium on Principle of Data Mining and Knolwedge Discovery (PKDD'00). 12--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ito, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S., and Sakaki, Y. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine twohybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci.Google ScholarGoogle Scholar
  29. Jeh, G. and Widom, J. 2004. Mining the space of graph properties. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proceedings of the International Conference on Data Mining (ICDM'01). 313--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lee, H., Hsu, A., Sajdak, J., Qin, J., and Pavlidis, P. 2004. Coexpression analysis of human genes across many microarray data sets. Genome Resear. 14, 1085--1094.Google ScholarGoogle ScholarCross RefCross Ref
  32. Lyons, D., Mahanty, S., Choi, K., Manandhar, M., and Elion, E. 1996. The Sh3-domain protein Bem1 coordinates mitogen-activated protein kinase cascade activation with cell cycle control in Saccharomyces cerevisiae. Mol. Cell Biol. 16, 4095--C4106.Google ScholarGoogle ScholarCross RefCross Ref
  33. Matsuda, M., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210, 2, 305--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mewes, H., Frishman, D., Mayer, K., et al. 2006. MIPS: Analysis and annotation of proteins from whole genomes in 2005. Nucl. Acids Resear. 34 (Supplement 1), D169--D172.Google ScholarGoogle ScholarCross RefCross Ref
  35. Moreau, Y., Aerts, S., Moor, B., Strooper, B., and Dabrowski, M. 2003. Comparison and meta-analysis of microarray data: From the bench to the computer desk. TRENDS Genetics 19, 570--577.Google ScholarGoogle ScholarCross RefCross Ref
  36. Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: analysis and an algorithm. Adv. Neural Inform. Process. Syst. 14.Google ScholarGoogle Scholar
  37. Page, D. and Craven, M. 2003. Biological applications of multi-relational data mining. SIGKDD Explor. 5, 1, 69--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Palmer, C., Gibbons, P., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Pei, J., Jiang, D., and Zhang, A. 2005. On mining cross-graph quasi-cliques. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rymon, R. 1992. Search through systematic set enumeration. In Proceedings of International Conference on Principles of Knowledge Representation Reasoning (RR'92).Google ScholarGoogle Scholar
  41. Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, i264--i272.Google ScholarGoogle ScholarCross RefCross Ref
  42. Shamir, R. and Sharan, R. 2000. Click: A clustering algorithm for gene expression analysis. In Proceedings of Intelligent Systems for Molecular Biology Conference (ISMB'00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Spellman, P., Sherlock, G., Zhang, M., Iyer, V., Anders, K., Eisen, M., Brown, P., Botstein, D., and Futcher, B. 1998. Exploring the metabolic and genetic control of gene expression on a genomic scale. Mol. Biol. Cell 3273.Google ScholarGoogle Scholar
  44. Stuart, J., Segal, E., Koller, D., and Kim, S. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249--255.Google ScholarGoogle ScholarCross RefCross Ref
  45. Takahashi, Y., Satoh, Y., and Sasaki, S. 1987. Recognition of largest common fragment among a variety of chemical structures. Analyt. Sci., 23--28.Google ScholarGoogle Scholar
  46. Tamayo, P., Solni, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96, 6, 2907--2912.Google ScholarGoogle ScholarCross RefCross Ref
  47. Tavazoie, S., Hughes, D., Campbell, M., Cho, R., and Church, G. 1999. Systematic determination of genetic network architecture. Nature Genet. 281--285.Google ScholarGoogle Scholar
  48. Tong, A. H., Evangelista, M., Parsons, A. B., and et al. 2001. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364--2368.Google ScholarGoogle ScholarCross RefCross Ref
  49. Tong, A. H. Y., Lesage, G., Bader, G. D., and et al. 2004. Global mapping of the yeast genetic interaction network. Science 303, 808--813.Google ScholarGoogle ScholarCross RefCross Ref
  50. Tseng, G., Oh, M., Rohlin, L., Liao, J., and Wong, W. 2001. Issues in cDNA microarray analysis: Quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Resear. 29, 2549--2557.Google ScholarGoogle ScholarCross RefCross Ref
  51. Uetz, P., Giot, L., Cagney, G., and et. al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces Cerevisiae. Nature 403, 6770, 601--603.Google ScholarGoogle Scholar
  52. Wang, C., Wang, W., Pei, J., Zhu, Y., and Shi, B. 2004. Scalable mining of large disk-based graph databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Xu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536--545.Google ScholarGoogle ScholarCross RefCross Ref
  54. Yan, X. and Han, J. 2000. gspan: Graph-based substructure pattern mining. In Proceedings of the International Conference on Data Mining (ICDM'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yan, X. and Han, J. 2003. Closegraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yan, X., Yu, P., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD'04). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining frequent cross-graph quasi-cliques

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 2, Issue 4
      January 2009
      154 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/1460797
      Issue’s Table of Contents

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 January 2009
      • Accepted: 1 July 2008
      • Revised: 1 March 2008
      • Received: 1 August 2006
      Published in tkdd Volume 2, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader