skip to main content
10.1145/1066157.1066228acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Mining periodic patterns with gap requirement from sequences

Published:14 June 2005Publication History

ABSTRACT

We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem, and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.

References

  1. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Bairoch and B. Boeckmann. The swiss-prot protein sequence data bank. Nucleic Acids Research, 20(Suppl):2019--2022, 1992.Google ScholarGoogle Scholar
  3. Giorgio Bernardi, Birgitta Olofsson, Jan Filipski, Marino Zerial, Julio Salinas, Gerard Cuny, Michele Meunier-Rotival, and Francis Rodier. The mosaic genome of warm-blooded vertebrates. Science, 228(4702):953--958, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  4. Eivind Coward and Finn Drablos. Detecting periodic patterns in biological sequences. Bioinformatics, 14(6):498--507, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  5. J. W. Fickett and C. S. Tung. Assessment of protein coding measures. Nucleuic Acids Research, 20:6441--6450, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  6. Jiawei Han, Guozhu Dong, and YiWen Yin. Efficient mining of partial periodic patterns in time series database. In Proc. of 15th International Conference on Data Engineering, ICDE99, pages 106--115, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Herzel, O. Weiss, and E. N. Trifonov. 10--11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics, 15(3):187--193, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  8. Inge Jonassen. Efficient discovery of conserved patterns using a pattern graph. Technical Report Report No. 118, University of Bergen, 1996.Google ScholarGoogle Scholar
  9. Stefan Kurtz, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye, and Robert Giegerich. Computation and visualization of degenerate repeats in complete genomes. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB-00), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259--289, Nov 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. http://www.ncbi.nlm.nih.gov.Google ScholarGoogle Scholar
  12. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. In Proc. 17th IEEE International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Imielinski R. Agrawal and A. Swami. Mining association rules between sets of items in large databases. In Proc. ACM SIGMOD International Conference on Management of Data, page 207, Washington, D.C., May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. S. Reddy and D. E. Housman. The complex pathology of trinucleotide repeats. Current Opinion in Cell Biology, 9(3):364--372, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  15. Isidore Rigoutsos and Aris Floratos. Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics, 14(1), 1998.Google ScholarGoogle Scholar
  16. Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. of the 5th Conference on Extending Database Technology (EDBT), Avignion, France, March 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Masaru Tomita, Masahiko Wada, and Yukihiro Kawashima. ApA dinucleotide periodicity in prokaryote, eukaryote, and organelle genomes. Journal of Molecular Evolution, 49:182--192, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  18. E. N. Trifonov. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A, 249:511--516, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. van Belkum, S. Scherer amd W. van Leeuwen, D. Willemse, L. van Alphen, and H. Verbrugh. Variable number of tandem repeats in clinical strains of haemophilus influenzae. Infection and Immunity, 65(12):5017--5027, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. Widom. Short-range order in two eukaryotic genomes: Relation to chromosome structure. Journal of Moleular Biology, 259:579--588, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jiong Yang, Wei Wang, and Philip S. Yu. Mining asynchronous periodic patterns in time series data. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 275--279, Boston, MA USA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mohammed J. Zaki. Efficient enumeration of frequent sequences. In Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management(CIKM'98), Washington, United States, November 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Minghua Zhang, Ben Kao, David W. Cheung, and Kevin Y. Yip. Mining periodic patterns with gap requirement from sequences. Technical Report TR-2005-04, Department of Computer Science, The University of Hong Kong, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Minghua Zhang, Ben Kao, C. L. Yip, and David Cheung. A GSP-based efficient algorithm for mining frequent sequences. In Proc. of IC-AI'2001, Las Vegas, Nevada, USA, June 2001.Google ScholarGoogle Scholar
  1. Mining periodic patterns with gap requirement from sequences

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
          June 2005
          990 pages
          ISBN:1595930604
          DOI:10.1145/1066157
          • Conference Chair:
          • Fatma Ozcan

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 June 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader