skip to main content
article

Mining periodic patterns with gap requirement from sequences

Published:01 August 2007Publication History
Skip Abstract Section

Abstract

We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.

References

  1. Agarwal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. ACM, New York, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. J. Molec. Biol. 215, 403--410.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bairoch, A. and Boeckmann, B. 1992. The swiss-prot protein sequence data bank. Nucleic Acids Res. 20, Suppl, 2019--2022.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M., and Rodier, F. 1985. The mosaic genome of warm-blooded vertebrates. Science 228, 4702, 953--958.Google ScholarGoogle Scholar
  5. Coward, E. and Drablos, F. 1998. Detecting periodic patterns in biological sequences. Bioinformatics 14, 6, 498--507.Google ScholarGoogle ScholarCross RefCross Ref
  6. Fickett, J. W. and Tung, C. S. 1992. Assessment of protein coding measures. Nucleuic Acids Res. 20, 6441--6450.Google ScholarGoogle ScholarCross RefCross Ref
  7. Han, J., Dong, G., and Yin, Y. 1999. Efficient mining of partial periodic patterns in time series database. In Proceedings of the 15th International Conference on Data Engineering, (ICDE99). 106--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Herzel, H., Weiss, O., and Trifonov, E. N. 1999. 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics 15, 3, 187--193.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jonassen, I. 1996. Efficient discovery of conserved patterns using a pattern graph. Tech. Rep. Report No. 118, University of Bergen.Google ScholarGoogle Scholar
  10. Kurtz, S., Ohlebusch, E., Schleiermacher, C., Stoye, J., and Giegerich, R. 2000. Computation and visualization of degenerate repeats in complete genomes. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB-00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mannila, H., Toivonen, H., and Verkamo, A. I. 1997. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1, 3 (Nov.), 259--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. NCBI. The National Center for Biotechnology Information web site. http://www.ncbi.nlm.nih.gov.Google ScholarGoogle Scholar
  13. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. 2001. Prefixspan: Mining sequential patterns by prefix-projected growth. In Proceedings of the 17th IEEE International Conference on Data Engineering (ICDE) (Heidelberg, Germany). IEEE Computer Society Press, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Reddy, P. and Housman, D. 1997. The complex pathology of trinucleotide repeats. Current Opinion in Cell Biology 9, 3, 364--372.Google ScholarGoogle ScholarCross RefCross Ref
  15. Rigoutsos, I. and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 14, 1.Google ScholarGoogle Scholar
  16. Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th Conference on Extending Database Technology (EDBT) (Avignion, France). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tomita, M., Wada, M., and Kawashima, Y. 1999. ApA dinucleotide periodicity in prokaryote, eukaryote, and organelle genomes. J. Molec. Evolut. 49, 182--192.Google ScholarGoogle ScholarCross RefCross Ref
  18. Trifonov, E. N. 1998. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A 249, 511--516.Google ScholarGoogle ScholarCross RefCross Ref
  19. van Belkum, A., W. van Leeuwen, S. S., Willemse, D., van Alphen, L., and Verbrugh, H. 1997. Variable number of tandem repeats in clinical strains of haemophilus influenzae. Infect. Immun. 65, 12, 5017--5027.Google ScholarGoogle ScholarCross RefCross Ref
  20. Widom, J. 1996. Short-range order in two eukaryotic genomes: Relation to chromosome structure. J. Molec. Biol. 259, 579--588.Google ScholarGoogle ScholarCross RefCross Ref
  21. Yang, J., Wang, W., and Yu, P. S. 2000. Mining asynchronous periodic patterns in time series data. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, MA). ACM, New York, 275--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Zaki, M. J. 1998. Efficient enumeration of frequent sequences. In Proceedings of the ACM 7th International Conference on Information and Knowledge Management (CIKM'98) (Washington). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zhang, M., Kao, B., Cheung, D. W., and Yip, K. Y. 2005. Mining periodic patterns with gap requirement from sequences. In Proceedings of the ACM SIGMOD Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhang, M., Kao, B., Yip, C., and Cheung, D. 2001. A GSP-based efficient algorithm for mining frequent sequences. In Proceedings of IC-AI'2001 (Las Vegas, NV).Google ScholarGoogle Scholar

Index Terms

  1. Mining periodic patterns with gap requirement from sequences

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 2
      August 2007
      89 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/1267066
      Issue’s Table of Contents

      Copyright © 2007 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 August 2007
      Published in tkdd Volume 1, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader