Skip to main content
Erschienen in: The VLDB Journal 6/2014

01.12.2014 | Special Issue Paper

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

verfasst von: Majed Sahli, Essam Mansour, Panos Kalnis

Erschienen in: The VLDB Journal | Ausgabe 6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency, because of the related overheads. To this end, ACME introduces an automatic tuning mechanism that suggests the appropriate number of CPUs to utilize, in order to meet the user constraints in terms of run time, while minimizing the financial cost of cloud resources. Our experiments show that, compared to the state of the art, ACME supports three orders of magnitude longer sequences (e.g., DNA for the entire human genome); handles large alphabets (e.g., English alphabet for Wikipedia); scales out to 16,384 CPUs on a supercomputer; and supports elastic deployment in the cloud.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
There exist several parallel approaches [4, 6, 7, 16] for the much simpler common motifs problem.
 
2
FAST stands for fine-grained adaptive sub-tasks.
 
3
For simplicity, the discussion assumes that each CPU executes a single process. In practice, our implementation executes one process per core.
 
4
The family of Gamma distributions contains many different shapes. To verify that our sample generates the desired leptokurtic, positively skewed nonsymmetric distribution, we check: \({4> \alpha > 1}\) and \({\beta < \alpha }\).
 
5
CAST stands for cache aware search space traversal.
 
6
ACME code and the used datasets are available online at: http://​cloud.​kaust.​edu.​sa/​Pages/​acme_​software.​aspx.
 
Literatur
1.
Zurück zum Zitat Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)CrossRef Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)CrossRef
2.
Zurück zum Zitat Becher, V., Deymonnaz, A., Heiber, P.: Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics 25(14), 1746–53 (2009)CrossRef Becher, V., Deymonnaz, A., Heiber, P.: Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics 25(14), 1746–53 (2009)CrossRef
3.
Zurück zum Zitat Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC), pp. 147–153 (2004) Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC), pp. 147–153 (2004)
4.
Zurück zum Zitat Challa, S., Thulasiraman, P.: Protein sequence motif discovery on distributed supercomputer. In: Proceedings of the International Conference on Advances in Grid and Pervasive Computing (GPC), pp. 232–243 (2008) Challa, S., Thulasiraman, P.: Protein sequence motif discovery on distributed supercomputer. In: Proceedings of the International Conference on Advances in Grid and Pervasive Computing (GPC), pp. 232–243 (2008)
5.
Zurück zum Zitat Das, M.K., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinform. 8(S-7), S21 (2007) Das, M.K., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinform. 8(S-7), S21 (2007)
6.
Zurück zum Zitat Dasari, N.S., Desh, R., Zubair, M.: An efficient multicore implementation of planted motif problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 9–15 (2010) Dasari, N.S., Desh, R., Zubair, M.: An efficient multicore implementation of planted motif problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 9–15 (2010)
7.
Zurück zum Zitat Dasari, N.S., Ranjan, D., Zubair, M.: High performance implementation of planted motif problem using suffix trees. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 200–206 (2011) Dasari, N.S., Ranjan, D., Zubair, M.: High performance implementation of planted motif problem using suffix trees. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 200–206 (2011)
8.
Zurück zum Zitat Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci. 410(43), 4391–4401 (2009)CrossRefMATHMathSciNet Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci. 410(43), 4391–4401 (2009)CrossRefMATHMathSciNet
9.
Zurück zum Zitat Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. IEEE Trans. Knowl. Data Eng. 23(8), 1154–1168 (2011)CrossRef Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. IEEE Trans. Knowl. Data Eng. 23(8), 1154–1168 (2011)CrossRef
10.
Zurück zum Zitat Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F., Salzberg, S., Warnow, T.: MADMX: a novel strategy for maximal dense motif extraction. In: Proceedings of Workshop on Algorithms in Bioinformatics, pp. 362–374 (2009) Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F., Salzberg, S., Warnow, T.: MADMX: a novel strategy for maximal dense motif extraction. In: Proceedings of Workshop on Algorithms in Bioinformatics, pp. 362–374 (2009)
11.
Zurück zum Zitat Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997) Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
12.
Zurück zum Zitat Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1–12 (2000) Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1–12 (2000)
13.
Zurück zum Zitat Huang, E., Yang, L., Chowdhary, R., Kassim, A., Bajic, V.B.: An algorithm for ab initio dna motif detection. Inf. Process. Living Syst. 2, 611–614 (2005) Huang, E., Yang, L., Chowdhary, R., Kassim, A., Bajic, V.B.: An algorithm for ab initio dna motif detection. Inf. Process. Living Syst. 2, 611–614 (2005)
14.
Zurück zum Zitat Huang, C.W., Lee, W.S., Hsieh, S.Y.: An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 959–975 (2011)CrossRef Huang, C.W., Lee, W.S., Hsieh, S.Y.: An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 959–975 (2011)CrossRef
15.
Zurück zum Zitat Kleinrock, L.: Queueing Systems, vol. I: Theory. Wiley, New York (1975)MATH Kleinrock, L.: Queueing Systems, vol. I: Theory. Wiley, New York (1975)MATH
16.
Zurück zum Zitat Liu, Y., Schmidt, B., Maskell, D.L.: An ultrafast scalable many-core motif discovery algorithm for multiple gpus. In: Proceedings of the International Symposium on Parallel and Distributed Processing, pp. 428–434 (2011) Liu, Y., Schmidt, B., Maskell, D.L.: An ultrafast scalable many-core motif discovery algorithm for multiple gpus. In: Proceedings of the International Symposium on Parallel and Distributed Processing, pp. 428–434 (2011)
17.
Zurück zum Zitat Mabroukeh, N.R., Ezeife, C.I.: A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 43(1), 1–41 (2010)CrossRef Mabroukeh, N.R., Ezeife, C.I.: A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 43(1), 1–41 (2010)CrossRef
18.
Zurück zum Zitat Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)CrossRef Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)CrossRef
19.
Zurück zum Zitat Marchand, B., Bajic, V.B., Kaushik, D.K.: Highly scalable ab initio genomic motif identification. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 56:1–56:10 (2011) Marchand, B., Bajic, V.B., Kaushik, D.K.: Highly scalable ab initio genomic motif identification. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 56:1–56:10 (2011)
20.
Zurück zum Zitat Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7(3–4), 345–362 (2000)CrossRef Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7(3–4), 345–362 (2000)CrossRef
21.
Zurück zum Zitat Meisner, D., Wenisch, T.F.: Stochastic queuing simulation for data center workloads. In: Exascale Evaluation and Research Techniques Workshop (2010) Meisner, D., Wenisch, T.F.: Stochastic queuing simulation for data center workloads. In: Exascale Evaluation and Research Techniques Workshop (2010)
22.
Zurück zum Zitat Mueen, A., Keogh, E.: Online discovery and maintenance of time series motifs. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1089–1098 (2010) Mueen, A., Keogh, E.: Online discovery and maintenance of time series motifs. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1089–1098 (2010)
23.
Zurück zum Zitat Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York (2002) Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York (2002)
24.
Zurück zum Zitat Sagot, M.F.: Spelling approximate repeated or common motifs using a suffix tree. In: Proceedings of 3rd Latin American Symposium on Theoretical Informatics, pp. 374–390 (1998) Sagot, M.F.: Spelling approximate repeated or common motifs using a suffix tree. In: Proceedings of 3rd Latin American Symposium on Theoretical Informatics, pp. 374–390 (1998)
25.
Zurück zum Zitat Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013) Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013)
26.
Zurück zum Zitat Saxena, K., Shukla, R.: Significant interval and frequent pattern discovery in web log data. Int. J. Comput. Sci. Issues 7(1(3)), 29–36 (2010) Saxena, K., Shukla, R.: Significant interval and frequent pattern discovery in web log data. Int. J. Comput. Sci. Issues 7(1(3)), 29–36 (2010)
27.
Zurück zum Zitat Schad, J., Dittrich, J., Quiané-Ruiz, J.A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3(1–2), 460–471 (2010)CrossRef Schad, J., Dittrich, J., Quiané-Ruiz, J.A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3(1–2), 460–471 (2010)CrossRef
28.
Zurück zum Zitat Tsirogiannis, D., Koudas, N.: Suffix tree construction algorithms on modern hardware. In: Proceedings of the International Conference on Extending Database Technology (EDBT), pp. 263–274 (2010) Tsirogiannis, D., Koudas, N.: Suffix tree construction algorithms on modern hardware. In: Proceedings of the International Conference on Extending Database Technology (EDBT), pp. 263–274 (2010)
30.
Zurück zum Zitat Xie, X., Mikkelsen, T.S., Gnirke, A., Lindblad-Toh, K., Kellis, M., Lander, E.S.: Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of ctcf insulator sites. Proc. Natl. Acad. Sci. 104(17), 7145–7150 (2007)CrossRef Xie, X., Mikkelsen, T.S., Gnirke, A., Lindblad-Toh, K., Kellis, M., Lander, E.S.: Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of ctcf insulator sites. Proc. Natl. Acad. Sci. 104(17), 7145–7150 (2007)CrossRef
31.
Zurück zum Zitat Yun, U., Ryu, K.H.: Approximate weighted frequent pattern mining with/without noisy environments. Knowl. Based Syst. 24(1), 73–82 (2011)CrossRef Yun, U., Ryu, K.H.: Approximate weighted frequent pattern mining with/without noisy environments. Knowl. Based Syst. 24(1), 73–82 (2011)CrossRef
Metadaten
Titel
ACME: A scalable parallel system for extracting frequent patterns from a very long sequence
verfasst von
Majed Sahli
Essam Mansour
Panos Kalnis
Publikationsdatum
01.12.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 6/2014
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-014-0370-1

Weitere Artikel der Ausgabe 6/2014

The VLDB Journal 6/2014 Zur Ausgabe