nach oben

The VLDB Journal

Erschienen in:

01.12.2014 | Special Issue Paper

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

verfasst von: Majed Sahli, Essam Mansour, Panos Kalnis

Erschienen in: The VLDB Journal | Ausgabe 6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency, because of the related overheads. To this end, ACME introduces an automatic tuning mechanism that suggests the appropriate number of CPUs to utilize, in order to meet the user constraints in terms of run time, while minimizing the financial cost of cloud resources. Our experiments show that, compared to the state of the art, ACME supports three orders of magnitude longer sequences (e.g., DNA for the entire human genome); handles large alphabets (e.g., English alphabet for Wikipedia); scales out to 16,384 CPUs on a supercomputer; and supports elastic deployment in the cloud.

Vorheriger Artikel SWORD: workload-aware data placement and replica selection for cloud data management systems

Nächster Artikel MyBenchmark: generating databases for query workloads

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

There exist several parallel approaches [4, 6, 7, 16] for the much simpler common motifs problem.

FAST stands for fine-grained adaptive sub-tasks.

For simplicity, the discussion assumes that each CPU executes a single process. In practice, our implementation executes one process per core.

The family of Gamma distributions contains many different shapes. To verify that our sample generates the desired leptokurtic, positively skewed nonsymmetric distribution, we check: \({4> \alpha > 1}\) and \({\beta < \alpha }\).

CAST stands for cache aware search space traversal.

ACME code and the used datasets are available online at: http://cloud.kaust.edu.sa/Pages/acme_software.aspx.

http://webhome.cs.uvic.ca/~thomo/HG18.fasta.tar.gz.

http://www.uniprot.org/uniprot/?query=&format=*.

http://en.wikipedia.org/wiki/Wikipedia:Database_download.

http://aws.amazon.com/ec2/pricing/.

http://www.dei.unipd.it/wdyn/?IDsezione=6376.

http://researcher.ibm.com/files/us-parida/varun.zip.

Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)CrossRef

Becher, V., Deymonnaz, A., Heiber, P.: Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics 25(14), 1746–53 (2009)CrossRef

Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC), pp. 147–153 (2004)

Challa, S., Thulasiraman, P.: Protein sequence motif discovery on distributed supercomputer. In: Proceedings of the International Conference on Advances in Grid and Pervasive Computing (GPC), pp. 232–243 (2008)

Das, M.K., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinform. 8(S-7), S21 (2007)

Dasari, N.S., Desh, R., Zubair, M.: An efficient multicore implementation of planted motif problem. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 9–15 (2010)

Dasari, N.S., Ranjan, D., Zubair, M.: High performance implementation of planted motif problem using suffix trees. In: Proceedings of the International Conference on High Performance Computing and Simulation (HPCS), pp. 200–206 (2011)

Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci. 410(43), 4391–4401 (2009)CrossRefMATHMathSciNet

Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. IEEE Trans. Knowl. Data Eng. 23(8), 1154–1168 (2011)CrossRef

10.

Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F., Salzberg, S., Warnow, T.: MADMX: a novel strategy for maximal dense motif extraction. In: Proceedings of Workshop on Algorithms in Bioinformatics, pp. 362–374 (2009)

11.

Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)

12.

Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1–12 (2000)

13.

Huang, E., Yang, L., Chowdhary, R., Kassim, A., Bajic, V.B.: An algorithm for ab initio dna motif detection. Inf. Process. Living Syst. 2, 611–614 (2005)

14.

Huang, C.W., Lee, W.S., Hsieh, S.Y.: An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 8(4), 959–975 (2011)CrossRef

15.

Kleinrock, L.: Queueing Systems, vol. I: Theory. Wiley, New York (1975)MATH

16.

Liu, Y., Schmidt, B., Maskell, D.L.: An ultrafast scalable many-core motif discovery algorithm for multiple gpus. In: Proceedings of the International Symposium on Parallel and Distributed Processing, pp. 428–434 (2011)

17.

Mabroukeh, N.R., Ezeife, C.I.: A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 43(1), 1–41 (2010)CrossRef

18.

Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)CrossRef

19.

Marchand, B., Bajic, V.B., Kaushik, D.K.: Highly scalable ab initio genomic motif identification. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 56:1–56:10 (2011)

20.

Marsan, L., Sagot, M.F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7(3–4), 345–362 (2000)CrossRef

21.

Meisner, D., Wenisch, T.F.: Stochastic queuing simulation for data center workloads. In: Exascale Evaluation and Research Techniques Workshop (2010)

22.

Mueen, A., Keogh, E.: Online discovery and maintenance of time series motifs. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1089–1098 (2010)

23.

Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York (2002)

24.

Sagot, M.F.: Spelling approximate repeated or common motifs using a suffix tree. In: Proceedings of 3rd Latin American Symposium on Theoretical Informatics, pp. 374–390 (1998)

25.

Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013)

26.

Saxena, K., Shukla, R.: Significant interval and frequent pattern discovery in web log data. Int. J. Comput. Sci. Issues 7(1(3)), 29–36 (2010)

27.

Schad, J., Dittrich, J., Quiané-Ruiz, J.A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3(1–2), 460–471 (2010)CrossRef

28.

Tsirogiannis, D., Koudas, N.: Suffix tree construction algorithms on modern hardware. In: Proceedings of the International Conference on Extending Database Technology (EDBT), pp. 263–274 (2010)

29.

Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)CrossRefMATHMathSciNet

30.

Xie, X., Mikkelsen, T.S., Gnirke, A., Lindblad-Toh, K., Kellis, M., Lander, E.S.: Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of ctcf insulator sites. Proc. Natl. Acad. Sci. 104(17), 7145–7150 (2007)CrossRef

31.

Yun, U., Ryu, K.H.: Approximate weighted frequent pattern mining with/without noisy environments. Knowl. Based Syst. 24(1), 73–82 (2011)CrossRef

Titel: ACME: A scalable parallel system for extracting frequent patterns from a very long sequence
verfasst von: Majed Sahli
Essam Mansour
Panos Kalnis
Publikationsdatum: 01.12.2014
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 6/2014
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-014-0370-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2014

Real-time continuous intersection joins over large sets of moving objects using graphic processing units

Special section on data-intensive cloud infrastructure

Maximum error-bounded Piecewise Linear Representation for online stream approximation

The Stratosphere platform for big data analytics

SWORD: workload-aware data placement and replica selection for cloud data management systems

Distributed snapshot isolation: global transactions pay globally, local transactions pay locally