Skip to main content

2018 | OriginalPaper | Buchkapitel

Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud

verfasst von : Bożena Małysiak-Mrozek, Paweł Daniłowicz, Dariusz Mrozek

Erschienen in: Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Exploration of 3D protein structures provides a broad potential for possible applications of its results in medical diagnostics, drug design, and treatment of patients. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. However, the process is time-consuming and requires increased computational resources when performed against large repositories. In this paper, we show that 3D protein structure similarity searching can be significantly accelerated by using modern processing techniques and computer architectures. Results of our experiments prove that by distributing computations on large Hadoop/HBase (HDInsight) clusters and scaling them out and up in the Microsoft Azure public cloud we can reduce the execution times of similarity search processes from hundred of hours to minutes. We will also show that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when scaling time-consuming computations over a mass of biological data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Berman, H.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRef Berman, H.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRef
3.
Zurück zum Zitat Bourne, P., Berman, H., Watenpaugh, K.: The macromolecular crystallographic information file (mmCIF). Methods Enzymol. 277, 571–590 (1997)CrossRef Bourne, P., Berman, H., Watenpaugh, K.: The macromolecular crystallographic information file (mmCIF). Methods Enzymol. 277, 571–590 (1997)CrossRef
4.
Zurück zum Zitat George, L.: HBase: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2011) George, L.: HBase: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2011)
5.
Zurück zum Zitat Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996)CrossRef Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996)CrossRef
6.
Zurück zum Zitat Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010) Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010)
7.
Zurück zum Zitat Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v.3. Bioinformatics 24, 2780–2781 (2008)CrossRef Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v.3. Bioinformatics 24, 2780–2781 (2008)CrossRef
8.
Zurück zum Zitat Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genom. Article ID 439681, pp. 1–8 (2008) Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genom. Article ID 439681, pp. 1–8 (2008)
9.
Zurück zum Zitat Leinweber, M., et al.: GPU-based cloud computing for comparing the structure of protein binding sites. In: 2012 6th IEEE International Conference on Digital Ecosystems and Technologies, DEST, pp. 1–6 (2012) Leinweber, M., et al.: GPU-based cloud computing for comparing the structure of protein binding sites. In: 2012 6th IEEE International Conference on Digital Ecosystems and Technologies, DEST, pp. 1–6 (2012)
10.
Zurück zum Zitat Leinweber, M., Fober, T., Freisleben, B.: GPU-based point cloud superpositioning for structural comparisons of protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1–14 (2018) Leinweber, M., Fober, T., Freisleben, B.: GPU-based point cloud superpositioning for structural comparisons of protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1–14 (2018)
11.
Zurück zum Zitat Leinweber, M., et al.: CavSimBase: a database for large scale comparison of protein binding sites. IEEE Trans. Knowl. Data Eng. 28(6), 1423–1434 (2016)CrossRef Leinweber, M., et al.: CavSimBase: a database for large scale comparison of protein binding sites. IEEE Trans. Knowl. Data Eng. 28(6), 1423–1434 (2016)CrossRef
14.
Zurück zum Zitat Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014)CrossRef Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014)CrossRef
15.
Zurück zum Zitat Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)CrossRef Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)CrossRef
16.
Zurück zum Zitat Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 254–265. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32152-3_24CrossRef Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 254–265. Springer, Cham (2016). https://​doi.​org/​10.​1007/​978-3-319-32152-3_​24CrossRef
18.
Zurück zum Zitat Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)CrossRef Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)CrossRef
20.
Zurück zum Zitat Mrozek, D., Wieczorek, D., Malysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 1073–1076 (2010) Mrozek, D., Wieczorek, D., Malysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 1073–1076 (2010)
24.
Zurück zum Zitat National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013) National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013)
26.
Zurück zum Zitat Prlić, A., et al.: Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26, 2983–2985 (2010)CrossRef Prlić, A., et al.: Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26, 2983–2985 (2010)CrossRef
27.
Zurück zum Zitat Prlić, A., Yates, A., Bliven, S.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012)CrossRef Prlić, A., Yates, A., Bliven, S.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012)CrossRef
28.
Zurück zum Zitat Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998)CrossRef Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998)CrossRef
29.
Zurück zum Zitat Sosinsky, B.: Cloud Computing Bible, 1st edn. Wiley, New York (2011) Sosinsky, B.: Cloud Computing Bible, 1st edn. Wiley, New York (2011)
31.
Zurück zum Zitat Wesbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)CrossRef Wesbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)CrossRef
32.
Zurück zum Zitat Westbrook, J., Fitzgerald, P.: The PDB format, mmCIF, and other data formats. Methods Biochem. Anal. 44, 161–79 (2003) Westbrook, J., Fitzgerald, P.: The PDB format, mmCIF, and other data formats. Methods Biochem. Anal. 44, 161–79 (2003)
33.
Zurück zum Zitat Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003) Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003)
Metadaten
Titel
Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud
verfasst von
Bożena Małysiak-Mrozek
Paweł Daniłowicz
Dariusz Mrozek
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-99987-6_3