Skip to main content

2018 | OriginalPaper | Buchkapitel

The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis

verfasst von : Michał Bochenek, Kamil Folkert, Roman Jaksik, Michał Krzesiak, Marcin Michalak, Marek Sikora, Tomasz Stȩclik, Łukasz Wróbel

Erschienen in: Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The cancer and the cancer mortality may seem the sign of the present times. This leads hundreds of scientists to handle the issue of finding significant premises of cancer occurrence. In this paper a set of data mining tasks is defined that joins the observed genes mutation with the specific cancer type observation. Due to the high computational complexity of this kind of data a Hadoop ecosystem cluster was developed to perform the required calculations. The results may be satisfactory in the domains of distributed data storage (processing) and the genes mutation occurrence interpretation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)CrossRef Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)CrossRef
4.
Zurück zum Zitat Buchfink, B., Xie, C., Huson, D.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)CrossRef Buchfink, B., Xie, C., Huson, D.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)CrossRef
5.
Zurück zum Zitat Gao, S., Li, L., Li, W., Janowicz, K., Zhang, Y.: Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput. Environ. Urban Syst. 61(Part B), 172–186 (2017)CrossRef Gao, S., Li, L., Li, W., Janowicz, K., Zhang, Y.: Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput. Environ. Urban Syst. 61(Part B), 172–186 (2017)CrossRef
6.
Zurück zum Zitat Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003)CrossRef Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003)CrossRef
7.
Zurück zum Zitat Hanahan, D., Weinberg, R.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011)CrossRef Hanahan, D., Weinberg, R.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011)CrossRef
8.
Zurück zum Zitat Knijnenburg, T.A., Bismeijer, T., et al.: A multilevel pan-cancer map links gene mutations to cancer hallmarks. Chin. J. Cancer 34(3), 439–449 (2015)CrossRef Knijnenburg, T.A., Bismeijer, T., et al.: A multilevel pan-cancer map links gene mutations to cancer hallmarks. Chin. J. Cancer 34(3), 439–449 (2015)CrossRef
9.
Zurück zum Zitat Li, K.B.: ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12), 1585–1586 (2003)CrossRef Li, K.B.: ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12), 1585–1586 (2003)CrossRef
10.
Zurück zum Zitat Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling ab initio predictions of 3D protein structures in Microsoft Azure Cloud. J. Grid Comput. 13(4), 561–585 (2015)CrossRef Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling ab initio predictions of 3D protein structures in Microsoft Azure Cloud. J. Grid Comput. 13(4), 561–585 (2015)CrossRef
11.
Zurück zum Zitat Mrozek, D., Kłapciński, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 723–732. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54430-4_69CrossRef Mrozek, D., Kłapciński, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 723–732. Springer, Cham (2017). https://​doi.​org/​10.​1007/​978-3-319-54430-4_​69CrossRef
12.
Zurück zum Zitat Natesan, P., Rajalaxmi, R.R., Gowrison, G., Balasubramanie, P.: Hadoop based parallel binary bat algorithm for network intrusion detection. Int. J. Parallel Program. 45(5), 1194–1213 (2017)CrossRef Natesan, P., Rajalaxmi, R.R., Gowrison, G., Balasubramanie, P.: Hadoop based parallel binary bat algorithm for network intrusion detection. Int. J. Parallel Program. 45(5), 1194–1213 (2017)CrossRef
13.
Zurück zum Zitat Sandholm, T., Lai, K.: MapReduce optimization using regulated dynamic prioritization. SIGMETRICS Perform. Eval. Rev. 37(1), 299–310 (2009) Sandholm, T., Lai, K.: MapReduce optimization using regulated dynamic prioritization. SIGMETRICS Perform. Eval. Rev. 37(1), 299–310 (2009)
14.
Zurück zum Zitat Sarnovsky, M., Butka, P., Huzvarova, A.: Twitter data analysis and visualizations using the R language on top of the Hadoop platform. In: IEEE 15th International Symposium on Applied Machine Intelligence and Informatics, pp. 327–331 (2017) Sarnovsky, M., Butka, P., Huzvarova, A.: Twitter data analysis and visualizations using the R language on top of the Hadoop platform. In: IEEE 15th International Symposium on Applied Machine Intelligence and Informatics, pp. 327–331 (2017)
15.
Zurück zum Zitat Schaefer, C.F., Anthony, K., et al.: PID: the pathway interaction database. Nucleic Acids Res. 37(Suppl. 1), D674–D679 (2009)CrossRef Schaefer, C.F., Anthony, K., et al.: PID: the pathway interaction database. Nucleic Acids Res. 37(Suppl. 1), D674–D679 (2009)CrossRef
16.
Zurück zum Zitat Schnase, J.L., Duffy, D.Q., et al.: MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput. Environ. Urban Syst. 61(B), 198–211 (2017)CrossRef Schnase, J.L., Duffy, D.Q., et al.: MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput. Environ. Urban Syst. 61(B), 198–211 (2017)CrossRef
17.
Zurück zum Zitat Shah, S.P., Huang, Y., Xu, T., et al.: Atlas-a data warehouse for integrative bioinformatics. BMC Bioinform. 6(1), 34 (2005)CrossRef Shah, S.P., Huang, Y., Xu, T., et al.: Atlas-a data warehouse for integrative bioinformatics. BMC Bioinform. 6(1), 34 (2005)CrossRef
18.
Zurück zum Zitat Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)CrossRef Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)CrossRef
19.
Zurück zum Zitat Thoralf, T.T., Kormeier, B., Klassen, A., Hofestädt, R.: BioDWH: a data warehouse kit for life science data integration. J. Integr. Bioinform. 5(2), 49–57 (2008) Thoralf, T.T., Kormeier, B., Klassen, A., Hofestädt, R.: BioDWH: a data warehouse kit for life science data integration. J. Integr. Bioinform. 5(2), 49–57 (2008)
20.
Zurück zum Zitat Wan, S., Zou, Q.: HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol. 12(1), 25 (2017)CrossRef Wan, S., Zou, Q.: HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol. 12(1), 25 (2017)CrossRef
21.
Zurück zum Zitat White, T.: The Definitive Guide. O’Reilly Media, Newton (2009) White, T.: The Definitive Guide. O’Reilly Media, Newton (2009)
22.
Zurück zum Zitat Yang, A., Troup, M., Lin, P., Ho, J.: Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33(5), 767–769 (2017) Yang, A., Troup, M., Lin, P., Ho, J.: Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33(5), 767–769 (2017)
23.
Zurück zum Zitat Yang, M., Mei, H., Huang, D.: An effective detection of satellite images via k-means clustering on Hadoop system. Int. J. Innov. Comput. Inf. Control 13(3), 1037–1046 (2017) Yang, M., Mei, H., Huang, D.: An effective detection of satellite images via k-means clustering on Hadoop system. Int. J. Innov. Comput. Inf. Control 13(3), 1037–1046 (2017)
24.
Zurück zum Zitat Yu, J., Blom, J., Sczyrba, A., Goesmann, A.: Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism. J. Biotechnol. 257(Suppl. C), 58–60 (2017)CrossRef Yu, J., Blom, J., Sczyrba, A., Goesmann, A.: Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism. J. Biotechnol. 257(Suppl. C), 58–60 (2017)CrossRef
25.
Zurück zum Zitat Zou, Q., Hu, Q., et al.: HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)CrossRef Zou, Q., Hu, Q., et al.: HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)CrossRef
Metadaten
Titel
The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis
verfasst von
Michał Bochenek
Kamil Folkert
Roman Jaksik
Michał Krzesiak
Marcin Michalak
Marek Sikora
Tomasz Stȩclik
Łukasz Wróbel
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-99987-6_2