A Comparison of Apache Spark Supervised Machine Learning Algorithms for DNA Splicing Site Prediction

Morfino, Valerio; Rampone, Salvatore; Weitschek, Emanuel

doi:10.1007/978-981-13-8950-4_13

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 151))

917 Accesses
2 Citations

Abstract

Thanks to next-generation sequencing techniques, a very big amount of genomic data are available. Therefore, in the last years, biomedical databases are growing more and more. Analyzing this big amount of data with bioinformatics and big data techniques could lead to the discovery of new knowledge for the treatment of serious diseases. In this work, we deal with the splicing site prediction problem in DNA sequences by using supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on two public available datasets adopting both local and cloud environments, emphasizing the importance of this last environment for its scalability and elasticity of use. We compare the performance of the algorithms with U-BRAIN, a general-purpose learning algorithm originally designed for the prediction of DNA splicing sites. Results show that, among the Spark algorithms, all have good prediction accuracy (>0.9)—that is comparable with the one of U-BRAIN—and much lower execution time. Therefore, we can state that Apache Spark machine learning algorithms are promising candidates for dealing with the DNA splicing site prediction problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Maxwell, W.L., Noble, W.S.: Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16(6), 321 (2015)
Article Google Scholar
Weitschek, E., Fiscon, G., Fustaino, V., Felici, G., Bertolazzi, P.: Clustering and classification techniques for gene expression profile pattern analysis. In: Pattern Recognition in Computational Molecular Biology: Techniques and Approaches, p. 347 (2015)
Chapter Google Scholar
Apache Spark Home page. http://spark.apache.org/. Last accessed 10 April 2018
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Rampone, S.: Recognition of splice junctions on DNA sequences by BRAIN learning algorithm. Bioinformatics (Oxford, England) 14(8), 676–684 (1998)
Article Google Scholar
Morfino, V. Rampone, S.: Metodi ed architetture per la creazione di applicazioni multicanale per la bioinformatica. In: Ceccarell, M., Colantuoni, V., Graziano, G., Rampone, S. (eds.) Bioinformatica. Sfide e prospettive. Edizioni Franco Angeli (2007)
Google Scholar
Rampone, S., Russo, C.: A fuzzified brain algorithm for learning DNF from incomplete data. Electron. J. Appl. Statistical Anal. (EJASA) 5(2), 256–270 (2012)
MathSciNet Google Scholar
Rampone, S.: An error tolerant software equipment for human DNA characterization. IEEE Trans. Nucl. Sci. 51(5), 2018–2026 (2004)
Article Google Scholar
D’Angelo, G., Rampone, S.: Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications. BMC Bioinform. 15(5), S2 (2014)
Article Google Scholar
Aloisio, A., Izzo, V., Rampone, S.: FPGA implementation of a greedy algorithm for set covering, In: 14TH IEEE-NPSS Real Time Conference, IEEE (2005)
Google Scholar
D’Angelo, G., Palmieri, F., Ficco, M., Rampone, S.: An uncertainty-managing batch relevance-based approach to network anomaly detection. Appl. Soft Comput. J. 35, 408–418 (2015)
Article Google Scholar
D’Angelo, G., Rampone, S.: Diagnosis of aerospace structure defects by a HPC implemented soft computing algorithm. In: IEEE Metrology for Aerospace (MetroAeroSpace), pp. 408–412. IEEE (2014)
Google Scholar
D’Angelo, G., Rampone, S.: Feature extraction and soft computing methods for aerospace structure defect classification. Meas. J. Int. Meas. Confederation 85, 192–209 (2016)
Article Google Scholar
Kimmel, G., Farkash, A.: Lecturer Ron Shamir, “Algorithms for Molecular Biology”, Lecture 1: 25 Oct 2001, Fall Semester, Tel Aviv University (2001)
Google Scholar
Jo, Bong-Seok, Choi, Sun Shim: Introns: the functional benefits of introns in genomes. Genomics Informatics 13(4), 112–118 (2015)
Article Google Scholar
Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Inc. (2017)
Google Scholar
Bache, K., Lichman, M: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013). http://archive.ics.uci.edu/ml. Last accessed 10 April 2018
Pollastro, P., Rampone, S.: HS3D, a dataset of Homo sapiens splice regions, and its extraction procedure from a major public database. Int. J. Mod. Phys. C 13(8), 1105–1117 (2003)
Article Google Scholar
Forbes, S.A.: COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res. 39(suppl 1), D945–D950 (2011)
Article Google Scholar
Databricks Home page. https://databricks.com/. Last accessed 10 April 2018
Kennedy, J.: Encyclopedia of Machine Learning. Springer, US (2011)
Google Scholar
Cestarelli, V., Fiscon, G., Felici, G., Bertolazzi, P., Weitschek, E.: CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules. Bioinformatics 32(5), 697–704 (2016)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Celli, F., Cumbo, F., Weitschek, E.: Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res. 13, 21–28 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering, Uninettuno University, Rome, Italy
Valerio Morfino & Emanuel Weitschek
Department of Law, Economics, Management and Quantitative Methods (DEMM), Università degli Studi del Sannio, Benevento, Italy
Salvatore Rampone

Authors

Valerio Morfino
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Rampone
View author publications
You can also search for this author in PubMed Google Scholar
Emanuel Weitschek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valerio Morfino .

Editor information

Editors and Affiliations

Department of Psychology, University of Campania Luigi Vanvitelli, Caserta, Italy
Anna Esposito
Tecnocampus, Mataró, Spain
Marcos Faundez-Zanuy
Department of Civil, Environment, Energy and Materials Engineering, Mediterranea University of Reggio Calabria, Reggio Calabria, Italy
Francesco Carlo Morabito
Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Turin, Italy
Eros Pasero

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Morfino, V., Rampone, S., Weitschek, E. (2020). A Comparison of Apache Spark Supervised Machine Learning Algorithms for DNA Splicing Site Prediction. In: Esposito, A., Faundez-Zanuy, M., Morabito, F., Pasero, E. (eds) Neural Approaches to Dynamics of Signal Exchanges. Smart Innovation, Systems and Technologies, vol 151. Springer, Singapore. https://doi.org/10.1007/978-981-13-8950-4_13

Download citation

DOI: https://doi.org/10.1007/978-981-13-8950-4_13
Published: 19 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-8949-8
Online ISBN: 978-981-13-8950-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics