Skip to main content
Erschienen in:
Buchtitelbild

2023 | OriginalPaper | Buchkapitel

Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences

verfasst von : Sarwan Ali, Pin-Yu Chen, Murray Patterson

Erschienen in: Bioinformatics Research and Applications

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the midst of the global COVID-19 pandemic, a wealth of data has become available to researchers, presenting a unique opportunity to investigate the behavior of the virus. This research aims to facilitate the design of efficient vaccinations and proactive measures to prevent future pandemics through the utilization of machine learning (ML) models for decision-making processes. Consequently, ensuring the reliability of ML predictions in these critical and rapidly evolving scenarios is of utmost importance. Notably, studies focusing on the genomic sequences of individuals infected with the coronavirus have revealed that the majority of variations occur within a specific region known as the spike (or S) protein. Previous research has explored the analysis of spike proteins using various ML techniques, including classification and clustering of variants. However, it is imperative to acknowledge the possibility of errors in spike proteins, which could lead to misleading outcomes and misguide decision-making authorities. Hence, a comprehensive examination of the robustness of ML and deep learning models in classifying spike sequences is essential. In this paper, we propose a framework for evaluating and benchmarking the robustness of diverse ML methods in spike sequence classification. Through extensive evaluation of a wide range of ML algorithms, ranging from classical methods like naive Bayes and logistic regression to advanced approaches such as deep neural networks, our research demonstrates that utilizing k-mers for creating the feature vector representation of spike proteins is more effective than traditional one-hot encoding-based embedding methods. Additionally, our findings indicate that deep neural networks exhibit superior accuracy and robustness compared to non-deep-learning baselines. To the best of our knowledge, this study is the first to benchmark the accuracy and robustness of machine-learning classification models against various types of random corruptions in COVID-19 spike protein sequences. The benchmarking framework established in this research holds the potential to assist future researchers in gaining a deeper understanding of the behavior of the coronavirus, enabling the implementation of proactive measures and the prevention of similar pandemics in the future.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisc. Rev. Comput. Stat. 2(4), 433–459 (2010)CrossRef Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisc. Rev. Comput. Stat. 2(4), 433–459 (2010)CrossRef
2.
Zurück zum Zitat Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M.D., Khan, I.: A k-mer based approach for SARS-CoV-2 variant identification. In: International Symposium on Bioinformatics Research and Applications (ISBRA) (2021, accepted) Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M.D., Khan, I.: A k-mer based approach for SARS-CoV-2 variant identification. In: International Symposium on Bioinformatics Research and Applications (ISBRA) (2021, accepted)
3.
Zurück zum Zitat Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. CoRR arXiv:2109.05019 (2021) Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. CoRR arXiv:​2109.​05019 (2021)
4.
Zurück zum Zitat Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in COVID-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)CrossRefPubMedPubMedCentral Ali, S., Sahoo, B., Zelikovsky, A., Chen, P.Y., Patterson, M.: Benchmarking machine learning robustness in COVID-19 genome sequence classification. Sci. Rep. 13(1), 4154 (2023)CrossRefPubMedPubMedCentral
5.
Zurück zum Zitat Ali, S., Tamkanat-E-Ali, Khan, M.A., Khan, I., Patterson, M., et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR) (2021, accepted) Ali, S., Tamkanat-E-Ali, Khan, M.A., Khan, I., Patterson, M., et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR) (2021, accepted)
6.
Zurück zum Zitat Arons, M.M., et al.: Presymptomatic SARS-CoV-2 infections and transmission in a skilled nursing facility. N. Engl. J. Med. 382(22), 2081–2090 (2020)CrossRefPubMed Arons, M.M., et al.: Presymptomatic SARS-CoV-2 infections and transmission in a skilled nursing facility. N. Engl. J. Med. 382(22), 2081–2090 (2020)CrossRefPubMed
7.
Zurück zum Zitat Baek, M., et al.: Accurate prediction of protein structures and interactions using a 3-track network. bioRxiv (2021) Baek, M., et al.: Accurate prediction of protein structures and interactions using a 3-track network. bioRxiv (2021)
8.
Zurück zum Zitat Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed
9.
Zurück zum Zitat Denti, L., et al.: Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 37(4), 464–472 (2021)CrossRefPubMed Denti, L., et al.: Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 37(4), 464–472 (2021)CrossRefPubMed
11.
Zurück zum Zitat Du, N., Shang, J., Sun, Y.: Improving protein domain classification for third-generation sequencing reads using deep learning. BMC Genom. 22(251) (2021) Du, N., Shang, J., Sun, Y.: Improving protein domain classification for third-generation sequencing reads using deep learning. BMC Genom. 22(251) (2021)
17.
Zurück zum Zitat Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv (2019) Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv (2019)
18.
Zurück zum Zitat Jha, S.K., Ramanathan, A., Ewetz, R., Velasquez, A., Jha, S.: Protein folding neural networks are not robust. arXiv (2021) Jha, S.K., Ramanathan, A., Ewetz, R., Velasquez, A., Jha, S.: Protein folding neural networks are not robust. arXiv (2021)
19.
Zurück zum Zitat Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature (2021) Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature (2021)
20.
Zurück zum Zitat Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012) Kuksa, P., Khan, I., Pavlovic, V.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM), pp. 873–882 (2012)
21.
Zurück zum Zitat Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533, 553–558 (2020)CrossRefPubMedPubMedCentral Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533, 553–558 (2020)CrossRefPubMedPubMedCentral
22.
Zurück zum Zitat Leslie, C., Eskin, E., Weston, J., Noble, W.: Mismatch string kernels for SVM protein classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1441–1448 (2003) Leslie, C., Eskin, E., Weston, J., Noble, W.: Mismatch string kernels for SVM protein classification. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1441–1448 (2003)
24.
Zurück zum Zitat Minh, B.Q., et al.: IQ-tree 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37(5), 1530–1534 (2020)CrossRefPubMedPubMedCentral Minh, B.Q., et al.: IQ-tree 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37(5), 1530–1534 (2020)CrossRefPubMedPubMedCentral
25.
26.
Zurück zum Zitat Park, S.E.: Epidemiology, virology, and clinical features of severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2; coronavirus disease-19). Clin. Exp. Pediatr. 63(4), 119 (2020)CrossRefPubMedPubMedCentral Park, S.E.: Epidemiology, virology, and clinical features of severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2; coronavirus disease-19). Clin. Exp. Pediatr. 63(4), 119 (2020)CrossRefPubMedPubMedCentral
27.
Zurück zum Zitat Rahimi, A., Recht, B., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5 (2007) Rahimi, A., Recht, B., et al.: Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5 (2007)
29.
Zurück zum Zitat Schwalbe-Koda, D., Tan, A., Gómez-Bombarelli, R.: Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks. Nat. Commun. 12(5104) (2021) Schwalbe-Koda, D., Tan, A., Gómez-Bombarelli, R.: Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks. Nat. Commun. 12(5104) (2021)
30.
Zurück zum Zitat Stoler, N., Nekrutenko, A.: Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3(1) (2021) Stoler, N., Nekrutenko, A.: Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3(1) (2021)
Metadaten
Titel
Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences
verfasst von
Sarwan Ali
Pin-Yu Chen
Murray Patterson
Copyright-Jahr
2023
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-7074-2_1

Premium Partner