Skip to main content

2023 | OriginalPaper | Buchkapitel

Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems

verfasst von : Diego Fernández, Álvaro Olivera-Nappa, Roberto Uribe-Paredes, David Medina-Ortiz

Erschienen in: Bioinformatics and Biomedical Engineering

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. However, there needs to be a clear strategy to train models. Therefore, when exploring several alternatives, it is observed that the methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinform. 10(1), 1–15 (2009)CrossRef Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinform. 10(1), 1–15 (2009)CrossRef
2.
Zurück zum Zitat Basso, A., Serban, S.: Industrial applications of immobilized enzymes-a review. Mol. Catal. 479, 110607 (2019)CrossRef Basso, A., Serban, S.: Industrial applications of immobilized enzymes-a review. Mol. Catal. 479, 110607 (2019)CrossRef
3.
Zurück zum Zitat Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020) Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)
4.
Zurück zum Zitat Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. In: Protein Crystallography: Methods and Protocols, pp. 627–641 (2017) Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. In: Protein Crystallography: Methods and Protocols, pp. 627–641 (2017)
5.
Zurück zum Zitat Cadet, F., et al.: A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci. Rep. 8(1), 16757 (2018)CrossRefPubMedPubMedCentral Cadet, F., et al.: A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci. Rep. 8(1), 16757 (2018)CrossRefPubMedPubMedCentral
6.
Zurück zum Zitat Cock, P.J., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)CrossRefPubMedPubMedCentral Cock, P.J., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)CrossRefPubMedPubMedCentral
7.
Zurück zum Zitat UniProt Consortium: Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506–D515 (2019) UniProt Consortium: Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506–D515 (2019)
8.
Zurück zum Zitat Copeland, R.A.: Enzymes: A Practical Introduction to Structure, Mechanism, and Data Analysis. Wiley, Hoboken (2023)CrossRef Copeland, R.A.: Enzymes: A Practical Introduction to Structure, Mechanism, and Data Analysis. Wiley, Hoboken (2023)CrossRef
9.
Zurück zum Zitat Dallago, C., et al.: Learned embeddings from deep learning to visualize and predict protein sets. Curr. Protoc. 1(5), e113 (2021)CrossRefPubMed Dallago, C., et al.: Learned embeddings from deep learning to visualize and predict protein sets. Curr. Protoc. 1(5), e113 (2021)CrossRefPubMed
11.
Zurück zum Zitat Greener, J.G., Kandathil, S.M., Moffat, L., Jones, D.T.: A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23(1), 40–55 (2022)CrossRefPubMed Greener, J.G., Kandathil, S.M., Moffat, L., Jones, D.T.: A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23(1), 40–55 (2022)CrossRefPubMed
12.
Zurück zum Zitat Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017)CrossRefPubMed Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017)CrossRefPubMed
13.
Zurück zum Zitat Kanehisa, M., Sato, Y., Kawashima, M.: KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 31(1), 47–53 (2022)CrossRefPubMed Kanehisa, M., Sato, Y., Kawashima, M.: KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 31(1), 47–53 (2022)CrossRefPubMed
14.
Zurück zum Zitat Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M.: Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36(Suppl. 1), D202–D205 (2007) Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M.: Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36(Suppl. 1), D202–D205 (2007)
15.
Zurück zum Zitat Kuo, C.H., Huang, C.Y., Shieh, C.J., Dong, C.D.: Enzymes and biocatalysis (2022) Kuo, C.H., Huang, C.Y., Shieh, C.J., Dong, C.D.: Enzymes and biocatalysis (2022)
16.
Zurück zum Zitat Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)CrossRefPubMed Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)CrossRefPubMed
17.
Zurück zum Zitat Luo, Y., et al.: ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12(1), 1–14 (2021)CrossRef Luo, Y., et al.: ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12(1), 1–14 (2021)CrossRef
18.
Zurück zum Zitat Maeda, K., Strassel, S.M.: Annotation tools for large-scale corpus development: using AGTK at the linguistic data consortium. In: LREC (2004) Maeda, K., Strassel, S.M.: Annotation tools for large-scale corpus development: using AGTK at the linguistic data consortium. In: LREC (2004)
19.
Zurück zum Zitat Mazurenko, S., Prokop, Z., Damborsky, J.: Machine learning in enzyme engineering. ACS Catal. 10(2), 1210–1223 (2019)CrossRef Mazurenko, S., Prokop, Z., Damborsky, J.: Machine learning in enzyme engineering. ACS Catal. 10(2), 1210–1223 (2019)CrossRef
20.
Zurück zum Zitat Medina-Ortiz, D., et al.: Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering. Front. Mol. Biosci. 9 (2022) Medina-Ortiz, D., et al.: Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering. Front. Mol. Biosci. 9 (2022)
21.
Zurück zum Zitat Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)CrossRefPubMed Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)CrossRefPubMed
22.
Zurück zum Zitat Przepiórkowski, A.: XML text interchange format in the national corpus of polish. In: The Proceedings of Practical Applications in Language and Computers PALC 2009 (2009) Przepiórkowski, A.: XML text interchange format in the national corpus of polish. In: The Proceedings of Practical Applications in Language and Computers PALC 2009 (2009)
23.
Zurück zum Zitat Qu, K., Wei, L., Zou, Q.: A review of DNA-binding proteins prediction methods. Curr. Bioinform. 14(3), 246–254 (2019)CrossRef Qu, K., Wei, L., Zou, Q.: A review of DNA-binding proteins prediction methods. Curr. Bioinform. 14(3), 246–254 (2019)CrossRef
24.
Zurück zum Zitat Quiroz, C., et al.: Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach. Database 2021 (2021) Quiroz, C., et al.: Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach. Database 2021 (2021)
25.
Zurück zum Zitat Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
26.
Zurück zum Zitat Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. 116(28), 13996–14001 (2019)CrossRefPubMedPubMedCentral Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. 116(28), 13996–14001 (2019)CrossRefPubMedPubMedCentral
27.
Zurück zum Zitat Salgado, D., et al.: MyMiner: a web application for computer-assisted biocuration and text annotation. Bioinformatics 28(17), 2285–2287 (2012)CrossRefPubMed Salgado, D., et al.: MyMiner: a web application for computer-assisted biocuration and text annotation. Bioinformatics 28(17), 2285–2287 (2012)CrossRefPubMed
28.
29.
Zurück zum Zitat Siedhoff, N.E., Illig, A.M., Schwaneberg, U., Davari, M.D.: PyPEF-an integrated framework for data-driven protein engineering. J. Chem. Inf. Model. 61(7), 3463–3476 (2021)CrossRefPubMed Siedhoff, N.E., Illig, A.M., Schwaneberg, U., Davari, M.D.: PyPEF-an integrated framework for data-driven protein engineering. J. Chem. Inf. Model. 61(7), 3463–3476 (2021)CrossRefPubMed
30.
Zurück zum Zitat Tao, Z., Dong, B., Teng, Z., Zhao, Y.: The classification of enzymes by deep learning. IEEE Access 8, 89802–89811 (2020)CrossRef Tao, Z., Dong, B., Teng, Z., Zhao, Y.: The classification of enzymes by deep learning. IEEE Access 8, 89802–89811 (2020)CrossRef
Metadaten
Titel
Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems
verfasst von
Diego Fernández
Álvaro Olivera-Nappa
Roberto Uribe-Paredes
David Medina-Ortiz
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-34953-9_24

Premium Partner