Skip to main content

2023 | OriginalPaper | Buchkapitel

PDB2Vec: Using 3D Structural Information for Improved Protein Analysis

verfasst von : Sarwan Ali, Prakash Chourasia, Murray Patterson

Erschienen in: Bioinformatics Research and Applications

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In recent years, machine learning methods have shown remarkable results in various protein analysis tasks, including protein classification, folding prediction, and protein-to-protein interaction prediction. However, most studies focus only on the 3D structures or sequences for the downstream classification task. Hence analyzing the combination of both 3D structures and sequences remains comparatively unexplored. This study investigates how incorporating protein sequence and 3D structure information influences protein classification performance. We use two well-known datasets, STCRDAB and PDB Bind, for classification tasks to accomplish this. To this end, we propose an embedding method called PDB2Vec to encode both the 3D structure and protein sequence data to improve the predictive performance of the downstream classification task. We performed protein classification using three different experimental settings: only 3D structural embedding (called PDB2Vec), sequence embeddings using alignment-free methods from the biology domain including on k-mers, position weight matrix, minimizers and spaced k-mers, and the combination of both structural and sequence-based embeddings. Our experiments demonstrate the importance of incorporating both three-dimensional structural information and amino acid sequence information for improving the performance of protein classification and show that the combination of structural and sequence information leads to the best performance. We show that both types of information are complementary and essential for classification tasks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Al-Lazikani, B., Jung, J., Xiang, Z., Honig, B.: Protein structure prediction. Curr. Opin. Chem. Biol. 5(1), 51–56 (2001)CrossRefPubMed Al-Lazikani, B., Jung, J., Xiang, Z., Honig, B.: Protein structure prediction. Curr. Opin. Chem. Biol. 5(1), 51–56 (2001)CrossRefPubMed
2.
Zurück zum Zitat Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: Pwm2vec: An efficient embedding approach for viral host specification from coronavirus spike sequences. MDPI Biology (2022) Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: Pwm2vec: An efficient embedding approach for viral host specification from coronavirus spike sequences. MDPI Biology (2022)
3.
Zurück zum Zitat Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021) Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021)
4.
Zurück zum Zitat Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022) Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022)
5.
Zurück zum Zitat Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021) Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)
7.
Zurück zum Zitat Bepler, T., Berger, B.: Learning protein sequence embeddings using information from structure. In: International Conference on Learning Representations (2019) Bepler, T., Berger, B.: Learning protein sequence embeddings using information from structure. In: International Conference on Learning Representations (2019)
8.
Zurück zum Zitat Bigelow, D.J., Squier, T.C.: Redox modulation of cellular signaling and metabolism through reversible oxidation of methionine sensors in calcium regulatory proteins. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1703(2), 121–134 (2005) Bigelow, D.J., Squier, T.C.: Redox modulation of cellular signaling and metabolism through reversible oxidation of methionine sensors in calcium regulatory proteins. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1703(2), 121–134 (2005)
9.
Zurück zum Zitat Boscher, C., Dennis, J.W., Nabi, I.R.: Glycosylation, galectins and cellular signaling. Curr. Opin. Cell Biol. 23(4), 383–392 (2011)CrossRefPubMed Boscher, C., Dennis, J.W., Nabi, I.R.: Glycosylation, galectins and cellular signaling. Curr. Opin. Cell Biol. 23(4), 383–392 (2011)CrossRefPubMed
10.
Zurück zum Zitat Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)CrossRefPubMedPubMedCentral Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)CrossRefPubMedPubMedCentral
11.
Zurück zum Zitat Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M.: Clustering sars-cov-2 variants from raw high-throughput sequencing reads data. In: International Conference on Computational Advances in Bio and Medical Sciences, pp. 133–148. Springer (2021) Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M.: Clustering sars-cov-2 variants from raw high-throughput sequencing reads data. In: International Conference on Computational Advances in Bio and Medical Sciences, pp. 133–148. Springer (2021)
12.
Zurück zum Zitat Chourasia, P., Ali, S., Ciccolella, S., Vedova, G.D., Patterson, M.: Reads2vec: Efficient embedding of raw high-throughput sequencing reads data. J. Comput. Biol. 30(4), 469–491 (2023)CrossRefPubMed Chourasia, P., Ali, S., Ciccolella, S., Vedova, G.D., Patterson, M.: Reads2vec: Efficient embedding of raw high-throughput sequencing reads data. J. Comput. Biol. 30(4), 469–491 (2023)CrossRefPubMed
13.
Zurück zum Zitat Chourasia, P., Tayebi, Z., Ali, S., Patterson, M.: Empowering pandemic response with federated learning for protein sequence data analysis. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 01–08. IEEE (2023) Chourasia, P., Tayebi, Z., Ali, S., Patterson, M.: Empowering pandemic response with federated learning for protein sequence data analysis. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 01–08. IEEE (2023)
14.
Zurück zum Zitat Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed
15.
Zurück zum Zitat Denti, L., Pirola, Y., Previtali, M., Ceccato, T., Della Vedova, G., Rizzi, R., Bonizzoni, P.: Shark: fishing relevant reads in an rna-seq sample. Bioinformatics 37(4), 464–472 (2021)CrossRefPubMed Denti, L., Pirola, Y., Previtali, M., Ceccato, T., Della Vedova, G., Rizzi, R., Bonizzoni, P.: Shark: fishing relevant reads in an rna-seq sample. Bioinformatics 37(4), 464–472 (2021)CrossRefPubMed
16.
Zurück zum Zitat Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems (NeurIPS), pp. 6935–6945 (2017) Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems (NeurIPS), pp. 6935–6945 (2017)
17.
Zurück zum Zitat Fiser, A., Šali, A.: Modeller: generation and refinement of homology-based protein structure models. In: Methods in Enzymology, vol. 374, pp. 461–491 (2003) Fiser, A., Šali, A.: Modeller: generation and refinement of homology-based protein structure models. In: Methods in Enzymology, vol. 374, pp. 461–491 (2003)
18.
Zurück zum Zitat Freeman, B.A., O’Donnell, V.B., Schopfer, F.J.: The discovery of nitro-fatty acids as products of metabolic and inflammatory reactions and mediators of adaptive cell signaling. Nitric Oxide 77, 106–111 (2018)CrossRefPubMedPubMedCentral Freeman, B.A., O’Donnell, V.B., Schopfer, F.J.: The discovery of nitro-fatty acids as products of metabolic and inflammatory reactions and mediators of adaptive cell signaling. Nitric Oxide 77, 106–111 (2018)CrossRefPubMedPubMedCentral
20.
Zurück zum Zitat Gohlke, H., Klebe, G.: Approaches to the description and prediction of the binding affinity of small-molecule ligands to macromolecular receptors. Angew. Chem. Int. Ed. 41(15), 2644–2676 (2002)CrossRef Gohlke, H., Klebe, G.: Approaches to the description and prediction of the binding affinity of small-molecule ligands to macromolecular receptors. Angew. Chem. Int. Ed. 41(15), 2644–2676 (2002)CrossRef
22.
Zurück zum Zitat Groom, C.R., Allen, F.H.: The cambridge structural database: experimental three-dimensional information on small molecules is a vital resource for interdisciplinary research and learning. Wiley Interdisciplinary Rev. Comput. Molecular Sci. 1(3), 368–376 (2011)CrossRef Groom, C.R., Allen, F.H.: The cambridge structural database: experimental three-dimensional information on small molecules is a vital resource for interdisciplinary research and learning. Wiley Interdisciplinary Rev. Comput. Molecular Sci. 1(3), 368–376 (2011)CrossRef
23.
Zurück zum Zitat Hardin, C., Pogorelov, T.V., Luthey-Schulten, Z.: Ab initio protein structure prediction. Curr. Opin. Struct. Biol. 12(2), 176–181 (2002)CrossRefPubMed Hardin, C., Pogorelov, T.V., Luthey-Schulten, Z.: Ab initio protein structure prediction. Curr. Opin. Struct. Biol. 12(2), 176–181 (2002)CrossRefPubMed
24.
Zurück zum Zitat Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., Rost, B.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)CrossRef Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., Rost, B.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20(1), 1–17 (2019)CrossRef
25.
Zurück zum Zitat Jisna, V., Jayaraj, P.: Protein structure prediction: conventional and deep learning perspectives. Protein J. 40(4), 522–544 (2021)CrossRefPubMed Jisna, V., Jayaraj, P.: Protein structure prediction: conventional and deep learning perspectives. Protein J. 40(4), 522–544 (2021)CrossRefPubMed
26.
Zurück zum Zitat Kubinyi, H.: Structure-based design of enzyme inhibitors and receptor ligands. Curr. Opin. Drug Discov. Devel. 1(1), 4–15 (1998)PubMed Kubinyi, H.: Structure-based design of enzyme inhibitors and receptor ligands. Curr. Opin. Drug Discov. Devel. 1(1), 4–15 (1998)PubMed
27.
Zurück zum Zitat Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)CrossRefPubMedPubMedCentral Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)CrossRefPubMedPubMedCentral
28.
Zurück zum Zitat Leem, J., de Oliveira, S.H.P., Krawczyk, K., Deane, C.M.: Stcrdab: the structural t-cell receptor database. Nucleic Acids Res. 46(D1), D406–D412 (2018)CrossRefPubMed Leem, J., de Oliveira, S.H.P., Krawczyk, K., Deane, C.M.: Stcrdab: the structural t-cell receptor database. Nucleic Acids Res. 46(D1), D406–D412 (2018)CrossRefPubMed
29.
Zurück zum Zitat Liu, Z., Li, Y., Han, L., Li, J., Liu, J., Zhao, Z., Nie, W., Liu, Y., Wang, R.: Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics 31(3), 405–412 (2015)CrossRefPubMed Liu, Z., Li, Y., Han, L., Li, J., Liu, J., Zhao, Z., Nie, W., Liu, Y., Wang, R.: Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics 31(3), 405–412 (2015)CrossRefPubMed
30.
Zurück zum Zitat Oshima, A., Tani, K., Hiroaki, Y., Fujiyoshi, Y., Sosinsky, G.E.: Three-dimensional structure of a human connexin26 gap junction channel reveals a plug in the vestibule. Proc. Natl. Acad. Sci. 104(24), 10034–10039 (2007)CrossRefPubMedPubMedCentral Oshima, A., Tani, K., Hiroaki, Y., Fujiyoshi, Y., Sosinsky, G.E.: Three-dimensional structure of a human connexin26 gap junction channel reveals a plug in the vestibule. Proc. Natl. Acad. Sci. 104(24), 10034–10039 (2007)CrossRefPubMedPubMedCentral
31.
Zurück zum Zitat Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., et al.: A large-scale evaluation of computational protein function prediction. Nat. Methods 10(3), 221–227 (2013)CrossRefPubMedPubMedCentral Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., et al.: A large-scale evaluation of computational protein function prediction. Nat. Methods 10(3), 221–227 (2013)CrossRefPubMedPubMedCentral
32.
Zurück zum Zitat Reynolds, C., Damerell, D., Jones, S.: Protorp: a protein-protein interaction analysis server. Bioinformatics 25(3), 413–414 (2009)CrossRefPubMed Reynolds, C., Damerell, D., Jones, S.: Protorp: a protein-protein interaction analysis server. Bioinformatics 25(3), 413–414 (2009)CrossRefPubMed
33.
Zurück zum Zitat Roberts, M., Haynes, W., Hunt, B., Mount, S., Yorke, J.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–9 (2004)CrossRefPubMed Roberts, M., Haynes, W., Hunt, B., Mount, S., Yorke, J.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–9 (2004)CrossRefPubMed
34.
35.
Zurück zum Zitat Singh, R., Sekhon, A., Kowsari, K., Lanchantin, J., Wang, B., Qi, Y.: Gakco: a fast gapped k-mer string kernel using counting. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 356–373 (2017) Singh, R., Sekhon, A., Kowsari, K., Lanchantin, J., Wang, B., Qi, Y.: Gakco: a fast gapped k-mer string kernel using counting. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 356–373 (2017)
36.
Zurück zum Zitat Spencer, M., Eickholt, J., Cheng, J.: A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(1), 103–112 (2014)CrossRef Spencer, M., Eickholt, J., Cheng, J.: A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(1), 103–112 (2014)CrossRef
37.
Zurück zum Zitat Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: Udsmprot: universal deep sequence models for protein classification. Bioinformatics 36(8), 2401–2409 (2020)CrossRefPubMedPubMedCentral Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: Udsmprot: universal deep sequence models for protein classification. Bioinformatics 36(8), 2401–2409 (2020)CrossRefPubMedPubMedCentral
38.
Zurück zum Zitat Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of sars-cov-2 variants. Algorithms 14(12), 348 (2021)CrossRef Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of sars-cov-2 variants. Algorithms 14(12), 348 (2021)CrossRef
39.
40.
Zurück zum Zitat Tramontano, A., Morea, V.: Assessment of homology-based predictions in casp5. Proteins: Struct. Function Bioinform. 53(S6), 352–368 (2003) Tramontano, A., Morea, V.: Assessment of homology-based predictions in casp5. Proteins: Struct. Function Bioinform. 53(S6), 352–368 (2003)
41.
Zurück zum Zitat Villegas-Morcillo, A., Makrodimitris, S., van Ham, R.C., Gomez, A.M., Sanchez, V., Reinders, M.J.: Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37(2), 162–170 (2021)CrossRefPubMed Villegas-Morcillo, A., Makrodimitris, S., van Ham, R.C., Gomez, A.M., Sanchez, V., Reinders, M.J.: Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37(2), 162–170 (2021)CrossRefPubMed
43.
Zurück zum Zitat Yao, Y., Du, X., Diao, Y., Zhu, H.: An integration of deep learning with feature embedding for protein-protein interaction prediction. PeerJ 7, e7126 (2019)CrossRefPubMedPubMedCentral Yao, Y., Du, X., Diao, Y., Zhu, H.: An integration of deep learning with feature embedding for protein-protein interaction prediction. PeerJ 7, e7126 (2019)CrossRefPubMedPubMedCentral
Metadaten
Titel
PDB2Vec: Using 3D Structural Information for Improved Protein Analysis
verfasst von
Sarwan Ali
Prakash Chourasia
Murray Patterson
Copyright-Jahr
2023
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-7074-2_29

Premium Partner