Skip to main content
Top

2024 | OriginalPaper | Chapter

Weighted Chaos Game Representation for Molecular Sequence Classification

Authors : Taslim Murad, Sarwan Ali, Murray Patterson

Published in: Advances in Knowledge Discovery and Data Mining

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Molecular sequence analysis is a crucial task in bioinformatics and has several applications in drug discovery and disease diagnosis. However, traditional methods for molecular sequence classification are based on sequence alignment, which can be computationally expensive and lack accuracy. Although alignment-free methods exist, they usually do not take full advantage of deep learning (DL) models since DL models traditionally perform below power on tabular data compared to their effectiveness on image-based data. To address this, we propose a novel approach to classify molecular sequences using a Chaos Game Representation (CGR)-based approach. We utilize k-mers-based frequency chaos game representation (FCGR) to generate 2D images for molecular sequences. Additionally, we incorporate scaling features for the sliding windows, including Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the characters, and Hydropathy scale, to assign weights to the k-mers. By selecting multiple features, we aim to improve the accuracy of molecular sequence classification models. The motivations to incorporate weights for the k-mers in the molecular sequence analysis are the fact that different k-mers may have different levels of importance or relevance to the classification task at hand and that incorporating additional information, such as hydropathy scales, could improve the accuracy of classification models. The proposed method shows promising results in molecular sequence classification by outperforming the baseline methods and provides a new direction for analyzing sequences using image classification techniques.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36(3), 307–340 (2003)CrossRef Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36(3), 307–340 (2003)CrossRef
2.
go back to reference Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)CrossRef Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)CrossRef
3.
go back to reference Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology. 11(3), 418 (2022)CrossRef Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology. 11(3), 418 (2022)CrossRef
4.
go back to reference Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRef Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRef
5.
go back to reference Ma, Y., Yu, Z., Tang, R., Xie, X., Han, G., Anh, V.V.: Phylogenetic analysis of HIV-1 genomes based on the position-weighted K-mers method. Entropy 22(2), 255 (2020)MathSciNetCrossRef Ma, Y., Yu, Z., Tang, R., Xie, X., Han, G., Anh, V.V.: Phylogenetic analysis of HIV-1 genomes based on the position-weighted K-mers method. Entropy 22(2), 255 (2020)MathSciNetCrossRef
6.
go back to reference Zhang, J., Bi, C., Wang, Y., Zeng, T., Liao, B., Chen, L.: Efficient mining closed K-mers from DNA and protein sequences. In: International Conference on Big Data and Smart Computing, pp. 342–349 (2020) Zhang, J., Bi, C., Wang, Y., Zeng, T., Liao, B., Chen, L.: Efficient mining closed K-mers from DNA and protein sequences. In: International Conference on Big Data and Smart Computing, pp. 342–349 (2020)
7.
go back to reference Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021) Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)
8.
go back to reference Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)CrossRef Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)CrossRef
9.
go back to reference Löchel, H.F., Eger, D., Sperlea, T., Heider, D.: Deep learning on chaos game representation for proteins. Bioinformatics 36(1), 272–279 (2020)CrossRef Löchel, H.F., Eger, D., Sperlea, T., Heider, D.: Deep learning on chaos game representation for proteins. Bioinformatics 36(1), 272–279 (2020)CrossRef
10.
go back to reference Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference (2018) Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference (2018)
11.
go back to reference Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017) Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)
12.
go back to reference Barnsley, M.F.: Fractals everywhere: New edition (2012) Barnsley, M.F.: Fractals everywhere: New edition (2012)
13.
go back to reference Tzanov, V.: Strictly self-similar fractals composed of star-polygons that are attractors of iterated function systems. arXiv preprint arXiv:1502.01384 (2015) Tzanov, V.: Strictly self-similar fractals composed of star-polygons that are attractors of iterated function systems. arXiv preprint arXiv:​1502.​01384 (2015)
14.
go back to reference Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Bio. 157(1), 105–132 (1982)CrossRef Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Bio. 157(1), 105–132 (1982)CrossRef
15.
go back to reference Eisenberg, D.: Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem. 53(1), 595–623 (1984)CrossRef Eisenberg, D.: Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem. 53(1), 595–623 (1984)CrossRef
16.
go back to reference Hopp, T.P., Woods, K.R.: Prediction of protein antigenic determinants from amino acid sequences. PNAS 78(6), 3824–3828 (1981)CrossRef Hopp, T.P., Woods, K.R.: Prediction of protein antigenic determinants from amino acid sequences. PNAS 78(6), 3824–3828 (1981)CrossRef
17.
go back to reference Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. Orig. Res. Biomol. 22(12), 2577–2637 (1983) Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. Orig. Res. Biomol. 22(12), 2577–2637 (1983)
18.
go back to reference MacCallum, J.L., Tieleman, D.P.: Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem. Sci. 36(12), 653–662 (2011)CrossRef MacCallum, J.L., Tieleman, D.P.: Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem. Sci. 36(12), 653–662 (2011)CrossRef
19.
go back to reference Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929 (2020)
21.
go back to reference Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019) Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)
23.
go back to reference Campbell, K., et al.: Making genomic surveillance deliver: A lineage classification and nomenclature system to inform rabies elimination. PLoS Pathog. 18(5), e1010023 (2022)CrossRef Campbell, K., et al.: Making genomic surveillance deliver: A lineage classification and nomenclature system to inform rabies elimination. PLoS Pathog. 18(5), e1010023 (2022)CrossRef
24.
go back to reference Ali, S., Murad, T., Patterson, M.: PSSM2Vec: a compact alignment-free embedding approach for coronavirus spike sequence classification. In: Neural Information Processing (ICONIP), pp. 420–432 (2023) Ali, S., Murad, T., Patterson, M.: PSSM2Vec: a compact alignment-free embedding approach for coronavirus spike sequence classification. In: Neural Information Processing (ICONIP), pp. 420–432 (2023)
Metadata
Title
Weighted Chaos Game Representation for Molecular Sequence Classification
Authors
Taslim Murad
Sarwan Ali
Murray Patterson
Copyright Year
2024
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2238-9_18

Premium Partner