nach oben

Erschienen in:

2023 | OriginalPaper | Buchkapitel

Efficient Sequence Embedding for SARS-CoV-2 Variants Classification

verfasst von : Sarwan Ali, Usama Sardar, Imdad Ullah Khan, Murray Patterson

Erschienen in: Bioinformatics Research and Applications

Verlag: Springer Nature Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Kernel-based methods, such as Support Vector Machines (SVM), have demonstrated their utility in various machine learning (ML) tasks, including sequence classification. However, these methods face two primary challenges:(i) the computational complexity associated with kernel computation, which involves an exponential time requirement for dot product calculation, and (ii) the scalability issue of storing the large \(n \times n\) matrix in memory when the number of data points(n) becomes too large. Although approximate methods can address the computational complexity problem, scalability remains a concern for conventional kernel methods. This paper presents a novel and efficient embedding method that overcomes both the computational and scalability challenges inherent in kernel methods. To address the computational challenge, our approach involves extracting the k-mers/nGrams (consecutive character substrings) from a given biological sequence, computing a sketch of the sequence, and performing dot product calculations using the sketch. By avoiding the need to compute the entire spectrum (frequency count) and operating with low-dimensional vectors (sketches) for sequences instead of the memory-intensive \(n \times n\) matrix or full-length spectrum, our method can be readily scaled to handle a large number of sequences, effectively resolving the scalability problem. Furthermore, conventional kernel methods often rely on limited algorithms (e.g., kernel SVM) for underlying ML tasks. In contrast, our proposed fast and alignment-free spectrum method can serve as input for various distance-based (e.g., k-nearest neighbors) and non-distance-based (e.g., decision tree) ML methods used in classification and clustering tasks. We achieve superior prediction for coronavirus spike/Peplomer using our method on real biological sequences excluding full genomes. Moreover, our proposed method outperforms several state-of-the-art embedding and kernel methods in terms of both predictive performance and computational runtime.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences

Nächstes Kapitel On Computing the Jaro Similarity Between Two Strings

https://www.worldometers.info/coronavirus/.

https://www.cdc.gov/coronavirus/2019-ncov/index.html.

https://gisaid.org/.

https://www.gisaid.org/.

Ali, S., Bello, B., et al.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)CrossRefPubMedPubMedCentral

Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)

Ali, S., Sahoo, B., et al.: A k-mer based approach for SARS-CoV-2 variant identification. In: ISBRA, pp. 153–164 (2021)

Borisov, V., Leemann, T., et al.: Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022)

Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed

Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)CrossRef

ElAbd, H., Bromberg, Y., Hoarfrost, A., Lenz, T., Franke, A., Wendorff, M.: Amino acid encoding for deep learning applications. Bioinformatics 21(1), 1–14 (2020)

Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)

Ghandi, M., Noori, M., Beer, M.: Robust k k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69(2), 469–500 (2014)CrossRefPubMed

10.

Hadfield, J., Megill, C., Bell, S., et al.: NextStrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018)CrossRefPubMedPubMedCentral

11.

Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)CrossRef

12.

Hu, W., Bansal, R., Cao, K., et al.: Learning backward compatible embeddings. In: Proceedings of the 28th ACM SIGKDD KDD, pp. 3018–3028 (2022)

13.

Kuksa, P., Khan, I., et al.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM) (2012)

14.

Kuzmin, K., et al.: Machine learning methods accurately predicts host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3) (2020)

15.

Leslie, C., Eskin, E., Noble, W.: The spectrum kernel: a string kernel for SVM protein classification. In: Symposium on Biocomputing, pp. 566–575 (2002)

16.

Leslie, C., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)CrossRefPubMed

17.

Löchel, H., et al.: Chaos game representation and its applications in bioinformatics. Comput. Struct. Biotechnol. J. 19, 6263–6271 (2021)CrossRefPubMedPubMedCentral

18.

Phylogenetic Assignment of Named Global Outbreak LINeages (Pangolin). https://cov-lineages.org/resources/pangolin.html

19.

Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference on A.I (2018)

20.

Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)CrossRef

21.

Singh, R., Sekhon, A., et al.: Gakco: a fast gapped k-mer string kernel using counting. In: Joint ECML and Knowledge Discovery in Databases, pp. 356–373 (2017)

22.

Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: IJCNN, pp. 1578–1585 (2017)

Titel: Efficient Sequence Embedding for SARS-CoV-2 Variants Classification
verfasst von: Sarwan Ali
Usama Sardar
Imdad Ullah Khan
Murray Patterson
Verlag: Springer Nature Singapore
Buch: Bioinformatics Research and Applications
Print ISBN: 978-981-9970-73-5

Electronic ISBN: 978-981-9970-74-2

Copyright-Jahr: 2023
DOI: https://doi.org/10.1007/978-981-99-7074-2_2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner