Skip to main content

2023 | OriginalPaper | Buchkapitel

Efficient Sequence Embedding for SARS-CoV-2 Variants Classification

verfasst von : Sarwan Ali, Usama Sardar, Imdad Ullah Khan, Murray Patterson

Erschienen in: Bioinformatics Research and Applications

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Kernel-based methods, such as Support Vector Machines (SVM), have demonstrated their utility in various machine learning (ML) tasks, including sequence classification. However, these methods face two primary challenges:(i) the computational complexity associated with kernel computation, which involves an exponential time requirement for dot product calculation, and (ii) the scalability issue of storing the large \(n \times n\) matrix in memory when the number of data points(n) becomes too large. Although approximate methods can address the computational complexity problem, scalability remains a concern for conventional kernel methods. This paper presents a novel and efficient embedding method that overcomes both the computational and scalability challenges inherent in kernel methods. To address the computational challenge, our approach involves extracting the k-mers/nGrams (consecutive character substrings) from a given biological sequence, computing a sketch of the sequence, and performing dot product calculations using the sketch. By avoiding the need to compute the entire spectrum (frequency count) and operating with low-dimensional vectors (sketches) for sequences instead of the memory-intensive \(n \times n\) matrix or full-length spectrum, our method can be readily scaled to handle a large number of sequences, effectively resolving the scalability problem. Furthermore, conventional kernel methods often rely on limited algorithms (e.g., kernel SVM) for underlying ML tasks. In contrast, our proposed fast and alignment-free spectrum method can serve as input for various distance-based (e.g., k-nearest neighbors) and non-distance-based (e.g., decision tree) ML methods used in classification and clustering tasks. We achieve superior prediction for coronavirus spike/Peplomer using our method on real biological sequences excluding full genomes. Moreover, our proposed method outperforms several state-of-the-art embedding and kernel methods in terms of both predictive performance and computational runtime.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ali, S., Bello, B., et al.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)CrossRefPubMedPubMedCentral Ali, S., Bello, B., et al.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology 11(3), 418 (2022)CrossRefPubMedPubMedCentral
2.
Zurück zum Zitat Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021) Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)
3.
Zurück zum Zitat Ali, S., Sahoo, B., et al.: A k-mer based approach for SARS-CoV-2 variant identification. In: ISBRA, pp. 153–164 (2021) Ali, S., Sahoo, B., et al.: A k-mer based approach for SARS-CoV-2 variant identification. In: ISBRA, pp. 153–164 (2021)
4.
Zurück zum Zitat Borisov, V., Leemann, T., et al.: Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022) Borisov, V., Leemann, T., et al.: Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022)
5.
Zurück zum Zitat Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)CrossRefPubMed
6.
Zurück zum Zitat Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)CrossRef Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)CrossRef
7.
Zurück zum Zitat ElAbd, H., Bromberg, Y., Hoarfrost, A., Lenz, T., Franke, A., Wendorff, M.: Amino acid encoding for deep learning applications. Bioinformatics 21(1), 1–14 (2020) ElAbd, H., Bromberg, Y., Hoarfrost, A., Lenz, T., Franke, A., Wendorff, M.: Amino acid encoding for deep learning applications. Bioinformatics 21(1), 1–14 (2020)
8.
Zurück zum Zitat Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017) Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)
9.
Zurück zum Zitat Ghandi, M., Noori, M., Beer, M.: Robust k k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69(2), 469–500 (2014)CrossRefPubMed Ghandi, M., Noori, M., Beer, M.: Robust k k-mer frequency estimation using gapped k-mers. J. Math. Biol. 69(2), 469–500 (2014)CrossRefPubMed
10.
11.
Zurück zum Zitat Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)CrossRef Hoffmann, H.: Kernel PCA for novelty detection. Pattern Recogn. 40(3), 863–874 (2007)CrossRef
12.
Zurück zum Zitat Hu, W., Bansal, R., Cao, K., et al.: Learning backward compatible embeddings. In: Proceedings of the 28th ACM SIGKDD KDD, pp. 3018–3028 (2022) Hu, W., Bansal, R., Cao, K., et al.: Learning backward compatible embeddings. In: Proceedings of the 28th ACM SIGKDD KDD, pp. 3018–3028 (2022)
13.
Zurück zum Zitat Kuksa, P., Khan, I., et al.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM) (2012) Kuksa, P., Khan, I., et al.: Generalized similarity kernels for efficient sequence classification. In: SIAM International Conference on Data Mining (SDM) (2012)
14.
Zurück zum Zitat Kuzmin, K., et al.: Machine learning methods accurately predicts host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3) (2020) Kuzmin, K., et al.: Machine learning methods accurately predicts host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3) (2020)
15.
Zurück zum Zitat Leslie, C., Eskin, E., Noble, W.: The spectrum kernel: a string kernel for SVM protein classification. In: Symposium on Biocomputing, pp. 566–575 (2002) Leslie, C., Eskin, E., Noble, W.: The spectrum kernel: a string kernel for SVM protein classification. In: Symposium on Biocomputing, pp. 566–575 (2002)
16.
Zurück zum Zitat Leslie, C., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)CrossRefPubMed Leslie, C., et al.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)CrossRefPubMed
17.
19.
Zurück zum Zitat Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference on A.I (2018) Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference on A.I (2018)
20.
Zurück zum Zitat Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)CrossRef Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022)CrossRef
21.
Zurück zum Zitat Singh, R., Sekhon, A., et al.: Gakco: a fast gapped k-mer string kernel using counting. In: Joint ECML and Knowledge Discovery in Databases, pp. 356–373 (2017) Singh, R., Sekhon, A., et al.: Gakco: a fast gapped k-mer string kernel using counting. In: Joint ECML and Knowledge Discovery in Databases, pp. 356–373 (2017)
22.
Zurück zum Zitat Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: IJCNN, pp. 1578–1585 (2017) Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: IJCNN, pp. 1578–1585 (2017)
Metadaten
Titel
Efficient Sequence Embedding for SARS-CoV-2 Variants Classification
verfasst von
Sarwan Ali
Usama Sardar
Imdad Ullah Khan
Murray Patterson
Copyright-Jahr
2023
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-7074-2_2

Premium Partner