Skip to main content

2023 | OriginalPaper | Buchkapitel

Enhancing t-SNE Performance for Biological Sequencing Data Through Kernel Selection

verfasst von : Prakash Chourasia, Taslim Murad, Sarwan Ali, Murray Patterson

Erschienen in: Bioinformatics Research and Applications

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The genetic code for many different proteins can be found in biological sequencing data, which offers vital insight into the genetic evolution of viruses. While machine learning approaches are becoming increasingly popular for many “Big Data” situations, they have made little progress in comprehending the nature of such data. One such area is the t-distributed Stochastic Neighbour Embedding (t-SNE), a general-purpose approach used to represent high dimensional data in low dimensional (LD) space while preserving similarity between data points. Traditionally, the Gaussian kernel is used with t-SNE. However, since the Gaussian kernel is not data-dependent, it only determines each local bandwidth based on one local point. This makes it computationally expensive, hence limited in scalability. Moreover, it can misrepresent some structures in the data. An alternative is to use the isolation kernel, which is a data-dependent method. However, it has a single parameter to tune in computing the kernel. Although the isolation kernel yields better performance in terms of scalability and preserving the similarity in LD space, it may still not perform optimally in some cases. This paper presents a perspective on improving the performance of t-SNE and argues that kernel selection could impact this performance. We use 9 different kernels to evaluate their impact on the performance of t-SNE, using SARS-CoV-2 “spike” protein sequences. With three different embedding methods, we show that the cosine similarity kernel gives the best results and enhances the performance of t-SNE.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The SARS-CoV-2 virus is the cause of the global COVID-19 pandemic.
 
Literatur
1.
Zurück zum Zitat Ali, S., Bello, B., Chourasia, P., et al.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. MDPI Biol. 11(3), 418 (2022) Ali, S., Bello, B., Chourasia, P., et al.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. MDPI Biol. 11(3), 418 (2022)
2.
Zurück zum Zitat Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: International Conference on Big Data (Big Data), pp. 1533–1540 (2021) Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: International Conference on Big Data (Big Data), pp. 1533–1540 (2021)
4.
Zurück zum Zitat Ali, S., Tamkanat-E-Ali, et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR), pp. 1–8 (2021) Ali, S., Tamkanat-E-Ali, et al.: Effective and scalable clustering of SARS-CoV-2 sequences. In: International Conference on Big Data Research (ICBDR), pp. 1–8 (2021)
5.
Zurück zum Zitat Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of covid-19 clinical data using machine learning models. arXiv preprint arXiv:2110.09606 (2021) Ali, S., Zhou, Y., Patterson, M.: Efficient analysis of covid-19 clinical data using machine learning models. arXiv preprint arXiv:​2110.​09606 (2021)
7.
Zurück zum Zitat Chourasia, P., Ali, S., Ciccolella, S., Vedova, G.D., Patterson, M.: Reads2vec: efficient embedding of raw high-throughput sequencing reads data. J. Comput. Biol. 30(4), 469–491 (2023)CrossRefPubMed Chourasia, P., Ali, S., Ciccolella, S., Vedova, G.D., Patterson, M.: Reads2vec: efficient embedding of raw high-throughput sequencing reads data. J. Comput. Biol. 30(4), 469–491 (2023)CrossRefPubMed
8.
Zurück zum Zitat Chourasia, P., Ali, S., Patterson, M.: Informative initialization and kernel selection improves t-SNE for biological sequences. arXiv preprint arXiv:2211.09263 (2022) Chourasia, P., Ali, S., Patterson, M.: Informative initialization and kernel selection improves t-SNE for biological sequences. arXiv preprint arXiv:​2211.​09263 (2022)
9.
Zurück zum Zitat Cook, J., Sutskever, I., et al.: Visualizing similarity data with a mixture of maps. In: Artificial Intelligence and Statistics. PMLR (2007) Cook, J., Sutskever, I., et al.: Visualizing similarity data with a mixture of maps. In: Artificial Intelligence and Statistics. PMLR (2007)
10.
Zurück zum Zitat Corso, G., Ying, Z., et al.: Neural distance embeddings for biological sequences. In: Advances in Neural Information Processing Systems, vol. 34, pp. 18539–18551 (2021) Corso, G., Ying, Z., et al.: Neural distance embeddings for biological sequences. In: Advances in Neural Information Processing Systems, vol. 34, pp. 18539–18551 (2021)
12.
Zurück zum Zitat Kuzmin, K., Adeniyi, A.E., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)CrossRefPubMedPubMedCentral Kuzmin, K., Adeniyi, A.E., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)CrossRefPubMedPubMedCentral
13.
Zurück zum Zitat Lee, J.A., Peluffo-Ordóñez, D.H., Verleysen, M.: Multi-scale similarities in stochastic neighbour embedding: reducing dimensionality while preserving both local and global structure. Neurocomputing 169, 246–261 (2015)CrossRef Lee, J.A., Peluffo-Ordóñez, D.H., Verleysen, M.: Multi-scale similarities in stochastic neighbour embedding: reducing dimensionality while preserving both local and global structure. Neurocomputing 169, 246–261 (2015)CrossRef
14.
Zurück zum Zitat Lee, J.A., Renard, et al.: Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing 112, 92–108 (2013) Lee, J.A., Renard, et al.: Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing 112, 92–108 (2013)
15.
Zurück zum Zitat Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008) Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
16.
Zurück zum Zitat Melnyk, A., et al.: From alpha to zeta: identifying variants and subtypes of SARS-CoV-2 via clustering. J. Comput. Biol. 28(11), 1113–1129 (2021)CrossRefPubMedPubMedCentral Melnyk, A., et al.: From alpha to zeta: identifying variants and subtypes of SARS-CoV-2 via clustering. J. Comput. Biol. 28(11), 1113–1129 (2021)CrossRefPubMedPubMedCentral
17.
Zurück zum Zitat Saha, D.K., Calhoun, V.D., Panta, S.R., Plis, S.M.: See without looking: joint visualization of sensitive multi-site datasets. In: IJCAI, pp. 2672–2678 (2017) Saha, D.K., Calhoun, V.D., Panta, S.R., Plis, S.M.: See without looking: joint visualization of sensitive multi-site datasets. In: IJCAI, pp. 2672–2678 (2017)
18.
Zurück zum Zitat Saha, D.K., et al.: Privacy-preserving quality control of neuroimaging datasets in federated environment. Hum. Brain Mapp. 43(7), 2289–2310 (2022)CrossRefPubMedPubMedCentral Saha, D.K., et al.: Privacy-preserving quality control of neuroimaging datasets in federated environment. Hum. Brain Mapp. 43(7), 2289–2310 (2022)CrossRefPubMedPubMedCentral
20.
Zurück zum Zitat Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants. Algorithms 14(12), 348 (2021)CrossRef Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of SARS-CoV-2 variants. Algorithms 14(12), 348 (2021)CrossRef
21.
Zurück zum Zitat Van Der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014) Van Der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014)
22.
Zurück zum Zitat Van Der Maaten, L., Weinberger, K.: Stochastic triplet embedding. In: IEEE International Workshop on Machine Learning for Signal Processing, pp. 1–6 (2012) Van Der Maaten, L., Weinberger, K.: Stochastic triplet embedding. In: IEEE International Workshop on Machine Learning for Signal Processing, pp. 1–6 (2012)
23.
Zurück zum Zitat Xue, J., Chen, Y., et al.: Classification and identification of unknown network protocols based on CNN and t-SNE. In: Journal of Physics: Conference Series, vol. 1617, p. 012071 (2020) Xue, J., Chen, Y., et al.: Classification and identification of unknown network protocols based on CNN and t-SNE. In: Journal of Physics: Conference Series, vol. 1617, p. 012071 (2020)
24.
Zurück zum Zitat Yang, Z., King, I., Xu, Z., Oja, E.: Heavy-tailed symmetric stochastic neighbor embedding. In: Advances in Neural Information Processing Systems, vol. 22 (2009) Yang, Z., King, I., Xu, Z., Oja, E.: Heavy-tailed symmetric stochastic neighbor embedding. In: Advances in Neural Information Processing Systems, vol. 22 (2009)
25.
Zurück zum Zitat Zhu, Y., Ting, K.M.: Improving the effectiveness and efficiency of stochastic neighbour embedding with isolation kernel. J. Artif. Intell. Res. 71, 667–695 (2021)CrossRef Zhu, Y., Ting, K.M.: Improving the effectiveness and efficiency of stochastic neighbour embedding with isolation kernel. J. Artif. Intell. Res. 71, 667–695 (2021)CrossRef
Metadaten
Titel
Enhancing t-SNE Performance for Biological Sequencing Data Through Kernel Selection
verfasst von
Prakash Chourasia
Taslim Murad
Sarwan Ali
Murray Patterson
Copyright-Jahr
2023
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-7074-2_35

Premium Partner