Top

International Journal of Data Science and Analytics

Published in:

03-01-2023 | Regular Paper

Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

Authors: Preeti Jha, Aruna Tiwari, Neha Bharill, Milind Ratnaparkhe, Om Prakash Patel, Nilagiri Harshith, Mukkamalla Mounika, Neha Nagendra

Published in: International Journal of Data Science and Analytics | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.

previous article Privacy-preserving record linkage using autoencoders

next article Statistical power, accuracy, reproducibility and robustness of a graph clusterability test

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://www.hpc.iitkgp.ac.in/HPCF/paramShakti.

https://www.cdac.in/index.aspx?id=hgc.

Guo, R., Zhao, Y., Zou, Q., Fang, X., Peng, S.: Bioinformatics applications on apache spark. GigaScience 7(8), giy098 (2018)

Alawneh, L., Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Al-Sharif, Z.A.: A scalable multiple pairwise protein sequence alignment acceleration using hybrid cpu-gpu approach. Clust. Comput. 23(4), 2677–2688 (2020)CrossRef

Krause, A., Stoye, J., Vingron, M.: Large scale hierarchical clustering of protein sequences. BMC Bioinform. 6(1), 15 (2005)CrossRef

Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Brief. Bioinform. 21(1), 1–10 (2020)

Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 1–8 (2018)CrossRef

Zeng, M., Zhang, F., Wu, F.X., Li, Y., Wang, J., Li, M.: Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 36(4), 1114–1120 (2020)CrossRef

Han, K.F., Baker, D.: Recurring local sequence motifs in proteins. J. Mol. Biol. 251(1), 176–187 (1995)CrossRef

Bystroff, C., Thorsson, V., Baker, D.: Hmmstr: a hidden markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301(1), 173–190 (2000)CrossRef

Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., Nagendra, N.: A novel scalable kernelized fuzzy clustering algorithms based on in-memory computation for handling big data. IEEE Trans. Emerg. Topic. Comput. Intell. 5, 908–919 (2020)CrossRef

10.

Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Mounika, M., Nagendra, N.: Apache spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis. Comput. Biol. Chem 92, 107454 (2021)CrossRef

11.

Jha, P., Tiwari, A., Bharill, N., Ratnaparkhe, M., Nagendra, N., Mounika, M.: Scalable incremental fuzzy consensus clustering algorithm for handling big data. Soft. Comput. pp 1–17 (2021b)

12.

Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. & Geosci. 10(2–3), 191–203 (1984)CrossRef

13.

Zhang, C.T., Chou, K.C., Maggiora, G.: Predicting protein structural classes from amino acid composition: application of fuzzy clustering. Protein Eng. Des. Select. 8(5), 425–435 (1995)CrossRef

14.

Lu, T., Dou, Y., Zhang, C.: Fuzzy clustering of cpp family in plants with evolution and interaction analyses. BMC Bioinform. 14(S13), S10 (2013)CrossRef

15.

Farhangi, E., Ghadiri, N., Asadi, M., Nikbakht, MA., Pitre, S.: Fast and scalable protein motif sequence clustering based on hadoop framework. In: 2017 3th International Conference on Web Research (ICWR), IEEE, pp 24–31 (2017)

16.

Chunduri, R.K., Cherukuri, A.K.: Scalable formal concept analysis algorithms for large datasets using spark. J. Ambient Intell. Humaniz. Comput. pp 1–21 (2018)

17.

Oussous, A., Benjelloun, F.Z., Lahcen, A.A., Belfkih, S.: Big data technologies: a survey. J. King Saud Univ Comput Inform Sci 30(4), 431–448 (2018)

18.

Bharill, N., Tiwari, A., Malviya, A.: Fuzzy based scalable clustering algorithms for handling big data using apache spark. IEEE Trans. Big Data 2(4), 339–352 (2016)CrossRef

19.

Vipsita, S., Rath, S.K.: Two-stage approach for protein superfamily classification. Comput. Biol. J. 2013 (2013)

20.

Wang, J.T.L., Ma, Q., Shasha, D., Wu, C.H.: New techniques for extracting features from protein sequences. IBM Syst. J. 40(2), 426–441 (2001)CrossRef

21.

Wu, C., Whitson, G., McLarty, J., Ermongkonchai, A., Chang, T.C.: Protein classification artificial neural system. Protein Sci. 1(5), 667–677 (1992)CrossRef

22.

Dayhoff, M., Schwartz, R., Orcutt, B.: 22 a model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure, vol. 5, pp. 345–352. National Biomedical Research Foundation Silver Spring, MD (1978)

23.

Das, J.K., Sengupta, A., Choudhury, P.P., Roy, S.: Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis. Gene 766, 145096 (2021)CrossRef

24.

Bandyopadhyay, S.: An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets Syst. 152(1), 5–16 (2005)MathSciNetCrossRefMATH

25.

Mansoori, E.G., Zolghadri, M.J., Katebi, S.D., Mohabatkar, H., Boostani, R., Sadreddini, M.H.: Generating fuzzy rules for protein classification. Iran. J. Fuzzy Syst 5(2), 21–33 (2008)MathSciNetMATH

26.

Chou, K.C.: Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273(1), 236–247 (2011)MathSciNetCrossRefMATH

27.

Chou, KC.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)

28.

Yu, C., Deng, M., Cheng, S.Y., Yau, S.C., He, R.L., Yau, S.S.T.: Protein space: a natural method for realizing the nature of protein universe. J. Theor. Biol. 318, 197–204 (2013)CrossRefMATH

29.

Gupta, M., Niyogi, R., Misra, M.: An alignment-free method to find similarity among protein sequences via the general form of chou’s pseudo amino acid composition. SAR and QSAR in Environ. Res. 24(7), 597–609 (2013)CrossRef

30.

Chou, K.C.: Prediction of protein cellular attributes using pseudo-amino acid composition. Protein. Struct. Funct. Bioinform. 43(3), 246–255 (2001)CrossRef

31.

Bharill, N., Tiwari, A., Rawat, A.: A novel technique of feature extraction with dual similarity measures for protein sequence classification. Proced. Comput. Sci. 48, 795–801 (2015)CrossRef

32.

Mansoori, E.G., Zolghadri, M.J., Katebi, S.D.: Protein superfamily classification using fuzzy rule-based classifier. IEEE Trans. NanoBiosci. 8(1), 92–99 (2009)CrossRef

33.

Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp 424–431 (2016)

34.

Li, R., Hu, H., Li, H., Wu, Y., Yang, J.: Mapreduce parallel programming model: a state-of-the-art survey. Int. J. Parallel Program. 44(4), 832–866 (2016)CrossRef

35.

Le Nir, Y.: Spark and machine learning library. TORUS 1–Toward an open resource using services: Cloud Comput. Environ. Data pp 229–243 (2020)

36.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, pp 2–2 (2012)

37.

Tang, S., He, B., Yu, C., Li, Y., Li, K.: A survey on spark ecosystem for big data processing. (2018) arXiv preprint arXiv:1811.08834

38.

Chou, K.C.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)CrossRef

39.

Dayhoff, M.O.: A model of evolutionary change in proteins. Atlas Prot. Seq. Struct. 5, 89–99 (1972)

40.

Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)CrossRef

41.

Borthakur, D., et al.: Hdfs architecture guide. Hado. Apac. Proj. 53(1–13), 2 (2008)

42.

Wysmierski, P.T., Vello, N.A.: The genetic base of brazilian soybean cultivars: evolution over time and breeding implications. Gene. Mole. Biol. 36(4), 547–555 (2013)CrossRef

43.

Sedivy, E.J., Wu, F., Hanzawa, Y.: Soybean domestication: the origin, genetic architecture and molecular bases. New Phytol. 214(2), 539–553 (2017)CrossRef

44.

Lee, J.D., Shannon, J.G., Vuong, T.D., Nguyen, H.T.: Inheritance of salt tolerance in wild soybean (glycine soja sieb. and zucc.) accession pi483463. J. Hered. 100(6), 798–801 (2009)

45.

Xie, M., Chung, C.Y.L., Li, M.W., Wong, F.L., Wang, X., Liu, A., Wang, Z., Leung, A.K.Y., Wong, T.H., Tong, S.W., et al.: A reference-grade wild soybean genome. Nat. Commun. 10(1), 1–12 (2019)CrossRef

46.

Bolshakova, N., Azuaje, F.: Cluster validation techniques for genome expression data. Signal Process. 83(4), 825–833 (2003)CrossRefMATH

47.

Dugué, N., Lamirel, J.C., Chen, Y.: Evaluating clustering quality using features salience: a promising approach. Neural Comput. Appl. 33(19), 12939–12956 (2021)CrossRef

48.

Coelho, G.P., Barbante, C.C., Boccato, L., Attux, R.R., Oliveira, J.R., Von Zuben, F.J.: Automatic feature selection for bci: an analysis using the davies-bouldin index and extreme learning machines. In: The 2012 international joint conference on neural networks (IJCNN), IEEE, pp 1–8 (2012)

49.

Shen, H.B., Yang, J., Liu, X.J., Chou, K.C.: Using supervised fuzzy clustering to predict protein structural classes. Biochem. Biophys. Res. Commun. 334(2), 577–581 (2005)CrossRef

Title: Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis
Authors: Preeti Jha
Aruna Tiwari
Neha Bharill
Milind Ratnaparkhe
Om Prakash Patel
Nilagiri Harshith
Mukkamalla Mounika
Neha Nagendra
Publication date: 03-01-2023
Publisher: Springer International Publishing
Published in: International Journal of Data Science and Analytics / Issue 4/2023
Print ISSN: 2364-415X
Electronic ISSN: 2364-4168
DOI: https://doi.org/10.1007/s41060-022-00381-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

Crime prediction in Trinidad and Tobago using big data analytics

Privacy-preserving record linkage using autoencoders

Performance measure for sparse recovery algorithms in compressed sensing perspective

Process mining: software comparison, trends, and challenges

Statistical power, accuracy, reproducibility and robustness of a graph clusterability test

Premium Partner