Skip to main content

2020 | OriginalPaper | Buchkapitel

FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA \(N^6\)-Methyladenine Sites

verfasst von : Md. Khaledur Rahman

Erschienen in: Computational Advances in Bio and Medical Sciences

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

\(N^6\)-methyladenine is widely found in both prokaryotes and eukaryotes. It is responsible for many biological processes including prokaryotic defense system and human diseases. So, it is important to know its correct location in genome which may play a significant role in different biological functions. Few computational tools exist to serve this purpose but they are computationally expensive and still there is scope to improve accuracy. An informative feature extraction pipeline from genome sequences is the heart of these tools as well as for many other bioinformatics tools. But it becomes reasonably expensive for sequential approaches when the size of data is large. Hence, a scalable parallel approach is highly desirable. In this paper, we have developed a new tool, called FastFeatGen, emphasizing both developing a parallel feature extraction technique and improving accuracy using machine learning methods. We have implemented our feature extraction approach using shared memory parallelism which achieves around 10\(\times \) speed over the sequential one. Then we have employed an exploratory feature selection technique which helps to find more relevant features that can be fed to machine learning methods. We have employed Extra-Tree Classifier (ETC) in FastFeatGen and performed experiments on rice and mouse genomes. Our experimental results achieve accuracy of 85.57% and 96.64%, respectively, which are better or competitive to current state-of-the-art methods. Our shared memory based tool can also serve queries much faster than sequential technique. All source codes and datasets are available at https://​github.​com/​khaled-rahman/​FastFeatGen.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Luo, G.-Z., Blanco, M.A., Greer, E.L., He, C., Shi, Y.: DNA \(N^6\)-methyladenine: a new epigenetic mark in eukaryotes? Nat. Rev. Mol. Cell Biol. 16(12), 705 (2015)CrossRef Luo, G.-Z., Blanco, M.A., Greer, E.L., He, C., Shi, Y.: DNA \(N^6\)-methyladenine: a new epigenetic mark in eukaryotes? Nat. Rev. Mol. Cell Biol. 16(12), 705 (2015)CrossRef
2.
3.
Zurück zum Zitat Zhang, G., et al.: N\(^6\)-methyladenine DNA modification in Drosophila. Cell 161(4), 893–906 (2015)CrossRef Zhang, G., et al.: N\(^6\)-methyladenine DNA modification in Drosophila. Cell 161(4), 893–906 (2015)CrossRef
4.
Zurück zum Zitat Lichinchi, G., et al.: Dynamics of the human and viral m\(^6\)A RNA methylomes during HIV-1 infection of T cells. Nat. Microbiol. 1(4), 16011 (2016)CrossRef Lichinchi, G., et al.: Dynamics of the human and viral m\(^6\)A RNA methylomes during HIV-1 infection of T cells. Nat. Microbiol. 1(4), 16011 (2016)CrossRef
5.
Zurück zum Zitat Lichinchi, G., et al.: Dynamics of human and viral RNA methylation during Zika virus infection. Cell Host Microbe 20(5), 666–673 (2016)CrossRef Lichinchi, G., et al.: Dynamics of human and viral RNA methylation during Zika virus infection. Cell Host Microbe 20(5), 666–673 (2016)CrossRef
6.
Zurück zum Zitat Xiao, C.-L., et al.: N\(^6\)-methyladenine DNA modification in the human genome. Mol. Cell 71(2), 306–318 (2018)CrossRef Xiao, C.-L., et al.: N\(^6\)-methyladenine DNA modification in the human genome. Mol. Cell 71(2), 306–318 (2018)CrossRef
7.
Zurück zum Zitat Fu, Y., et al.: N\(^6\)-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas. Cell 161(4), 879–892 (2015)CrossRef Fu, Y., et al.: N\(^6\)-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas. Cell 161(4), 879–892 (2015)CrossRef
8.
Zurück zum Zitat Frelon, S., Douki, T., Ravanat, J.-L., Pouget, J.-P., Tornabene, C., Cadet, J.: High-performance liquid chromatography- tandem mass spectrometry measurement of radiation-induced base damage to isolated and cellular DNA. Chem. Res. Toxicol. 13(10), 1002–1010 (2000)CrossRef Frelon, S., Douki, T., Ravanat, J.-L., Pouget, J.-P., Tornabene, C., Cadet, J.: High-performance liquid chromatography- tandem mass spectrometry measurement of radiation-induced base damage to isolated and cellular DNA. Chem. Res. Toxicol. 13(10), 1002–1010 (2000)CrossRef
9.
Zurück zum Zitat Roberts, R.J., Macelis, D.: Rebase—restriction enzymes and methylases. Nucleic Acids Res. 29(1), 268–269 (2001)CrossRef Roberts, R.J., Macelis, D.: Rebase—restriction enzymes and methylases. Nucleic Acids Res. 29(1), 268–269 (2001)CrossRef
10.
Zurück zum Zitat Flusberg, B.A., et al.: Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7(6), 461 (2010)CrossRef Flusberg, B.A., et al.: Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7(6), 461 (2010)CrossRef
11.
Zurück zum Zitat Fang, G., et al.: Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat. Biotechnol. 30(12), 1232 (2012)CrossRef Fang, G., et al.: Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat. Biotechnol. 30(12), 1232 (2012)CrossRef
12.
Zurück zum Zitat Krais, A.M., Cornelius, M.G., Schmeiser, H.H.: Genomic N\(^6\)-methyladenine determination by MEKC with LIF. Electrophoresis 31(21), 3548–3551 (2010)CrossRef Krais, A.M., Cornelius, M.G., Schmeiser, H.H.: Genomic N\(^6\)-methyladenine determination by MEKC with LIF. Electrophoresis 31(21), 3548–3551 (2010)CrossRef
13.
Zurück zum Zitat Chen, W., Lv, H., Nie, F., Lin, H.: i6mA-Pred: identifying DNA N\(^6\)-methyladenine sites in the rice genome. Bioinformatics 35(16), 2796–2800 (2019)CrossRef Chen, W., Lv, H., Nie, F., Lin, H.: i6mA-Pred: identifying DNA N\(^6\)-methyladenine sites in the rice genome. Bioinformatics 35(16), 2796–2800 (2019)CrossRef
14.
Zurück zum Zitat Feng, P., Yang, H., Ding, H., Lin, H., Chen, W., Chou, K.-C.: iDNA6mA-PseKNC: identifying DNA N\(^6\)-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111(1), 96–102 (2019)CrossRef Feng, P., Yang, H., Ding, H., Lin, H., Chen, W., Chou, K.-C.: iDNA6mA-PseKNC: identifying DNA N\(^6\)-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111(1), 96–102 (2019)CrossRef
15.
Zurück zum Zitat Tahir, M., Tayara, H., Chong, K.T.: iDNA6mA (5-step rule): identification of DNA N\(^6\)-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemometrics and Intelligent Laboratory Systems (2019) Tahir, M., Tayara, H., Chong, K.T.: iDNA6mA (5-step rule): identification of DNA N\(^6\)-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemometrics and Intelligent Laboratory Systems (2019)
16.
Zurück zum Zitat Doench, J.G., et al.: Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34(2), 184 (2016)CrossRef Doench, J.G., et al.: Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34(2), 184 (2016)CrossRef
17.
Zurück zum Zitat Rahman, M.K., Rahman, M.S.: CRISPRpred: a flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems. PLoS ONE 12(8), e0181943 (2017)CrossRef Rahman, M.K., Rahman, M.S.: CRISPRpred: a flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems. PLoS ONE 12(8), e0181943 (2017)CrossRef
18.
Zurück zum Zitat Manavalan, B., Lee, J.: SVMQA: support–vector-machine-based protein single-model quality assessment. Bioinformatics 33(16), 2496–2503 (2017)CrossRef Manavalan, B., Lee, J.: SVMQA: support–vector-machine-based protein single-model quality assessment. Bioinformatics 33(16), 2496–2503 (2017)CrossRef
19.
Zurück zum Zitat Chou, K.-C.: Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273(1), 236–247 (2011)MathSciNetMATHCrossRef Chou, K.-C.: Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273(1), 236–247 (2011)MathSciNetMATHCrossRef
20.
Zurück zum Zitat Rahman, M.S., Rahman, M.K., Kaykobad, M., Rahman, M.S.: isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif. Intell. Med. 84, 90–100 (2018)CrossRef Rahman, M.S., Rahman, M.K., Kaykobad, M., Rahman, M.S.: isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif. Intell. Med. 84, 90–100 (2018)CrossRef
21.
Zurück zum Zitat Rahman, M.S., Rahman, M.K., Saha, S., Kaykobad, M., Rahman, M.S.: Antigenic: an improved prediction model of protective antigens. Artif. Intell. Med. 94, 28–41 (2019)CrossRef Rahman, M.S., Rahman, M.K., Saha, S., Kaykobad, M., Rahman, M.S.: Antigenic: an improved prediction model of protective antigens. Artif. Intell. Med. 94, 28–41 (2019)CrossRef
22.
Zurück zum Zitat Cao, D.-S., Xu, Q.-S., Liang, Y.-Z.: propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29(7), 960–962 (2013)CrossRef Cao, D.-S., Xu, Q.-S., Liang, Y.-Z.: propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29(7), 960–962 (2013)CrossRef
23.
Zurück zum Zitat Liu, B., Liu, F., Fang, L., Wang, X., Chou, K.-C.: repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8), 1307–1309 (2014)CrossRef Liu, B., Liu, F., Fang, L., Wang, X., Chou, K.-C.: repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8), 1307–1309 (2014)CrossRef
24.
Zurück zum Zitat Liu, B.: BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. (2017) Liu, B.: BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. (2017)
25.
Zurück zum Zitat Schauer, B.: Multicore processors–a necessity. In: ProQuest Discovery Guides, pp. 1–14 (2008) Schauer, B.: Multicore processors–a necessity. In: ProQuest Discovery Guides, pp. 1–14 (2008)
26.
Zurück zum Zitat Blake, G., Dreslinski, R.G., Mudge, T.: A survey of multicore processors. IEEE Signal Process. Mag. 26(6), 26–37 (2009)CrossRef Blake, G., Dreslinski, R.G., Mudge, T.: A survey of multicore processors. IEEE Signal Process. Mag. 26(6), 26–37 (2009)CrossRef
27.
28.
Zurück zum Zitat Stephenson, N., et al.: Survey of machine learning techniques in drug discovery. Curr. Drug Metab. 20(3), 185–193 (2019)CrossRef Stephenson, N., et al.: Survey of machine learning techniques in drug discovery. Curr. Drug Metab. 20(3), 185–193 (2019)CrossRef
30.
Zurück zum Zitat Zhou, C., et al.: Identification and analysis of adenine \(N^6\)-methylation sites in the rice genome. Nat. Plants 4(8), 554 (2018)CrossRef Zhou, C., et al.: Identification and analysis of adenine \(N^6\)-methylation sites in the rice genome. Nat. Plants 4(8), 554 (2018)CrossRef
31.
32.
Zurück zum Zitat Shao, J., Xu, D., Tsai, S.-N., Wang, Y., Ngai, S.-M.: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 4(3), e4920 (2009)CrossRef Shao, J., Xu, D., Tsai, S.-N., Wang, Y., Ngai, S.-M.: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 4(3), e4920 (2009)CrossRef
33.
Zurück zum Zitat Cawley, G.C., Talbot, N.L.: On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)MathSciNetMATH Cawley, G.C., Talbot, N.L.: On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)MathSciNetMATH
Metadaten
Titel
FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA -Methyladenine Sites
verfasst von
Md. Khaledur Rahman
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-46165-2_5