Introduction
-
Data recording Includes the different challenges and tools regarding the capture and storage of data.
-
Data pre-processing Which includes all the operations of cleaning and appropriation of the captured data to the ready to analyze form in order to optimize the analysis step.
-
Data analysis The task of evaluating data using different algorithms following a logical reasoning to examine each component of the data provided, with the aim of dispensing insightful outcomes.
-
Data visualization and interpretation The step involving the effective knowledge representation using different methods in order to determine the significance and importance of the findings.
-
Filters Filter methods are a preprocessing step that is independent of a subsequent learning algorithms. They use independent techniques to select features. The set of features is chosen by an evaluation criterion, or a score to assess the degree of relevance of each characteristics to a target variable [5].
-
Wrappers Wrappers are feature selection methods that evaluate a subset of characteristics by the accuracy of a predictive model trained along with them. The evaluation is done using a classifier that estimates the relevance of a given subset of characteristics. This type of methods has given evidence to be efficient yet computationally expensive which makes it not very popular [6].
-
Embedded Combine the qualities of filter and wrapper methods. As the Filter methods have shown to be faster yet not very efficient while the Wrapper methods are more effective but very computationally expensive especially with big datasets, a solution that combines the advantages of both methods was needed.
The mapping process
Research questions
-
Research is being conducted in the paper?
-
Contribution was proposed in the paper?
-
Use in Bioinformatics?
-
Data mining: predictive or descriptive?
-
Predictive/descriptive modelling?
Query search
Terms | Synonyms list |
---|---|
Feature selection | Variable selection, dimensionality reduction |
Big data | Multi-dimensional data, high-dimensional data, Hadoop, MapReduce, Spark |
Genomics | Genetics, bioinformatics, micro-array data |
Repositories | Number of publications |
---|---|
ACM | 216,222 |
IEEE Xplore | 38,268 |
Science Direct | 46,180 |
Scopus | 17,600 |
Identification of relevant work
-
The reputation of the academic source, such as a journal or conference,
-
Articles referenced in one of the articles considered and related to the subject.
-
Delete publications that do not contain the term ‘Feature Selection’, or any of its synonyms, in the title, summary, or metadata section of the document.
-
Delete publications that do not contain the term ‘Big Data’, or any of its synonyms, in the title, summary, or metadata section of the document.
-
Delete documents that only refer to terms without them being a subject of the study.
-
Cast aside publications that do not present a strong study that involves the three terms by examining the introduction, conclusion and results sections of each publication.
Classification characteristics
Classification characteristics | Types of research | Types of contribution | Types of analytics |
---|---|---|---|
Type of characteristics | Evaluation | Architecture | Predictive |
Experience | Framework | Descriptive | |
Opinion | Methodology | ||
Philosophical | Model | ||
Solution | Platform | ||
Solution | Process | ||
Solution | Theory | ||
Solution | Tool |
-
Validation research Research that presents a thorough investigation of a solution that is previously proposed.
-
Experience research Study where the researcher proposes the steps of an experimental study and presents experimental results.
-
Opinion research A personal subjective opinion of the researcher focusing on a certain method compared to other related works.
-
Philosophical research Research that analyses a certain problem on a theoretical level.
-
Solution research A presented solution to a certain problem supported by experiments and proof of validity.
-
Architecture A solution that is constructed of multiple components working together for better results.
-
Framework A potentially extensible combination of various libraries that solve a certain problem.
-
Methodology A contribution to the methods for solving a certain computational issue.
-
Model Presentation of predictive/descriptive models trained for solving particular problems.
-
Platform A combination of hardware and software solutions enabling applications to run.
-
Process Data-processing workflows proposed for solving a particular problem.
-
Theory Philosophical guidance towards solving a certain problem.
-
Tool Well-defined software utilities addressing a subset of a bigger problem.
-
Predictive analysis An analytical study of current data with the aim of making predictions about future outcomes.
-
Descriptive analysis An analytical description of the basic features of the dataset in a study that provides simple summaries about a sample.
Mapping results and discussion
Repositories | Number of publications |
---|---|
ACM | 1 |
IEEE Xplore | 9 |
Science Direct | 19 |
Scopus | 2 |
Review of results
Refs | App in genomics | Algorithm | Datasets | Evaluation methods | Technologies | Advantages | Disadvantages | Big data addressed | Type of feature selection | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Vo | Va | Ve | F | W | Em | H | En | I | ||||||||
[17] | Sorting genomes | – | No datasets | – | – | – | – | – | – | Y | – | – | – | – | – | – |
[18] | Classification | mRMR | Colorectal, liverM, pancreatic, central nervous system (CNS), leukemia data | Cross-validation | – | Good classification accuracy | – | Y | Y | – | Y | – | – | – | – | – |
[19] | Prediction | PFBP | Nucleotide polymorphism (SNP) data | Bootstrapping | MapReduce | Reduces time complexity with better accuracy parallelized | – | Y | Y | – | Y | – | – | – | – | – |
[20] | Genetic trait prediction | MINT | Real data: maize data rice data pine data | Cross-validation | – | Reduces time complexity | – | Y | Y | – | – | Y | – | – | – | – |
[21] | Prediction | Boruta Random Forest | Next- generation sequencing laboratory of Novogene Bioinformatics Institute, Beijing, China, | Bootstrapping | NetBeans | Good prediction accuracy | Small sample size | Y | Y | – | – | Y | – | – | – | – |
[22] | Prediction | MIMIC FS | ASU datasets | Cross-validation | Weka | Good performance | – | Y | Y | – | – | Y | – | – | - | – |
[23] | Marker selection | FIFS | Single nucleotide polymorphism (SNP) | Train and test | – | Huge rate of success | Not parallelized | Y | Y | – | – | Y | – | – | – | – |
[24] | Binning for prediction | Random forest Naïve Bayes | Generated datasets | Train and test | – | Dataset presents better prediction | – | Y | Y | – | – | – | Y | – | – | – |
[25] | Classification predicting disease | SVEGA | Breast cancer dataset Kent ridge biomedical repository | TPR/FPR | – | Classification accuracy rate | Not parallelized | Y | Y | - | - | - | Y | - | - | - |
[26] | Classification prediction | SVM | Kent Ridge Bio-medical dataSet Repository and National center of Biotechnology Information | ANOVA | Hadoop MapReduce | Good accuracy rate | – | Y | Y | – | – | – | Y | – | – | – |
[27] | Classification prediction | K-nearest neighbor | National Center of Biotechnology Information NCBI GEO | Cross-validation | Hadoop MapReduce | Reduces time complexity Parallelized | – | Y | Y | – | – | – | – | – | – | Y |
[28] | Identification of gene expression signatures | SVM | 20,475 features in 1920 samples, a highdimensional dataset (source not mentioned) | Cross-validation | Weka | Better understanding of the classification | – | Y | Y | – | – | – | Y | – | – | – |
[29] | Prediction | Cox-regression | The Cancer Genome Atlas datasets, glioblastoma) and lung adenocarcinoma | Cross-validation | – | Higher true variables rate Better predicting performance Easy-to-implement property | – | Y | Y | – | – | – | Y | – | – | – |
[30] | Prediction | mRMR IFS | Genome-wide association studies | Cross-validation | Weka | Good classification performance | Not parallelized | Y | Y | – | – | – | Y | – | – | – |
[31] | Prediction | mRMR IFS | UniProtdatabase http://www.uniprot.org | Cross-validation | Weka | High prediction accuracy | Not parallelized | Y | Y | – | – | – | Y | – | – | – |
[32] | Classification | ROSEFW-RF | Generated with the ROS technique | Train and test | MapReduce | Parallelized Suitable for large scale data | – | Y | Y | – | – | – | Y | – | – | – |
[33] | Genetic association | Screening | GEO database with ID GSE13355 and GSE14905 | Cross-validation | – | Good classification accuracy | – | Y | Y | – | – | – | – | Y | – | – |
[34] | Classification | Decision Tree Support Vector Machine | UCI machine learning repository http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html | Cross-validation | Weka | Simplicity of implementation Reduces time complexity High accuracy | Expensive computational cost | Y | Y | – | – | – | – | Y | – | – |
[35] | Prediction | Pearson Correlation Coefficient (PCC) Information Gain (IG) and ReliefF | Prokaryotic model organism name as E. coli, as a real biologic network | Fitness function | – | High speed and prediction accuracy Easily parallelizable | – | Y | Y | – | – | – | – | Y | – | – |
[36] | Classification | SVM | Sentiment classification of on-line reviews using data collected from amazon, imdb, and yelp. Cancer classification based on gene ex-pressions for leukemia, prostate cancer, and lung cancer | Hold-out validation | – | Simplicity and low error rates | Lack of scalability Not parallelized | Y | Y | – | – | – | – | Y | – | – |
[37] | Classification | SVM | Breast cancer, colorectal adenocarcinoma, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, ovarian cancer http://www.cbio.mskcc.org/cancergenomics/pancan | Cross-validation | Weka | Optimal classification | – | Y | Y | – | – | – | Y | – | – | – |
[38] | Clustering | Different clustering algorithms | Exome dataset of Brugada syndrome (BrS) | – | – | Suitable for high-dimensional genomic big data | No parallel implementation | Y | Y | – | – | – | – | – | Y | – |
[39] | Classification next generation sequencing | SVM Random Forest | NCBI Reference sequence database, http://http://www.ncbi.nlm.nih.gov/refseq/ | Cross-validation | Hadoop MapReduce | Scalable High classification accuracy | – | Y | Y | – | – | – | – | – | Y | – |
[40] | Identification of genetic markers prediction | Sparse Regression | SNP: a database of single nucleotide polymorphisms http://www.alzgene.org | Cross-validation | - | Good accuracy for selection of features | Not always trivial | Y | Y | – | – | – | – | – | – | Y |
[41] | Prediction detecting SNP interactions | LogicFS-GPU | Stimulated and real schizophrenia data set | Cross-validation | MapReduce | Parallel design of the algorithm | Expensive computational cost | Y | Y | – | – | – | – | – | – | Y |
[42] | Sequencing | PrefDiv and MGM PC-Stable | Pathway information database Cancer Genome Atlas (TCGA) | Cross-validation | - | Combining two algorithms to inhance accuracy | – | Y | Y | – | – | – | – | – | – | Y |
[43] | Prediction | Fireflies and ant colony | PDB Bank dataset Varibench Protein data Lung Cancer data bank Marketing | TPR/FPR | Matlab | High efficiency for feature selection | – | Y | Y | – | – | – | – | – | – | Y |
[44] | Classification for prediction | ANOVA and K-Nearest Neighbor | NCBI GEO Leukemia Ovarian Cancer Breast Cancer | ANOVA | MapReduce | Distributed and scalable | – | Y | Y | – | – | – | – | – | – | Y |
[45] | Classification for prediction | Decision tree k-nearest-neighbor | Brugada syndrome at Centre for Medical Genetics http://www.uzbrussel.be | Cross-validation | Weka | Good prediction accuracy Good with heterogeneous data | – | Y | Y | – | – | – | – | – | – | Y |
[46] | Classification | – | Real-life biomedical data, SNP repository data; mixture models simulation studies | Cross-validation | MapReduce | High classification performance Parallelized | – | Y | Y | – | – | – | – | – | – | Y |
[47] | Classification | – | Graph datasets of protein 3D-structures | Cross-validation | MapReduce | Improves prediction accuracy | – | Y | Y | – | – | – | – | – | – | Y |