Introduction
Methodology
Selected keywords |
---|
Data analytics |
Hadoop |
Machine learning |
MapReduce |
Large dataset |
Big data |
Data warehouse |
Predictive analytics |
NoSQL |
Unstructured data |
Data science |
Sentiment analysis |
Data center |
Keyword in title | Search term | Example | Results |
---|---|---|---|
Data analytics | "Data* Analytic*" | Data Analytic/s, Data-analytical, DATABASE ANALYTICS | 264 |
Hadoop | "Hadoop*" | Hadoop, Hadoop-based, HadoopToSQL, HadoopRDF, HadoopM | 312 |
Machine learning | "Machine* Learn*" | Machine/s learning, MACHINERY LEARNING, machine learners, Machinery Learners, Machine Learned, machine-learned | 4466 |
MapReduce | "MapReduce*" OR “Map$Reduce*” | MapReduce, Map-Reduce | 752 |
large datasets | "Large$ Dataset*" | Large dataset/s, larger datasets | 309 |
Big Data | "Big Data*" | Big data, Big Datasets, Big Databases | 1310 |
Data warehouse | "Data Warehouse*" | Data Warehouse/s | 1200 |
Predictive analytics | "Predictive Analytic*" | Predictive analytic/s | 60 |
No SQL | "No SQL” OR “NoSQL” OR “NoSQL Database” | No-SQL, NoSQL, No SQL | 72 |
Unstructured data | "Unstructured Data” | Unstructured Data | 82 |
Data science | "Data Science*" | Data Science/s | 46 |
Sentiment analysis | "Sentiment Analy*" | Sentiment Analysis, Sentiment Analyzing, Sentiment Analyzer | 303 |
Data centers | “Data Cent*" | Data Center/s, Data Centre/s, data centric | 2384 |
All above | 11,307 |
Item | Highly cited | All papers |
---|---|---|
Results founds | 28 | 6572 |
Sum of the times cited | 3549 | 32,683 |
Sum of times cited without self-citations | 3540 | 28,617 |
Average citations per item | 126.75 | 4.97 |
h-index | 19 | 64 |
-
Trend of publications during 1980–2015
-
Analysis of distribution of author keywords
-
Analysis of distribution of KeyWords Plus
-
Comparison of papers citation based on author keywords with the KeyWords Plus
-
Citation analysis of the research output
Results and discussion
Title | Authors | Year | NR | TC (Rank) | Refs. |
---|---|---|---|---|---|
Trends in big data analytics | Kambatla et al. | 2014 | 75 | 6 (27) | [50] |
Big data: a survey | Chen et al. | 2014 | 155 | 7 (26) | [6] |
A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems | Zhang et al. | 2014 | 46 | 9 (24) | [51] |
A scalable two-phase top-down specialization approach for data anonymization using MapReduce on cloud | Zhang et al. | 2014 | 31 | 6 (27) | [52] |
Data mining with big data | Wu et al. | 2014 | 56 | 12 (23) | [1] |
Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis | Balahur and Turchi | 2014 | 39 | 9 (24) | [53] |
Techniques and applications for sentiment analysis | Feldman | 2013 | 39 | 19 (20) | [54] |
New avenues in opinion mining and sentiment analysis | Cambria et al. | 2013 | 33 | 41 (18) | [55] |
Review of performance metrics for green data centers: a taxonomy study | Wang and Khan | 2013 | 43 | 18 (21) | [56] |
G-Hadoop: MapReduce across distributed data centers for data-intensive computing | Wang et al. | 2013 | 39 | 27 (19) | [57] |
Data center network virtualization: a survey | Bari et al. | 2013 | 67 | 17 (22) | [58] |
Business intelligence and analytics: from big data to big impact | Chen et al. | 2012 | 68 | 53 (15) | [59] |
Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing | Beloglazov et al. | 2012 | 39 | 88 (12) | [60] |
A survey on optical interconnects for data centers | Kachris and Tomkos | 2012 | 64 | 49 (16) | [61] |
Scikit-learn: machine learning in python | Pedregosa et al. | 2011 | 16 | 299 (2) | [62] |
Lexicon-based methods for sentiment analysis | Taboada et al. | 2011 | 120 | 64 (14) | [63] |
MapReduce: a flexible data processing tool | Dean and Ghemawat | 2010 | 14 | 110 (11) | [64] |
Faster and better: a machine learning approach to corner detection | Rosten et al. | 2010 | 102 | 156 (7) | [65] |
VL2: a scalable and flexible data center network | Greenberg et al. | 2009 | 23 | 121 (10) | [66] |
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability | Garcia et al. | 2009 | 46 | 160 (5) | [67] |
Improving the performance of predictive process modeling for large datasets | Finley et al. | 2009 | 17 | 47 (17) | [68] |
CloudBurst: highly sensitive read mapping with MapReduce | Schatz | 2009 | 20 | 146 (9) | [69] |
A scalable, commodity data center network architecture | Al-Fares et al. | 2008 | 33 | 148 (8) | [70] |
MapReduce: simplified data processing on large clusters | Dean and Ghemawat | 2008 | 15 | 1249 (1) | [71] |
Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning | Ishibuchi and Nojima | 2007 | 33 | 158 (6) | [72] |
A machine learning information retrieval approach to protein fold recognition | Cheng and Baldi | 2006 | 83 | 86 (13) | [73] |
Machine learning for high-speed corner detection | Rosten and Drummond | 2006 | 35 | 251 (3) | [74] |
Predicting subcellular localization of proteins using machine-learned classifiers | Lu et al. | 2004 | 21 | 193 (4) | [75] |
Document type and language
Document types in general | Percentage (out of 6572) | Document types for highly cited papers | Percentage (out of 28) |
---|---|---|---|
Proceedings paper | 62.73 | Article | 89.28 |
Article | 38.61 | Review | 10.71 |
Editorial material | 3.97 | Proceedings paper | 10.71 |
Review | 1.01 | – | |
Meeting abstract | 0.27 | – | |
News item | 0.27 | – | |
Book review | 0.24 | – | |
Letter | 0.16 | – | |
Correction | 0.12 | – | |
Software review | 0.07 | – | |
Book chapter | 0.06 | – | |
Item about an individual | 0.06 | – | |
Note | 0.03 | – | |
Reprint | 0.01 | – |
Publication trends: annually, regions/countries, contribution of countries
Publication output
Contribution of regions/countries
Analysis of countries between all and highly cited papers
Analysis of web of science categories and journals between all and highly cited papers
Top 10 journals in highly cited papers | Impact factor | Top 10 journals in all papers | Impact factor | ||
---|---|---|---|---|---|
2012 | 2013 | 2012 | 2013 | ||
Bioinformatics | 5.323 | 4.621 | Lecture notes in computera science | N/A | N/A |
Communications of the Acm | 2.511 | 2.863 | Lecture notes in artificial intelligencea
| N/A | N/A |
ACM sigcomm computer communication review | N/A | 1.102 | Expert systems with applications | 1.854 | 1.965 |
Future generation computer systems the international journal of grid computing and escience | 1.864 | 2.639 | Bioinformatics | 5.323 | 4.621 |
IEEE communications surveys and tutorials | 4.818 | 6.490 | Journal of the american medical information association | 3.571 | 3.932 |
International journal of approximate reasoning | 1.729 | 1.977 | Decision support systems | 2.201 | 2.036 |
Computational linguistics | 0.940 | 1.468 | Communications of the ACM | 2.511 | 2.863 |
Computational statistics & data analysis | 1.304 | 1.151 | ACM sigcomm computer communication review | N/A | 1.102 |
Computer speech and language | 1.463 | 1.812 | Neurocomputing | 1.634 | 2.005 |
IEEE intelligent systems | 1.930 | 1.920 | Machine learning | 1.454 | 1.689 |
Source of variation | Sum of square | Degree of freedom | Mean square | F value | Pr(> F) |
---|---|---|---|---|---|
Between groups | 13,988.33 | 1 | 13,988.33 | 106.98 | 7.39E−25 |
Within groups | 751,787.8 | 5750 | 130.74 | ||
Total | 765,776.2 | 5751 |
Analysis of authors between all and highly cited papers
Analysis of research areas between all and highly cited papers
Analysis of author keywords and KeyWords plus
Author keyword | 1980–2015 TP | 1980–1999 TP (%) | 2000–2009 TP (%) | 2010–2015 TP (%) |
---|---|---|---|---|
Machine learning | 757 | 48 (0.06) | 304 (0.40) | 405 (0.53) |
MapReduce | 514 | N/A | 24 (0.04) | 490 (0.95) |
Data warehouse(s)/warehousing | 353 | 11 (0.03) | 215 (0.60) | 127 (0.35) |
Big data | 292 | N/A | N/A | 292 (1) |
Hadoop | 236 | N/A | 5 (0.02) | 231 (0.97) |
Cloud computing | 232 | N/A | 4 (0.01) | 228 (0.98) |
Data center(s) | 232 | N/A | 40 (0.17) | 192 (0.82) |
Data mining | 181 | 4 (0.02) | 80 (0.44) | 97 (0.53) |
Support vector machine(s) | 180 | N/A | 64 (0.35) | 116 (0.64) |
Sentiment analysis | 147 | N/A | 6 (0.04) | 141 (0.95) |
Classification(s)/classifier(s) | 112 | 4 (0.03) | 53 (0.47) | 55 (0.49) |
Neural network(s) | 85 | 9 (0.10) | 41 (0.48) | 35 (0.41) |
Performance | 84 | N/A | 14 (0.16) | 70 (0.83) |
Energy efficiency | 84 | N/A | 4 (0.04) | 80 (0.95) |
Online analytic(al) processing (OLAP) | 77 | N/A | 47 (0.61) | 30 (0.38) |
Virtualization | 64 | N/A | 14 (0.21) | 50 (0.78) |
Feature selection | 57 | N/A | 28 (0.49) | 29 (0.50) |
Cluster/clustering | 54 | 2 (0.03) | 16 (0.29) | 36 (0.66) |
Opinion mining | 59 | N/A | 5 (0.10) | 44 (0.89) |
Scheduling | 47 | N/A | 5 (0.10) | 42 (0.89) |
Source of variation | Sum of square | Degree of freedom | Mean square | F value | Pr(> F) |
---|---|---|---|---|---|
Between groups | 3956.402 | 1 | 3956.40 | 121.97 | 2.69E−28 |
Within groups | 852,197.5 | 26,274 | 32.43 | ||
Total | 856,153.9 | 26,275 |
Multi-regression analysis
Coefficients | Standard error | t stat | P-value | |
---|---|---|---|---|
Intercept | − 2.47 | 0.98 | − 2.49 | 0.012 |
Number of authors (NA) | − 0.36 | 0.21 | − 1.74 | 0.081 |
Number of pages (NP) | 0.38 | 0.07 | 5.12 | 3.09E− 07 |
Number of references (NR) | 0.22 | 0.02 | 8.40 | 5.03E− 17 |