Introduction
Data analytics
Data input
Data analysis
Output the result
Classified positive | Classified negative | |
---|---|---|
Actual positive | TP | FN |
Actual negative | FP | TN |
Summary
Problem | Method | References |
---|---|---|
Clustering | BIRCH | [44] |
DBSCAN | [45] | |
Incremental DBSCAN | [46] | |
RKM | [47] | |
TKM | [42] | |
Classification | SLIQ | [50] |
TLAESA | [51] | |
FastNN | [52] | |
SFFS | [53] | |
GPU-based SVM | [43] | |
Association rules | CLOSET | [54] |
FP-tree | [32] | |
CHARM | [55] | |
MAFIA | [56] | |
FAST | [57] | |
Sequential patterns | SPADE | [58] |
CloSpan | [59] | |
PrefixSpan | [60] | |
SPAM | [61] | |
ISE | [62] |
-
Unscalability and centralization Most data analysis methods are not for large-scale and complex dataset. The traditional data analysis methods cannot be scaled up because their design does not take into account large or complex datasets. The design of traditional data analysis methods typically assumed they will be performed in a single machine, with all the data in memory for the data analysis process. For this reason, the performance of traditional data analytics will be limited in solving the volume problem of big data.
-
Non-dynamic Most traditional data analysis methods cannot be dynamically adjusted for different situations, meaning that they do not analyze the input data on-the-fly. For example, the classifiers are usually fixed which cannot be automatically changed. The incremental learning [66] is a promising research trend because it can dynamically adjust the the classifiers on the training process with limited resources. As a result, the performance of traditional data analytics may not be useful to the problem of velocity problem of big data.
-
Uniform data structure Most of the data mining problems assume that the format of the input data will be the same. Therefore, the traditional data mining algorithms may not be able to deal with the problem that the formats of different input data may be different and some of the data may be incomplete. How to make the input data from different sources the same format will be a possible solution to the variety problem of big data.
Big data analytics
-
From the volume perspective, the deluge of input data is the very first thing that we need to face because it may paralyze the data analytics. Different from traditional data analytics, for the wireless sensor network data analysis, Baraniuk [71] pointed out that the bottleneck of big data analytics will be shifted from sensor to processing, communications, storage of sensing data, as shown in Fig. 6. This is because sensors can gather much more data, but when uploading such large data to upper layer system, it may create bottlenecks everywhere.
-
In addition, from the velocity perspective, real-time or streaming data bring up the problem of large quantity of data coming into the data analytics within a short duration but the device and system may not be able to handle these input data. This situation is similar to that of the network flow analysis for which we typically cannot mirror and analyze everything we can gather.
-
From the variety perspective, because the incoming data may use different types or have incomplete data, how to handle them also bring up another issue for the input operators of data analytics.
Big data input
Big data analysis frameworks and platforms
Researches in frameworks and platforms
Comparison between the frameworks/platforms of big data
Big data analysis algorithms
Mining algorithms for specific problem
Machine learning for big data mining
Output the result of big data analysis
\(\mathcal {P}\)
| Name | References | Year | Description |
\(\mathcal {T}\)
|
---|---|---|---|---|---|
Analysis framework | DOT | [88] | 2011 | Add more computation resources via scale out solution | Framework |
GLADE | [89] | 2011 | Multi-level tree-based system architecture | ||
Starfish | [92] | 2012 | Self-tuning analytics system | ||
ODT-MDC | [96] | 2012 | Privacy issues | ||
MRAM | [91] | 2013 | Mobile agent technologies | ||
CBDMASP | [94] | 2013 | Statistical computation and data mining approaches | ||
SODSS | [97] | 2013 | Decision support system issues | ||
BDAF | [93] | 2014 | Data centric architecture | ||
HACE | [95] | 2014 | Data mining approaches | ||
Hadoop | [83] | 2011 | Parallel computing platform | Platform | |
CUDA | [84] | 2007 | Parallel computing platform | ||
Storm | [85] | 2014 | Parallel computing platform | ||
Pregel | [125] | 2010 | Large-scale graph data analysis | ||
MLPACK | [86] | 2013 | Scalable machine learning library | ML | |
Mahout | [87] | 2011 | Machine-learning algorithms | ||
MLAS | [124] | 2012 | Machine-learning algorithms | ||
PIMRU | [124] | 2012 | Machine Learning algorithms | ||
Radoop | [129] | 2011 | Data analytics, machine learning algorithms, and R statistical tool | ||
Mining algorithm | DBDC | [144] | 2004 | Parallel clustering | CLU |
PKM | [145] | 2009 | Map-reduce-based k-means clustering | ||
CloudVista | [111] | 2012 | Cloud computing for clustering | ||
MSFCUDA | [113] | 2013 | GPU for clustering | ||
BDCAC | [127] | 2013 | Ant on grid computing environment for clustering | ||
Corest | [114] | 2013 | Use a tree construction for generating the coresets in parallel for clustering | ||
SOM-MBP | [126] | 2013 | Neural network with CGP for classification | CLA | |
CoS | [115] | 2013 | Parallel computing for classification | ||
SVMGA | [72] | 2014 | Using GA for reduce the number of dimensions | ||
Quantum SVM | [116] | 2014 | Quantum computing for classification | ||
DPSP | [121] | 2010 | Applied frequent pattern algorithm to cloud platform | FP | |
DHTRIE | [120] | 2011 | Applied frequent pattern algorithm to cloud platform | ||
SPC, FPC, and DPC | [117] | 2012 | Map-reduce model for frequent pattern mining | ||
MFPSAM | [119] | 2014 | Concerned the specific interest constraints and applied map-reduce model |
Summary of process of big data analytics
The open issues
Platform and framework perspective
Input and output ratio of platform
Communication between systems
Bottlenecks on data analytics system
Security issues
Data mining perspective
Data mining algorithm for map-reduce solution
Noise, outliers, incomplete and inconsistent data
Bottlenecks on data mining algorithm
Privacy issues
Conclusions
-
For the computation time, there is no doubt at all that parallel computing is one of the important future trends to make the data analytics work for big data, and consequently the technologies of cloud computing, Hadoop, and map-reduce will play the important roles for the big data analytics. To handle the computation resources of the cloud-based platform and to finish the task of data analysis as fast as possible, the scheduling method is another future trend.
-
Using efficient methods to reduce the computation time of input, comparison, sampling, and a variety of reduction methods will play an important role in big data analytics. Because these methods typically do not consider parallel computing environment, how to make them work on parallel computing environment will be a future research trend. Similar to the input, the data mining algorithms also face the same situation that we mentioned in the previous section , how to make them work on parallel computing environment will be a very important research trend because there are abundant research results on traditional data mining algorithms.
-
How to model the mining problem to find something from big data and how to display the knowledge we got from big data analytics will also be another two vital future trends because the results of these two researches will decide if the data analytics can practically work for real world approaches, not just a theoretical stuff.
-
The methods of extracting information from external and relative knowledge resources to further reinforce the big data analytics, until now, are not very popular in big data analytics. But combining information from different resources to add the value of output knowledge is a common solution in the area of information retrieval, such as clustering search engine or document summarization. For this reason, information fusion will also be a future trend for improving the end results of big data analytics.
-
Because the metaheuristic algorithms are capable of finding an approximate solution within a reasonable time, they have been widely used in solving the data mining problem in recent years. Until now, many state-of-the-art metaheuristic algorithms still have not been applied to big data analytics. In addition, compared to some early data mining algorithms, the performance of metaheuristic is no doubt superior in terms of the computation time and the quality of end result. From these observations, the application of metaheuristic algorithms to big data analytics will also be an important research topic.
-
Because social network is part of the daily life of most people and because its data is also a kind of big data, how to analyze the data of a social network has become a promising research issue. Obviously, it can be used to predict the behavior of a user. After that, we can make applicable strategies for the user. For instance, a business intelligence system can use the analysis results to encourage particular customers to buy the goods they are interested.
-
The security and privacy issues that accompany the work of data analysis are intuitive research topics which contain how to safely store the data, how to make sure the data communication is protected, and how to prevent someone from finding out the information about us. Many problems of data security and privacy are essentially the same as those of the traditional data analysis even if we are entering the big data age. Thus, how to protect the data will also appear in the research of big data analytics.