Introduction
Data lifecycle management (DLM)
Data lifecycle models
Data modelling
-
How should metadata be described (or characterised)? The description of a metadata schema which can be exploited to efficiently place a certain Big Data application in multiple Clouds by respecting both user constraints and requirements. Such a metadata schema has been proposed partially in [25] or completely in [26].
Data lifecycle management systems
-
Metadata management takes care of maintaining information which concerns both the static and dynamic characteristics of data. It is the cornerstone for enabling efficient data management.
-
Data placement encapsulates the main methods for efficient data placement and data replication while satisfying user requirements.
-
Data storage is responsible for proper (transactional) storage and efficient data retrieval support.
-
Data ingestion enables importing and exporting the data over the respective system.
-
Big Data processing supports the efficient and clustered processing of Big Data by executing the main logic of the user application(s).
-
Resource management is responsible for the proper and efficient management of computational resources.
Methodology
SLR planning
SLR need identification
Research questions identification
SLR protocol formation
(Big Data) AND (METHODOLOGY OR METHOD OR ALGORITHM OR APPROACH OR SURVEY OR STUDY) |
AND (MANAGEMENT OR PLACEMENT OR POSITION OR ALLOCATION OR STORAGE) WITH TIME SPAN:2010–2018 |
SLR conduction
Study selection
-
Peer-reviewed articles.
-
Latest articles only (last 8 years).
-
In case of equivalent studies, only the one published in the highest rated journal or conference is selected to sustain only a high-quality set of articles on which the review is conducted.
-
Articles which supply methodologies, methods or approaches for Big Data management.
-
Articles which study or propose Big Data storage management systems or databases.
-
Articles which propose Big Data placement methodologies or algorithms.
-
Inaccessible articles.
-
Articles in a different language than English.
-
Short papers, posters or other kinds of small in contribution articles.
-
Articles which deal with the management of data in general and do not focus on Big Data.
-
Articles that focus on studying or proposing normal database management systems.
-
Articles that focus on studying or proposing normal file management systems.
-
Articles that focus on the supply of Big Data processing techniques or algorithms. As the focus in this article is mainly on how to manage the data and not how to process them to achieve a certain result.
Quality assessment criteria
-
Presentation of the article is clear and there is no great effort needed to comprehend it.
-
Any kind of validation is offered especially in the context of the proposal of certain algorithms, methods, systems or databases.
-
The advancement over the state-of-the-art is clarified as well as the main limitations of the proposed work.
-
The objectives of the study are well covered by the approach that is being employed.
Study selection procedure
Non-functional data management features
Performance
Scalability
Elasticity
Availability
Consistency
Big Data processing
Data storage systems
Database management systems
Relational data models
NewSQL
Key-value
Document
Wide-column
Graph
Time-series
Multi-model
Comparison of selected DBMSs
Qualitative criteria
Qualitative analysis
DBMS | Version | Data model | Technical features | Service model | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Architecture | Sharding | Elasticity | CAP | Replication | Big Data adapter | Community | Enterprise | DBaaS | |||
MySQL | 8.0.11 | RDBMS | Single/master–slave | Manual | No | CA | Cluster | 3rd party (SQL-based) | Yes | Yes | |
PostgreSQL | 10.4 | RDBMS | Single/master–slave | Manual | No | CA | Cluster | 3rd party (SQL-based) | Yes | Yes | |
VoltDB | 8.1.2 | NewSQL | Multi-master | Hash | Yes (commercial) | CP | Cross-cluster (commercial) | 3rd party (SQL-based) | Yes | Yes | No |
CockroachDB | 2.0.3 | NewSQL | Multi-master | Hash | Yes | CP | Cross-cluster (commercial) | 3rd party (SQL-based) | Yes | Yes | No |
Riak | 2.2.3 | Key-value | Multi-master | Hash | Yes | AP | Cross-cluster | Native | Yes | Yes | No |
Redis | 4.0 | Key-value | Multi-master | Hash | Yes | AC | Cluster | Native | Yes | Yes | |
MongoDB | 4.0.0 | Document | Multi-master | Hash/range | Yes | CP | Cross-cluster | Native | Yes | Yes | https://www.mongodb.com/ Cloud/atlas |
Couchbase | 5.0.1 | Document | Multi-master | Hash | Yes | CP | Cross-cluster | Native | Yes | Yes | |
Cassandra | 3.11.2 | Wide-column | Multi-master | Hash/range | Yes | AP | Cross-cluster | Native | Yes | Yes, By DataStax | |
HBase | 2.0.1 | Wide-column | Multi-master | Hash | Yes | CP | Cross-cluster | 3rd party | Yes | Yes, By Cloudera | No |
Neo4J | 3.4.1 | Graph | Master–slave | No | Yes | CA | Cross-cluster | Native | Yes | Yes | |
JanusGraph | 0.2.0 | Graph | Multi-master | Manual | Yes | AP/CP | Cluster | 3rd party | Yes | No | No |
ArangoDB | 3.3.11 | Multi-model (key-value, document, graph) | Multi-master | Hash | Yes | CP | Cross-cluster | Native | Yes | Yes | No |
OrientDB | 3.0.2 | Multi-model (key-value, document, graph) | Multi-master | Hash | Yes | – | Cross-cluster (commercial) | Native | Yes | Yes | No |
InfluxDB | 1.5.4 | Time-series | Multi-master (commercial) | Range | Yes (commercial) | AP/CP | Cross-cluster (commercial) | 3rd party | Yes | Yes | |
Prometheus | 2.3 | Time-series | Master–slave | Manual | No | – | Cluster | 3rd party | Yes | Yes | No |
Cloudification of DMS
Distributed file systems
Client–server model
Clustered-distributed model
Symmetric model
DFS | Version | FileSystem | Technical features | |||||
---|---|---|---|---|---|---|---|---|
Architecture | Sharding | Elasticity | CAP | Replication | Big Data adapter | |||
NFS | 4.2 | Client–server | Fully-centralized | Index/range | No | CA | Block level | 3rd party |
GlusterFS | 4.0 | Client–server | Fully-centralized | Automatic | Yes | CA | Node level | Native |
HDFS | 3.0.1 | Clustered-distributed | Less-centralized | Fixed size | Yes | AP | Block level | Native |
CephFS | 12.2.5 | Clustered-distributed | Less-centralized | Index/range | Yes | CP | Cluster-level | Native/3rd party |
Ivy | 0.3 | Symmetric | Fully-distributed | DHash | Yes | AP | Block-level | – |
PVFS | 2.0 | Symmetric | Fully-distributed | Hash | Yes | AP | – | 3rd party |
DFS evaluation
Data placement techniques
Formal definition
Data placement methodologies
Data dependency methods
Task and data scheduling methods
Graph-based data placement
Comparative evaluation
Approach | Fixed DS | Constraint satisfaction | Granul. | Interm. DS | Mult. appl. | Data size | Repl. | Opt. criteria | Add. info. |
---|---|---|---|---|---|---|---|---|---|
BDAP [85] | Yes | Meta-heuristic | Fine | Yes | No | No | No | Comm. cost | No |
Xu [92] | No | Meta-heuristic | Coarse | No | No | No | No | Data transf. number | No |
Yuan [78] | Yes | Recursive binary part. | Coarse | Yes | Yes | Yes | No | Data transf. number | No |
Kaya [100] | No | Hypergraph part. | Coarse | No | No | No | No | Data transf. number | No |
Zhao [87] | Yes | Hierarchical part. clust. + PSO | Fine | Yes | No | No | No | Data transf. number | No |
Wang [83] | No | Recursive clust. + ODPA | Fine | No | No | No | No | Data transf. number | Yes |
Yu [72] | No | Hypergraph part. | Fine | No | No | No | No | Cut weight | Yes |
Zhang [90] | No | Lagrance MIP relaxation | Coarse | No | No | No | No | Data access cost | No |
Hsu [91] | No | – | Fine | No | No | No | No | Profiling-related metric | Yes |
LeBeane [97] | No | Hypergraph part. | Fine | No | No | No | No | Skew factor | Yes |
Lan [89] | No | Clustering-based PSO search | Fine | No | No | No | No | Volatility AMA, hurst distance | Yes |
BitDew [94] | No | Fine | Yes | Yes | No | Yes | Data dep. repl., fault tol. | Yes | |
Kayoor [99] | No | Hypergraph part. | Coarse | No | No | No | Yes | Avg. query span | Yes |
Kosar [81] | Yes | Fine | Yes | Yes | No | Yes | Yes | ||
Scalia [86] | No | Multi-dimensional Knapsack problem | Fine | No | Yes | Yes | No | Storage cost | Yes |
SWORD [98] | Yes | Graph partition | Fine | No | Yes | Conflicting transactions | Yes |
Lessons learned and future research directions
Data lifecycle management
Challenges and issues
Future research directions
-
Use of advanced modelling techniques that consider metadata schemas for setting the scope of truly exploitable data modelling artefacts. It refers to managing the modelling task in a way that covers the description of all V’s (e.g. velocity, volume, value, variety, and veracity) in the characteristics of Big Data to be processed. The proper and multi-dimensional data modelling will allow for an adequate description of the data placement problem.
-
Perform optimal data placement across multiple Cloud resources based on the data modelling and user-defined goals, requirements and constraints.
-
Use of efficiently distributed monitoring functionalities for observing the status of the Big Data stored or processed and detect any migration or reconfiguration opportunities.
-
Employ the appropriate replication, fail-over and backup techniques by considering and exploiting at the same time the already offered functionalities by public Cloud providers.
-
According to such opportunities, continuously make reconfiguration and migration decisions by consistently considering the real penalty for the overall application reconfiguration, always in sight of the user constraints, goals and requirements that should drive the configuration of computational resources and the scheduling of application tasks.
-
Design and implement security policies in order to guarantee that certain regulations (e.g., General Data Protection Regulation) are constantly and firmly respected (e.g., data artefacts should not be stored or processed outside the European Union) while at the same time the available Cloud providers’ offerings are exploited according to the data owners’ privacy needs (e.g., exploit the data sanitization service when migrating or just removing data from a certain Cloud provider).
Data storage
Challenges and issues
Future research directions
-
The growing domain of hybrid transaction/analytical processing workloads needs to be considered for the existing data models. Moreover, comparable benchmarks for different data models need to be established [107] and qualitative evaluations need to be performed across all data model domains as well.
-
To select an optimal combination of a distributed DBMS and Cloud resources, evaluation frameworks across different DBMS, Cloud resource and workload domains are required [108]. Such frameworks ease the DBMS selection and operation for Big Data lifecycle management.
-
Holistic DBMS evaluation frameworks are required to enable the qualitative analysis across all non-functional features in a comparable manner. In order to achieve this, frameworks need to support complex DBMS adaptation scenarios, including scaling and failure injection.
-
DBMS adaptation strategies need to be derived and integrated into the orchestration frameworks to enable the automated operation (to cope with workload fluctuations) of a distributed DBMS.
-
Qualitative DBMS selection guidelines need to be extended with respect to operational and adaptation features of current DBMS (i.e., support for orchestration frameworks to enable automated operation and adaptation and the integration support into Big Data frameworks).
-
For efficient, resource sharing among multiple Cloud service providers/components, a single/unified interface must handle the complex issues, such as seamless workload distribution, improved data access experience and faster read-write synchronizations, together with the increased level of data serialization for DFSs.
-
We also advocate for using smarter replica-assignment policies to achieve better workload balance and efficient storage space management.
-
To counter the synchronization issue in DFSs, a generic solution could be to cache the data in the client or in the local server’s side, but such an approach can become the bottleneck for the Big Data management scenario as well. Thus, exploratory research must be done in this direction.
-
As the data diversity and the networks heterogeneity is increasing, an abstract communication layer must be in place to address the issue of network transparency. Such abstraction can handle different types of communications easily and efficiently.
-
The standard security mechanisms are in place (such as ACLs) for data security. However, after the Cloudification of the file system, the data become more vulnerable due to the interconnection of diverse distributed, heterogeneous computing components. Thus, proper security measures must be built-in features of tomorrow’s DFSs.
Data placement
Challenges and issues
Fixed data set size
Constraint solving
Granularity
Multiple applications
Data growth
Data replication
Optimisation criteria
Additional information
Future research directions
-
Fixed data set size: To guarantee the true, optimal satisfaction of the user requirements and optimisation objectives, we suggest the use of semi-fixed constraints in a more suitable and flexible manner as a respective non-static part of the location-aware optimisation problem to be solved.
-
Constraint solving: We propose the use of hybrid approaches (i.e., combining exhaustive and meta-search heuristic techniques) so as to rapidly get (within an acceptable and practically employable execution time) optimal or near-optimal results in a scalable fashion. For instance, constraint programming could be combined with local search. The first could be used to find a good initial solution, while the latter could be used for neighbourhood search to find a better result. In addition, it might be possible that a different and more scalable modelling of the optimisation problem could enable to run standard exhaustive solution techniques even with medium-sized problem instances. Finally, solution learning from history could be adopted to fix parts of the optimisation problem and thus substantially reduce the solution space to be examined.
-
Granularity: There is a need for dynamic approaches for data placement which do take into account the workload fluctuation and the data growth to both partition data as well as optimally place them in a set of resources with a size that is dynamically identified.
-
Multiple applications: To handle applications conflicting requirements and the dynamicity of context (e.g., change of infrastructure, application requirements), different techniques to solve the (combined) optimisation problem are required. First, soft constraints could be used to solve this problem, even if it is over-constrained (e.g., producing a solution that violates the least number of these preferences). Next, we could prioritise the applications and/or their tasks. Third, distributed solving techniques could be used to produce application-specific optimisation problems of reduced complexity. This would require a transformation of the overall problem into sub-problems which retains as much as possible the main constraints and requirements of each relevant application. Finally, complementary to these distributed solving techniques, the measure of replication could also be employed. By using such a measure, we enable each application to operate over its own copy of the data originally shared. This could actually enable to have complete independence of applications which would then allow us to solve data placement individually for each of these applications.
-
Data growth: There is a need to employ a more sophisticated approach which exploits the data (execution) history as well as data size prediction and data (type) similarity techniques to solve the data growth issue. Similarity can be learned by knowing the context of data (e.g., by assuming the same context has been employed for similar data over time by multiple users), while statistical methods can predict the data growth. Such an approach can also be used for new data sets for which no prior knowledge exists (known as the cold-start problem).
-
Data replication: For data replication, we suggest to dynamically compute the replication degree by considering the application size, data size, data access pattern, data growth rate, user requirements, and the capabilities of Cloud services. Such a solution could also rely on a weight calculation method for the determination of the relative importance of each of these factors.
-
Optimisation criteria: An interesting research direction compiles into exploring ways via data placement and task scheduling could be either solved in conjunction or in a clever but independent way such that they do take into account the same set of (high-level) user requirements. This could lead to producing solutions which are in concert and also optimal according to both aspects of data and computation.
-
Additional information: We advocate that the additional information required to be collected or derived include: (i) co-locating frequently accessing tasks and data; (ii) exploiting data dependencies to have effective data partitioning. A similar approach is employed by Wang et al. [83] where data are grouped together at a finer granularity. There are also precautions in not storing different data blocks from the same data in the same node; (iii) data variability data can be of different forms. Each form might require a different machine configuration for optimal storage and processing. In this case, profiling should be extended to also capture this kind of machine performance variation which could be quite beneficial for more data-form-focused placement. In fact, we see that whole approaches are dedicated to dealing with different data forms. For instance, graph analytics-oriented data placement algorithms exploit the fact that data are stored in the form of graphs to more effectively select the right techniques and algorithms for solving the data placement problem. While special-purpose approaches might be suitable for different data forms, they are not the right choice for handling different kinds of data. As such, we believe that an important future direction should be the ability to more optimally handle data of multiple forms to enhance the applicability of a data placement algorithm and make it suitable for handling different kinds of applications instead of a single one.