7.1 Introduction
7.2 Key Insights for Big Data Storage
-
Potential to Transform Society and Businesses across Sectors: Big data storage technologies are a key enabler for advanced analytics that have the potential to transform society and the way key business decisions are made. This is of particular importance in traditionally non-IT-based sectors such as energy. While these sectors face non-technical issues such as the lack of skilled big data experts and regulatory barriers, novel data storage technologies have the potential to enable new value-generating analytics in and across various industrial sectors.
-
Lack of Standards Is a Major Barrier: The history of NoSQL is based on solving specific technological challenges which lead to a range of different storage technologies. The large range of choices coupled with the lack of standards for querying the data makes it harder to exchange data stores as it may tie application specific code to a certain storage solution.
-
Open Scalability Challenges in Graph-Based Data Stores: Processing data based on graph data structures is beneficial in an increasing amount of applications. It allows better capture of semantics and complex relationships with other pieces of information coming from a large variety of different data sources, and has the potential to improve the overall value that can be generated by analysing the data. While graph databases are increasingly used for this purpose, it remains hard to efficiently distribute graph-based data structure across computing nodes.
-
Privacy and Security Is Lagging Behind: Although there are several projects and solutions that address privacy and security, the protection of individuals and securing their data lags behind the technological advances of data storage systems. Considerable research is required to better understand how data can be misused, how it needs to be protected and integrated in big data storage solutions.
7.3 Social and Economic Impact of Big Data Storage
7.4 Big Data Storage State-of-the-Art
7.4.1 Data Storage Technologies
-
Distributed File Systems : File systems such as the Hadoop File System (HDFS) (Shvachko et al. 2010) offer the capability to store large amounts of unstructured data in a reliable way on commodity hardware. Although there are file systems with better performance, HDFS is an integral part of the Hadoop framework (White 2012) and has already reached the level of a de-facto standard. It has been designed for large data files and is well suited for quickly ingesting data and bulk processing.
-
NoSQL Databases: Probably the most important family of big data storage technologies are NoSQL database management systems. NoSQL databases use data models from outside the relational world that do not necessarily adhere to the transactional properties of atomicity, consistency, isolation, and durability (ACID).
-
NewSQL Databases : A modern form of relational databases that aim for comparable scalability as NoSQL databases while maintaining the transactional guarantees made by traditional database systems.
-
Big Data Querying Platforms : Technologies that provide query facades in front of big data stores such as distributed file systems or NoSQL databases. The main concern is providing a high-level interface, e.g. via SQL3 like query languages and achieving low query latencies.
7.4.1.1 NoSQL Databases
-
Key-Value Stores : Key-value stores allow storage of data in a schema-less way. Data objects can be completely unstructured or structured and are accessed by a single key. As no schema is used, it is not even necessary that data objects share the same structure.
-
Columnar Stores : According to Wikipedia “A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data, like most relational DBMSs” (Wikipedia 2013). Such databases are typically sparse, distributed, and persistent multi-dimensional sorted maps in which data is indexed by a triple of a row key, column key, and a timestamp. The value is represented as an uninterrupted string data type. Data is accessed by column families, i.e. a set of related column keys that effectively compress the sparse data in the columns. Column families are created before data can be stored and their number is expected to be small. In contrast, the number of columns is unlimited. In principle columnar stores are less suitable when all columns need to be accessed. However in practice this is rarely the case, leading to superior performance of columnar stores.
-
Document Databases : In contrast to the values in a key-value store, documents are structured. However, there is no requirement for a common schema that all documents must adhere to as in the case for records in relational databases. Thus document databases are referred to as storing semi-structured data. Similar to key-value stores, documents can be queried using a unique key. However, it is possible to access documents by querying their internal structure, such as requesting all documents that contain a field with a specified value. The capability of the query interface is typically dependent on the encoding format used by the databases. Common encodings include XML or JSON.
-
Graph Databases: Graph databases, such as Neo4J (2015), store data in graph structures making them suitable for storing highly associative data such as social network graphs. A particular flavour of graph databases are triple stores such as AllegroGraph (Franz 2015) and Virtuoso (Erling 2009) that are specifically designed to store RDF triples. However, existing triple store technologies are not yet suitable for storing truly large datasets efficiently.
7.4.1.2 NewSQL Databases
-
SQL is the primary mechanism for application interaction
-
ACID support for transactions
-
A non-locking concurrency control mechanism
-
An architecture providing much higher per-node performance
-
A scale-out, shared-nothing architecture, capable of running on a large number of nodes without suffering bottlenecks
7.4.1.3 Big Data Query Platforms
7.4.1.4 Cloud Storage
-
As cloud storage is a service, applications using this storage have less control and may experience decreased performance as a result of networking. These performance differences need to be taken into account during design and implementation stages.
-
Security is one of the main concerns related to public clouds. As a result the Amazon CTO predicts that in five years all data in the cloud will be encrypted by default (Vogels 2013).
-
Feature rich clouds like AWS supports calibration of latency, redundancy, and throughput levels for data access, thus allowing users to find the right trade-off between cost and quality.
7.4.2 Privacy and Security
7.4.2.1 Security Best Practices for Non-relational Data Stores
7.4.2.2 Secure Data Storage and Transaction Logs
7.4.2.3 Cryptographically Enforced Access Control and Secure Communication
7.4.2.4 Security and Privacy Challenges for Granular Access Control
7.4.2.5 Data Provenance
7.4.2.6 Privacy Challenges in Big Data Storage
7.5 Future Requirements and Emerging Paradigms for Big Data Storage
7.5.1 Future Requirements for Big Data Storage
7.5.1.1 Standardized Query Interfaces
7.5.1.2 Security and Privacy
7.5.1.3 Semantic Data Models
7.5.2 Emerging Paradigms for Big Data Storage
7.5.2.1 Increased Use of NoSQL Databases
7.5.2.2 In-Memory and Column-Oriented Designs
7.5.2.3 Convergence with Analytics Frameworks
7.5.2.4 The Data Hub
7.6 Sector Case Studies for Big Data Storage
Case study | Sector | Volume | Storage technologies | Key requirements |
---|---|---|---|---|
Treato: Social media based medication intelligence | Health | >150 TB | HBase | Cost-efficiency, scalability limitations of relational DBs |
Centralized data hub | Finance | Between several petabytes and over 150 PB | Hadoop/HDFS | Building more accurate models, scale of data, suitability for unstructured data |
Smart grid
| Energy | Tens of TB per day | Hadoop | Data volume, operational challenges |
7.6.1 Health Sector: Social Media-Based Medication Intelligence
7.6.2 Finance Sector: Centralized Data Hub
7.6.3 Energy: Device Level Metering
Sampling rate | 1 Hz |
Record size | 50 Bytes |
Raw data per day and household | 4.1 MB |
Raw data per day for 10 Mio customers | ~39 TB |