Understandable Big Data: A survey

doi:10.1016/j.cosrev.2015.05.002

Computer Science Review

Volume 17, August 2015, Pages 70-81

https://doi.org/10.1016/j.cosrev.2015.05.002 Get rights and content

Abstract

This survey presents the concept of Big Data. Firstly, a definition and the features of Big Data are given. Secondly, the different steps for Big Data data processing and the main problems encountered in big data management are described. Next, a general overview of an architecture for handling it is depicted. Then, the problem of merging Big Data architecture in an already existing information system is discussed. Finally this survey tackles semantics (reasoning, coreference resolution, entity linking, information extraction, consolidation, paraphrase resolution, ontology alignment) in the Big Data context.

Introduction

Today, people and systems overload the web with an exponential generation of huge amount of data. The amount of data on the web is measured in exabytes (10¹⁸) and zettabytes (10²¹). By 2025, the forecast is that the Internet will exceed the brain capacity of everyone living in the whole world [1]. This fast growth of data is due to advances in digital sensors, communications, computation, and storage that have created huge collections of data.¹ The term Big Data had been coined, by Roger Magoulas (according to [2]), to describe this phenomenon.

Seven recent papers (including [3] and [4]) have aimed to extract Big Data trends, challenges and opportunities. [5] provide a survey on scalable database management: updating of heavy application, analytics and decision support. Likewise, [6] study analytics in Big Data with a focus on data warehouse. These two papers have different goals comparatively to [7]. In a more rigorous way, M. Pospiech and C. Felden [7] have selected relevant and recent papers which tackle different aspects of Big Data and have clustered them in four domains: Technical data provisioning (acquisition, storage, processing), Technical data utilization (computation and time complexity), Functional data provisioning (information life cycle management, lean information management, value oriented information management, etc.) and Functional data utilization (realms where big data is used). At the end of their clustering, [7] note that a lot of papers (87%) are technical and that there is not any paper on functional data provisioning. More closed (compared to the three previous works) to our target, semantics in the age of Big Data, [8] focus on knowledge discovery and management in Big Data era (flooding of data on the web). As our paper they zoom on gathering relational facts, information extraction, emergence of structure, etc. But a deep circonscription of the concept of Big Data is not in the scope of their article like some other key themes of this paper like reasoning on large and uncertain OWL triples, coreference resolution, ontology alignment. The last paper has been authored by [9]. They present Big Data integration in a easy-understandable-way. Schema alignment, record linkage and data fusion are presented w.r.t to Big Data characteristics (volume, velocity and variety). Knowing the high value carried by data in general and thus by Big Data, it is not surprising therefore that Chief Information Officers (CIOs) are interested in it analytics as technological. If initially web pages and traditional databases were the raw materials respectively for search engine companies and other businesses, now it has been mixed with large sets of miscellaneous, heterogeneous and unstructured data. It implies that tools and techniques have to be designed to disambiguate it before putting it together to master and manage data of organizations. Our work is similar to [9] in the approach. We discuss challenges and opportunities of semantics in the age of Big Data and present the supply chain to handle it. Therefore, this article defines Big Data (Section 2), briefly discusses its management (Section 3) and finally tackles Big Data and semantics challenges and opportunities (Section 4).

Section snippets

What is big data?

Manyika et al. [10, page 1] define Big Data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”. Likewise, Davis and Patterson [1, page 4] say “Big data is data too big to be handled and analyzed by traditional database protocols such as SQL”; and the same opinion is shared by [11], [3], [4], etc. Both groups of authors previously mentioned go beyond the only size aspects of data when defining Big Data! Edd Dumbill in

Big data management

Basically, data processing is seen as the gathering, processing, management of data for producing “new” information for end users [3]. Over time, key challenges are related to storage, transportation and processing of high throughput data. It is different from Big Data challenges to which we have to add ambiguity, uncertainty and variety [3]. Consequently, these requirements imply an additional step where data are cleaned, tagged, classified and formatted [3], [14]. Karmasphere⁵

Big data quality, the next semantic challenge

A question that experts of the knowledge management ask themselves is to know if Big Data can leverage on semantics. The answer to this question is obviously “yes”. Companies and governments are interested in two types of data in a big data context. First, they consider data generated by human, mainly those disseminated through web tools (social networks, cookies, emails...). Secondly they want to merge data generated from connected objects. The Internet of human beings and the internet of

Ethics and privacy

Ethics and privacy have always been a main concern in data management. They are now of big interest with big data. This is due to the multi-dimensionality of big data:

•
Due to the huge volume of data more pieces of valuable information can be identified or inferred than it was possible before.
•
The high velocity of data makes feasible analysis in real time and thus a continuous refining of users’ profiles.
•
The variety of data sources make users traceable. In addition the diversity of data types

Conclusion

We are living in the era of data deluge. The term Big Data had been coined to describe this age. This paper defines and characterizes the concept of Big Data. It gives a definition of this new concept and its characteristics. In addition, a supply chain and technologies for Big Data management are presented. During that management, many problems can be encountered, especially during semantic gathering. Thus it tackles semantics (reasoning, coreference resolution, entity linking, information

References (91)

J. Urbani et al.
Webpie: A web-scale parallel inference engine using mapreduce
Web Semant.
(2012)
A. Hogan et al.
Searching and browsing linked data with $s w s e$ : The semantic web search engine
Web Semant.
(2011)
A. Hogan et al.
Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora
Web Semant.
(2012)
D.W. Embley et al.
Conceptual-model-based data extraction from multiple-record web pages
Data Knowl. Eng.
(1999)
S. Kara et al.
An ontology-based retrieval system using semantic indexing
Inf. Syst.
(2012)
G. Tummarello et al.
Sig.ma: Live views on the web of data
Web Semantics: Sci. Serv. Agents on the World Wide Web
(2010)
B. Liu et al.
Repairing and reasoning with inconsistent and uncertain ontologies
Adv. Eng. Softw.
(2012)
K. Davis et al.
Ethics of Big Data: Balancing Risk and Innovation
(2012)
G. Halevi et al.
The evolution of big data as a research and scientific topic: Overview of the literature
Res. Trends
(2012)
K. Krishnan
Data warehousing in the age of big data

A. Reeve

Managing Data in Motion: Data Integration Best Practice Techniques and Technologies

(2013)

D. Agrawal et al.

Big data and cloud computing: current state and future opportunities

A. Cuzzocrea et al.

Analytics over large-scale multidimensional data: the big data revolution!

M. Pospiech et al.

Big data—a state-of-the-art

F. Suchanek et al.

Knowledge harvesting in the big-data era

X. Dong et al.

Big data integration

J. Manyika et al.

Big Data: The Next Frontier for Innovation, Competition, and Productivity

(2011)

P. Zikopoulos et al.

Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data

(2011)

I. O’Reilly Media

Big Data Now

(2014)

P. Hitzler et al.

Linked data, big data, and the 4th paradigm

Semant. web

(2013)

H.V. Jagadish et al.

Challenges and Opportunities with Big Data

(2015)

T. White

Hadoop: The Definitive Guide

(2009)

D. Borthakur

The hadoop distributed file system: Architecture and design

The Apache Software Foundation.

(2007)

K. Shvachko et al.

The hadoop distributed file system

G. Turkington

Hadoop Beginners Guide

(2013)

J. Dean et al.

Mapreduce: simplified data processing on large clusters

Commun. ACM

(2008)

C. Ranger et al.

Evaluating mapreduce for multi-core and multiprocessor systems

R. Mutharaju et al.

A mapreduce algorithm for ${EL}^{+}$

T. Kaldewey et al.

Clydesdale: structured data processing on mapreduce

Y. He et al.

Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems

A. Thusoo et al.

Hive - a petabyte scale data warehouse using Hadoop

J. Boulon et al.

Chukwa, a large-scale monitoring system

Cloud Comput. Appl.

(2008)

C. Wang et al.

Faster, larger, easier: reining real-time big data processing in cloud

M. Schonlau

The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses

Stata J.

(2002)

F.B. Viégas et al.

Studying cooperation and conflict between authors with history flow visualizations

D. Keim et al.

Big-data visualization, computer graphics and applications

IEEE

(2013)

P. Russom, et al. Big data analytics, TDWI Best Practices Report, Fourth...

D. Maltby

Big data analytics

A. Hoppe et al.

Automatic user profile mapping to marketing segments in a bigdata context

R. Peixoto et al.

Semantic hmc for business intelligence using cross-referencing

S.H. Sengamedu

Scalable analytics–algorithms and systems

C. Bizer et al.

The meaningful use of big data: four perspectives–four challenges

SIGMOD Rec.

(2012)

S. Auer et al.

Creating knowledge out of interlinked data

Semant. web

(2010)

M. Laclavík et al.

Towards large scale semantic annotation built on mapreduce architecture

N. Nakashole et al.

Scalable knowledge harvesting with high precision and high recall

Cited by (259)

Comparison of linear and non-linear decision boundaries to detect feedlot bloat using intensive data collection systems on Angus × Hereford steers
2023, Animal
Ruminal tympany (bloat) has long been an issue for large and small livestock operations. Though improvements in feedlot management practices have reduced its occurrence, it is still highly prevalent and is known to detrimentally affect animal performance, welfare, and in many instances, lead to animal death. Current decision support systems and diet formulation software omit the inclusion of bloat prediction based on animal performance. Here, we aim to predict bloat incidence in implanted and non-implanted feedlot steers from performance data comparing linear (LDB) and non-linear decision boundaries. Eighteen crossbred Angus × Hereford steers: BW (491.13 ± 25.78 kg) and age (12 ± 1 mo) were randomly distributed into implanted and non-implanted treatments. All animals were randomly assigned to one of two pens fit with automated monitoring systems for BW, freshwater intake, and water intake behavior: water intake event visit, no water intake event visit (NWIE), and time spent drinking. DM intake (DMI) was individually recorded from all animals through the Calan Gate system for 135 d (30 d adaptation, 105 d experimental diet). Incidences of bloat were recorded as bloat instances regardless of severity to ensure that early onset detection of bloat was recorded and properly identified in predictive models. Logistic regression with a binomial distribution and a logit link function was utilized to predict the incidences of bloat through LDB. Feature selection and penalization of coefficients were explored through L1 (sum of absolute values) and L2 (sum of squares) penalization to avoid overfitting of models. Additional NLDB and a non-parametric LDB are examined for prediction. Accuracy, specificity, and sensitivity were high for the models reported. No significant differences were observed between LDB and NLDB, with the highest specificity (predicting bloat) value of 0.820 for stepwise feature selection algorithms, and a value of 0.832 for the artificial neural network. Highest accuracy was 0.829 for ridge regression, and 0.847 for the random forest with hyperparameter tuning. DM intake, BW, and NWIE were the three most important variables for the prediction of feedlot bloat showing clear drops in DMI and BW and increases in NWIE when animals bloated. The lack of difference in predictive performance between LDB and NLDB highlights the often-overlooked concept that machine learning algorithms are not always the only/best modeling technique. Additionally, the models reported herein carry acceptable predictive performance for inclusion into management decisions that reduce bloat incidences in feedlot cattle.
Formally specifying and coinductive approach to verifying synthesis of stream calculus-based computing big data in livestream
2023, Internet of Things (Netherlands)
This article will perform the formally specifying and coinductive approach to verifying synthesis of stream calculus-based computing big data in livestream (BDL). Specifically, this article specifies register transfer level (RTL) synthesis without looping and including looping in stream calculus as behavioral functions. This approach assists the RTL synthesis semantics and is advantageous in putting coinduction in an application for comparing the behavioral functions using bisimulation. In other words, this article describes the principle of coinductive approach to verifying synthesis of stream calculus-based computing BDL. The specification enables pipelining, where all hardware resources including registers along with functional units (FUs) are reused during different control steps (CSteps), to be interpreted as, different scheduling.
DNA computing-based Big Data storage
2023, Advances in Computers
Citation Excerpt :
This data may be public or private, organized or unorganized, local or distant, shared or confidential global, complete or incomplete, shared or secret, etc. [5,12]. For a better definition of Big Data, some more Vs with extended characteristics are added in the above list by Emani et al. [4] and Gandomi and Haider [62]. These include vision (purpose of data), verification (confirmation of the data to some specification), validation (fulfillment of the purpose of data), value (extracting the information from the data for other sectors), and complexity (difficulty in organizing and analyzing the data due to evolving relationships).
In the current digital age, the rate of digital data generation is growing exponentially. Though the capacity of conventional storage devices is continuously increasing, it is far from matching the current exponential growth rate of digital data generation. Further, there is an urgent need for a high-density and high-capacity medium to store the information for a prolonged period. Deoxyribonucleic acid (DNA) seems to be a favorable alternative for storing such exponentially growing digital information for a prolonged period with high density and capacity as it keeps the information at a molecular level using nucleotides. DNA stores genetic information of all living things and the information is transferred from one generation to another accurately due to its precise Watson–Crick base pairing. Researchers have successfully used DNA for storing digital data, which opened the possibility of storing Big Data using DNA-based systems. In this chapter, different conventional tools and challenges associated with Big Data storage are reviewed. Further, the various encoding and encryption methods used for DNA-based data storage are critically analyzed. In addition, the challenges for DNA-based Big Data storage are reviewed, and the capabilities of different approaches to overcome these shortcomings are discussed.
An urgent call for I-O psychologists to produce timelier technology research
2022, Industrial and Organizational Psychology
Streaming traffic classification: a hybrid deep learning and big data approach
2024, Cluster Computing
Multimodal text summarization with evaluation approaches
2023, Sadhana - Academy Proceedings in Engineering Sciences

View all citing articles on Scopus

View full text

SurveyUnderstandable Big Data: A survey

Abstract

Introduction

Section snippets

What is big data?

Big data management

Big data quality, the next semantic challenge

Ethics and privacy

Conclusion

Web Semant.

Web Semant.

Web Semant.

Data Knowl. Eng.

Inf. Syst.

Web Semantics: Sci. Serv. Agents on the World Wide Web

Adv. Eng. Softw.

Ethics of Big Data: Balancing Risk and Innovation

The evolution of big data as a research and scientific topic: Overview of the literature

Res. Trends

Data warehousing in the age of big data

Managing Data in Motion: Data Integration Best Practice Techniques and Technologies

Big data and cloud computing: current state and future opportunities

Analytics over large-scale multidimensional data: the big data revolution!

Big data—a state-of-the-art

Knowledge harvesting in the big-data era

Big data integration

Big Data: The Next Frontier for Innovation, Competition, and Productivity

Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data

Big Data Now

Linked data, big data, and the 4th paradigm

Semant. web

Challenges and Opportunities with Big Data

Hadoop: The Definitive Guide

The hadoop distributed file system: Architecture and design

The Apache Software Foundation.

The hadoop distributed file system

Hadoop Beginners Guide

Mapreduce: simplified data processing on large clusters

Commun. ACM

Evaluating mapreduce for multi-core and multiprocessor systems

A mapreduce algorithm for EL+

Clydesdale: structured data processing on mapreduce

Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems

Hive - a petabyte scale data warehouse using Hadoop

Chukwa, a large-scale monitoring system

Cloud Comput. Appl.

Faster, larger, easier: reining real-time big data processing in cloud

The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses

Stata J.

Studying cooperation and conflict between authors with history flow visualizations

Big-data visualization, computer graphics and applications

IEEE

Big data analytics

Automatic user profile mapping to marketing segments in a bigdata context

Semantic hmc for business intelligence using cross-referencing

Scalable analytics–algorithms and systems

The meaningful use of big data: four perspectives–four challenges

SIGMOD Rec.

Creating knowledge out of interlinked data

Semant. web

Towards large scale semantic annotation built on mapreduce architecture

Scalable knowledge harvesting with high precision and high recall

Survey
Understandable Big Data: A survey

A mapreduce algorithm for ${EL}^{+}$