Elsevier

Computer Science Review

Volume 17, August 2015, Pages 70-81
Computer Science Review

Survey
Understandable Big Data: A survey

https://doi.org/10.1016/j.cosrev.2015.05.002Get rights and content

Abstract

This survey presents the concept of Big Data. Firstly, a definition and the features of Big Data are given. Secondly, the different steps for Big Data data processing and the main problems encountered in big data management are described. Next, a general overview of an architecture for handling it is depicted. Then, the problem of merging Big Data architecture in an already existing information system is discussed. Finally this survey tackles semantics (reasoning, coreference resolution, entity linking, information extraction, consolidation, paraphrase resolution, ontology alignment) in the Big Data context.

Introduction

Today, people and systems overload the web with an exponential generation of huge amount of data. The amount of data on the web is measured in exabytes (1018) and zettabytes (1021). By 2025, the forecast is that the Internet will exceed the brain capacity of everyone living in the whole world  [1]. This fast growth of data is due to advances in digital sensors, communications, computation, and storage that have created huge collections of data.1  The term Big Data had been coined, by Roger Magoulas (according to  [2]), to describe this phenomenon.

Seven recent papers (including  [3] and  [4]) have aimed to extract Big Data trends, challenges and opportunities.  [5] provide a survey on scalable database management: updating of heavy application, analytics and decision support. Likewise,  [6] study analytics in Big Data with a focus on data warehouse. These two papers have different goals comparatively to  [7]. In a more rigorous way, M. Pospiech and C. Felden  [7] have selected relevant and recent papers which tackle different aspects of Big Data and have clustered them in four domains: Technical data provisioning (acquisition, storage, processing), Technical data utilization (computation and time complexity), Functional data provisioning (information life cycle management, lean information management, value oriented information management, etc.) and Functional data utilization (realms where big data is used). At the end of their clustering,  [7] note that a lot of papers (87%) are technical and that there is not any paper on functional data provisioning. More closed (compared to the three previous works) to our target, semantics in the age of Big Data,  [8] focus on knowledge discovery and management in Big Data era (flooding of data on the web). As our paper they zoom on gathering relational facts, information extraction, emergence of structure, etc. But a deep circonscription of the concept of Big Data is not in the scope of their article like some other key themes of this paper like reasoning on large and uncertain OWL triples, coreference resolution, ontology alignment. The last paper has been authored by  [9]. They present Big Data integration in a easy-understandable-way. Schema alignment, record linkage and data fusion are presented w.r.t to Big Data characteristics (volume, velocity and variety). Knowing the high value carried by data in general and thus by Big Data, it is not surprising therefore that Chief Information Officers (CIOs) are interested in it analytics as technological. If initially web pages and traditional databases were the raw materials respectively for search engine companies and other businesses, now it has been mixed with large sets of miscellaneous, heterogeneous and unstructured data. It implies that tools and techniques have to be designed to disambiguate it before putting it together to master and manage data of organizations. Our work is similar to  [9] in the approach. We discuss challenges and opportunities of semantics in the age of Big Data and present the supply chain to handle it. Therefore, this article defines Big Data (Section  2), briefly discusses its management (Section  3) and finally tackles Big Data and semantics challenges and opportunities (Section  4).

Section snippets

What is big data?

Manyika et al.  [10, page 1] define Big Data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”. Likewise, Davis and Patterson  [1, page 4] say “Big data is data too big to be handled and analyzed by traditional database protocols such as SQL”; and the same opinion is shared by  [11], [3], [4], etc. Both groups of authors previously mentioned go beyond the only size aspects of data when defining Big Data! Edd Dumbill in 

Big data management

Basically, data processing is seen as the gathering, processing, management of data for producing “new” information for end users  [3]. Over time, key challenges are related to storage, transportation and processing of high throughput data. It is different from Big Data challenges to which we have to add ambiguity, uncertainty and variety   [3]. Consequently, these requirements imply an additional step where data are cleaned, tagged, classified and formatted  [3], [14]. Karmasphere5

Big data quality, the next semantic challenge

A question that experts of the knowledge management ask themselves is to know if Big Data can leverage on semantics. The answer to this question is obviously “yes”. Companies and governments are interested in two types of data in a big data context. First, they consider data generated by human, mainly those disseminated through web tools (social networks, cookies, emails...). Secondly they want to merge data generated from connected objects. The Internet of human beings and the internet of

Ethics and privacy

Ethics and privacy have always been a main concern in data management. They are now of big interest with big data. This is due to the multi-dimensionality of big data:

  • Due to the huge volume of data more pieces of valuable information can be identified or inferred than it was possible before.

  • The high velocity of data makes feasible analysis in real time and thus a continuous refining of users’ profiles.

  • The variety of data sources make users traceable. In addition the diversity of data types

Conclusion

We are living in the era of data deluge. The term Big Data had been coined to describe this age. This paper defines and characterizes the concept of Big Data. It gives a definition of this new concept and its characteristics. In addition, a supply chain and technologies for Big Data management are presented. During that management, many problems can be encountered, especially during semantic gathering. Thus it tackles semantics (reasoning, coreference resolution, entity linking, information

References (91)

  • A. Reeve

    Managing Data in Motion: Data Integration Best Practice Techniques and Technologies

    (2013)
  • D. Agrawal et al.

    Big data and cloud computing: current state and future opportunities

  • A. Cuzzocrea et al.

    Analytics over large-scale multidimensional data: the big data revolution!

  • M. Pospiech et al.

    Big data—a state-of-the-art

  • F. Suchanek et al.

    Knowledge harvesting in the big-data era

  • X. Dong et al.

    Big data integration

  • J. Manyika et al.

    Big Data: The Next Frontier for Innovation, Competition, and Productivity

    (2011)
  • P. Zikopoulos et al.

    Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data

    (2011)
  • I. O’Reilly Media

    Big Data Now

    (2014)
  • P. Hitzler et al.

    Linked data, big data, and the 4th paradigm

    Semant. web

    (2013)
  • H.V. Jagadish et al.

    Challenges and Opportunities with Big Data

    (2015)
  • T. White

    Hadoop: The Definitive Guide

    (2009)
  • D. Borthakur

    The hadoop distributed file system: Architecture and design

    The Apache Software Foundation.

    (2007)
  • K. Shvachko et al.

    The hadoop distributed file system

  • G. Turkington

    Hadoop Beginners Guide

    (2013)
  • J. Dean et al.

    Mapreduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • C. Ranger et al.

    Evaluating mapreduce for multi-core and multiprocessor systems

  • R. Mutharaju et al.

    A mapreduce algorithm for EL+

  • T. Kaldewey et al.

    Clydesdale: structured data processing on mapreduce

  • Y. He et al.

    Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems

  • A. Thusoo et al.

    Hive - a petabyte scale data warehouse using Hadoop

  • J. Boulon et al.

    Chukwa, a large-scale monitoring system

    Cloud Comput. Appl.

    (2008)
  • C. Wang et al.

    Faster, larger, easier: reining real-time big data processing in cloud

  • M. Schonlau

    The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses

    Stata J.

    (2002)
  • F.B. Viégas et al.

    Studying cooperation and conflict between authors with history flow visualizations

  • D. Keim et al.

    Big-data visualization, computer graphics and applications

    IEEE

    (2013)
  • P. Russom, et al. Big data analytics, TDWI Best Practices Report, Fourth...
  • D. Maltby

    Big data analytics

  • A. Hoppe et al.

    Automatic user profile mapping to marketing segments in a bigdata context

  • R. Peixoto et al.

    Semantic hmc for business intelligence using cross-referencing

  • S.H. Sengamedu

    Scalable analytics–algorithms and systems

  • C. Bizer et al.

    The meaningful use of big data: four perspectives–four challenges

    SIGMOD Rec.

    (2012)
  • S. Auer et al.

    Creating knowledge out of interlinked data

    Semant. web

    (2010)
  • M. Laclavík et al.

    Towards large scale semantic annotation built on mapreduce architecture

  • N. Nakashole et al.

    Scalable knowledge harvesting with high precision and high recall

  • Cited by (259)

    • DNA computing-based Big Data storage

      2023, Advances in Computers
      Citation Excerpt :

      This data may be public or private, organized or unorganized, local or distant, shared or confidential global, complete or incomplete, shared or secret, etc. [5,12]. For a better definition of Big Data, some more Vs with extended characteristics are added in the above list by Emani et al. [4] and Gandomi and Haider [62]. These include vision (purpose of data), verification (confirmation of the data to some specification), validation (fulfillment of the purpose of data), value (extracting the information from the data for other sectors), and complexity (difficulty in organizing and analyzing the data due to evolving relationships).

    • Multimodal text summarization with evaluation approaches

      2023, Sadhana - Academy Proceedings in Engineering Sciences
    View all citing articles on Scopus
    View full text