SurveyUnderstandable Big Data: A survey
Introduction
Today, people and systems overload the web with an exponential generation of huge amount of data. The amount of data on the web is measured in exabytes (1018) and zettabytes (1021). By 2025, the forecast is that the Internet will exceed the brain capacity of everyone living in the whole world [1]. This fast growth of data is due to advances in digital sensors, communications, computation, and storage that have created huge collections of data.1 The term Big Data had been coined, by Roger Magoulas (according to [2]), to describe this phenomenon.
Seven recent papers (including [3] and [4]) have aimed to extract Big Data trends, challenges and opportunities. [5] provide a survey on scalable database management: updating of heavy application, analytics and decision support. Likewise, [6] study analytics in Big Data with a focus on data warehouse. These two papers have different goals comparatively to [7]. In a more rigorous way, M. Pospiech and C. Felden [7] have selected relevant and recent papers which tackle different aspects of Big Data and have clustered them in four domains: Technical data provisioning (acquisition, storage, processing), Technical data utilization (computation and time complexity), Functional data provisioning (information life cycle management, lean information management, value oriented information management, etc.) and Functional data utilization (realms where big data is used). At the end of their clustering, [7] note that a lot of papers (87%) are technical and that there is not any paper on functional data provisioning. More closed (compared to the three previous works) to our target, semantics in the age of Big Data, [8] focus on knowledge discovery and management in Big Data era (flooding of data on the web). As our paper they zoom on gathering relational facts, information extraction, emergence of structure, etc. But a deep circonscription of the concept of Big Data is not in the scope of their article like some other key themes of this paper like reasoning on large and uncertain OWL triples, coreference resolution, ontology alignment. The last paper has been authored by [9]. They present Big Data integration in a easy-understandable-way. Schema alignment, record linkage and data fusion are presented w.r.t to Big Data characteristics (volume, velocity and variety). Knowing the high value carried by data in general and thus by Big Data, it is not surprising therefore that Chief Information Officers (CIOs) are interested in it analytics as technological. If initially web pages and traditional databases were the raw materials respectively for search engine companies and other businesses, now it has been mixed with large sets of miscellaneous, heterogeneous and unstructured data. It implies that tools and techniques have to be designed to disambiguate it before putting it together to master and manage data of organizations. Our work is similar to [9] in the approach. We discuss challenges and opportunities of semantics in the age of Big Data and present the supply chain to handle it. Therefore, this article defines Big Data (Section 2), briefly discusses its management (Section 3) and finally tackles Big Data and semantics challenges and opportunities (Section 4).
Section snippets
What is big data?
Manyika et al. [10, page 1] define Big Data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”. Likewise, Davis and Patterson [1, page 4] say “Big data is data too big to be handled and analyzed by traditional database protocols such as SQL”; and the same opinion is shared by [11], [3], [4], etc. Both groups of authors previously mentioned go beyond the only size aspects of data when defining Big Data! Edd Dumbill in
Big data management
Basically, data processing is seen as the gathering, processing, management of data for producing “new” information for end users [3]. Over time, key challenges are related to storage, transportation and processing of high throughput data. It is different from Big Data challenges to which we have to add ambiguity, uncertainty and variety [3]. Consequently, these requirements imply an additional step where data are cleaned, tagged, classified and formatted [3], [14]. Karmasphere5
Big data quality, the next semantic challenge
A question that experts of the knowledge management ask themselves is to know if Big Data can leverage on semantics. The answer to this question is obviously “yes”. Companies and governments are interested in two types of data in a big data context. First, they consider data generated by human, mainly those disseminated through web tools (social networks, cookies, emails...). Secondly they want to merge data generated from connected objects. The Internet of human beings and the internet of
Ethics and privacy
Ethics and privacy have always been a main concern in data management. They are now of big interest with big data. This is due to the multi-dimensionality of big data:
- •
Due to the huge volume of data more pieces of valuable information can be identified or inferred than it was possible before.
- •
The high velocity of data makes feasible analysis in real time and thus a continuous refining of users’ profiles.
- •
The variety of data sources make users traceable. In addition the diversity of data types
Conclusion
We are living in the era of data deluge. The term Big Data had been coined to describe this age. This paper defines and characterizes the concept of Big Data. It gives a definition of this new concept and its characteristics. In addition, a supply chain and technologies for Big Data management are presented. During that management, many problems can be encountered, especially during semantic gathering. Thus it tackles semantics (reasoning, coreference resolution, entity linking, information
References (91)
- et al.
Webpie: A web-scale parallel inference engine using mapreduce
Web Semant.
(2012) - et al.
Searching and browsing linked data with : The semantic web search engine
Web Semant.
(2011) - et al.
Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora
Web Semant.
(2012) - et al.
Conceptual-model-based data extraction from multiple-record web pages
Data Knowl. Eng.
(1999) - et al.
An ontology-based retrieval system using semantic indexing
Inf. Syst.
(2012) - et al.
Sig.ma: Live views on the web of data
Web Semantics: Sci. Serv. Agents on the World Wide Web
(2010) - et al.
Repairing and reasoning with inconsistent and uncertain ontologies
Adv. Eng. Softw.
(2012) - et al.
Ethics of Big Data: Balancing Risk and Innovation
(2012) - et al.
The evolution of big data as a research and scientific topic: Overview of the literature
Res. Trends
(2012) Data warehousing in the age of big data
Managing Data in Motion: Data Integration Best Practice Techniques and Technologies
Big data and cloud computing: current state and future opportunities
Analytics over large-scale multidimensional data: the big data revolution!
Big data—a state-of-the-art
Knowledge harvesting in the big-data era
Big data integration
Big Data: The Next Frontier for Innovation, Competition, and Productivity
Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data
Big Data Now
Linked data, big data, and the 4th paradigm
Semant. web
Challenges and Opportunities with Big Data
Hadoop: The Definitive Guide
The hadoop distributed file system: Architecture and design
The Apache Software Foundation.
The hadoop distributed file system
Hadoop Beginners Guide
Mapreduce: simplified data processing on large clusters
Commun. ACM
Evaluating mapreduce for multi-core and multiprocessor systems
A mapreduce algorithm for
Clydesdale: structured data processing on mapreduce
Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems
Hive - a petabyte scale data warehouse using Hadoop
Chukwa, a large-scale monitoring system
Cloud Comput. Appl.
Faster, larger, easier: reining real-time big data processing in cloud
The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses
Stata J.
Studying cooperation and conflict between authors with history flow visualizations
Big-data visualization, computer graphics and applications
IEEE
Big data analytics
Automatic user profile mapping to marketing segments in a bigdata context
Semantic hmc for business intelligence using cross-referencing
Scalable analytics–algorithms and systems
The meaningful use of big data: four perspectives–four challenges
SIGMOD Rec.
Creating knowledge out of interlinked data
Semant. web
Towards large scale semantic annotation built on mapreduce architecture
Scalable knowledge harvesting with high precision and high recall
Cited by (259)
Formally specifying and coinductive approach to verifying synthesis of stream calculus-based computing big data in livestream
2023, Internet of Things (Netherlands)DNA computing-based Big Data storage
2023, Advances in ComputersCitation Excerpt :This data may be public or private, organized or unorganized, local or distant, shared or confidential global, complete or incomplete, shared or secret, etc. [5,12]. For a better definition of Big Data, some more Vs with extended characteristics are added in the above list by Emani et al. [4] and Gandomi and Haider [62]. These include vision (purpose of data), verification (confirmation of the data to some specification), validation (fulfillment of the purpose of data), value (extracting the information from the data for other sectors), and complexity (difficulty in organizing and analyzing the data due to evolving relationships).
An urgent call for I-O psychologists to produce timelier technology research
2022, Industrial and Organizational PsychologyStreaming traffic classification: a hybrid deep learning and big data approach
2024, Cluster ComputingMultimodal text summarization with evaluation approaches
2023, Sadhana - Academy Proceedings in Engineering Sciences