Skip to main content

Über dieses Buch

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource. Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and the support of agent systems with scaling and decentralized control. Synergy between grids, P2P systems, and agent technologies is the key to data- and knowledge-centered systems in large-scale environments.


, the 33rd issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains five revised selected regular papers. Topics covered include distributed massive data streams, storage systems, scientific workflow scheduling, cost optimization of data flows, and fusion strategies.



Lightweight Metric Computation for Distributed Massive Data Streams

The real time analysis of massive data streams is of utmost importance in data intensive applications that need to detect as fast as possible and as efficiently as possible (in terms of computation and memory space) any correlation between its inputs or any deviance from some expected nominal behavior. The IoT infrastructure can be used for monitoring any events or changes in structural conditions that can compromise safety and increase risk. It is thus a recurrent and crucial issue to determine whether huge data streams, received at monitored devices, are correlated or not as it may reveal the presence of attacks. We propose a metric, called codeviation, that allows to evaluate the correlation between distributed massive streams. This metric is inspired from classical metric in statistics and probability theory, and as such enables to understand how observed quantities change together, and in which proportion. We then propose to estimate the codeviation in the data stream model. In this model, functions are estimated on a huge sequence of data items, in an online fashion, and with a very small amount of memory with respect to both the size of the input stream and the values domain from which data items are drawn. We then generalize our approach by presenting a new metric, the Sketch-\(\star \) metric, which allows us to define a distance between updatable summaries of large data streams. An important feature of the Sketch-\(\star \) metric is that, given a measure on the entire initial data streams, the Sketch-\(\star \) metric preserves the axioms of the latter measure on the sketch. We finally present results obtained during extensive experiments conducted on both synthetic traces and real data sets allowing us to validate the robustness and accuracy of our metrics.
Emmanuelle Anceaume, Yann Busnel

Performance Analysis of Object Store Systems in a Fog and Edge Computing Infrastructure

Fog and Edge computing infrastructures have been proposed as an alternative to the current Cloud Computing facilities to address the latency issue for some applications. The main idea is to deploy smaller data-centers at the edge of the backbone in order to bring Cloud Computing resources closer to the end-usages. While a couple of works illustrated the advantages of such infrastructures in particular for Internet of Things (IoT) applications, the way of designing elementary services that can take advantage of such massively distributed infrastructures has not been yet discussed. In this paper, we propose to deal with such a question from the storage point of view. First, we propose a list of properties a storage system should meet in this context. Second, we evaluate through performance analysis three “off-the-shelf” object store solutions, namely Rados, Cassandra and InterPlanetary File System (IPFS). In particular, we focus (i) on access times to push and get objects under different scenarios and (ii) on the amount of network traffic that is exchanged between the different geographical sites during such operations. We also evaluate how the network latencies influence the access times. Experiments are conducted using the Yahoo Cloud System Benchmark (YCSB) on top of the Grid’5000 testbed. Finally, we show that adding a Scale-Out NAS system on each site improves the access times of IPFS and reduces the amount of traffic between the sites when objects are read locally by reducing the costly DHT access. The simultaneous observation of different Fog sites also constitutes the originality of this work.
Bastien Confais, Adrien Lebre, Benoît Parrein

Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud

Recently, some Scientific Workflow Management Systems (SWfMSs) with provenance support (e.g. Chiron) have been deployed in the cloud. However, they typically use a single cloud site. In this paper, we consider a multisite cloud, where the data and computing resources are distributed at different sites (possibly in different regions). Based on a multisite architecture of SWfMS, i.e. multisite Chiron, and its provenance model, we propose a multisite task scheduling algorithm that considers the time to generate provenance data. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workflows (Buzz and Montage). The results show that our scheduling algorithm is up to 49.6% better than baseline algorithms in terms of total execution time.
Ji Liu, Esther Pacitti, Patrick Valduriez, Marta Mattoso

Cost Optimization of Data Flows Based on Task Re-ordering

Analyzing big data with the help of automated data flows attracts a lot of attention because of the growing need for end-to-end processing of this data. Modern data flows may consist of a high number of tasks and it is difficult for flow designers to define an efficient execution order of the tasks manually given that, typically, there is significant freedom in the valid positioning for some of the tasks. Several automated execution plan enumeration techniques have been proposed. These solutions can be broadly classified into three categories, each having significant limitations: (i) the optimizations are based on rewrite rules similar to those used in databases, such as filter and projection push-down, but these rules cover only the flow tasks that correspond to extended relational algebra operators. To cover arbitrary tasks, the solutions (ii) either rely on simple heuristics, or (iii) they exhaustively check all orderings, and thus cannot scale. We target the second category and we propose an efficient and polynomial cost-based task ordering solution for flows with arbitrary tasks seen as black boxes. We evaluated our proposals using both real runs and simulations, and the results show that we can achieve speed-ups of orders of magnitude, especially for flows with a high number of tasks even for relatively low flexibility in task positioning.
Georgia Kougka, Anastasios Gounaris

Fusion Strategies for Large-Scale Multi-modal Image Retrieval

Large-scale data management and retrieval in complex domains such as images, videos, or biometrical data remains one of the most important and challenging information processing tasks. Even after two decades of intensive research, many questions still remain to be answered before working tools become available for everyday use. In this work, we focus on the practical applicability of different multi-modal retrieval techniques. Multi-modal searching, which combines several complementary views on complex data objects, follows the human thinking process and represents a very promising retrieval paradigm. However, a rapid development of modality fusion techniques in several diverse directions and a lack of comparisons between individual approaches have resulted in a confusing situation when the applicability of individual solutions is unclear. Aiming at improving the research community’s comprehension of this topic, we analyze and systematically categorize existing multi-modal search techniques, identify their strengths, and describe selected representatives. In the second part of the paper, we focus on the specific problem of large-scale multi-modal image retrieval on the web. We analyze the requirements of such task, implement several applicable fusion methods, and experimentally evaluate their performance in terms of both efficiency and effectiveness. The extensive experiments provide a unique comparison of diverse approaches to modality fusion in equal settings on two large real-world datasets.
Petra Budikova, Michal Batko, Pavel Zezula


Weitere Informationen

Premium Partner