Skip to main content

Über dieses Buch

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource. Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and the support of agent systems with scaling and decentralized control. Synergy between grids, P2P systems, and agent technologies is the key to data- and knowledge-centered systems in large-scale environments.

This, the 29th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains four revised selected regular papers. Topics covered include optimization and cluster validation processes for entity matching, business intelligence systems, and data profiling in the Semantic Web.



Sensitivity - An Important Facet of Cluster Validation Process for Entity Matching Technique

Cluster validity measure is one of the important components of cluster validation process in which once a clustering arrangement is found, then it is compared with the actual clustering arrangement or gold standard if it is available. For this purpose, different external cluster validity measures (VMs) are available. However, all the measures are not equally good for some specific clustering problem. For example, in entity matching technique, F-measure is a preferably used VM than McNemar index as the former satisfies a given set of desirable properties for entity matching problem. But we have observed that even if all the existing desirable properties are satisfied, then also some of the important differences between two clustering arrangements are not detected by some VMs. Thus we propose to introduce another property, termed as sensitivity, which can be added to the desirable property set and can be used along with the existing set of properties for the cluster validation process. In this paper, the sensitivity property of a VM is formally introduced and then the value of sensitivity is computed using the proposed identity matrix based technique. A comprehensive analysis is made to compare some of the existing VMs and then the suitability of the VMs with respect to the entity matching technique is obtained. Thus, this paper helps to improve the performance of the cluster validation process.
Sumit Mishra, Samrat Mondal, Sriparna Saha

Pay-as-you-go Configuration of Entity Resolution

Entity resolution, which seeks to identify records that represent the same entity, is an important step in many data integration and data cleaning applications. However, entity resolution is challenging both in terms of scalability (all-against-all comparisons are computationally impractical) and result quality (syntactic evidence on record equivalence is often equivocal). As a result, end-to-end entity resolution proposals involve several stages, including blocking to efficiently identify candidate duplicates, detailed comparison to refine the conclusions from blocking, and clustering to identify the sets of records that may represent the same entity. However, the quality of the result is often crucially dependent on configuration parameters in all of these stages, for which it may be difficult for a human expert to provide suitable values. This paper describes an approach in which a complete entity resolution process is optimized, on the basis of feedback (such as might be obtained from crowds) on candidate duplicates. Given such feedback, an evolutionary search of the space of configuration parameters is carried out, with a view to maximizing the fitness of the resulting clusters. The approach is pay-as-you-go in that more feedback can be expected to give rise to better outcomes. An empirical evaluation shows that the co-optimization of the different stages in entity resolution can yield significant improvements over default parameters, even with small amounts of feedback.
Ruhaila Maskat, Norman W. Paton, Suzanne M. Embury

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.
Petar Jovanovic, Oscar Romero, Alberto Abelló

A Self-Adaptive and Incremental Approach for Data Profiling in the Semantic Web

The increasing adoption of linked data principles has led to the availability of a huge amount of datasets on the Web. However, the use of these datasets is hindered by the lack of descriptive information about their content. Indeed, interlinking, matching or querying them requires some knowledge about the types and properties they contain.
In this paper, we tackle the problem of describing the content of an RDF dataset by profiling its entities, which consists in discovering the implicit types and providing their description. Each type is described by a profile composed of properties and their probabilities. Our approach relies on a clustering algorithm. It is self-adaptive, as it can automatically detect the most appropriate similarity threshold according to the dataset. Our algorithms generate overlapping clusters, enabling the detection of several types for an entity. As a dataset may evolve, our approach is incremental and can assign a type to a new entity and update the type profiles without browsing the whole dataset. We also present some experimental evaluations to demonstrate the effectiveness of our approach.
Kenza Kellou-Menouer, Zoubida Kedad


Weitere Informationen

Premium Partner