nach oben

2016 | Buch

Kapitel lesen Erstes Kapitel lesen

Data and Information Quality

Dimensions, Principles and Techniques

verfasst von: Carlo Batini, Monica Scannapieco

Verlag: Springer International Publishing

Buchreihe : Data-Centric Systems and Applications

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book provides a systematic and comparative description of the vast number of research issues related to the quality of data and information. It does so by delivering a sound, integrated and comprehensive overview of the state of the art and future development of data and information quality in databases and information systems.

To this end, it presents an extensive description of the techniques that constitute the core of data and information quality research, including record linkage (also called object identification), data integration, error localization and correction, and examines the related techniques in a comprehensive and original methodological framework. Quality dimension definitions and adopted models are also analyzed in detail, and differences between the proposed solutions are highlighted and discussed. Furthermore, while systematically describing data and information quality as an autonomous research area, paradigms and influences deriving from other areas, such as probability theory, statistical data analysis, data mining, knowledge representation, and machine learning are also included. Last not least, the book also highlights very practical solutions, such as methodologies, benchmarks for the most effective techniques, case studies, and examples.

The book has been written primarily for researchers in the fields of databases and information management or in natural sciences who are interested in investigating properties of data and information that have an impact on the quality of experiments, processes and on real life. The material presented is also sufficiently self-contained for masters or PhD-level courses, and it covers all the fundamentals and topics without the need for other textbooks. Data and information system administrators and practitioners, who deal with systems exposed to data-quality issues and as a result need a systematization of the field and practical methods in the area, will also benefit from the combination of concrete practical approaches with sound theoretical formalisms.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction to Information Quality

Abstract

The Search query “data quality” entered into Google returns about three million pages, and searching similarly for the term “information quality” (IQ) returns about one and a half million pages, both frequencies showing the increasing importance of data and information quality. The goal of this chapter is to show and discuss the perspectives that make data and information (D&I) quality an issue worth being investigated and understood. We first (Sect. 1.2) highlight the relevance of information quality in everyday life and some of the main related initiatives in the public and private domains. Then, in Sect. 1.3, we show the multidimensional nature of information quality by means of several examples.

Carlo Batini, Monica Scannapieco

Chapter 2. Data Quality Dimensions

Abstract

In Chap. 1, we provided an intuitive concept of information quality and we informally introduced several data quality dimensions, such as accuracy, completeness, currency, and consistency.

Carlo Batini, Monica Scannapieco

Chapter 3. Information Quality Dimensions for Maps and Texts

Abstract

In Chap. 2, we have considered quality dimensions for structured data. In this chapter, we move from data quality dimensions to information quality dimensions. We will consider two coordinates for the types of information, respectively, the perceptual coordinate and the linguistic coordinate. From one side, we will explore how dimensions change according to the coordinate and to the type of information, considering as to the perceptual coordinate the case of maps and as to the textual coordinate the case of semistructured texts. From the other side, we will deal with a better detail a topic that has been introduced in Chap. 2, related to how dimensions change or evolve when specific domains are considered. In particular, we will consider a special kind of semistructured texts, namely, law texts.

Carlo Batini, Monica Scannapieco

Chapter 4. Data Quality Issues in Linked Open Data

Abstract

The increasing diffusion of linked data as a standard way to share knowledge on the Web allows users and public and private organizations to fully exploit structured data from very large datasets that were not available in the past. Over the last few years, linked data developed into a large number of datasets with an open access from several domains leading to the linking open data (LOD) cloud. Similar to other types of information such as structured data, linked data suffers from quality problems such as inconsistency, inaccuracy, out-of-dateness, incompleteness, and inconsistency, which are frequent and imply serious limitations to the full exploitation of such data. Therefore, it is important to assess the quality of the datasets that are used in linked data applications before using them. The quality assessment allows users or applications to understand whether data is appropriate for their task at hand.

Anisa Rula, Andrea Maurino, Carlo Batini

Chapter 5. Quality of Images

Abstract

An image is the result of the optical imaging process which maps physical scene properties onto a two-dimensional luminance distribution; it encodes important and useful information about the geometry of the scene and the properties of the objects located within this scene [339, 611, 687].

Gianluigi Ciocca, Silvia Corchs, Francesca Gasparini, Carlo Batini, Raimondo Schettini

Chapter 6. Models for Information Quality

Abstract

In the previous chapters, we introduced several dimensions that are useful to describe and measure information quality in its different aspects and meanings. Focusing on structured data, database management systems (DBMSs) represent data and relative operations on it in terms of a data model and a data definition and manipulation language, i.e., a set of structures and commands that can be represented, interpreted, and executed by a computer. We can follow the same process to represent, besides data, their quality dimensions. This means that in order to represent data quality, we have to extend data models.

Carlo Batini, Monica Scannapieco

Chapter 7. Activities for Information Quality

Abstract

In Chap. 1 we noticed that information quality is a multifaceted concept, and the cleaning of poor quality information can be performed by measuring different dimensions and setting out several different activities, with various goals. An information quality activity is any process we perform directly on information to improve their quality. An example of “manual” information quality activity is the process we perform when we send an e-mail message and the e-mail bounces back because of an unknown user; we check the exact address in a reliable source, and we type the address on the keyboard more carefully to avoid further mistakes. An example of “computerized” information quality activity is the matching of two files in which inaccurate records are included, in order to find similar records that correspond to the same real-world entity. Other activities for improving information quality act on processes; they will be discussed and compared with information quality activities in Chap. 12.

Carlo Batini, Monica Scannapieco

Chapter 8. Object Identification

Abstract

In this chapter we address object identification (IQ), the most important and the most extensively investigated information quality activity. Due to such an importance, we decided to dedicate two chapters of the book to object identification, this chapter focusing on consolidated techniques and the next one on recent advancements.

Carlo Batini, Monica Scannapieco

Chapter 9. Recent Advances in Object Identification

Abstract

Research on object identification has been producing several significant results in the last years, in different areas of computer science. As observed in [140], it is well known that in data mining projects, a large proportion of effort (20–30 % reported in [566]) is spent for understanding data and 50–70 % for data preparation. Governmental organizations need to reconcile and integrate their huge and heterogeneous data assets; statistical agencies routinely link survey and administrative data, in the health sector historical data on patients; and analyses are to be linked for improving effectiveness of operation and policies [80]; security agencies increasingly rely on the ability to correlate files referring to a single individual; data linkage can help in bioinformatics to relate known genome sequences to a new unknown sequence. Due to such increasing interest in object identification, in this chapter, we pay attention to the main trends and results in the area with a focus on the latest results.

Carlo Batini, Monica Scannapieco

Chapter 10. Data Quality Issues in Data Integration Systems

Abstract

In distributed environments, data sources are typically characterized by various kinds of heterogeneities that can be generally classified into (1) technological heterogeneities, (2) schema heterogeneities, and (3) instance-level heterogeneities. Technological heterogeneities are due to the use of products by different vendors, employed at various layers of an information and communication infrastructure. An example of technological heterogeneity is the usage of two different relational database management systems like IBM’s DB2 vs. Microsoft’s SQLServer. Schema heterogeneities are principally caused by the use of (1) different data models, such as one source that adopts the relational data model and a different source that adopts the XML data model, and (2) different data representations, such as one source that stores addresses as one single field and another source that stores addresses with separate fields for street, civic number, and city. Instance-level heterogeneities are caused by different, conflicting data values provided by distinct sources for the same objects. This type of heterogeneity can be caused by quality errors, such as accuracy, completeness, currency, and consistency errors; such errors may result, for instance, from independent processes that feed the different data sources.

Carlo Batini, Monica Scannapieco

Chapter 11. Information Quality in Use

Abstract

We have seen in the Preface that the amount of information exchanged in the Web doubles every one year and a half. Besides the Web, to make a whole picture of the multitude of information used every day, we have to consider the information managed in information systems of organizations, the information exchanged by organizations, and the information used in everyday life by all of us. Organizations and single persons make use of information for different purposes, among such purposes, we are interested in those related to (1) taking decisions and (2) doing actions. Decisions and actions in organizations are of different nature according to the type of organization; in public administrations, they are the result of administrative processes, which are executed to provide services to citizens and communities; for private companies, they are the result of business processes, which produce goods or services to be sold in the market.

Carlo Batini, Monica Scannapieco

Chapter 12. Methodologies for Information Quality Assessment and Improvement

Abstract

Measuring and improving information quality in a single organization or in a set of cooperating organizations is a complex task. In previous chapters, we discussed relevant activities for improving information quality (Chap. 7) and corresponding techniques (Chaps. 7–10). Several methodologies have been developed in the last few years that provide a rationale for the optimal choice of such activities and techniques. In this chapter, we discuss methodologies proposed in the research and professional literature for information quality assessment and improvement from multiple perspectives.

Carlo Batini, Monica Scannapieco

Chapter 13. Information Quality in Healthcare

Abstract

In this chapter, we will shortly frame information quality in healthcare as a matter of study or concern. Being aware that such a vast topic cannot be covered in one single book chapter, here we will at least orient interested readers to resources that could be consulted to get further information on this broad field of study and practice. To this aim, we will proceed as follows: firstly, we will define the kind of data or information whose quality is under consideration and possibly at stake; then, we will try to convey the importance to focus on this area of interest within the broader information quality field; lastly, we will try to consider how health practitioners see this area and how this can inform programs of quality assessment and improvement from a practice-oriented perspective. Short conclusions will summarize the main points outlined in this chapter. The chapter is organized as follows: in Sect. 13.2, we will recall some of the oft-mentioned definitions of the concepts related to the heading of this chapter. In Sect. 13.3, we outline the main challenges that are posed by the healthcare domain to those willing to address the task of improving the related information, while Sect. 13.4 provides the core notions to orient those practitioners by extracting from the relevant literature references to the main dimensions, methodologies, and initiatives where those methods and the related techniques have been applied with some success. Section 13.5 discusses the most recent trends in research on information quality in healthcare. Finally, Sect. 13.6 aims to motivate the serious practitioners and scholars to devote more efforts in the development of further tools and techniques for the clear impact that IQ can have on health outcomes, costs, and long-term sustainability of healthcare.

Federico Cabitza, Carlo Batini

Chapter 14. Quality of Web Data and Quality of Big Data: Open Problems

Abstract

In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.

Monica Scannapieco, Laure Berti

Erratum to: Data and Information Quality: Dimensions, Principles and Techniques

Carlo Batini, Monica Scannapieco

Backmatter

Titel: Data and Information Quality
verfasst von: Carlo Batini
Monica Scannapieco
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-24106-7
Print ISBN: 978-3-319-24104-3
DOI: https://doi.org/10.1007/978-3-319-24106-7