Skip to main content

2013 | Buch

Handbook of Data Quality

Research and Practice

insite
SUCHEN

Über dieses Buch

The issue of data quality is as old as data itself. However, the proliferation of diverse, large-scale and often publically available data on the Web has increased the risk of poor data quality and misleading data interpretations. On the other hand, data is now exposed at a much more strategic level e.g. through business intelligence systems, increasing manifold the stakes involved for individuals, corporations as well as government agencies. There, the lack of knowledge about data accuracy, currency or completeness can have erroneous and even catastrophic results.

With these changes, traditional approaches to data management in general, and data quality control specifically, are challenged. There is an evident need to incorporate data quality considerations into the whole data cycle, encompassing managerial/governance as well as technical aspects.

Data quality experts from research and industry agree that a unified framework for data quality management should bring together organizational, architectural and computational approaches. Accordingly, Sadiq structured this handbook in four parts: Part I is on organizational solutions, i.e. the development of data quality objectives for the organization, and the development of strategies to establish roles, processes, policies, and standards required to manage and ensure data quality. Part II, on architectural solutions, covers the technology landscape required to deploy developed data quality management processes, standards and policies. Part III, on computational solutions, presents effective and efficient tools and techniques related to record linkage, lineage and provenance, data uncertainty, and advanced integrity constraints. Finally, Part IV is devoted to case studies of successful data quality initiatives that highlight the various aspects of data quality in action. The individual chapters present both an overview of the respective topic in terms of historical research and/or practice and state of the art, as well as specific techniques, methodologies and frameworks developed by the individual contributors.

Researchers and students of computer science, information systems, or business management as well as data professionals and practitioners will benefit most from this handbook by not only focusing on the various sections relevant to their research area or particular practical work, but by also studying chapters that they may initially consider not to be directly relevant to them, as there they will learn about new perspectives and approaches.

Inhaltsverzeichnis

Frontmatter
Prologue: Research and Practice in Data Quality Management
Abstract
This handbook is motivated by the presence of diverse communities within the area of data quality management, which have individually contributed a wealth of knowledge on data quality research and practice. The chapter presents a snapshot of these contributions from both research and practice, and highlights the background and rational for the handbook.
Shazia Sadiq
Epilogue: The Data Quality Profession
Abstract
In this final chapter, we will discuss four significant topics concerning the data quality profession. First, we will examine how the data quality profession has evolved. Second, we will explore what it means to be a data quality professional. Third, we will review the training opportunities currently available to those interested in becoming a data quality professional, and finally, we will assess the outlook for the future of the data quality profession. Throughout this chapter we will use the terms “data” and “information” interchangeably.
Elizabeth Pierce, John Talburt, C. Lwanga Yonke

Organizational Aspects of Data Quality

Frontmatter
Data Quality Management Past, Present, and Future: Towards a Management System for Data
Abstract
This chapter provides a prospective look at the “big research issues” in data quality. It is based on 25 years experience, most as a practitioner; early work with a terrific team of researchers and business people at Bell Labs and AT&T; constant reflection on the meanings and methods of quality, the strange and wondrous properties of data, the importance of data and data quality in markets and companies, and the underlying reasons that some enterprises make rapid progress and others fall flat; and interactions with most of the leading companies, practitioners, and researchers.
Thomas C. Redman
Data Quality Projects and Programs
Abstract
Projects and programs are two fundamental ways of putting data quality into practice. A data quality (DQ) project includes a plan of work with clear beginning and end points and specific deliverables and uses data quality activities, methods, tools, and techniques to address a particular business issue. A data quality program, on the other hand, often spearheaded by an initial project, ensures that data quality continues to be put into practice over the long term. This chapter focuses on the components necessary for successful data quality projects and programs and introduces various frameworks to illustrate these components, including the Ten Steps to Quality Data and Trusted InformationTM methodology (Ten StepsTM). A discussion of two companies—one housing a mature data quality program, the other a more recent “DQ start-up” initiative—shows two examples of how data quality components and frameworks were applied to meet their organizations’ specific needs, environments, and cultures. Readers should come away from the chapter understanding the foundation behind the execution of data quality projects, the development of data quality programs, and generate ideas for incorporating data quality work into their own organization.
Danette McGilvray
Cost and Value Management for Data Quality
Abstract
The cost and value of data quality have been discussed in numerous articles; however, suitable and rigor cost measures and approaches to estimate the value are rare and indeed difficult to develop. At the same time, as a critical concern to the success of organizations, the cost and value of data quality become important. Numerous business initiatives have been delayed or even cancelled, citing poor-quality data as the main concern. Previous research and practice have indicated that understanding the cost and value of data quality is a critical step to the success of information systems. This chapter provides an overview of cost and value issues related to data quality. This includes data quality cost and value identification, classification, taxonomy, and evaluation framework, as well as analysis model. Furthermore, this chapter provides a guideline for cost and value analysis related to data quality.
Mouzhi Ge, Markus Helfert
On the Evolution of Data Governance in Firms: The Case of Johnson & Johnson Consumer Products North America
Abstract
Data Governance defines decision-making rights for company-wide use of data. The topic has received increased attention both in the scientific and in the practitioners’ community, as the quality of data meanwhile is increasingly being considered a key prerequisite for companies for being able to meet a number of strategic business requirements, such as compliance with a growing number of legal provisions or pursuit of a customer-centric business model. While first results can be found in literature addressing Data Governance arrangements, no studies have been published so far investigating the evolution of Data Governance over time. Drawing on theory about organizational capabilities, the chapter assumes that Data Governance can be considered a dynamic capability and that the concept of capability lifecycles can be applied. A single-case study conducted at Johnson & Johnson Consumer Products, North America, is presented to explore the research question as to how Data Governance effectiveness can be measured as the ratio of the number of preventive data quality management (DQM) measures to the total number of DQM measures in order to trace the evolution of Data Governance over time. The findings suggest that Data Governance can in fact be seen as a dynamic capability and that its effectiveness evolves according to a lifecycle curve. Furthermore, the chapter discusses a maturity model which can be used as an instrument to manage and monitor this evolution.
Boris Otto

Architectural Aspects of Data Quality

Frontmatter
Data Warehouse Quality: Summary and Outlook
Abstract
Data warehouses correlate data from various sources to enable reporting, data mining, and decision support. Some of the unique features of data warehouses (as compared to transactional databases) include data integration from multiple sources and emphasis on temporal, historical, and multidimensional data. In this chapter, we survey data warehouse quality problems and solutions, including data freshness (ensuring that materialized views are up to date as new data arrive over time), data completeness (capturing all the required history), data correctness (as defined by various types of integrity constraints, including those which govern how data may evolve over time), consistency, error detection and profiling, and distributed data quality.
Lukasz Golab
Using Semantic Web Technologies for Data Quality Management
Abstract
In the past decade, the World Wide Web has started to evolve from a Web that provides human-interpretable information to a Semantic Web that provides data that can also be processed by machines. With this evolution, several useful technologies for knowledge capturing, representation, and processing have been developed which can be used in large-scale environments. Semantic Web technologies may be able to shift current data quality management technology to the next level. In this chapter, we discuss how Semantic Web technologies can be employed to improve information quality. In particular, we outline their application for (1) data requirements and metadata management, (2) data quality monitoring, (3) data quality assessment, (4) validation of data entries, (5) as reference data, and (6) in the area of content integration.
Christian Fürber, Martin Hepp
Data Glitches: Monsters in Your Data
Abstract
Data types and data structures are becoming increasingly complex as they keep pace with evolving technologies and applications. The result is an increase in the number and complexity of data quality problems. Data glitches, a common name for data quality problems, can be simple and stand alone, or highly complex with spatial and temporal correlations. In this chapter, we provide an overview of a comprehensive and measurable data quality process. To begin, we define and classify complex glitch types, and describe detection and cleaning techniques. We present metrics for assessing data quality and for choosing cleaning strategies subject to a variety of considerations. The process culminates in a “clean” data set that is acceptable to the end user. We conclude with an overview of significant literature in this area, and a discussion of opportunities for practice, application, and further research.
Tamraparni Dasu

Computational Aspects of Data Quality

Frontmatter
Generic and Declarative Approaches to Data Quality Management
Abstract
Data quality assessment and data cleaning tasks have traditionally been addressed through procedural solutions. Most of the time, those solutions have been applicable to specific problems and domains. In the last few years we have seen the emergence of more generic solutions, and also of declarative and rule-based specifications of the intended solutions of data cleaning processes. In this chapter we review some of those recent developments.
Leopoldo Bertossi, Loreto Bravo
Linking Records in Complex Context
Abstract
There are different kinds of information present in a data set that can be utilized for record linkage activities: attributes, context, relationships, etc. In this chapter, we focus on techniques that enable record linkage in so-called complex context, which includes data sets with hierarchial relations, data sets that contain temporal information, and data sets that are extracted from the Web. For each method, we describe the problem to be solved and use a motivating example to demonstrate the challenges and intuitions of the work. We then present an overview of the approaches, followed by more detailed explanation of some key ideas, together with examples.
Pei Li, Andrea Maurino
A Practical Guide to Entity Resolution with OYSTER
Abstract
This chapter discusses the concepts and methods of entity resolution (ER) and how they can be applied in practice to eliminate redundant data records and support master data management programs. The chapter is organized into two main parts. The first part discusses the components of ER with particular emphasis approximate matching algorithms and the activities that comprise identity information management. The second part provides a step-by-step guide to build an ER process including data profiling, data preparation, identity attribute selection, rule development, ER algorithm considerations, deciding on an identity management strategy, results analysis, and rule refinement. Each step in the process is illustrated with an actual example using the OYSTER open-source, entity resolution system.
John R. Talburt, Yinle Zhou
Managing Quality of Probabilistic Databases
Abstract
Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries) and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are also examined.
Reynold Cheng
Data Fusion: Resolving Conflicts from Multiple Sources
Abstract
Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values, and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that reflect the real world; this task is called data fusion. Typically, we expect a true value to be provided by more sources than any particular false one, so we can take the value provided by the largest number of sources as the truth. Unfortunately, a false value can be spread through copying and that makes truth discovery extremely tricky. In this chapter, we consider how to find true values from conflicting information when there are a large number of sources, among which some may copy from others. We describe a novel approach that considers copying between data sources in truth discovery. Intuitively, if two data sources provide a large number of common values and many of these values are unlikely to be provided by other sources (e.g., particular false values), it is very likely that one copies from the other. We apply Bayesian analysis to decide copying between sources and design an algorithm that iteratively detects dependence and discovers truth from conflicting information. We also consider accuracy of data sources and similarity between values in fusion to further improve the results. We present a case study on real-world data showing that the described algorithm can significantly improve accuracy of truth discovery and is scalable when there are a large number of data sources.
Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava

Data Quality in Action

Frontmatter
Ensuring the Quality of Health Information: The Canadian Experience
Abstract
High-quality health information is critical for quality health care and for effective and efficient management of the health care system. This chapter highlights the Canadian experience in the capture and use of health information including a brief introduction to the Canadian health care system and the Canadian Institute for Health Information (CIHI); an overview of CIHI strategies and programs that support quality health information for key stakeholders, including clinicians, health system managers and policymakers; and four case studies from across the health care continuum illustrating data quality strategies in action—prevention, monitoring, feedback and continuous improvement. This chapter concludes with a discussion of two key data quality opportunities for CIHI—the movement toward interoperable electronic health records and data integration. Further information on CIHI, its data quality program and the case studies in this chapter may be found at www.​cihi.​ca
Heather Richards, Nancy White
Shell’s Global Data Quality Journey
Abstract
The importance of high-quality data has been long recognised in Shell. During the 1990s, a major data management programme developed and implemented sound data management practices that were successfully implemented in a number of operating units (OUs). However, each OU adapted the tools and techniques to their own environment. A unified global approach to data quality remained elusive until the new millennium. This chapter describes Shell’s global data quality journey since the early part of the millennium to the present.
Ken Self
Creating an Information-Centric Organisation Culture at SBI General Insurance
Abstract
For an insurance business, data is its lifeblood. It drives most, if not all, significant decisions including product design, pricing and marketing. Without ‘good’ data, an insurance business is almost blind. No matter how smart and efficient business’s processes are; how advanced, savvy and solid the IT systems that support the processes are; and how capable and skilful the staff who use the processes and technology are, if the underlying data and information that these processes, technology and people use is not good enough in terms of its quality and integrity, the outcome such as effective and efficient decision-making will be poor. The strategy of a business should recognise that information assets, supporting technology, business processes and people need to be coordinated and managed effectively. This chapter is about an award-winning case study of a general insurance business but applies equally to most businesses, regardless of industry. It tells the story of how the organisation created an information-centric culture by bringing four objectives relating to technology, business process, people and information to work together in a collaborative manner.
Ram Kumar, Robert Logie
Backmatter
Metadaten
Titel
Handbook of Data Quality
herausgegeben von
Shazia Sadiq
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-36257-6
Print ISBN
978-3-642-36256-9
DOI
https://doi.org/10.1007/978-3-642-36257-6