research-article

BigDansing: A System for Big Data Cleansing

Authors:
Zuhair Khayyat

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
View Profile

,
Ihab F. Ilyas

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
Alekh Jindal

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Samuel Madden

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

,
Mourad Ouzzani

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Paolo Papotti

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Jorge-Arnulfo Quiané-Ruiz

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Nan Tang

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

,
Si Yin

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1215–1230https://doi.org/10.1145/2723372.2747646

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1215–1230

ABSTRACT

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

References

Shark (Hive on Spark). https://github.com/amplab/shark.Google Scholar
TPC-H benchmark version 2.14.4. http://www.tpc.org/tpch/.Google Scholar
C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarDigital Library
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3:197--207, 2010. Google ScholarDigital Library
P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic Data Cleaning: Putting Violations into Context. In ICDE, 2013.Google ScholarDigital Library
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: A Commodity Data Cleaning System. In SIGMOD, 2013. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In VLDB, 1991. Google ScholarDigital Library
W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarDigital Library
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008. Google ScholarDigital Library
W. Fan, F. Geerts, N. Tang, and W. Yu. Conflict resolution with data currency and consistency. J. Data and Information Quality, 5(1--2):6, 2014. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarDigital Library
W. Fan, J. Li, N. Tang, and W. Yu. Incremental Detection of Inconsistencies in Distributed Data. In ICDE, 2012. Google ScholarDigital Library
I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353), 1976.Google ScholarCross Ref
T. Friedman. Magic quadrant for data quality tools. http://www.gartner.com/, 2013.Google Scholar
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 6(9):625--636, 2013. Google ScholarDigital Library
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, 2014.Google ScholarCross Ref
M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.Google ScholarCross Ref
E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarDigital Library
A. Jindal, J.-A. Quiané-Ruiz, and S. Madden. Cartilage: Adding Flexibility to the Hadoop Skeleton. In SIGMOD, 2013. Google ScholarDigital Library
G. Karypis and V. Kumar. Multilevel K-way Hypergraph Partitioning. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC. ACM, 1999. Google ScholarDigital Library
S. Kolahi and L. V. S. Lakshmanan. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT, 2009. Google ScholarDigital Library
L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB, 2012. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-scale Graph Processing. In SIGMOD, 2010. Google ScholarDigital Library
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarDigital Library
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, 2011. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
G. Smith. PostgreSQL 9.0 High Performance: Accelerate your PostgreSQL System and Avoid the Common Pitfalls that Can Slow it Down. Packt Publishing, 2010.Google Scholar
N. Swartz. Gartner warns firms of 'dirty data'. Information Management Journal, 41(3), 2007.Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution over a Map-reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen. CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE, 2014.Google ScholarCross Ref
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.Google ScholarCross Ref
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD, 2014. Google ScholarDigital Library
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014. Google ScholarDigital Library
R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. GraphX: A Resilient Distributed Graph System on Spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES. ACM, 2013. Google ScholarDigital Library
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, 2013. Google ScholarDigital Library
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarDigital Library
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010. Google ScholarDigital Library

Index Terms

BigDansing: A System for Big Data Cleansing
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Using the uni-level description (ULD) to support data-model interoperability
Special issue: ER 2003

We describe a framework called the Uni-Level Description (ULD) for accurately representing information from a broad range of data models. The ULD extends previous meta-data-model approaches by: (a) providing uniform representation and access to data ...
Read More
Documenting database usages and schema constraints in database-centric applications
ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis

Database-centric applications (DCAs) usually rely on database operations over a large number of tables and attributes. Understanding how database tables and attributes are used to implement features in DCAs along with the constraints related to these ...
Read More
Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cleansing abstraction
distributed data cleansing
distributed data repair
schema constraints
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 133
  Total Citations
  View Citations
- 1,489
  Total Downloads
- Downloads (Last 12 months)78
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BigDansing: A System for Big Data Cleansing

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Using the uni-level description (ULD) to support data-model interoperability

Documenting database usages and schema constraints in database-centric applications

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark