ABSTRACT
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
- Shark (Hive on Spark). https://github.com/amplab/shark.Google Scholar
- TPC-H benchmark version 2.14.4. http://www.tpc.org/tpch/.Google Scholar
- C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarDigital Library
- G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3:197--207, 2010. Google ScholarDigital Library
- P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Holistic Data Cleaning: Putting Violations into Context. In ICDE, 2013.Google ScholarDigital Library
- M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: A Commodity Data Cleaning System. In SIGMOD, 2013. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In VLDB, 1991. Google ScholarDigital Library
- W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarDigital Library
- W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008. Google ScholarDigital Library
- W. Fan, F. Geerts, N. Tang, and W. Yu. Conflict resolution with data currency and consistency. J. Data and Information Quality, 5(1--2):6, 2014. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarDigital Library
- W. Fan, J. Li, N. Tang, and W. Yu. Incremental Detection of Inconsistencies in Distributed Data. In ICDE, 2012. Google ScholarDigital Library
- I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353), 1976.Google ScholarCross Ref
- T. Friedman. Magic quadrant for data quality tools. http://www.gartner.com/, 2013.Google Scholar
- F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 6(9):625--636, 2013. Google ScholarDigital Library
- F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, 2014.Google ScholarCross Ref
- M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.Google ScholarCross Ref
- E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarDigital Library
- A. Jindal, J.-A. Quiané-Ruiz, and S. Madden. Cartilage: Adding Flexibility to the Hadoop Skeleton. In SIGMOD, 2013. Google ScholarDigital Library
- G. Karypis and V. Kumar. Multilevel K-way Hypergraph Partitioning. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC. ACM, 1999. Google ScholarDigital Library
- S. Kolahi and L. V. S. Lakshmanan. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT, 2009. Google ScholarDigital Library
- L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB, 2012. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-scale Graph Processing. In SIGMOD, 2010. Google ScholarDigital Library
- C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarDigital Library
- A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
- G. Smith. PostgreSQL 9.0 High Performance: Accelerate your PostgreSQL System and Avoid the Common Pitfalls that Can Slow it Down. Packt Publishing, 2010.Google Scholar
- N. Swartz. Gartner warns firms of 'dirty data'. Information Management Journal, 41(3), 2007.Google Scholar
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution over a Map-reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen. CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE, 2014.Google ScholarCross Ref
- M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.Google ScholarCross Ref
- J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD, 2014. Google ScholarDigital Library
- J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014. Google ScholarDigital Library
- R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. GraphX: A Resilient Distributed Graph System on Spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES. ACM, 2013. Google ScholarDigital Library
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, 2013. Google ScholarDigital Library
- M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarDigital Library
- M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010. Google ScholarDigital Library
Index Terms
- BigDansing: A System for Big Data Cleansing
Recommendations
Using the uni-level description (ULD) to support data-model interoperability
Special issue: ER 2003We describe a framework called the Uni-Level Description (ULD) for accurately representing information from a broad range of data models. The ULD extends previous meta-data-model approaches by: (a) providing uniform representation and access to data ...
Documenting database usages and schema constraints in database-centric applications
ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and AnalysisDatabase-centric applications (DCAs) usually rely on database operations over a large number of tables and attributes. Understanding how database tables and attributes are used to implement features in DCAs along with the constraints related to these ...
Comments