skip to main content
10.1145/2723372.2747646acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

BigDansing: A System for Big Data Cleansing

Published:27 May 2015Publication History

ABSTRACT

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

References

  1. Shark (Hive on Spark). https://github.com/amplab/shark.Google ScholarGoogle Scholar
  2. TPC-H benchmark version 2.14.4. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  3. C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3:197--207, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Chu, I. F. Ilyas, and P. Papotti. Holistic Data Cleaning: Putting Violations into Context. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: A Commodity Data Cleaning System. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In VLDB, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Fan, F. Geerts, N. Tang, and W. Yu. Conflict resolution with data currency and consistency. J. Data and Information Quality, 5(1--2):6, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Fan, J. Li, N. Tang, and W. Yu. Incremental Detection of Inconsistencies in Distributed Data. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353), 1976.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. Friedman. Magic quadrant for data quality tools. http://www.gartner.com/, 2013.Google ScholarGoogle Scholar
  17. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC Data-Cleaning Framework. PVLDB, 6(9):625--636, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  19. M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  20. E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jindal, J.-A. Quiané-Ruiz, and S. Madden. Cartilage: Adding Flexibility to the Hadoop Skeleton. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Karypis and V. Kumar. Multilevel K-way Hypergraph Partitioning. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Kolahi and L. V. S. Lakshmanan. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient Deduplication with Hadoop. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-scale Graph Processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  30. G. Smith. PostgreSQL 9.0 High Performance: Accelerate your PostgreSQL System and Avoid the Common Pitfalls that Can Slow it Down. Packt Publishing, 2010.Google ScholarGoogle Scholar
  31. N. Swartz. Gartner warns firms of 'dirty data'. Information Management Journal, 41(3), 2007.Google ScholarGoogle Scholar
  32. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A Warehousing Solution over a Map-reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen. CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  34. M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  35. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. GraphX: A Resilient Distributed Graph System on Spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BigDansing: A System for Big Data Cleansing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader