skip to main content
research-article

Wisteria: nurturing scalable data cleaning infrastructure

Published:01 August 2015Publication History
Skip Abstract Section

Abstract

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

References

  1. Apache falcon. http://falcon.apache.org.Google ScholarGoogle Scholar
  2. Informatica. https://www.informatica.com.Google ScholarGoogle Scholar
  3. Talend. https://www.talend.com/solutions/etl-analytics.Google ScholarGoogle Scholar
  4. Trifacta. http://www.trifacta.com.Google ScholarGoogle Scholar
  5. Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135. ACM, 2014. Google ScholarGoogle Scholar
  6. M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013. Google ScholarGoogle Scholar
  7. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014. Google ScholarGoogle Scholar
  8. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google ScholarGoogle Scholar
  9. S. Kandel, A. Paepcke, J. Hellerstein, and H. Jeffrey. Enterprise data analysis and visualization: An interview study. VAST, 2012.Google ScholarGoogle Scholar
  10. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. Proc. VLDB, 8(12), 2015. Google ScholarGoogle Scholar
  11. C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarGoogle Scholar
  12. H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. In SIGMOD, 2014. Google ScholarGoogle Scholar
  13. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  14. S. Venkataraman, A. Panda, G. Ananthanarayanan, M. J. Franklin, and I. Stoica. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation, pages 301--316. USENIX Association, 2014. Google ScholarGoogle Scholar
  15. R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013. Google ScholarGoogle Scholar
  16. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014. Google ScholarGoogle Scholar

Index Terms

  1. Wisteria: nurturing scalable data cleaning infrastructure
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 8, Issue 12
        Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
        August 2015
        728 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2015
        Published in pvldb Volume 8, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader