research-article

Wisteria: nurturing scalable data cleaning infrastructure

Authors:
Daniel Haas

UC Berkeley

UC Berkeley
View Profile

,
Sanjay Krishnan

UC Berkeley

UC Berkeley
View Profile

,
Jiannan Wang

UC Berkeley

UC Berkeley
View Profile

,
Michael J. Franklin

UC Berkeley

UC Berkeley
View Profile

,
Eugene Wu

Columbia University

Columbia University
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 12pp 2004–2007https://doi.org/10.14778/2824032.2824122

Published:01 August 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

References

Apache falcon. http://falcon.apache.org.Google Scholar
Informatica. https://www.informatica.com.Google Scholar
Talend. https://www.talend.com/solutions/etl-analytics.Google Scholar
Trifacta. http://www.trifacta.com.Google Scholar
Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135. ACM, 2014. Google Scholar
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013. Google Scholar
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014. Google Scholar
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google Scholar
S. Kandel, A. Paepcke, J. Hellerstein, and H. Jeffrey. Enterprise data analysis and visualization: An interview study. VAST, 2012.Google Scholar
S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. Proc. VLDB, 8(12), 2015. Google Scholar
C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google Scholar
H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. In SIGMOD, 2014. Google Scholar
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
S. Venkataraman, A. Panda, G. Ananthanarayanan, M. J. Franklin, and I. Stoica. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation, pages 301--316. USENIX Association, 2014. Google Scholar
R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013. Google Scholar
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014. Google Scholar

Index Terms

Wisteria: nurturing scalable data cleaning infrastructure
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Read More
The Complete Raw Workflow Guide: How to get the most from your raw images in Adobe Camera Raw, Lightroom, Photoshop, and Elements
Read More
ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse
ICAIP '17: Proceedings of the International Conference on Advances in Image Processing

Data cleansing can be considered to be an activity that is performed on the data sets of the data warehouse. The cleansing is done in order to enhance and collectively maintain data consistency and quality. The quality of data has a strong impact on a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2015
Published in pvldb Volume 8, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 252
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Wisteria: nurturing scalable data cleaning infrastructure

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Alliance Rules for Data Warehouse Cleansing

The Complete Raw Workflow Guide: How to get the most from your raw images in Adobe Camera Raw, Lightroom, Photoshop, and Elements

ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Wisteria: nurturing scalable data cleaning infrastructure

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Alliance Rules for Data Warehouse Cleansing

The Complete Raw Workflow Guide: How to get the most from your raw images in Adobe Camera Raw, Lightroom, Photoshop, and Elements

ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media