Abstract
Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst's choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.
- Apache falcon. http://falcon.apache.org.Google Scholar
- Informatica. https://www.informatica.com.Google Scholar
- Talend. https://www.talend.com/solutions/etl-analytics.Google Scholar
- Trifacta. http://www.trifacta.com.Google Scholar
- Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135. ACM, 2014. Google Scholar
- M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD Conference, pages 541--552, 2013. Google Scholar
- C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014. Google Scholar
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011. Google Scholar
- S. Kandel, A. Paepcke, J. Hellerstein, and H. Jeffrey. Enterprise data analysis and visualization: An interview study. VAST, 2012.Google Scholar
- S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. Proc. VLDB, 8(12), 2015. Google Scholar
- C. Mayfield, J. Neville, and S. Prabhakar. Eracer: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google Scholar
- H. Park and J. Widom. Crowdfill: Collecting structured data from the crowd. In SIGMOD, 2014. Google Scholar
- M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
- S. Venkataraman, A. Panda, G. Ananthanarayanan, M. J. Franklin, and I. Stoica. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation, pages 301--316. USENIX Association, 2014. Google Scholar
- R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013. Google Scholar
- J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD Conference, pages 469--480, 2014. Google Scholar
Index Terms
- Wisteria: nurturing scalable data cleaning infrastructure
Recommendations
Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing SystemsData Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse
ICAIP '17: Proceedings of the International Conference on Advances in Image ProcessingData cleansing can be considered to be an activity that is performed on the data sets of the data warehouse. The cleansing is done in order to enhance and collectively maintain data consistency and quality. The quality of data has a strong impact on a ...
Comments