ABSTRACT
Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing.
Supplemental Material
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In ACM SIGMOD, pages 337--348, 2003. Google ScholarDigital Library
- A. F. Blackwell. SWYN: A visual representation for regular expressions. In Your Wish is my Command: Programming by Example, pages 245--270, 2001. Google ScholarDigital Library
- L. Chiticariu, P. G. Kolaitis, and L. Popa. Interactive generation of integrated schemas. In ACM SIGMOD, pages 833--846, 2008. Google ScholarDigital Library
- T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., New York, NY, 2003. Google ScholarDigital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In ACM SIGMOD, pages 240--251, 2002. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
- K. Fisher and R. Gruber. Pads: a domain-specific language for processing ad hoc data. In ACM PLDI, pages 295--304, 2005. Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In ACM SIGMOD, page 590, 2000. Google ScholarDigital Library
- L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005. Google ScholarDigital Library
- J. M. Hellerstein. Quantitative data cleaning for large databases, 2008. White Paper, United Nations Economic Commission for Europe.Google Scholar
- V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 22(2):85--126, 2004. Google ScholarDigital Library
- E. Horvitz. Principles of mixed-initiative user interfaces. In ACM CHI, pages 159--166, 1999. Google ScholarDigital Library
- D. Huynh and S. Mazzocchi. Google Refine. http://code.google.com/p/google-refine/.Google Scholar
- D. F. Huynh, R. C. Miller, and D. R. Karger. Potluck: semi-ontology alignment for casual users. In ISWC, pages 903--910, 2007. Google ScholarDigital Library
- Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. Pratim, T. R. Tuchinda, J. Luis, A. Maria, and M. C. Gazen. Interactive data integration through smart copy & paste. In CIDR, 2009.Google Scholar
- H. Kang, L. Getoor, B. Shneiderman, M. Bilgic, and L. Licamele. Interactive entity resolution in relational data: A visual analytic tool and its evaluation. IEEE TVCG, 14(5):999--1014, 2008. Google ScholarDigital Library
- L. V. S. Lakshmanan, F. Sadri, and S. N. Subramanian. SchemaSQL: An extension to SQL for multidatabase interoperability. ACM Trans. Database Syst., 26(4):476--519, 2001. Google ScholarDigital Library
- J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with vegemite. In IUI, pages 97--106, 2009. Google ScholarDigital Library
- R. C. Miller and B. A. Myers. Interactive simultaneous editing of multiple text regions. In USENIX Tech. Conf., pages 161--174, 2001. Google ScholarDigital Library
- D. A. Norman. The Design of Everyday Things. Basic Books, 2002. Google ScholarDigital Library
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10:334--350, 2001. Google ScholarDigital Library
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
- G. G. Robertson, M. P. Czerwinski, and J. E. Churchill. Visualization of mappings between schemas. In ACM CHI, pages 431--439, 2005. Google ScholarDigital Library
- C. Scaffidi, B. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In ACM IUI, pages 297--306, 2009. Google ScholarDigital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1--3):233--272, 1999. Google ScholarDigital Library
- R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by example. In ACM IUI, pages 139--148, 2008. Google ScholarDigital Library
Index Terms
- Wrangler: interactive visual specification of data transformation scripts
Recommendations
Profiler: integrated statistical analysis and visualization for data quality assessment
AVI '12: Proceedings of the International Working Conference on Advanced Visual InterfacesData quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so ...
Research directions in data wrangling: visuatizations and transformations for usable and credible data
Special issue on State of the Field and New Research DirectionsIn spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of 'data wrangling' often constitutes the most tedious ...
A domain-specific language for scripting refactorings in erlang
FASE'12: Proceedings of the 15th international conference on Fundamental Approaches to Software EngineeringRefactoring is the process of changing the design of a program without changing its behaviour. Many refactoring tools have been developed for various programming languages; however, their support for composite refactorings --- refactorings that are ...
Comments