ABSTRACT
Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.
- M. D. Adelfio and H. Samet. Schema extraction for tabular data on the web. Proceedings of the VLDB Endowment, 6(6):421--432, 2013. Google ScholarDigital Library
- M. Arenas, F. Maturana, C. Riveros, and D. Vrgoč. A framework for annotating CSV-like data. Proceedings of the VLDB Endowment, 9(11):876--887, 2016. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008. Google ScholarDigital Library
- X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1713--1728. ACM, 2015. Google ScholarDigital Library
- E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. Laender. Joint unsupervised structure discovery and information extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 541--552. ACM, 2011. Google ScholarDigital Library
- T. Döhmen. Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files. Master's thesis, Vrije Universiteit Amsterdam, www.cwi.nl/boncz/msc/2016-Doehmen.pdf, 2016.Google Scholar
- J. Eberius, C. Werner, M. Thiele, K. Braunschweig, L. Dannecker, and W. Lehner. DeExcelerator: A framework for extracting relational data from partially structured documents. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2477--2480. ACM, 2013. Google ScholarDigital Library
- I. Ermilov, S. Auer, and C. Stadler. User-driven semantic mapping of tabular data. In Proceedings of the 9th International Conference on Semantic Systems, pages 105--112. ACM, 2013. Google ScholarDigital Library
- J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In 7th Biennial Conference on Innovative Data System Research, CIDR, volume 15, 2015.Google Scholar
- M. F. Hurst. The interpretation of tables in texts. 2000.Google Scholar
- Software engineering -- Software product Quality Requirements and Evaluation (SQuaRE) -- Data quality model. Standard, International Organization for Standardization, Geneva, CH, Mar. 2008.Google Scholar
- T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142. ACM, 2002. Google ScholarDigital Library
- J. G. Kim and M. Hausenblas. 5-star open data, 2015.Google Scholar
- V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.Google Scholar
- G. Nagy, D. W. Embley, M. Krishnamoorthy, and S. Seth. Clustering header categories extracted from web tables. In IS&T/SPIE Electronic Imaging, pages 94020M--94020M. International Society for Optics and Photonics, 2015.Google Scholar
- A. Pivk, P. Cimiano, Y. Sure, M. Gams, V. Rajkovič, and R. Studer. Transforming arbitrary tables into logical form with TARTAR. Data & Knowledge Engineering, 60(3):567--595, 2007. Google ScholarDigital Library
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016.Google Scholar
- I. Rafique, P. Lew, M. Q. Abbasi, and Z. Li. Information quality evaluation framework: Extending ISO 25012 data quality model. World Academy of Science, Engineering and Technology, 65:523--528, 2012.Google Scholar
- S. Seth and G. Nagy. Segmenting tables via indexing of value cells by table headers. In 2013 12th International Conference on Document Analysis and Recognition, pages 887--891. IEEE, 2013. Google ScholarDigital Library
- Y. Shafranovich. Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180 (Informational), Oct. 2005. Updated by RFC 7111.Google Scholar
- Y. Shafranovich. IESG CSV MIME Type, 2014.Google Scholar
- K. Sharma, U. Marjit, and U. Biswas. Automatically Converting Tabular Data to RDF: An Ontological Approach. International Journal of Web & Semantic Technology (IJWesT), 6(3), 2015.Google Scholar
- J. Tennison. CSV on the Web: A Primer, W3C WG Note. Working group note, W3C, Feb. 2016. http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/.Google Scholar
- H. Wickham. Tidy data. Under review, 2014.Google Scholar
- H. Wickham, J. Hester, and R. Francois. readr: Read Tabular Data, 2016. R package version 1.0.0.Google Scholar
- Q.-S. Xu and Y.-Z. Liang. Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1):1--11, 2001.Google ScholarCross Ref
Index Terms
- Multi-Hypothesis CSV Parsing
Recommendations
Speculative Distributed CSV Data Parsing for Big Data Analytics
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataThere has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that ...
LLLR parsing
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied ComputingThe idea of an LLLR parsing is presented. An LLLR(k) parser can be constructed for any LR(k) grammar but it produces the left parse of the input string in linear time (in respect to the length of the derivation) without backtracking. If used as a basis ...
Comments