research-article

Multi-Hypothesis CSV Parsing

Authors:
Till Döhmen

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
View Profile

,
Hannes Mühleisen

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
View Profile

,
Peter Boncz

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
View Profile

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database ManagementJune 2017Article No.: 16Pages 1–12https://doi.org/10.1145/3085504.3085520

Published:27 June 2017Publication History

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

Pages 1–12

ABSTRACT

Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.

References

M. D. Adelfio and H. Samet. Schema extraction for tabular data on the web. Proceedings of the VLDB Endowment, 6(6):421--432, 2013. Google ScholarDigital Library
M. Arenas, F. Maturana, C. Riveros, and D. Vrgoč. A framework for annotating CSV-like data. Proceedings of the VLDB Endowment, 9(11):876--887, 2016. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008. Google ScholarDigital Library
X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1713--1728. ACM, 2015. Google ScholarDigital Library
E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. Laender. Joint unsupervised structure discovery and information extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 541--552. ACM, 2011. Google ScholarDigital Library
T. Döhmen. Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files. Master's thesis, Vrije Universiteit Amsterdam, www.cwi.nl/boncz/msc/2016-Doehmen.pdf, 2016.Google Scholar
J. Eberius, C. Werner, M. Thiele, K. Braunschweig, L. Dannecker, and W. Lehner. DeExcelerator: A framework for extracting relational data from partially structured documents. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2477--2480. ACM, 2013. Google ScholarDigital Library
I. Ermilov, S. Auer, and C. Stadler. User-driven semantic mapping of tabular data. In Proceedings of the 9th International Conference on Semantic Systems, pages 105--112. ACM, 2013. Google ScholarDigital Library
J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In 7th Biennial Conference on Innovative Data System Research, CIDR, volume 15, 2015.Google Scholar
M. F. Hurst. The interpretation of tables in texts. 2000.Google Scholar
Software engineering -- Software product Quality Requirements and Evaluation (SQuaRE) -- Data quality model. Standard, International Organization for Standardization, Geneva, CH, Mar. 2008.Google Scholar
T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142. ACM, 2002. Google ScholarDigital Library
J. G. Kim and M. Hausenblas. 5-star open data, 2015.Google Scholar
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.Google Scholar
G. Nagy, D. W. Embley, M. Krishnamoorthy, and S. Seth. Clustering header categories extracted from web tables. In IS&T/SPIE Electronic Imaging, pages 94020M--94020M. International Society for Optics and Photonics, 2015.Google Scholar
A. Pivk, P. Cimiano, Y. Sure, M. Gams, V. Rajkovič, and R. Studer. Transforming arbitrary tables into logical form with TARTAR. Data & Knowledge Engineering, 60(3):567--595, 2007. Google ScholarDigital Library
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016.Google Scholar
I. Rafique, P. Lew, M. Q. Abbasi, and Z. Li. Information quality evaluation framework: Extending ISO 25012 data quality model. World Academy of Science, Engineering and Technology, 65:523--528, 2012.Google Scholar
S. Seth and G. Nagy. Segmenting tables via indexing of value cells by table headers. In 2013 12th International Conference on Document Analysis and Recognition, pages 887--891. IEEE, 2013. Google ScholarDigital Library
Y. Shafranovich. Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180 (Informational), Oct. 2005. Updated by RFC 7111.Google Scholar
Y. Shafranovich. IESG CSV MIME Type, 2014.Google Scholar
K. Sharma, U. Marjit, and U. Biswas. Automatically Converting Tabular Data to RDF: An Ontological Approach. International Journal of Web & Semantic Technology (IJWesT), 6(3), 2015.Google Scholar
J. Tennison. CSV on the Web: A Primer, W3C WG Note. Working group note, W3C, Feb. 2016. http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/.Google Scholar
H. Wickham. Tidy data. Under review, 2014.Google Scholar
H. Wickham, J. Hester, and R. Francois. readr: Read Tabular Data, 2016. R package version 1.0.0.Google Scholar
Q.-S. Xu and Y.-Z. Liang. Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1):1--11, 2001.Google ScholarCross Ref

Index Terms

Multi-Hypothesis CSV Parsing
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Inconsistent data

Recommendations

Speculative Distributed CSV Data Parsing for Big Data Analytics
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

There has been a recent flurry of interest in providing query capability on raw data in today's big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that ...
Read More
LLLR parsing
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing

The idea of an LLLR parsing is presented. An LLLR(k) parser can be constructed for any LR(k) grammar but it produces the left parse of the input string in linear time (in respect to the length of the derivation) without backtracking. If used as a basis ...
Read More
Parsing minimalist languages
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management
June 2017
373 pages
ISBN:9781450352826
DOI:10.1145/3085504
General Chair:
Alok Choudhary,
Program Chair:
Kesheng Wu,
Publications Chair:
Bin Dong
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate56of146submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 236
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-Hypothesis CSV Parsing

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speculative Distributed CSV Data Parsing for Big Data Analytics

LLLR parsing

Parsing minimalist languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multi-Hypothesis CSV Parsing

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Speculative Distributed CSV Data Parsing for Big Data Analytics

LLLR parsing

Parsing minimalist languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media