skip to main content
10.1145/3085504.3085520acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Multi-Hypothesis CSV Parsing

Authors Info & Claims
Published:27 June 2017Publication History

ABSTRACT

Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.

References

  1. M. D. Adelfio and H. Samet. Schema extraction for tabular data on the web. Proceedings of the VLDB Endowment, 6(6):421--432, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Arenas, F. Maturana, C. Riveros, and D. Vrgoč. A framework for annotating CSV-like data. Proceedings of the VLDB Endowment, 9(11):876--887, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1713--1728. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. Laender. Joint unsupervised structure discovery and information extraction. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 541--552. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Döhmen. Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files. Master's thesis, Vrije Universiteit Amsterdam, www.cwi.nl/boncz/msc/2016-Doehmen.pdf, 2016.Google ScholarGoogle Scholar
  7. J. Eberius, C. Werner, M. Thiele, K. Braunschweig, L. Dannecker, and W. Lehner. DeExcelerator: A framework for extracting relational data from partially structured documents. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2477--2480. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Ermilov, S. Auer, and C. Stadler. User-driven semantic mapping of tabular data. In Proceedings of the 9th International Conference on Semantic Systems, pages 105--112. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In 7th Biennial Conference on Innovative Data System Research, CIDR, volume 15, 2015.Google ScholarGoogle Scholar
  10. M. F. Hurst. The interpretation of tables in texts. 2000.Google ScholarGoogle Scholar
  11. Software engineering -- Software product Quality Requirements and Evaluation (SQuaRE) -- Data quality model. Standard, International Organization for Standardization, Geneva, CH, Mar. 2008.Google ScholarGoogle Scholar
  12. T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. G. Kim and M. Hausenblas. 5-star open data, 2015.Google ScholarGoogle Scholar
  14. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.Google ScholarGoogle Scholar
  15. G. Nagy, D. W. Embley, M. Krishnamoorthy, and S. Seth. Clustering header categories extracted from web tables. In IS&T/SPIE Electronic Imaging, pages 94020M--94020M. International Society for Optics and Photonics, 2015.Google ScholarGoogle Scholar
  16. A. Pivk, P. Cimiano, Y. Sure, M. Gams, V. Rajkovič, and R. Studer. Transforming arbitrary tables into logical form with TARTAR. Data & Knowledge Engineering, 60(3):567--595, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016.Google ScholarGoogle Scholar
  18. I. Rafique, P. Lew, M. Q. Abbasi, and Z. Li. Information quality evaluation framework: Extending ISO 25012 data quality model. World Academy of Science, Engineering and Technology, 65:523--528, 2012.Google ScholarGoogle Scholar
  19. S. Seth and G. Nagy. Segmenting tables via indexing of value cells by table headers. In 2013 12th International Conference on Document Analysis and Recognition, pages 887--891. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Shafranovich. Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180 (Informational), Oct. 2005. Updated by RFC 7111.Google ScholarGoogle Scholar
  21. Y. Shafranovich. IESG CSV MIME Type, 2014.Google ScholarGoogle Scholar
  22. K. Sharma, U. Marjit, and U. Biswas. Automatically Converting Tabular Data to RDF: An Ontological Approach. International Journal of Web & Semantic Technology (IJWesT), 6(3), 2015.Google ScholarGoogle Scholar
  23. J. Tennison. CSV on the Web: A Primer, W3C WG Note. Working group note, W3C, Feb. 2016. http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/.Google ScholarGoogle Scholar
  24. H. Wickham. Tidy data. Under review, 2014.Google ScholarGoogle Scholar
  25. H. Wickham, J. Hester, and R. Francois. readr: Read Tabular Data, 2016. R package version 1.0.0.Google ScholarGoogle Scholar
  26. Q.-S. Xu and Y.-Z. Liang. Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1):1--11, 2001.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multi-Hypothesis CSV Parsing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management
      June 2017
      373 pages
      ISBN:9781450352826
      DOI:10.1145/3085504

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 June 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate56of146submissions,38%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader