Skip to main content
Top

2021 | OriginalPaper | Chapter

Semi-automatic Column Type Inference for CSV Table Understanding

Authors : Sara Bonfitto, Luca Cappelletti, Fabrizio Trovato, Giorgio Valentini, Marco Mesiti

Published in: SOFSEM 2021: Theory and Practice of Computer Science

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents.
Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
To distinguish terminal symbols from types, types are delimited by brackets.
 
Literature
2.
go back to reference Abraham, R., Erwig, M.: Ucheck: a spreadsheet type checker for end users. J. Vis. Lang. Comput. 18, 71–95 (2007)CrossRef Abraham, R., Erwig, M.: Ucheck: a spreadsheet type checker for end users. J. Vis. Lang. Comput. 18, 71–95 (2007)CrossRef
3.
go back to reference Arenas, M., Barcelo, P., Libkin, L., Murlak, F.: Relational and XML Data Exchange. Morgan and Claypool Publishers, San Rafael (2010)CrossRef Arenas, M., Barcelo, P., Libkin, L., Murlak, F.: Relational and XML Data Exchange. Morgan and Claypool Publishers, San Rafael (2010)CrossRef
4.
go back to reference Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping. Springer, Dordrecht (2011)CrossRef Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping. Springer, Dordrecht (2011)CrossRef
6.
go back to reference Chen, J., Jimenez-Ruiz, E., Horrocks, I., Sutton, C.: Colnet: embedding the semantics of web tables for column type prediction. In: Proceeding of AAAI Conference on Artificial Intelligence, vol. 33, pp. 29–36, July 2019 Chen, J., Jimenez-Ruiz, E., Horrocks, I., Sutton, C.: Colnet: embedding the semantics of web tables for column type prediction. In: Proceeding of AAAI Conference on Artificial Intelligence, vol. 33, pp. 29–36, July 2019
7.
go back to reference Chen, Z.: Spreadsheet property detection with rule-assisted active learning. In: Proceeding of the Conference on Information and Knowledge Management, pp. 999–1008 (2017) Chen, Z.: Spreadsheet property detection with rule-assisted active learning. In: Proceeding of the Conference on Information and Knowledge Management, pp. 999–1008 (2017)
8.
go back to reference Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration. Morgan Kaufmann Publishers Inc., Waltham (2012) Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration. Morgan Kaufmann Publishers Inc., Waltham (2012)
9.
go back to reference Ermilov, I., Ngomo, A.-C.N.: Taipan: automatic property mapping for tabular data. In: Proceeding of International Conference Knowledge Engineering and Knowledge Management, pp. 163–179 (2016) Ermilov, I., Ngomo, A.-C.N.: Taipan: automatic property mapping for tabular data. In: Proceeding of International Conference Knowledge Engineering and Knowledge Management, pp. 163–179 (2016)
11.
go back to reference Fisher, K., Gruber, R.: Pads: a domain-specific language for processing ad hoc data. SIGPLAN Not. 40(6), 295–304 (2005)CrossRef Fisher, K., Gruber, R.: Pads: a domain-specific language for processing ad hoc data. SIGPLAN Not. 40(6), 295–304 (2005)CrossRef
12.
go back to reference Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables: supporting a neglected type of content on the web. In: Proceeding of International Conference Knowledge Engineering and Semantic Web, pp. 48–62, October 2015 Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables: supporting a neglected type of content on the web. In: Proceeding of International Conference Knowledge Engineering and Semantic Web, pp. 48–62, October 2015
14.
go back to reference Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Proceeding of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1500–1508. ACM (2019) Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Proceeding of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1500–1508. ACM (2019)
15.
go back to reference Hurst, M.: The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh (2000) Hurst, M.: The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh (2000)
16.
go back to reference Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: ACM Human Factors in Computing Systems (CHI), pp. 3363–3372 (2011) Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: ACM Human Factors in Computing Systems (CHI), pp. 3363–3372 (2011)
17.
go back to reference Koci, E., Thiele, M., Romero, O., Lehner, W.: A genetic-based search for adaptive table recognition in spreadsheets. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1274–1279 (2019) Koci, E., Thiele, M., Romero, O., Lehner, W.: A genetic-based search for adaptive table recognition in spreadsheets. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1274–1279 (2019)
18.
19.
go back to reference Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010)CrossRef Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010)CrossRef
20.
go back to reference Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB 3(1–2), 1338–1347 (2010)CrossRef Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB 3(1–2), 1338–1347 (2010)CrossRef
22.
go back to reference Milosevic, N., Gregson, C., Hernandez, R., Nenadic, G.: Disentangling the structure of tables in scientific literature. In: Proceeding of International Conference on Applications of Natural Language to Information Systems, pp. 162–174 (2016) Milosevic, N., Gregson, C., Hernandez, R., Nenadic, G.: Disentangling the structure of tables in scientific literature. In: Proceeding of International Conference on Applications of Natural Language to Information Systems, pp. 162–174 (2016)
24.
go back to reference Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
25.
go back to reference Petricek, T., Guerra, G., Syme, D.: Types from data: making structured data first-class citizens in f#. In: Proceeding of 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 477–490. ACM (2016) Petricek, T., Guerra, G., Syme, D.: Types from data: making structured data first-class citizens in f#. In: Proceeding of 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 477–490. ACM (2016)
26.
go back to reference Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceeding of the 26th International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 235–242 (2003) Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: Proceeding of the 26th International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 235–242 (2003)
27.
go back to reference Raman, V., Hellerstein, J.: Potter’s wheel: an interactive data cleaning system. In: Proceeding of International Conference Very Large Data Bases, pp. 381–390, September 2001 Raman, V., Hellerstein, J.: Potter’s wheel: an interactive data cleaning system. In: Proceeding of International Conference Very Large Data Bases, pp. 381–390, September 2001
28.
go back to reference Shigarov, A.: Table understanding using a rule engine. Expert Syst. Appl. 42, 929–937 (2015)CrossRef Shigarov, A.: Table understanding using a rule engine. Expert Syst. Appl. 42, 929–937 (2015)CrossRef
30.
go back to reference Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. J. Web Semant. 37, 152–169 (2016)CrossRef Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. J. Web Semant. 37, 152–169 (2016)CrossRef
32.
go back to reference Valera, I., Ghahramani, Z.: Automatic discovery of the statistical types of variables in a dataset. Proc. Mach. Learn. Res. 70, 3521–3529 (2017) Valera, I., Ghahramani, Z.: Automatic discovery of the statistical types of variables in a dataset. Proc. Mach. Learn. Res. 70, 3521–3529 (2017)
33.
go back to reference Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)CrossRef Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)CrossRef
Metadata
Title
Semi-automatic Column Type Inference for CSV Table Understanding
Authors
Sara Bonfitto
Luca Cappelletti
Fabrizio Trovato
Giorgio Valentini
Marco Mesiti
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-67731-2_39

Premium Partner