Skip to main content
Top
Published in: Knowledge and Information Systems 1/2021

07-09-2020 | Regular Paper

Learning cell embeddings for understanding table layouts

Authors: Majid Ghasemi-Gol, Jay Pujara, Pedro Szekely

Published in: Knowledge and Information Systems | Issue 1/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

There is a large amount of data on the web in tabular form, such as Excel sheets, CSV files, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (such as font size, border type, and background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a deep neural network model which can embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We pre-train this cell embedding model on a large corpus of tabular documents from various domains. We then propose a classification technique based on recurrent neural networks (RNNs) to use our pre-trained cell embeddings, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings: in-domain and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improve cell type classification performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abraham R, Erwig M (2006) Inferring templates from spreadsheets. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 182–191 Abraham R, Erwig M (2006) Inferring templates from spreadsheets. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 182–191
2.
go back to reference Adelfio MD, Samet H (2013) Schema extraction for tabular data on the web. Proc VLDB Endow 6(6):421–432CrossRef Adelfio MD, Samet H (2013) Schema extraction for tabular data on the web. Proc VLDB Endow 6(6):421–432CrossRef
3.
go back to reference Ahsan R, Neamtu R, Rundensteiner E (2016) Towards spreadsheet integration using entity identification driven by a spatial-temporal model. In: Proceedings of the 31st annual ACM symposium on applied computing. ACM, pp 1083–1085 Ahsan R, Neamtu R, Rundensteiner E (2016) Towards spreadsheet integration using entity identification driven by a spatial-temporal model. In: Proceedings of the 31st annual ACM symposium on applied computing. ACM, pp 1083–1085
4.
go back to reference Azunre P, Corcoran C, Dhamani N, Gleason J, Honke G, Sullivan D, Ruppel R, Verma S, Morgan J (2019) Semantic classification of tabular datasets via character-level convolutional neural networks. arXiv preprint arXiv:1901.08456 Azunre P, Corcoran C, Dhamani N, Gleason J, Honke G, Sullivan D, Ruppel R, Verma S, Morgan J (2019) Semantic classification of tabular datasets via character-level convolutional neural networks. arXiv preprint arXiv:​1901.​08456
5.
go back to reference Bhagavatula CS, Noraset T, Downey D (2015) Tabel: entity linking in web tables. In: International semantic web conference. Springer, pp 425–441 Bhagavatula CS, Noraset T, Downey D (2015) Tabel: entity linking in web tables. In: International semantic web conference. Springer, pp 425–441
6.
go back to reference Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175 Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:​1803.​11175
7.
go back to reference Chen Z, Cafarella M (2013) Automatic web spreadsheet data extraction. In: Proceedings of the 3rd international workshop on semantic search over the web. ACM, p 1 Chen Z, Cafarella M (2013) Automatic web spreadsheet data extraction. In: Proceedings of the 3rd international workshop on semantic search over the web. ACM, p 1
8.
go back to reference Chen Z, Cafarella M (2014) Integrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD. ACM, pp 1126–1135 Chen Z, Cafarella M (2014) Integrating spreadsheet data via accurate and low-effort extraction. In: Proceedings of the 20th ACM SIGKDD. ACM, pp 1126–1135
9.
go back to reference Chen Z, Dadiomov S, Wesley R, Xiao G, Cory D, Cafarella M, Mackinlay J (2017) Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 999–1008 Chen Z, Dadiomov S, Wesley R, Xiao G, Cory D, Cafarella M, Mackinlay J (2017) Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 999–1008
10.
go back to reference Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:​1705.​02364
11.
go back to reference Crestan E, Pantel P (2011) Web-scale table census and classification. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 545–554 Crestan E, Pantel P (2011) Web-scale table census and classification. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 545–554
12.
go back to reference Cunha J, Saraiva J, Visser J (2009) From spreadsheets to relational databases and back. In: Proceedings of the 2009 ACM SIGPLAN workshop on partial evaluation and program manipulation. ACM, pp 179–188 Cunha J, Saraiva J, Visser J (2009) From spreadsheets to relational databases and back. In: Proceedings of the 2009 ACM SIGPLAN workshop on partial evaluation and program manipulation. ACM, pp 179–188
13.
go back to reference Deng L, Zhang S, Balog K (2019) Table2vec: neural word and entity embeddings for table population and retrieval. arXiv preprint arXiv:1906.00041 Deng L, Zhang S, Balog K (2019) Table2vec: neural word and entity embeddings for table population and retrieval. arXiv preprint arXiv:​1906.​00041
14.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
15.
go back to reference Dou W, Han S, Xu L, Zhang D, Wei J (2018) Expandable group identification in spreadsheets. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 498–508 Dou W, Han S, Xu L, Zhang D, Wei J (2018) Expandable group identification in spreadsheets. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 498–508
16.
go back to reference Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) Deexcelerator: a framework for extracting relational data from partially structured documents. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, pp 2477–2480 Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) Deexcelerator: a framework for extracting relational data from partially structured documents. In: Proceedings of the 22nd ACM international conference on information and knowledge management. ACM, pp 2477–2480
18.
go back to reference Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp. 3363–3372 Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp. 3363–3372
20.
go back to reference Koci E, Thiele M, Lehner W, Romero O (2018) Table recognition in spreadsheets via a graph representation. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 139–144 Koci E, Thiele M, Lehner W, Romero O (2018) Table recognition in spreadsheets via a graph representation. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 139–144
21.
go back to reference Koci E, Thiele M, Romero O, Lehner W (2016) Cell classification for layout recognition in spreadsheets. In: International joint conference on knowledge discovery, knowledge engineering, and knowledge management. Springer, pp 78–100 Koci E, Thiele M, Romero O, Lehner W (2016) Cell classification for layout recognition in spreadsheets. In: International joint conference on knowledge discovery, knowledge engineering, and knowledge management. Springer, pp 78–100
22.
go back to reference Koci E, Thiele M, Romero Moral Ó, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: IC3K 2016: proceedings of the 8th international joint conference on knowledge discovery, knowledge engineering and knowledge management: volume 1: KDIR. SciTePress, pp 77–88 Koci E, Thiele M, Romero Moral Ó, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: IC3K 2016: proceedings of the 8th international joint conference on knowledge discovery, knowledge engineering and knowledge management: volume 1: KDIR. SciTePress, pp 77–88
23.
go back to reference Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:​1603.​01360
24.
go back to reference Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196 Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
25.
go back to reference Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
26.
go back to reference Neishi M, Sakuma J, Tohda S, Ishiwatari S, Yoshinaga N, Toyoda M (2017) A bag of useful tricks for practical neural machine translation: embedding layer initialization and large batch size. In: Proceedings of the 4th workshop on Asian translation (WAT2017), pp 99–109 Neishi M, Sakuma J, Tohda S, Ishiwatari S, Yoshinaga N, Toyoda M (2017) A bag of useful tricks for practical neural machine translation: embedding layer initialization and large batch size. In: Proceedings of the 4th workshop on Asian translation (WAT2017), pp 99–109
27.
go back to reference Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: AAAI, pp 168–174 Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: AAAI, pp 168–174
28.
go back to reference Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543
29.
go back to reference Shigarov AO (2015) Table understanding using a rule engine. Expert Syst Appl 42(2):929–937CrossRef Shigarov AO (2015) Table understanding using a rule engine. Expert Syst Appl 42(2):929–937CrossRef
30.
go back to reference Shigarov AO, Paramonov VV, Belykh PV, Bondarev AI (2016) Rule-based canonicalization of arbitrary tables in spreadsheets. In: International conference on information and software technologies. Springer, pp 78–91 Shigarov AO, Paramonov VV, Belykh PV, Bondarev AI (2016) Rule-based canonicalization of arbitrary tables in spreadsheets. In: International conference on information and software technologies. Springer, pp 78–91
31.
go back to reference Su H, Li Y, Wang X, Hao G, Lai Y, Wang W (2017) Transforming a nonstandard table into formalized tables. In: Web information systems and applications conference, 2017 14th. IEEE, pp 311–316 Su H, Li Y, Wang X, Hao G, Lai Y, Wang W (2017) Transforming a nonstandard table into formalized tables. In: Web information systems and applications conference, 2017 14th. IEEE, pp 311–316
32.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
33.
go back to reference Wang X (1996) Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo Wang X (1996) Tabular abstraction, editing, and formatting. PhD thesis, University of Waterloo
34.
go back to reference Wright P, Fox K (1970) Presenting information in tables. Appl Ergon 1(4):234–242CrossRef Wright P, Fox K (1970) Presenting information in tables. Appl Ergon 1(4):234–242CrossRef
35.
go back to reference Wu S, Hsiao L, Cheng X, Hancock B, Rekatsinas T, Levis P, Ré C (2018) Fonduer: knowledge base construction from richly formatted data. In: Proceedings of the 2018 international conference on management of data. ACM, pp 1301–1316 Wu S, Hsiao L, Cheng X, Hancock B, Rekatsinas T, Levis P, Ré C (2018) Fonduer: knowledge base construction from richly formatted data. In: Proceedings of the 2018 international conference on management of data. ACM, pp 1301–1316
36.
go back to reference Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 world wide web conference, pp 1553–1562 Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 world wide web conference, pp 1553–1562
Metadata
Title
Learning cell embeddings for understanding table layouts
Authors
Majid Ghasemi-Gol
Jay Pujara
Pedro Szekely
Publication date
07-09-2020
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 1/2021
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01508-6

Other articles of this Issue 1/2021

Knowledge and Information Systems 1/2021 Go to the issue

Premium Partner