article

Schema extraction for tabular data on the web

Authors:
Marco D. Adelfio

Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD

Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD
View Profile

,
Hanan Samet

Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD

Center for Automation Research, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland, College Park, MD
View Profile

Proceedings of the VLDB Endowment Volume 6 Issue 6pp 421–432https://doi.org/10.14778/2536336.2536343

Published:01 April 2013Publication History

Proceedings of the VLDB Endowment

Abstract

Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

References

P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. PVLDB, 4(11):695-701, 2011.Google Scholar
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In WebDB, Vancouver, Canada, June 2008.Google Scholar
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the power of tables on the web. In VLDB, pages 538-549, Auckland, New Zealand, Aug. 2008. Google Scholar
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, pages 1090-1101, Lyon, France, Aug. 2009. Google Scholar
H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In COLING, pages 166-172, Saarbrücken, Germany, July 2000. Google Scholar
E. F. Codd. A relational model of data for large shared data banks. CACM, 13(6):377-387, June 1970. Google Scholar
A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817-828, Scottsdale, Arizona, USA, May 2012. Google Scholar
D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2):66-86, 2006.Google Scholar
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71-80, Banff, Canada, May 2007. Google Scholar
G. S. Iwerks and H. Samet. The spatial spreadsheet. In VISUAL, pages 317-324, Amsterdam, The Netherlands, June 1999. Google Scholar
E. Jacox and H. Samet. Spatial join techniques. Computer Science Technical Report TR-4730, University of Maryland, College Park, MD, June 2005.Google Scholar
D. Jannach, K. Shchekotykhin, and G. Friedrich. Automated ontology instantiation from tabular web sources--the AllRight system. Web Semantics, 7(3):136-153, Sept. 2009. Google Scholar
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282-289, Williamstown, Massachussetts, USA, 2001. Google Scholar
O. Lassila. The resource description framework. IEEE Intelligent Systems, 15(6):67-69, 2000.Google Scholar
M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. Spatio-textual spreadsheets: Geotagging via spatial coherence. In SIGSPATIAL, pages 524-527, Seattle, WA, Nov. 2009. Google Scholar
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338-1347, 2010. Google Scholar
Y. Liu, K. Bai, P. Mitra, and C. L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In JCDL, pages 91-100, Vancouver, Canada, June 2007. Google Scholar
R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. In VLDB, pages 908-919, Istanbul, Turkey, Aug. 2012. Google Scholar
D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235-242, 2003. Google Scholar
H. Samet, A. Rosenfeld, C. A. Shaffer, and R. E. Webber. A geographic information system using quadtrees. Pattern Recognition, 17(6):647-656, November/December 1984.Google Scholar
H. Samet, H. Alborzi, F. Brabec, C. Esperança, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. CACM, 46(1):63-66, Jan. 2003. Google Scholar
F. Sha and F. C. N. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, pages 213-220, 2003. Google Scholar
P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528-538, June 2011. Google Scholar
Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242-250, Honolulu, HI, May 2002. Google Scholar
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97-108, Scottsdale, Arizona, USA, May 2012. Google Scholar
R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. IJDAR, 7(1):1-16, Mar. 2004. Google Scholar

Index Terms

Schema extraction for tabular data on the web
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Schema extraction from XML collections
JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries

XML Schema language has been proposed to replace Document Type Definitions (DTDs) as schema mechanism for XML data. This language consistently extends grammar-based constructions with constraint- and pattern-based ones and have a higher expressive power ...
Read More
Tabular Web Data: Schema Discovery and Integration
DaWaK 2013: Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 8057

Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and ...
Read More
Tabular representation of schema mappings: semantics and algorithms
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 6, Issue 6
April 2013
144 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 April 2013
Published in pvldb Volume 6, Issue 6
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 545
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Schema extraction from XML collections

Tabular Web Data: Schema Discovery and Integration

Tabular representation of schema mappings: semantics and algorithms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Schema extraction from XML collections

Tabular Web Data: Schema Discovery and Integration

Tabular representation of schema mappings: semantics and algorithms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media