skip to main content
article

Schema extraction for tabular data on the web

Authors Info & Claims
Published:01 April 2013Publication History
Skip Abstract Section

Abstract

Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

References

  1. P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. PVLDB, 4(11):695-701, 2011.Google ScholarGoogle Scholar
  2. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In WebDB, Vancouver, Canada, June 2008.Google ScholarGoogle Scholar
  3. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the power of tables on the web. In VLDB, pages 538-549, Auckland, New Zealand, Aug. 2008. Google ScholarGoogle Scholar
  4. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, pages 1090-1101, Lyon, France, Aug. 2009. Google ScholarGoogle Scholar
  5. H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In COLING, pages 166-172, Saarbrücken, Germany, July 2000. Google ScholarGoogle Scholar
  6. E. F. Codd. A relational model of data for large shared data banks. CACM, 13(6):377-387, June 1970. Google ScholarGoogle Scholar
  7. A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817-828, Scottsdale, Arizona, USA, May 2012. Google ScholarGoogle Scholar
  8. D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2):66-86, 2006.Google ScholarGoogle Scholar
  9. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71-80, Banff, Canada, May 2007. Google ScholarGoogle Scholar
  10. G. S. Iwerks and H. Samet. The spatial spreadsheet. In VISUAL, pages 317-324, Amsterdam, The Netherlands, June 1999. Google ScholarGoogle Scholar
  11. E. Jacox and H. Samet. Spatial join techniques. Computer Science Technical Report TR-4730, University of Maryland, College Park, MD, June 2005.Google ScholarGoogle Scholar
  12. D. Jannach, K. Shchekotykhin, and G. Friedrich. Automated ontology instantiation from tabular web sources--the AllRight system. Web Semantics, 7(3):136-153, Sept. 2009. Google ScholarGoogle Scholar
  13. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282-289, Williamstown, Massachussetts, USA, 2001. Google ScholarGoogle Scholar
  14. O. Lassila. The resource description framework. IEEE Intelligent Systems, 15(6):67-69, 2000.Google ScholarGoogle Scholar
  15. M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. Spatio-textual spreadsheets: Geotagging via spatial coherence. In SIGSPATIAL, pages 524-527, Seattle, WA, Nov. 2009. Google ScholarGoogle Scholar
  16. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338-1347, 2010. Google ScholarGoogle Scholar
  17. Y. Liu, K. Bai, P. Mitra, and C. L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In JCDL, pages 91-100, Vancouver, Canada, June 2007. Google ScholarGoogle Scholar
  18. R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. In VLDB, pages 908-919, Istanbul, Turkey, Aug. 2012. Google ScholarGoogle Scholar
  19. D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235-242, 2003. Google ScholarGoogle Scholar
  20. H. Samet, A. Rosenfeld, C. A. Shaffer, and R. E. Webber. A geographic information system using quadtrees. Pattern Recognition, 17(6):647-656, November/December 1984.Google ScholarGoogle Scholar
  21. H. Samet, H. Alborzi, F. Brabec, C. Esperança, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. CACM, 46(1):63-66, Jan. 2003. Google ScholarGoogle Scholar
  22. F. Sha and F. C. N. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, pages 213-220, 2003. Google ScholarGoogle Scholar
  23. P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528-538, June 2011. Google ScholarGoogle Scholar
  24. Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242-250, Honolulu, HI, May 2002. Google ScholarGoogle Scholar
  25. M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97-108, Scottsdale, Arizona, USA, May 2012. Google ScholarGoogle Scholar
  26. R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. IJDAR, 7(1):1-16, Mar. 2004. Google ScholarGoogle Scholar

Index Terms

  1. Schema extraction for tabular data on the web
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 6, Issue 6
        April 2013
        144 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 April 2013
        Published in pvldb Volume 6, Issue 6

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader