skip to main content
research-article

Harvesting relational tables from lists on the web

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

A large number of web pages contain data structured in the form of "lists". Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well defined templates -- they have inconsistent delimiters (if any) and often have missing information.

We propose a novel technique for extracting tables from lists. The technique is domain-independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields, and then compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the Web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table's quality.

We conducted an extensive experimental study using both real web lists and lists derived from tables on the Web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the Web. The analysis of the extracted tables have led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.

References

  1. A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Barish, Y. shin Chen, D. Dipasquo, C. A. Knoblock, S. Minton, I. Muslea, and C. Shahabi. Theaterloc: Using information integration technology to rapidly build virtual applications. In ICDE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  3. G. Barton and M. Sternberg. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. Journal of Molecular Biology, 198(2):327--37, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6):284, 1961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large Language Models in Machine Translation. In EMNLP-CoNLL, 2007.Google ScholarGoogle Scholar
  7. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the Relational Web. In WebDB, 2008.Google ScholarGoogle Scholar
  9. C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In WWW, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In WWW, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. C. Edgar and S. Batzoglou. Multiple sequence alignment. Current Opinion in Structural Biology, 3:368--373, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. In SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In IJCAI, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. List of Indonesian floral emblems. http://en.wikipedia.org/wiki/List_of_Indonesian_floral_emblems.Google ScholarGoogle Scholar
  17. L. Liu, C. Pu, and W. Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In ICDE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Y. Halevy. Google's deep web crawl. PVLDB, 1(2):1241--1252, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Michelson and C. Knoblock. Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web. IJDAR, 10(3):211--226, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443--453, March 1970.Google ScholarGoogle ScholarCross RefCross Ref
  21. C. Notredame. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics, 3(1):327--337, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  22. P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In EMNLP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In WWW, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Zhai and B. Liu. Extracting Web Data Using Instance-Based Learning. WWW, 10(2):113--132, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Harvesting relational tables from lists on the web

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader