skip to main content
research-article

Answering table augmentation queries from unstructured lists on the web

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows --- (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation.

The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision.

Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.

References

  1. E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In SIGKDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Álvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Eng., 64(2):491--509, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Cafarella, N. Khoussainova, D. Wang, E. Wu, Y. Zhang, and A. Halevy. Uncovering the relational web. In WebDB, 2008.Google ScholarGoogle Scholar
  5. A. Chandel, P. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Chang. and S. Lui. Iepad: Information extraction based on pattern discovery. In WWW, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.Google ScholarGoogle Scholar
  11. R. Gupta and S. Sarawagi. Curating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Workshop on Advances in Text Extraction and Mining (ATEM), 2001.Google ScholarGoogle Scholar
  14. M. Michelson and C. A. Knoblock. Creating relational data from unstructured and ungrammatical data sources. Journal of Artificial Intelligence Research (JAIR), 31:543--590, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Ravikumar and W. W. Cohen. A hierarchical graphical model for record linkage. In UAI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, 2004.Google ScholarGoogle Scholar
  20. P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In EMNLP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Answering table augmentation queries from unstructured lists on the web

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader