research-article

Answering table augmentation queries from unstructured lists on the web

Authors:
Rahul Gupta

IIT Bombay, India

IIT Bombay, India
View Profile

,
Sunita Sarawagi

IIT Bombay, India

IIT Bombay, India
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 1pp 289–300https://doi.org/10.14778/1687627.1687661

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows --- (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation.

The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision.

Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.

References

E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In SIGKDD, 2004. Google ScholarDigital Library
M. Álvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Eng., 64(2):491--509, 2008. Google ScholarDigital Library
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 2003. Google ScholarDigital Library
M. Cafarella, N. Khoussainova, D. Wang, E. Wu, Y. Zhang, and A. Halevy. Uncovering the relational web. In WebDB, 2008.Google Scholar
A. Chandel, P. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, 2006. Google ScholarDigital Library
C. Chang. and S. Lui. Iepad: Information extraction based on pattern discovery. In WWW, 2001. Google ScholarDigital Library
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarDigital Library
W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW, 2002. Google ScholarDigital Library
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In VLDB, 2009. Google ScholarDigital Library
Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.Google Scholar
R. Gupta and S. Sarawagi. Curating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarDigital Library
D. Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarDigital Library
K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Workshop on Advances in Text Extraction and Mining (ATEM), 2001.Google Scholar
M. Michelson and C. A. Knoblock. Creating relational data from unstructured and ungrammatical data sources. Journal of Artificial Intelligence Research (JAIR), 31:543--590, 2008. Google ScholarDigital Library
P. Ravikumar and W. W. Cohen. A hierarchical graphical model for record linkage. In UAI, 2004. Google ScholarDigital Library
E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI, 1999. Google ScholarDigital Library
S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, 2004.Google Scholar
P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google ScholarDigital Library
P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In EMNLP, 2008. Google ScholarDigital Library
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, 2007. Google ScholarDigital Library
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, 2005. Google ScholarDigital Library

Index Terms

Answering table augmentation queries from unstructured lists on the web
1. Information systems
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Answering table queries on the web using column keywords

We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are ...
Read More
Answering relationship queries on the web
WWW '07: Proceedings of the 16th international conference on World Wide Web

Finding relationships between entities on the Web, e.g., the connections between different places or the commonalities of people, is a novel and challenging problem. Existing Web search engines excel in keyword matching and document ranking, but they ...
Read More
Question Answering using Web Lists
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

There are many natural questions that are best answered with a list. We address the problem of answering such questions using lists that occur on the Web, i.e. List Question Answering (ListQA). The diverse formats of lists on the Web makes this task ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 2, Issue 1
August 2009
1293 pages
ISSN:2150-8097
Editors:
Serge Abiteboul,
Tova Milo,
Jignesh Patel,
Philippe Rigaux
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 48
  Total Citations
  View Citations
- 344
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Answering table queries on the web using column keywords

Answering relationship queries on the web

Question Answering using Web Lists

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Answering table queries on the web using column keywords

Answering relationship queries on the web

Question Answering using Web Lists

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media