Abstract
We present the design of a system for assembling a table from a few example rows by harnessing the huge corpus of information-rich but unstructured lists on the web. We developed a totally unsupervised end to end approach which given the sample query rows --- (a) retrieves HTML lists relevant to the query from a pre-indexed crawl of web lists, (b) segments the list records and maps the segments to the query schema using a statistical model, (c) consolidates the results from multiple lists into a unified merged table, (d) and presents to the user the consolidated records ranked by their estimated membership in the target relation.
The key challenges in this task include construction of new rows from very few examples, and an abundance of noisy and irrelevant lists that swamp the consolidation and ranking of rows. We propose modifications to statistical record segmentation models, and present novel consolidation and ranking techniques that can process input tables of arbitrary schema without requiring any human supervision.
Experiments with Wikipedia target tables and 16 million unstructured lists show that even with just three sample rows, our system is very effective at recreating Wikipedia tables, with a mean runtime of around 20s.
- E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In SIGKDD, 2004. Google ScholarDigital Library
- M. Álvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Eng., 64(2):491--509, 2008. Google ScholarDigital Library
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 2003. Google ScholarDigital Library
- M. Cafarella, N. Khoussainova, D. Wang, E. Wu, Y. Zhang, and A. Halevy. Uncovering the relational web. In WebDB, 2008.Google Scholar
- A. Chandel, P. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, 2006. Google ScholarDigital Library
- C. Chang. and S. Lui. Iepad: Information extraction based on pattern discovery. In WWW, 2001. Google ScholarDigital Library
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. Google ScholarDigital Library
- W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW, 2002. Google ScholarDigital Library
- H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In VLDB, 2009. Google ScholarDigital Library
- Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.Google Scholar
- R. Gupta and S. Sarawagi. Curating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarDigital Library
- D. Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarDigital Library
- K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Workshop on Advances in Text Extraction and Mining (ATEM), 2001.Google Scholar
- M. Michelson and C. A. Knoblock. Creating relational data from unstructured and ungrammatical data sources. Journal of Artificial Intelligence Research (JAIR), 31:543--590, 2008. Google ScholarDigital Library
- P. Ravikumar and W. W. Cohen. A hierarchical graphical model for record linkage. In UAI, 2004. Google ScholarDigital Library
- E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI, 1999. Google ScholarDigital Library
- S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
- S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In NIPS, 2004.Google Scholar
- P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM, 2006. Google ScholarDigital Library
- P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In EMNLP, 2008. Google ScholarDigital Library
- R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, 2007. Google ScholarDigital Library
- Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, 2005. Google ScholarDigital Library
Index Terms
- Answering table augmentation queries from unstructured lists on the web
Recommendations
Answering table queries on the web using column keywords
We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are ...
Answering relationship queries on the web
WWW '07: Proceedings of the 16th international conference on World Wide WebFinding relationships between entities on the Web, e.g., the connections between different places or the commonalities of people, is a novel and challenging problem. Existing Web search engines excel in keyword matching and document ranking, but they ...
Question Answering using Web Lists
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementThere are many natural questions that are best answered with a list. We address the problem of answering such questions using lists that occur on the Web, i.e. List Question Answering (ListQA). The diverse formats of lists on the Web makes this task ...
Comments