Article

Using the structure of Web sites for automatic segmentation of tables

Authors:
Kristina Lerman

USC Information Sciences Institute, Marina del Rey, CA

USC Information Sciences Institute, Marina del Rey, CA
View Profile

,
Lise Getoor

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

,
Steven Minton

Fetch Technologies, Manhattan Beach, CA

Fetch Technologies, Manhattan Beach, CA
View Profile

,
Craig Knoblock

USC Information Sciences Institute, Marina del Rey, CA

USC Information Sciences Institute, Marina del Rey, CA
View Profile

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of dataJune 2004Pages 119–130https://doi.org/10.1145/1007568.1007584

Published:13 June 2004Publication History

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Pages 119–130

ABSTRACT

Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.

References

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, 2003. Google ScholarDigital Library
L. Arlotta, V. Crescenzi, G. Mecca, and P. Marialdo. Automatic annotation of data extracted from large web sites. In Proceedings of the Sixth International Workshop on Web and Databases (WebDB03), 2003.Google Scholar
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records full text. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, 2001. Google ScholarDigital Library
C. H. Chang, and S. C. Lui. IEPAD: Information Extraction based on Pattern Discovery. In 10th International World Wide Web Conference (WWW10), Hong Kong, 2001. Google ScholarDigital Library
H. Chen, S. Tsai, and J. Tsai. Mining tables from large scale html texts. In 18th International Conference on Computational Linguistics (COLING), 2000. Google ScholarDigital Library
W. W. Cohen, M. Hurst, and L. S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In 11th International World Wide Web Conference (WWW10), Honolulu, Hawaii, 2002. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Automatic web information extraction in the ROADRUNNER system. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), 2001. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th Conference on Very Large Databases (VLDB), Rome, Italy, 2001. Google ScholarDigital Library
C. Gazen. Thesis proposal, Carnegie Mellon University.Google Scholar
Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Proc. Conf. Advances in Neural Information Processing Systems, NIPS, volume 8, pages 472--478. MIT Press, 1995.Google Scholar
M. Hurst. Layout and language: Challenges for table understanding on the web. In In Web Document Analysis, Proceedings of the 1st International Workshop on Web Document Analysis, 2001.Google Scholar
M. Hurst and S. Douglas. Layout and language: Preliminary investigations in recognizing the structure of tables. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 1997. Google ScholarDigital Library
Y. Jiang. Record-Boundary Discovery In Web Documents. PhD thesis, BYU, Utah, 1998.Google Scholar
N. Kushmerick and B. Thoma. Intelligent Information Agents R&D in Europe: An AgentLink perspective, chapter Adaptive information extraction: Core technologies for information agents. Springer, 2002. Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarDigital Library
K. Lerman and S. Minton. Learning the Common Structure of Data. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, 2000. AAAI Press. Google ScholarDigital Library
K. Lerman, C. A. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park, 2001. AAAI Press.Google Scholar
K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003. Google ScholarDigital Library
K. Lerman, C. Gazen, S. Minton, and C. A. Knoblock,. Populating the Semantic Web. Submitted to the workshop on Advances in Text Extraction and Mining (ATEM-2004), 2004.Google Scholar
K. Murphy. Dynamic bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, 2002. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pages 435--442. Morgan Kaufmann, San Francisco, CA, 2002. Google ScholarDigital Library
H. T. Ng, H. Y. Lim, and J. L. T. Koo. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL99), 1999. Google ScholarDigital Library
D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proceedings of the ACM SIGIR), 2003. Google ScholarDigital Library
P. Pyreddy and W. B. Croft. Tintin: A system for retrieval in text tables. In Proceedings of 2nd International Conference on Digital Libraries, 1997. Google ScholarDigital Library
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in Speech Recognition. Google ScholarDigital Library
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of the Twenty-seventh International Conference on Very Large Databases, 2001. Google ScholarDigital Library
J. P. Walser. Wsat(oip) package.Google Scholar
J. P. Walser. Integer Optimization by Local Search: A Domain Independent Approach, volume 1637 of LNCS. Springer, New York, 1999. Google ScholarDigital Library
Y. Wang and J. Hu. Detecting tables in html documents. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002. Google ScholarDigital Library
Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In In The Elevent International World Web Conference, Honolulu, Hawaii, USA, May 2002., 2002. Google ScholarDigital Library
Y. Wang, I. T. Phillips, and R. Haralick. Table detection via probability optimization. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002. Google ScholarDigital Library
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In in Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, U.S., September 2001.Google Scholar

Recommendations

A novel approach for comparing web sites by using MicroGenres

In this paper, a novel approach is introduced to compare web sites by analysing their web page content. Each web page can be expressed as a set of entities called MicroGenres, which in turn are abstractions about design patterns and genres for ...
Read More
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Read More
Automatic extraction of structure, content and usage data statistics of web sites
HT '10: Proceedings of the 21st ACM conference on Hypertext and hypermedia

In this paper we present a web mining tool which automatically extracts the structure, content and usage data statistics of web sites. This work inspired by the fact that web mining consists of three axes: web structure mining, web content mining and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
June 2004
988 pages
ISBN:1581138598
DOI:10.1145/1007568
Conference Chairs:
Arnd Christian König
Microsoft Research
,
Stefan Dessloch
University of Kaiserslautern, Germany
,
General Chair:
Patrick Valduriez
INRIA, France
,
Program Chair:
Gerhard Weikum
University of the Saarland
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 118
  Total Citations
  View Citations
- 1,409
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Recommendations

A novel approach for comparing web sites by using MicroGenres

Classifying web sites

Automatic extraction of structure, content and usage data statistics of web sites