ABSTRACT
Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, 2003. Google ScholarDigital Library
- L. Arlotta, V. Crescenzi, G. Mecca, and P. Marialdo. Automatic annotation of data extracted from large web sites. In Proceedings of the Sixth International Workshop on Web and Databases (WebDB03), 2003.Google Scholar
- V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records full text. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, 2001. Google ScholarDigital Library
- C. H. Chang, and S. C. Lui. IEPAD: Information Extraction based on Pattern Discovery. In 10th International World Wide Web Conference (WWW10), Hong Kong, 2001. Google ScholarDigital Library
- H. Chen, S. Tsai, and J. Tsai. Mining tables from large scale html texts. In 18th International Conference on Computational Linguistics (COLING), 2000. Google ScholarDigital Library
- W. W. Cohen, M. Hurst, and L. S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In 11th International World Wide Web Conference (WWW10), Honolulu, Hawaii, 2002. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Automatic web information extraction in the ROADRUNNER system. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), 2001. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th Conference on Very Large Databases (VLDB), Rome, Italy, 2001. Google ScholarDigital Library
- C. Gazen. Thesis proposal, Carnegie Mellon University.Google Scholar
- Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Proc. Conf. Advances in Neural Information Processing Systems, NIPS, volume 8, pages 472--478. MIT Press, 1995.Google Scholar
- M. Hurst. Layout and language: Challenges for table understanding on the web. In In Web Document Analysis, Proceedings of the 1st International Workshop on Web Document Analysis, 2001.Google Scholar
- M. Hurst and S. Douglas. Layout and language: Preliminary investigations in recognizing the structure of tables. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 1997. Google ScholarDigital Library
- Y. Jiang. Record-Boundary Discovery In Web Documents. PhD thesis, BYU, Utah, 1998.Google Scholar
- N. Kushmerick and B. Thoma. Intelligent Information Agents R&D in Europe: An AgentLink perspective, chapter Adaptive information extraction: Core technologies for information agents. Springer, 2002. Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarDigital Library
- K. Lerman and S. Minton. Learning the Common Structure of Data. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, 2000. AAAI Press. Google ScholarDigital Library
- K. Lerman, C. A. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park, 2001. AAAI Press.Google Scholar
- K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003. Google ScholarDigital Library
- K. Lerman, C. Gazen, S. Minton, and C. A. Knoblock,. Populating the Semantic Web. Submitted to the workshop on Advances in Text Extraction and Mining (ATEM-2004), 2004.Google Scholar
- K. Murphy. Dynamic bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, 2002. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pages 435--442. Morgan Kaufmann, San Francisco, CA, 2002. Google ScholarDigital Library
- H. T. Ng, H. Y. Lim, and J. L. T. Koo. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL99), 1999. Google ScholarDigital Library
- D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proceedings of the ACM SIGIR), 2003. Google ScholarDigital Library
- P. Pyreddy and W. B. Croft. Tintin: A system for retrieval in text tables. In Proceedings of 2nd International Conference on Digital Libraries, 1997. Google ScholarDigital Library
- L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in Speech Recognition. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of the Twenty-seventh International Conference on Very Large Databases, 2001. Google ScholarDigital Library
- J. P. Walser. Wsat(oip) package.Google Scholar
- J. P. Walser. Integer Optimization by Local Search: A Domain Independent Approach, volume 1637 of LNCS. Springer, New York, 1999. Google ScholarDigital Library
- Y. Wang and J. Hu. Detecting tables in html documents. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002. Google ScholarDigital Library
- Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In In The Elevent International World Web Conference, Honolulu, Hawaii, USA, May 2002., 2002. Google ScholarDigital Library
- Y. Wang, I. T. Phillips, and R. Haralick. Table detection via probability optimization. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002. Google ScholarDigital Library
- M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In in Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, U.S., September 2001.Google Scholar
Recommendations
A novel approach for comparing web sites by using MicroGenres
In this paper, a novel approach is introduced to compare web sites by analysing their web page content. Each web page can be expressed as a set of entities called MicroGenres, which in turn are abstractions about design patterns and genres for ...
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide WebIn this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Automatic extraction of structure, content and usage data statistics of web sites
HT '10: Proceedings of the 21st ACM conference on Hypertext and hypermediaIn this paper we present a web mining tool which automatically extracts the structure, content and usage data statistics of web sites. This work inspired by the fact that web mining consists of three axes: web structure mining, web content mining and ...
Comments