skip to main content
research-article

Unexpected results in automatic list extraction on the web

Published:31 March 2011Publication History
Skip Abstract Section

Abstract

The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.

References

  1. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the web. In EDBT, volume 261, pages 720--724. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, pages 538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, pages 406--417, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: automatic data extraction from data-intensive web sites. In SIGMOD, pages 624--624, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak. Towards domain independent information extraction from web tables. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, pages 289--300, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Tong and J. Dean. System and methods for automatically creating lists. US Patent: 7350187, Mar 2008.Google ScholarGoogle Scholar
  10. R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unexpected results in automatic list extraction on the web

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGKDD Explorations Newsletter
          ACM SIGKDD Explorations Newsletter  Volume 12, Issue 2
          December 2010
          98 pages
          ISSN:1931-0145
          EISSN:1931-0153
          DOI:10.1145/1964897
          Issue’s Table of Contents

          Copyright © 2011 Authors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 March 2011

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader