research-article

Unexpected results in automatic list extraction on the web

Authors:
Tim Weninger

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Fabio Fumarola

Università degli Studi di Bari "Aldo Moro"

Università degli Studi di Bari "Aldo Moro"
View Profile

,
Rick Barber

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Donato Malerba

Università degli Studi di Bari "Aldo Moro"

Università degli Studi di Bari "Aldo Moro"
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 12 Issue 2December 2010pp 26–30https://doi.org/10.1145/1964897.1964904

Published:31 March 2011Publication History

ACM SIGKDD Explorations Newsletter

Abstract

The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.

References

L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the web. In EDBT, volume 261, pages 720--724. ACM, 2008. Google ScholarDigital Library
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, pages 538--549, 2008. Google ScholarDigital Library
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, pages 406--417, 2003. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: automatic data extraction from data-intensive web sites. In SIGMOD, pages 624--624, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak. Towards domain independent information extraction from web tables. In WWW, 2007. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, pages 289--300, 2009. Google ScholarDigital Library
B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, 2009. Google ScholarDigital Library
S. Tong and J. Dean. System and methods for automatically creating lists. US Patent: 7350187, Mar 2008.Google Scholar
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, 2007. Google ScholarDigital Library

Index Terms

Unexpected results in automatic list extraction on the web
1. Information systems

Recommendations

Automatic Data Records Extraction from List Page in Deep Web Sources
APCIP '09: Proceedings of the 2009 Asia-Pacific Conference on Information Processing - Volume 01

with the explosive growth and popularity of the World Wide Web, a wealth of online e-commerce information resources become available. List pages in these web sites are usually automatically generated from the back-end DBMS using scripts. In order to ...
Read More
WebUser: mining unexpected web usage

Web usage mining has been much concentrated on the discovery of relevant user behaviours from web access record data. In this paper, we present WebUser, an approach to discover unexpected usage in web access log. We present a belief-driven method for ...
Read More
Automatic extraction of structure, content and usage data statistics of web sites
HT '10: Proceedings of the 21st ACM conference on Hypertext and hypermedia

In this paper we present a web mining tool which automatically extracts the structure, content and usage data statistics of web sites. This work inspired by the fact that web mining consists of three axes: web structure mining, web content mining and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 12, Issue 2
December 2010
98 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1964897
Issue’s Table of Contents

Copyright © 2011 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 March 2011
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 258
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Data Records Extraction from List Page in Deep Web Sources

WebUser: mining unexpected web usage

Automatic extraction of structure, content and usage data statistics of web sites

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Data Records Extraction from List Page in Deep Web Sources

WebUser: mining unexpected web usage

Automatic extraction of structure, content and usage data statistics of web sites

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media