poster

HyLiEn: a hybrid approach to general list extraction on the web

Authors:
Fabio Fumarola

Università degli Studi di Bari, Bari, UNK, Italy

Università degli Studi di Bari, Bari, UNK, Italy
View Profile

,
Tim Weninger

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
View Profile

,
Rick Barber

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
View Profile

,
Donato Malerba

Università degli Studi di Bari, Bari, Italy

Università degli Studi di Bari, Bari, Italy
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA

University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
View Profile

WWW '11: Proceedings of the 20th international conference companion on World wide webMarch 2011Pages 35–36https://doi.org/10.1145/1963192.1963211

Published:28 March 2011Publication History

WWW '11: Proceedings of the 20th international conference companion on World wide web

Pages 35–36

ABSTRACT

We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.

References

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008. Google ScholarDigital Library
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71--80, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng., 22(3):447--460, 2010. Google ScholarDigital Library
K. Simon and G. Lausen. Viper: augmenting automatic information extraction with visual perceptions. In CIKM, pages 381--388, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
S. Tong and J. Dean. System and methods for automatically creating lists. In US Patent: 7350187, Mar 2008.Google Scholar
R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM '07: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 342--350, Washington, DC, USA, 2007. IEEE Google ScholarDigital Library
T. Weninger, F. Fumarola, R. Barber, J. Han, and D. Malerba. Unexpected results in automatic list extraction on the web. SIGKDD Explorations, 12(2), 2010. Google ScholarDigital Library

Index Terms

HyLiEn: a hybrid approach to general list extraction on the web
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Extracting general lists from web documents: a hybrid approach
IEA/AIE'11: Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However,...
Read More
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
LinkSelector: A Web mining approach to hyperlink selection for Web portals

As the size and complexity of Web sites expands dramatically, it has become increasingly challenging to design Web sites where Web surfers can easily find the information they seek. In this article, we address the design of the portal page of a Web site,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '11: Proceedings of the 20th international conference companion on World wide web
March 2011
552 pages
ISBN:9781450306379
DOI:10.1145/1963192
General Chairs:
S. Sadagopan
IIIT-Bangalore, India
,
Krithi Ramamritham
IIT-Bombay, India
,
Arun Kumar
IBM Research, India
,
M. P. Ravindra
Infosys E & R, India
,
Program Chairs:
Elisa Bertino
Purdue University, USA
,
Ravi Kumar
Yahoo! Research, USA
Copyright © 2011 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
web information integration
web lists
web mining
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 174
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HyLiEn: a hybrid approach to general list extraction on the web

WWW '11: Proceedings of the 20th international conference companion on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extracting general lists from web documents: a hybrid approach

Current challenges in web crawling

LinkSelector: A Web mining approach to hyperlink selection for Web portals