ABSTRACT
We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, 2008. Google ScholarDigital Library
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71--80, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng., 22(3):447--460, 2010. Google ScholarDigital Library
- K. Simon and G. Lausen. Viper: augmenting automatic information extraction with visual perceptions. In CIKM, pages 381--388, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- S. Tong and J. Dean. System and methods for automatically creating lists. In US Patent: 7350187, Mar 2008.Google Scholar
- R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM '07: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 342--350, Washington, DC, USA, 2007. IEEE Google ScholarDigital Library
- T. Weninger, F. Fumarola, R. Barber, J. Han, and D. Malerba. Unexpected results in automatic list extraction on the web. SIGKDD Explorations, 12(2), 2010. Google ScholarDigital Library
Index Terms
- HyLiEn: a hybrid approach to general list extraction on the web
Recommendations
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11: Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part IThe problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However,...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
LinkSelector: A Web mining approach to hyperlink selection for Web portals
As the size and complexity of Web sites expands dramatically, it has become increasingly challenging to design Web sites where Web surfers can easily find the information they seek. In this article, we address the design of the portal page of a Web site,...
Comments