Abstract
The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the web. In EDBT, volume 261, pages 720--724. ACM, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, pages 538--549, 2008. Google ScholarDigital Library
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, pages 406--417, 2003. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: automatic data extraction from data-intensive web sites. In SIGMOD, pages 624--624, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak. Towards domain independent information extraction from web tables. In WWW, 2007. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, pages 289--300, 2009. Google ScholarDigital Library
- B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW, pages 981--990, 2009. Google ScholarDigital Library
- S. Tong and J. Dean. System and methods for automatically creating lists. US Patent: 7350187, Mar 2008.Google Scholar
- R. C. Wang and W. W. Cohen. Language-independent set expansion of named entities using the web. In ICDM, 2007. Google ScholarDigital Library
Index Terms
- Unexpected results in automatic list extraction on the web
Recommendations
Automatic Data Records Extraction from List Page in Deep Web Sources
APCIP '09: Proceedings of the 2009 Asia-Pacific Conference on Information Processing - Volume 01with the explosive growth and popularity of the World Wide Web, a wealth of online e-commerce information resources become available. List pages in these web sites are usually automatically generated from the back-end DBMS using scripts. In order to ...
WebUser: mining unexpected web usage
Web usage mining has been much concentrated on the discovery of relevant user behaviours from web access record data. In this paper, we present WebUser, an approach to discover unexpected usage in web access log. We present a belief-driven method for ...
Automatic extraction of structure, content and usage data statistics of web sites
HT '10: Proceedings of the 21st ACM conference on Hypertext and hypermediaIn this paper we present a web mining tool which automatically extracts the structure, content and usage data statistics of web sites. This work inspired by the fact that web mining consists of three axes: web structure mining, web content mining and ...
Comments