ABSTRACT
Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can hint at the category of the resource. This paper explores the use of URLs for webpage categorization via a two-phase pipeline of word segmentation/expansion and classification. We quantify its performance against document-based methods, which require the retrieval of the source document.
- D. Bergmark. Collection synthesis. In Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, Portland, Oregon, USA, 2002. Google ScholarDigital Library
- K. T. Lua and G. W. Gan. An application of information theory in chinese word segmentation. Computer Processing of Chinese and Oriental Languages, 8(1):115--124, 1994.Google Scholar
- S. Slattery and M. Craven. Combining statistical and relational methods for learning in hypertext domains. In 8th Int'l Conf. on Inductive Logic Programming, 1998. Google ScholarDigital Library
- A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In 4th Int'l Workshop on Web Information and Data Management (WIDM 2002), Virginia, USA, November 2002. Google ScholarDigital Library
Index Terms
- Web page classification without the web page
Recommendations
Web page classification: Features and algorithms
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as ...
Text categorization based on k-nearest neighbor approach for web site classification
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Web page genre classification
SAC '08: Proceedings of the 2008 ACM symposium on Applied computingIn this paper we present an automatic genre-based Web page classification system. Unlike subject or topic based classifications, genre-based classifications focus on functional purposes and classify web pages into categories such as online shopping, ...
Comments