ABSTRACT
In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the composition of URLs, and the link structure of Web sites. Opposed to previous work, we perform a comprehensive measurement study to delve into the relation between the structure and the functionality of Web sites. Our study focuses on five of the most relevant functional classes, namely Academic, Blog, Corporate, Personal, and Shop. It is based upon more than 1,400 Web sites composed of 7 million crawled and 47 million known Web pages. We present a detailed statistical analysis which provides insight into how structural properties can be used to distinguish between Web sites from different functional classes. Building on these results, we introduce a content-independent approach for the automated coarse-grained classification of Web sites. A naïve Bayesian classifier with advanced density estimation yields a precision of 82% and recall of 80% for the classification of Web sites into the considered classes.
- E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer, The Connectivity Sonar: Detecting Site Functionality by Structural Patterns, Proc. 14th Conf. on Hypertext and Hypermedia, Nottingham, United Kingdom, 2003. Google ScholarDigital Library
- D. Bates and D. Watts, Nonlinear Regression and Its Applications, Wiley, 1988.Google ScholarCross Ref
- A. Broder, R. Kumar, F. Maghoul, P. Rhaghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph Structure in the Web, Proc. 9th Int. WWW Conf., Amsterdam, The Netherlands, 2000. Google ScholarDigital Library
- S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.Google Scholar
- J. Cho and H. Garcia-Molina, The Evolution of the Web and its Implications for an Incremental Crawler, Proc. 26th VLDB Conf., Cairo, Egypt, 2000. Google ScholarDigital Library
- S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, Self-Similarity in the Web, ACM Trans. on Internet Technology, 2, 205--223, 2002. Google ScholarDigital Library
- DMOZ: open directory project, www.dmoz.orgGoogle Scholar
- P. Domingos and M. Pazzani, On the Optimality of the Bayesian Classifier under Zero-One Loss, Machine Learning, 29, 103--130, 1997. Google ScholarDigital Library
- R. Duda, P. Hart, and D. Stork, Pattern Classification, John Wiley and Sons, 2nd Edition, 2001. Google ScholarDigital Library
- M. Ester, H. Kriegel, and M. Schubert, Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002. Google ScholarDigital Library
- D. Fetterly, M. Manasse, M. Najork, and J. Wiener, A Large-scale Study of the Evolution of Web Pages, Proc. 12th Int. WWW Conf., Budapest, Hungary, 2003. Google ScholarDigital Library
- W. Gao, T.-J. Huang, and Y-H. Tian, Two-phase Web Site Classification Based on Hidden Markov Tree Models, Web Intelligence and Agent Systems, 2004. Google ScholarDigital Library
- D. Gibson, K. Punera, and A. Tomkins, The Volume and Evolution of Web Page Templates, Proc. 14th Int. WWW Conf., Chiba, Japan, 2005. Google ScholarDigital Library
- J. Kenney and E. Keeping, Root Mean Square, Mathematics of Statistics, Van Nostrand, 3rd Edition, 59--60, 1962.Google Scholar
- R. Kohavi and G. John, Wrappers for Feature Subset Selection, Artificial Intelligence, 97, 273--324, 1997. Google ScholarDigital Library
- J. M. Pierre, On the Automated Classification of Web Sites, Linköping Electronic Articles in Computer and Information Science, Sweden 6, 2001.Google Scholar
- Yahoo! Mindset, http://mindset.research.yahoo.comGoogle Scholar
- Y. Yang and G. Webb, Weighted Proportional k-Interval Discretization for Naive-Bayes Classifiers, Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Seoul, Korea, 2003. Google ScholarDigital Library
Index Terms
- Coarse-grained classification of web sites by their structural properties
Recommendations
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide WebIn this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Interpretable Mining of Influential Patterns from Sparse Web
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent TechnologyBig data are everywhere. World Wide Web is an example of these big data. It has become a vast data production and consumption platform, at which threads of data evolve from multiple devices, by different human interactions, over worldwide locations, ...
Text categorization based on k-nearest neighbor approach for web site classification
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Comments