skip to main content
10.1145/1183550.1183559acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Coarse-grained classification of web sites by their structural properties

Published:10 November 2006Publication History

ABSTRACT

In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the composition of URLs, and the link structure of Web sites. Opposed to previous work, we perform a comprehensive measurement study to delve into the relation between the structure and the functionality of Web sites. Our study focuses on five of the most relevant functional classes, namely Academic, Blog, Corporate, Personal, and Shop. It is based upon more than 1,400 Web sites composed of 7 million crawled and 47 million known Web pages. We present a detailed statistical analysis which provides insight into how structural properties can be used to distinguish between Web sites from different functional classes. Building on these results, we introduce a content-independent approach for the automated coarse-grained classification of Web sites. A naïve Bayesian classifier with advanced density estimation yields a precision of 82% and recall of 80% for the classification of Web sites into the considered classes.

References

  1. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer, The Connectivity Sonar: Detecting Site Functionality by Structural Patterns, Proc. 14th Conf. on Hypertext and Hypermedia, Nottingham, United Kingdom, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Bates and D. Watts, Nonlinear Regression and Its Applications, Wiley, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Broder, R. Kumar, F. Maghoul, P. Rhaghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph Structure in the Web, Proc. 9th Int. WWW Conf., Amsterdam, The Netherlands, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.Google ScholarGoogle Scholar
  5. J. Cho and H. Garcia-Molina, The Evolution of the Web and its Implications for an Incremental Crawler, Proc. 26th VLDB Conf., Cairo, Egypt, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, Self-Similarity in the Web, ACM Trans. on Internet Technology, 2, 205--223, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. DMOZ: open directory project, www.dmoz.orgGoogle ScholarGoogle Scholar
  8. P. Domingos and M. Pazzani, On the Optimality of the Bayesian Classifier under Zero-One Loss, Machine Learning, 29, 103--130, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Duda, P. Hart, and D. Stork, Pattern Classification, John Wiley and Sons, 2nd Edition, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Ester, H. Kriegel, and M. Schubert, Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Fetterly, M. Manasse, M. Najork, and J. Wiener, A Large-scale Study of the Evolution of Web Pages, Proc. 12th Int. WWW Conf., Budapest, Hungary, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Gao, T.-J. Huang, and Y-H. Tian, Two-phase Web Site Classification Based on Hidden Markov Tree Models, Web Intelligence and Agent Systems, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Gibson, K. Punera, and A. Tomkins, The Volume and Evolution of Web Page Templates, Proc. 14th Int. WWW Conf., Chiba, Japan, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Kenney and E. Keeping, Root Mean Square, Mathematics of Statistics, Van Nostrand, 3rd Edition, 59--60, 1962.Google ScholarGoogle Scholar
  15. R. Kohavi and G. John, Wrappers for Feature Subset Selection, Artificial Intelligence, 97, 273--324, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. M. Pierre, On the Automated Classification of Web Sites, Linköping Electronic Articles in Computer and Information Science, Sweden 6, 2001.Google ScholarGoogle Scholar
  17. Yahoo! Mindset, http://mindset.research.yahoo.comGoogle ScholarGoogle Scholar
  18. Y. Yang and G. Webb, Weighted Proportional k-Interval Discretization for Naive-Bayes Classifiers, Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Seoul, Korea, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Coarse-grained classification of web sites by their structural properties

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WIDM '06: Proceedings of the 8th annual ACM international workshop on Web information and data management
            November 2006
            102 pages
            ISBN:1595935258
            DOI:10.1145/1183550

            Copyright © 2006 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 November 2006

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader