Article

Coarse-grained classification of web sites by their structural properties

Authors:
Christoph Lindemann

University of Leipzig, Leipzig, Germany

University of Leipzig, Leipzig, Germany
View Profile

,
Lars Littig

University of Leipzig, Leipzig, Germany

University of Leipzig, Leipzig, Germany
View Profile

WIDM '06: Proceedings of the 8th annual ACM international workshop on Web information and data managementNovember 2006Pages 35–42https://doi.org/10.1145/1183550.1183559

Published:10 November 2006Publication History

WIDM '06: Proceedings of the 8th annual ACM international workshop on Web information and data management

Pages 35–42

ABSTRACT

In this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the composition of URLs, and the link structure of Web sites. Opposed to previous work, we perform a comprehensive measurement study to delve into the relation between the structure and the functionality of Web sites. Our study focuses on five of the most relevant functional classes, namely Academic, Blog, Corporate, Personal, and Shop. It is based upon more than 1,400 Web sites composed of 7 million crawled and 47 million known Web pages. We present a detailed statistical analysis which provides insight into how structural properties can be used to distinguish between Web sites from different functional classes. Building on these results, we introduce a content-independent approach for the automated coarse-grained classification of Web sites. A naïve Bayesian classifier with advanced density estimation yields a precision of 82% and recall of 80% for the classification of Web sites into the considered classes.

References

E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer, The Connectivity Sonar: Detecting Site Functionality by Structural Patterns, Proc. 14th Conf. on Hypertext and Hypermedia, Nottingham, United Kingdom, 2003. Google ScholarDigital Library
D. Bates and D. Watts, Nonlinear Regression and Its Applications, Wiley, 1988.Google ScholarCross Ref
A. Broder, R. Kumar, F. Maghoul, P. Rhaghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph Structure in the Web, Proc. 9th Int. WWW Conf., Amsterdam, The Netherlands, 2000. Google ScholarDigital Library
S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2003.Google Scholar
J. Cho and H. Garcia-Molina, The Evolution of the Web and its Implications for an Incremental Crawler, Proc. 26th VLDB Conf., Cairo, Egypt, 2000. Google ScholarDigital Library
S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, Self-Similarity in the Web, ACM Trans. on Internet Technology, 2, 205--223, 2002. Google ScholarDigital Library
DMOZ: open directory project, www.dmoz.orgGoogle Scholar
P. Domingos and M. Pazzani, On the Optimality of the Bayesian Classifier under Zero-One Loss, Machine Learning, 29, 103--130, 1997. Google ScholarDigital Library
R. Duda, P. Hart, and D. Stork, Pattern Classification, John Wiley and Sons, 2nd Edition, 2001. Google ScholarDigital Library
M. Ester, H. Kriegel, and M. Schubert, Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002. Google ScholarDigital Library
D. Fetterly, M. Manasse, M. Najork, and J. Wiener, A Large-scale Study of the Evolution of Web Pages, Proc. 12th Int. WWW Conf., Budapest, Hungary, 2003. Google ScholarDigital Library
W. Gao, T.-J. Huang, and Y-H. Tian, Two-phase Web Site Classification Based on Hidden Markov Tree Models, Web Intelligence and Agent Systems, 2004. Google ScholarDigital Library
D. Gibson, K. Punera, and A. Tomkins, The Volume and Evolution of Web Page Templates, Proc. 14th Int. WWW Conf., Chiba, Japan, 2005. Google ScholarDigital Library
J. Kenney and E. Keeping, Root Mean Square, Mathematics of Statistics, Van Nostrand, 3rd Edition, 59--60, 1962.Google Scholar
R. Kohavi and G. John, Wrappers for Feature Subset Selection, Artificial Intelligence, 97, 273--324, 1997. Google ScholarDigital Library
J. M. Pierre, On the Automated Classification of Web Sites, Linköping Electronic Articles in Computer and Information Science, Sweden 6, 2001.Google Scholar
Yahoo! Mindset, http://mindset.research.yahoo.comGoogle Scholar
Y. Yang and G. Webb, Weighted Proportional k-Interval Discretization for Naive-Bayes Classifiers, Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Seoul, Korea, 2003. Google ScholarDigital Library

Index Terms

Coarse-grained classification of web sites by their structural properties
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
  2. Information systems applications
    1. Data mining

Recommendations

Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Read More
Interpretable Mining of Influential Patterns from Sparse Web
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Big data are everywhere. World Wide Web is an example of these big data. It has become a vast data production and consumption platform, at which threads of data evolve from multiple devices, by different human interactions, over worldwide locations, ...
Read More
Text categorization based on k-nearest neighbor approach for web site classification

Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIDM '06: Proceedings of the 8th annual ACM international workshop on Web information and data management
November 2006
102 pages
ISBN:1595935258
DOI:10.1145/1183550
Program Chairs:
Angela Bonifati
Icar CNR, Italy
,
Irini Fundulaki
University of Edinburgh, UK
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 November 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
naïve bayesian classification
search engines
web measurement
web mining
web site classification
web structure mining
Qualifiers
- Article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 651
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Coarse-grained classification of web sites by their structural properties

WIDM '06: Proceedings of the 8th annual ACM international workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Classifying web sites

Interpretable Mining of Influential Patterns from Sparse Web

Text categorization based on k-nearest neighbor approach for web site classification