research-article

Tracking Web spam with HTML style similarities

Authors:
Tanguy Urvoy

Orange Labs (France Telecom R&D), Lannion cedex, France

Orange Labs (France Telecom R&D), Lannion cedex, France
View Profile

,
Emmanuel Chauveau

Orange Labs (France Telecom R&D), Lannion cedex, France

Orange Labs (France Telecom R&D), Lannion cedex, France
View Profile

,
Pascal Filoche

Orange Labs (France Telecom R&D), Lannion cedex, France

Orange Labs (France Telecom R&D), Lannion cedex, France
View Profile

,
Thomas Lavergne

Orange Labs and ENST Paris, Lannion cedex, France

Orange Labs and ENST Paris, Lannion cedex, France
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 2 Issue 1Article No.: 3pp 1–28https://doi.org/10.1145/1326561.1326564

Published:03 March 2008Publication History

ACM Transactions on the Web

Abstract

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).

Those pages built using the same generating method (template or script) share a common “look and feel” that is not easily detected by common text classification methods, but is more related to stylometry.

In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.

We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.

References

Bar-Yossef, Z. and Rajagopalan, S. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02). ACM Press, 580--591. Google ScholarDigital Library
Bawa, M., Condie, T., and Ganesan, P. 2005. LSH forest: Self-tuning indexes for similarity search. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). ACM Press, 651--660. Google ScholarDigital Library
Benczúr, A., Csalogány, K., and Sarlós, T. 2006. Link-based similarity search to fight web spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06). Seattle, WA.Google Scholar
Boullé, M. 2006. MODL: A Bayes optimal discretization method for continuous attributes. Machine Learn. 65, 1, 131--165. Google ScholarDigital Library
Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES'97). IEEE Computer Society. 21. Google ScholarDigital Library
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science Publishers, 1157--1166. Google ScholarDigital Library
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarDigital Library
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR. ACM Press, 423-- 430. Google ScholarDigital Library
Chakrabarti, D., Kumar, R., and Punera, K. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, 61--70. Google ScholarDigital Library
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). ACM Press, 380--388. Google ScholarDigital Library
Chen, L., Ye, S., and Li, X. 2006. Template detection for large scale search engines. In Proceedings of the ACM Symposium on Applied Computing (SAC'06). ACM Press, 1094--1098. Google ScholarDigital Library
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB'04). ACM Press, 1--6. Google ScholarDigital Library
Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM Press, 170--177. Google ScholarDigital Library
Filoche, P., Urvoy, T., Emmanuel, C., and Lavergne, T. 2007. France Telecom R&D entry. Web Spam Challenge 2007 (Track I).Google Scholar
Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of International Association of Forensic Linguists (IAFL'97). 1--8.Google Scholar
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05). Chiba, Japan.Google Scholar
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587. Google ScholarDigital Library
Heintze, N. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.Google Scholar
Henzinger, M. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '06). ACM Press, 284--291. Google ScholarDigital Library
Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC'98). ACM Press, 604--613. Google ScholarDigital Library
Jenkins, B. 1997. A hash function for hash table lookup. Dr Dobbs Journal.Google Scholar
Lavergne, T. 2006. Unnatural language detection. In Proceedings of Young Scientists' Conference on Information Retrieval (RJCRI'06).Google Scholar
Manber, U. 1994. Finding similar files in a large file system. In USENIX Winter. 1--10. Google ScholarDigital Library
McEnery, T. and Oakes, M. 2000. Authorship identification and computational stylometry. In Handbook of Natural Language Processing. Marcel Dekker Inc.Google Scholar
Meyer Zu Eissen, S. and Stein, B. 2004. Genre classification of web pages. In Proceedings of 27th German Conference on Artificial Intelligence (KI-04), S. Biundo, T. Frühwirth, and G. Palm, Eds. Lecture Notes in Computer Science, vol. 3238.Google Scholar
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). ACM Press, 83--92. Google ScholarDigital Library
Schleimer, S., Wilkerson, D. S., and Aiken, A. 2003. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the SIGMOD Conference, A. Y. Halevy, Z. G. Ives, and A. Doan, Eds. ACM Press, 76--85. Google ScholarDigital Library
Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06).Google Scholar
Van Rijsbergen, C. J. 1979. Information Retrieval 2nd ed. University of Glasgow, Glasgow, Scotland, UK. Google ScholarDigital Library
Westbrook, A. and Greene, R. 2002. Using semantic analysis to classify search engine spam. Tech. rep., Stanford University.Google Scholar
Zobel, J. and Moffat, A. 1998. Exploring the similarity space. In Proceedings of the SIGIR Forum 32, 1, 18--34. Google ScholarDigital Library

Index Terms

Tracking Web spam with HTML style similarities
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Web Spam: A Study of the Page Language Effect on the Spam Detection Features
ICMLA '12: Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 02

Although search engines have deployed various techniques to detect and filter out Web spam, Web stammers continue to develop new tactics to influence the result of search engines ranking algorithms, for the purpose of obtaining an undeservedly high ...
Read More
Looking into the past to better classify web spam
AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue ...
Read More
Google Penguin: Evasion in Non-English Languages and a New Classifier
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02

Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 2, Issue 1
February 2008
280 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/1326561
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 March 2008
- Accepted: 1 October 2007
- Revised: 1 September 2007
- Received: 1 April 2007
Published in tweb Volume 2, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clustering
document similarity
search engine spam
stylometry
templates identification
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 1,238
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Tracking Web spam with HTML style similarities

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Web Spam: A Study of the Page Language Effect on the Spam Detection Features

Looking into the past to better classify web spam

Google Penguin: Evasion in Non-English Languages and a New Classifier

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Tracking Web spam with HTML style similarities

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Web Spam: A Study of the Page Language Effect on the Spam Detection Features

Looking into the past to better classify web spam

Google Penguin: Evasion in Non-English Languages and a New Classifier

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media