skip to main content
research-article

Tracking Web spam with HTML style similarities

Authors Info & Claims
Published:03 March 2008Publication History
Skip Abstract Section

Abstract

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).

Those pages built using the same generating method (template or script) share a common “look and feel” that is not easily detected by common text classification methods, but is more related to stylometry.

In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.

We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.

References

  1. Bar-Yossef, Z. and Rajagopalan, S. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02). ACM Press, 580--591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bawa, M., Condie, T., and Ganesan, P. 2005. LSH forest: Self-tuning indexes for similarity search. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). ACM Press, 651--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Benczúr, A., Csalogány, K., and Sarlós, T. 2006. Link-based similarity search to fight web spam. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06). Seattle, WA.Google ScholarGoogle Scholar
  4. Boullé, M. 2006. MODL: A Bayes optimal discretization method for continuous attributes. Machine Learn. 65, 1, 131--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES'97). IEEE Computer Society. 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science Publishers, 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR. ACM Press, 423-- 430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chakrabarti, D., Kumar, R., and Punera, K. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). ACM Press, 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen, L., Ye, S., and Li, X. 2006. Template detection for large scale search engines. In Proceedings of the ACM Symposium on Applied Computing (SAC'06). ACM Press, 1094--1098. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB'04). ACM Press, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM Press, 170--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Filoche, P., Urvoy, T., Emmanuel, C., and Lavergne, T. 2007. France Telecom R&D entry. Web Spam Challenge 2007 (Track I).Google ScholarGoogle Scholar
  15. Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of International Association of Forensic Linguists (IAFL'97). 1--8.Google ScholarGoogle Scholar
  16. Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05). Chiba, Japan.Google ScholarGoogle Scholar
  17. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Heintze, N. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.Google ScholarGoogle Scholar
  19. Henzinger, M. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '06). ACM Press, 284--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC'98). ACM Press, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jenkins, B. 1997. A hash function for hash table lookup. Dr Dobbs Journal.Google ScholarGoogle Scholar
  22. Lavergne, T. 2006. Unnatural language detection. In Proceedings of Young Scientists' Conference on Information Retrieval (RJCRI'06).Google ScholarGoogle Scholar
  23. Manber, U. 1994. Finding similar files in a large file system. In USENIX Winter. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. McEnery, T. and Oakes, M. 2000. Authorship identification and computational stylometry. In Handbook of Natural Language Processing. Marcel Dekker Inc.Google ScholarGoogle Scholar
  25. Meyer Zu Eissen, S. and Stein, B. 2004. Genre classification of web pages. In Proceedings of 27th German Conference on Artificial Intelligence (KI-04), S. Biundo, T. Frühwirth, and G. Palm, Eds. Lecture Notes in Computer Science, vol. 3238.Google ScholarGoogle Scholar
  26. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). ACM Press, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Schleimer, S., Wilkerson, D. S., and Aiken, A. 2003. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the SIGMOD Conference, A. Y. Halevy, Z. G. Ives, and A. Doan, Eds. ACM Press, 76--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06).Google ScholarGoogle Scholar
  29. Van Rijsbergen, C. J. 1979. Information Retrieval 2nd ed. University of Glasgow, Glasgow, Scotland, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Westbrook, A. and Greene, R. 2002. Using semantic analysis to classify search engine spam. Tech. rep., Stanford University.Google ScholarGoogle Scholar
  31. Zobel, J. and Moffat, A. 1998. Exploring the similarity space. In Proceedings of the SIGIR Forum 32, 1, 18--34. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Tracking Web spam with HTML style similarities

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 2, Issue 1
      February 2008
      280 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/1326561
      Issue’s Table of Contents

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 March 2008
      • Accepted: 1 October 2007
      • Revised: 1 September 2007
      • Received: 1 April 2007
      Published in tweb Volume 2, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader