ABSTRACT
With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are re-weighted, providing a modified web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the final ranking results are improved for almost all tested queries.
- Pr0 - Google's PageRank 0, 2002. http://pr.efactory.de/e-pr0.shtml.Google Scholar
- Lycos 50, 2005. http://50.lycos.com/.Google Scholar
- Open directory project, 2005. http://dmoz.org/.Google Scholar
- E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pages 38--47, Aug. 2003. Google ScholarDigital Library
- K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000. Google ScholarDigital Library
- K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, Aug. 1998. Google ScholarDigital Library
- A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International World Wide Web Conference, pages 415--429, May 2001. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
- A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pages 1157--1166, 1997. Google ScholarDigital Library
- S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference, pages 211--220, 2001. Google ScholarDigital Library
- S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco, CA, 2003. Google ScholarDigital Library
- D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
- B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, Jul 2000. Technical Report WS-00-01.Google Scholar
- D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 1--6, June 2004. Google ScholarDigital Library
- Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. Technical report, Stanford Digital Library Technologies Project, Mar. 2004.Google Scholar
- Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th VLDB Conference, Sept. 2004. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999. Google ScholarDigital Library
- D. Lemin. Google Zeitgeist, Dec. 2004. http://www.google.com/press/zeitgeist.html.Google Scholar
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1--6):387--401, 2000. Google ScholarDigital Library
- L. Li, Y. Shang, and W. Zhang. Improvement of HITS-based algorithms on web documents. In The eleventh International World Wide Web Conference, pages 527--535. ACM Press, 2002. Google ScholarDigital Library
- A. Perkins. White paper: The classification of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/.Google Scholar
- G. O. Roberts and J. S. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics, 3(3):199--216, Dec. 2003.Google Scholar
- D. Sullivan. Search engine optimization, Apr. 2000. Online at http://searchenginewatch.com/resources/article.php/2156511.Google Scholar
- H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusions. In Proceedings of the Third Workshop on Algorithms and Models for the Web Graph, Oct. 2004.Google ScholarCross Ref
Index Terms
- Identifying link farm spam pages
Recommendations
Topical link analysis for web search
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalTraditional web link-based ranking schemes use a single score to measure a page's authority without concern of the community from which that authority is derived. As a result, a resource that is highly popular for one topic may dominate the results of ...
Topical TrustRank: using topicality to combat web spam
WWW '06: Proceedings of the 15th international conference on World Wide WebWeb spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently ...
Content and link-structure perspective of ranking webpages: A review
AbstractThe delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
Comments