Skip to main content
Top
Published in: Discover Computing 6/2017

10-05-2017

A unified score propagation model for web spam demotion algorithm

Authors: Xu Zhuang, Yan Zhu, Chin-Chen Chang, Qiang Peng, Faisal Khurshid

Published in: Discover Computing | Issue 6/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Web spam pages exploit the biases of search engine algorithms to get higher than their deserved rankings in search results by using several types of spamming techniques. Many web spam demotion algorithms have been developed to combat spam via the use of the web link structure, from which the goodness or badness score of each web page is evaluated. Those scores are then used to identify spam pages or punish their rankings in search engine results. However, most of the published spam demotion algorithms differ from their base models by only very limited improvements and still suffer from some common score manipulation methods. The lack of a general framework for this field makes the task of designing high-performance spam demotion algorithms very inefficient. In this paper, we propose a unified score propagation model for web spam demotion algorithms by abstracting the score propagation process of relevant models with a forward score propagation function and a backward score propagation function, each of which can further be expressed as three sub-functions: a splitting function, an accepting function and a combination function. On the basis of the proposed model, we develop two new web spam demotion algorithms named Supervised Forward and Backward score Ranking (SFBR) and Unsupervised Forward and Backward score Ranking (UFBR). Our experiments, conducted on three large-scale public datasets, show that (1) SFBR is very robust and apparently outperforms other algorithms and (2) UFBR can obtain results comparable to some well-known supervised algorithms in the spam demotion task even if the UFBR is unsupervised.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
A WSDA is termed inconsistent if in-system scores will be lost or out-system scores will be brought in during the score propagation process.
 
Literature
go back to reference Al-Kabi, M., Wahsheh, H., Alsmadi, I., Al-Shawakfa, E., Wahbeh, A., & Al-Hmoud, A. (2012). Content-based analysis to detect Arabic web spam. Journal of Information Science, 38(3), 284–296.CrossRef Al-Kabi, M., Wahsheh, H., Alsmadi, I., Al-Shawakfa, E., Wahbeh, A., & Al-Hmoud, A. (2012). Content-based analysis to detect Arabic web spam. Journal of Information Science, 38(3), 284–296.CrossRef
go back to reference András, A., Benczúr, C. C., Erdélyi, M., Gyöngyi, Z., Masanes, J., & Matthews, M. (2010). ECML/PKDD 2010 discovery challenge data set. Crawled by the European Archive Foundation. András, A., Benczúr, C. C., Erdélyi, M., Gyöngyi, Z., Masanes, J., & Matthews, M. (2010). ECML/PKDD 2010 discovery challenge data set. Crawled by the European Archive Foundation.
go back to reference Baeza-Yates, R., Boldi, P., & Castillo, C. (2006). Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 308–315). Baeza-Yates, R., Boldi, P., & Castillo, C. (2006). Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 308–315).
go back to reference Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. A. (2006). Link-based characterization and detection of web spam. In AIRWeb (pp. 1–8). Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. A. (2006). Link-based characterization and detection of web spam. In AIRWeb (pp. 1–8).
go back to reference Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. (2008). Web spam detection: Link-based and content-based techniques. In The European integrated project dynamically evolving, large scale information systems (DELIS): Proceedings of the final workshop (Vol. 222, pp. 99–113). Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. (2008). Web spam detection: Link-based and content-based techniques. In The European integrated project dynamically evolving, large scale information systems (DELIS): Proceedings of the final workshop (Vol. 222, pp. 99–113).
go back to reference Castillo, C., Chellapilla K., & Denoyer L. (2008). Web spam challenge 2008. In AIRWeb, 2008. Castillo, C., Chellapilla K., & Denoyer L. (2008). Web spam challenge 2008. In AIRWeb, 2008.
go back to reference Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., et al. (2006). A reference collection for web spam. ACM Sigir Forum, 40(2), 11–24.CrossRef Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., et al. (2006). A reference collection for web spam. ACM Sigir Forum, 40(2), 11–24.CrossRef
go back to reference Caverlee, J., & Liu, L. (2007). Countering web spam with credibility-based link analysis. In Proceedings of the twenty-sixth annual ACM symposium on principles of distributed computing (pp. 157–166). Caverlee, J., & Liu, L. (2007). Countering web spam with credibility-based link analysis. In Proceedings of the twenty-sixth annual ACM symposium on principles of distributed computing (pp. 157–166).
go back to reference Chandra, A., Suaib, M., & Beg, D. (2015). Web spam classification using supervised artificial neural network algorithms. arXiv preprint arXiv:1502.03581. Chandra, A., Suaib, M., & Beg, D. (2015). Web spam classification using supervised artificial neural network algorithms. arXiv preprint arXiv:​1502.​03581.
go back to reference Chellapilla, K., & Chickering, D. M. (2006). Improving cloaking detection using search query popularity and monetizability. In AIRWeb (pp. 17–23). Chellapilla, K., & Chickering, D. M. (2006). Improving cloaking detection using search query popularity and monetizability. In AIRWeb (pp. 17–23).
go back to reference Convey, E. (1996). Porn sneaks way back on web. The Boston Herald, 028. Convey, E. (1996). Porn sneaks way back on web. The Boston Herald, 028.
go back to reference Cormack, G. (2007). Content-based web spam detection. In Proceedings of the 3rd international workshop on adversarial information retrieval on the web (AIRWeb). Cormack, G. (2007). Content-based web spam detection. In Proceedings of the 3rd international workshop on adversarial information retrieval on the web (AIRWeb).
go back to reference Cormack, G. V., Smucker, M. D., & Clarke, C. L. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5), 441–465.CrossRef Cormack, G. V., Smucker, M. D., & Clarke, C. L. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5), 441–465.CrossRef
go back to reference Diligenti, M., Gori, M., & Maggini, M. (2004). A unified probabilistic framework for web page scoring systems. IEEE Transactions on Knowledge and Data Engineering, 16(1), 4–16.CrossRef Diligenti, M., Gori, M., & Maggini, M. (2004). A unified probabilistic framework for web page scoring systems. IEEE Transactions on Knowledge and Data Engineering, 16(1), 4–16.CrossRef
go back to reference Gyongyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the web. Gyongyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the web.
go back to reference Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustrank. In Proceedings of the thirtieth international conference on very large data bases (Vol. 30, pp. 576–587). Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with trustrank. In Proceedings of the thirtieth international conference on very large data bases (Vol. 30, pp. 576–587).
go back to reference Henzinger, M. R., Motwani, R., & Silverstein, C. (2002). Challenges in web search engines. ACM SIGIR Forum, 36(2), 11–22.CrossRef Henzinger, M. R., Motwani, R., & Silverstein, C. (2002). Challenges in web search engines. ACM SIGIR Forum, 36(2), 11–22.CrossRef
go back to reference Hochstotter, N., & Koch, M. (2009). Standard parameters for searching behaviour in search engines and their empirical evaluation. Journal of Information Science, 35(1), 45–65.CrossRef Hochstotter, N., & Koch, M. (2009). Standard parameters for searching behaviour in search engines and their empirical evaluation. Journal of Information Science, 35(1), 45–65.CrossRef
go back to reference Krishnan, V., & Raj, R. (2006). Web spam detection with anti-trust rank. In AIRWeb (Vol. 6, pp. 37–40). Krishnan, V., & Raj, R. (2006). Web spam detection with anti-trust rank. In AIRWeb (Vol. 6, pp. 37–40).
go back to reference Liu, X., Wang, Y., Zhu, S., & Lin, H. (2013). Combating Web spam through trust–distrust propagation with confidence. Pattern Recognition Letters, 34(13), 1462–1469.CrossRef Liu, X., Wang, Y., Zhu, S., & Lin, H. (2013). Combating Web spam through trust–distrust propagation with confidence. Pattern Recognition Letters, 34(13), 1462–1469.CrossRef
go back to reference Marchiori, M. (1997). The quest for correct information on the web: Hyper search engines. Computer Networks and ISDN Systems, 29(8), 1225–1235.CrossRef Marchiori, M. (1997). The quest for correct information on the web: Hyper search engines. Computer Networks and ISDN Systems, 29(8), 1225–1235.CrossRef
go back to reference Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web (pp. 83–92). Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web (pp. 83–92).
go back to reference Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web.
go back to reference Radlinski, F. (2007). Addressing malicious noise in clickthrough data. In Learning to rank for information retrieval workshop at SIGIR. Radlinski, F. (2007). Addressing malicious noise in clickthrough data. In Learning to rank for information retrieval workshop at SIGIR.
go back to reference Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. ACM SIGIR Forum, 33(1), 6–12.CrossRef Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. ACM SIGIR Forum, 33(1), 6–12.CrossRef
go back to reference Spirin, N., & Han, J. (2012). Survey on web spam detection: Principles and algorithms. ACM SIGKDD Explorations Newsletter, 13(2), 50–64.CrossRef Spirin, N., & Han, J. (2012). Survey on web spam detection: Principles and algorithms. ACM SIGKDD Explorations Newsletter, 13(2), 50–64.CrossRef
go back to reference Wu, B., Goel, V., & Davison, B. D. (2006). Propagating trust and distrust to demote web spam. MTW, 190. Wu, B., Goel, V., & Davison, B. D. (2006). Propagating trust and distrust to demote web spam. MTW, 190.
go back to reference Zhang, X., Wang, Y., Mou, N., & Liang, W. (2011). Propagating both trust and distrust with target differentiation for combating web spam. In AAAI (pp. 1292–1297). Zhang, X., Wang, Y., Mou, N., & Liang, W. (2011). Propagating both trust and distrust with target differentiation for combating web spam. In AAAI (pp. 1292–1297).
Metadata
Title
A unified score propagation model for web spam demotion algorithm
Authors
Xu Zhuang
Yan Zhu
Chin-Chen Chang
Qiang Peng
Faisal Khurshid
Publication date
10-05-2017
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 6/2017
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-017-9307-9

Premium Partner