Skip to main content
Top
Published in: Discover Computing 4/2018

29-11-2017

Website replica detection with distant supervision

Authors: Cristiano Carvalho, Edleno Silva de Moura, Adriano Veloso, Nivio Ziviani

Published in: Discover Computing | Issue 4/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Duplicate content on the Web occurs within the same website or across multiple websites. The latter is mainly associated with the existence of website replicas—sites that are perceptibly similar. Replication may be accidental, intentional or malicious, but no matter the reason, search engines suffer greatly either from unnecessarily storing and moving duplicate data, or from providing search results that do not offer real value to the users. In this paper, we model the detection of website replicas as a pairwise classification problem with distant supervision. That is, (heuristically) finding obvious replica and non-replica cases is trivial, but learning effective classifiers requires a representative set of non-obvious labeled examples, which are hard to obtain. We employ efficient Expectation-Maximization (EM) algorithms in order to find non-obvious examples from obvious ones, enlarging the training-set and improving the classifiers iteratively. Our classifiers employ association rules, being thus incrementally updated as the EM process iterates, making our algorithms time-efficient. Experiments show that: (1) replicas are fully eliminated at a false-positive rate lower than 0.005, incurring in + 19% reduction in the number of duplicate URLs, (2) reduction increases to + 21% by using our site-level algorithms in conjunction with existing URL-level algorithms, and (3) our classifiers are more than two orders of magnitude faster than semi-supervised alternative solutions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The terms “duplicate” and “near-duplicate” are used interchangeably.
 
3
According to an October 2014 survey from netcraft.​com.
 
4
An Expectation-Maximization algorithm is a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data.
 
5
These numbers were obtained by considering only replica candidates, and not all possible pairs of websites in the collection.
 
6
These labels are only used to evaluate the classifiers, and are not used to learn them.
 
7
Due to the large number of candidates, we cannot assure the non-existence of false-negatives, but we assure the non-existence of false-positives.
 
Literature
go back to reference Agarwal, A., Koppula, H. S., Leela, K. P., Chitrapura, K. P., Garg, S., Pavan Kumar, et al. (2009). URL normalization for de-duplication of web pages. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 1987–1990). Agarwal, A., Koppula, H. S., Leela, K. P., Chitrapura, K. P., Garg, S., Pavan Kumar, et al. (2009). URL normalization for de-duplication of web pages. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 1987–1990).
go back to reference Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. SIGMOD Record, 22(2), 207–216.CrossRef Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. SIGMOD Record, 22(2), 207–216.CrossRef
go back to reference Bar-Yossef, Z., Keidar, I., & Schonfeld, U. (2009). Do not crawl in the dust: Different URLs with similar text. ACM Transactions on the Web, 3(1), 3:1–3:31.CrossRef Bar-Yossef, Z., Keidar, I., & Schonfeld, U. (2009). Do not crawl in the dust: Different URLs with similar text. ACM Transactions on the Web, 3(1), 3:1–3:31.CrossRef
go back to reference Bharat, K., & Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks, 31(11–16), 1579–1590.CrossRef Bharat, K., & Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks, 31(11–16), 1579–1590.CrossRef
go back to reference Bharat, K., Broder, A., Dean, J., & Henzinger, M. R. (2000). A comparison of techniques to find mirrored hosts on the www. Journal of the American Society for Information Science, 51(12), 1114–1122.CrossRef Bharat, K., Broder, A., Dean, J., & Henzinger, M. R. (2000). A comparison of techniques to find mirrored hosts on the www. Journal of the American Society for Information Science, 51(12), 1114–1122.CrossRef
go back to reference Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Network ISDN System, 29(8–13), 1157–1166.CrossRef Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Network ISDN System, 29(8–13), 1157–1166.CrossRef
go back to reference Carvalho, A Ld C., Moura, E. S. d, Silva, A. S. d, Berlt, K., & Bezerra, A. (2007). A cost-effective method for detecting web site replicas on search engine databases. Data Knowledge Engineering, 62(3), 421–437.CrossRef Carvalho, A Ld C., Moura, E. S. d, Silva, A. S. d, Berlt, K., & Bezerra, A. (2007). A cost-effective method for detecting web site replicas on search engine databases. Data Knowledge Engineering, 62(3), 421–437.CrossRef
go back to reference Cho, J., Shivakumar, N., & Garcia-Molina, H. (2000). Finding replicated web collections. SIGMOD Record, 29, 355–366.CrossRef Cho, J., Shivakumar, N., & Garcia-Molina, H. (2000). Finding replicated web collections. SIGMOD Record, 29, 355–366.CrossRef
go back to reference Dasgupta, A., Kumar, R., & Sasturkar, A. (2008). De-duping URLs via rewrite rules. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 186–194). Dasgupta, A., Kumar, R., & Sasturkar, A. (2008). De-duping URLs via rewrite rules. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 186–194).
go back to reference Davis, A., Veloso, A., Silva, A., Laender, A. H. F., & Meira-Jr., W. (2012). Named entity disambiguation in streaming data. In Proceedings of he 50th annual meeting of the Association for Computational Linguistics (pp. 815–824). Davis, A., Veloso, A., Silva, A., Laender, A. H. F., & Meira-Jr., W. (2012). Named entity disambiguation in streaming data. In Proceedings of he 50th annual meeting of the Association for Computational Linguistics (pp. 815–824).
go back to reference Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MathSciNetMATH Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MathSciNetMATH
go back to reference Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022–1029). Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022–1029).
go back to reference Fetterly, D., Manasse, M., & Najork, M. (2003). On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st conference on Latin American Web Congress (pp. 37–45). Fetterly, D., Manasse, M., & Najork, M. (2003). On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st conference on Latin American Web Congress (pp. 37–45).
go back to reference Koppula, H. S., Leela, K. P., Agarwal, A., Chitrapura, K. P., Garg, S., & Sasturkar, A. (2010). Learning URL patterns for webpage de-duplication. In Proceedings of the 3rd ACM international conference on web search and data mining (pp. 381–390). Koppula, H. S., Leela, K. P., Agarwal, A., Chitrapura, K. P., Garg, S., & Sasturkar, A. (2010). Learning URL patterns for webpage de-duplication. In Proceedings of the 3rd ACM international conference on web search and data mining (pp. 381–390).
go back to reference Lei, T., Cai, R., Yang, J.-M., Ke, Y., Fan, X., & Zhang, L. (2010). A pattern tree-based approach to learning URL normalization rules. In Proceedings of the 19th international conference on world wide web (pp. 611–620). Lei, T., Cai, R., Yang, J.-M., Ke, Y., Fan, X., & Zhang, L. (2010). A pattern tree-based approach to learning URL normalization rules. In Proceedings of the 19th international conference on world wide web (pp. 611–620).
go back to reference Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. S. (2003). Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE international conference on data mining (pp. 179–188). Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. S. (2003). Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE international conference on data mining (pp. 179–188).
go back to reference Palda, F. (2011). Pareto’s republic and the new science of peace. Ottawa: Cooper-Wolfling. Palda, F. (2011). Pareto’s republic and the new science of peace. Ottawa: Cooper-Wolfling.
go back to reference Rodrigues, K. W. L., Cristo, M., de Moura, E. S., & da Silva, A. S. (2013). Learning URL normalization rules using multiple alignment of sequences. In Proceedings of the 20th international symposium on string processing and information retrieval (pp. 197–205). Rodrigues, K. W. L., Cristo, M., de Moura, E. S., & da Silva, A. S. (2013). Learning URL normalization rules using multiple alignment of sequences. In Proceedings of the 20th international symposium on string processing and information retrieval (pp. 197–205).
go back to reference Veloso, A., Meira, W., Jr., & Zaki, M. (2006). Lazy associative classification. In IEEE international conference on data mining (pp. 645–654). Veloso, A., Meira, W., Jr., & Zaki, M. (2006). Lazy associative classification. In IEEE international conference on data mining (pp. 645–654).
go back to reference Yang, H., & Callan, J. (2006). Near-duplicate detection by instance-level constrained clustering. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 421–428). Yang, H., & Callan, J. (2006). Near-duplicate detection by instance-level constrained clustering. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 421–428).
go back to reference Ye, S., Wen, J.-R., & Ma, W.-Y. (2008). A systematic study on parameter correlations in large-scale duplicate document detection. Knowledge and Information Systems, 14, 217–232.CrossRef Ye, S., Wen, J.-R., & Ma, W.-Y. (2008). A systematic study on parameter correlations in large-scale duplicate document detection. Knowledge and Information Systems, 14, 217–232.CrossRef
Metadata
Title
Website replica detection with distant supervision
Authors
Cristiano Carvalho
Edleno Silva de Moura
Adriano Veloso
Nivio Ziviani
Publication date
29-11-2017
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 4/2018
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-017-9320-z

Premium Partner