Top

Discover Computing

Published in:

29-11-2017

Website replica detection with distant supervision

Authors: Cristiano Carvalho, Edleno Silva de Moura, Adriano Veloso, Nivio Ziviani

Published in: Discover Computing | Issue 4/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Duplicate content on the Web occurs within the same website or across multiple websites. The latter is mainly associated with the existence of website replicas—sites that are perceptibly similar. Replication may be accidental, intentional or malicious, but no matter the reason, search engines suffer greatly either from unnecessarily storing and moving duplicate data, or from providing search results that do not offer real value to the users. In this paper, we model the detection of website replicas as a pairwise classification problem with distant supervision. That is, (heuristically) finding obvious replica and non-replica cases is trivial, but learning effective classifiers requires a representative set of non-obvious labeled examples, which are hard to obtain. We employ efficient Expectation-Maximization (EM) algorithms in order to find non-obvious examples from obvious ones, enlarging the training-set and improving the classifiers iteratively. Our classifiers employ association rules, being thus incrementally updated as the EM process iterates, making our algorithms time-efficient. Experiments show that: (1) replicas are fully eliminated at a false-positive rate lower than 0.005, incurring in + 19% reduction in the number of duplicate URLs, (2) reduction increases to + 21% by using our site-level algorithms in conjunction with existing URL-level algorithms, and (3) our classifiers are more than two orders of magnitude faster than semi-supervised alternative solutions.

next article Clustering small-sized collections of short texts

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

The terms “duplicate” and “near-duplicate” are used interchangeably.

https://blog.seoprofiler.com/matt-cutts/.

According to an October 2014 survey from netcraft.com.

An Expectation-Maximization algorithm is a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data.

These numbers were obtained by considering only replica candidates, and not all possible pairs of websites in the collection.

These labels are only used to evaluate the classifiers, and are not used to learn them.

Due to the large number of candidates, we cannot assure the non-existence of false-negatives, but we assure the non-existence of false-positives.

Agarwal, A., Koppula, H. S., Leela, K. P., Chitrapura, K. P., Garg, S., Pavan Kumar, et al. (2009). URL normalization for de-duplication of web pages. In Proceedings of the 18th ACM conference on information and knowledge management (pp. 1987–1990).

Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. SIGMOD Record, 22(2), 207–216.CrossRef

Bar-Yossef, Z., Keidar, I., & Schonfeld, U. (2009). Do not crawl in the dust: Different URLs with similar text. ACM Transactions on the Web, 3(1), 3:1–3:31.CrossRef

Bharat, K., & Broder, A. (1999). Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks, 31(11–16), 1579–1590.CrossRef

Bharat, K., Broder, A., Dean, J., & Henzinger, M. R. (2000). A comparison of techniques to find mirrored hosts on the www. Journal of the American Society for Information Science, 51(12), 1114–1122.CrossRef

Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Network ISDN System, 29(8–13), 1157–1166.CrossRef

Carvalho, A Ld C., Moura, E. S. d, Silva, A. S. d, Berlt, K., & Bezerra, A. (2007). A cost-effective method for detecting web site replicas on search engine databases. Data Knowledge Engineering, 62(3), 421–437.CrossRef

Cho, J., Shivakumar, N., & Garcia-Molina, H. (2000). Finding replicated web collections. SIGMOD Record, 29, 355–366.CrossRef

Dasgupta, A., Kumar, R., & Sasturkar, A. (2008). De-duping URLs via rewrite rules. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 186–194).

Davis, A., Veloso, A., Silva, A., Laender, A. H. F., & Meira-Jr., W. (2012). Named entity disambiguation in streaming data. In Proceedings of he 50th annual meeting of the Association for Computational Linguistics (pp. 815–824).

Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MathSciNetMATH

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022–1029).

Fetterly, D., Manasse, M., & Najork, M. (2003). On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st conference on Latin American Web Congress (pp. 37–45).

Koppula, H. S., Leela, K. P., Agarwal, A., Chitrapura, K. P., Garg, S., & Sasturkar, A. (2010). Learning URL patterns for webpage de-duplication. In Proceedings of the 3rd ACM international conference on web search and data mining (pp. 381–390).

Lei, T., Cai, R., Yang, J.-M., Ke, Y., Fan, X., & Zhang, L. (2010). A pattern tree-based approach to learning URL normalization rules. In Proceedings of the 19th international conference on world wide web (pp. 611–620).

Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. S. (2003). Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE international conference on data mining (pp. 179–188).

Palda, F. (2011). Pareto’s republic and the new science of peace. Ottawa: Cooper-Wolfling.

Rodrigues, K. W. L., Cristo, M., de Moura, E. S., & da Silva, A. S. (2013). Learning URL normalization rules using multiple alignment of sequences. In Proceedings of the 20th international symposium on string processing and information retrieval (pp. 197–205).

Veloso, A., Meira, W., Jr., & Zaki, M. (2006). Lazy associative classification. In IEEE international conference on data mining (pp. 645–654).

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1, 80–93.MathSciNetCrossRef

Yang, H., & Callan, J. (2006). Near-duplicate detection by instance-level constrained clustering. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 421–428).

Ye, S., Wen, J.-R., & Ma, W.-Y. (2008). A systematic study on parameter correlations in large-scale duplicate document detection. Knowledge and Information Systems, 14, 217–232.CrossRef

Title: Website replica detection with distant supervision
Authors: Cristiano Carvalho
Edleno Silva de Moura
Adriano Veloso
Nivio Ziviani
Publication date: 29-11-2017
Publisher: Springer Netherlands
Published in: Discover Computing / Issue 4/2018
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-017-9320-z

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner