Skip to main content
Top

2018 | OriginalPaper | Chapter

Hadoop Based Parallel Deduplication Method for Web Documents

Authors : Junjie Song, Jin Liu, Yuhui Zheng

Published in: Advances in Computer Science and Ubiquitous Computing

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Lopresti, D.P.: Models and algorithms for duplicate document detection. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, pp. 297–300. IEEE (1999) Lopresti, D.P.: Models and algorithms for duplicate document detection. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, pp. 297–300. IEEE (1999)
2.
go back to reference Jianyong, W., Zhengmao, X., Ming, L., et al.: Research and evaluation of near-replicas of Web pages detection algorithms. Chin. J. Electron. (2000) Jianyong, W., Zhengmao, X., Ming, L., et al.: Research and evaluation of near-replicas of Web pages detection algorithms. Chin. J. Electron. (2000)
3.
go back to reference Liu, S., Zhang, Y., Xia, Y., et al.: Duplicate web page elimination based on HTML and extraction of long sentence. Microcomput. Appl. (2009) Liu, S., Zhang, Y., Xia, Y., et al.: Duplicate web page elimination based on HTML and extraction of long sentence. Microcomput. Appl. (2009)
4.
go back to reference Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986) Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
5.
6.
go back to reference Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRef
7.
go back to reference Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on MapReduce. In: Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT), Huangshan, PR China, pp. 278–280 (2009) Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on MapReduce. In: Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT), Huangshan, PR China, pp. 278–280 (2009)
8.
go back to reference Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. Association for Computational Linguistics (2004) Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. Association for Computational Linguistics (2004)
9.
go back to reference Page, L., Brin, S., Motwani, R., et al.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999) Page, L., Brin, S., Motwani, R., et al.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999)
11.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
12.
go back to reference Broder, A.Z., Glassman, S.C., Manasse, M.S., et al.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)CrossRef Broder, A.Z., Glassman, S.C., Manasse, M.S., et al.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)CrossRef
Metadata
Title
Hadoop Based Parallel Deduplication Method for Web Documents
Authors
Junjie Song
Jin Liu
Yuhui Zheng
Copyright Year
2018
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-7605-3_82