Skip to main content

2021 | OriginalPaper | Buchkapitel

A New Shingling Similar Text Detection Algorithm

verfasst von : Peng Li, Tianling Qiao, Yongxing Guang, Lan Zhang

Erschienen in: Advances in Simulation and Process Modelling

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Since the traditional Shingling algorithm removes duplicate text in Chinese web pages, the amount of data that needs to be processed is particularly large, the efficiency of the algorithm is low. In this paper, a new text duplicate checking algorithm is constructed by changing the traditional Shingling algorithm. The meaningless dummy words are deleted in the text. Then the text is segmented according to semantics. Finally, the text similarity calculation formula is used to calculate the text similarity. The above work focuses on removing meaningless words from the text. In this way, the calculation rate of text similarity can be improved, and the accuracy of the text query and the complete query rate can be improved. The simulation results show that the algorithm is simple and feasible and has good text similarity calculation effect and certain advantages.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Gurmeet, S.M., Arvind, J., Anish, D.S.: Detecting near duplicates for web crawling. In: International World Wide Web Conference Committee (IW3C2), pp. 141–149 (2007) Gurmeet, S.M., Arvind, J., Anish, D.S.: Detecting near duplicates for web crawling. In: International World Wide Web Conference Committee (IW3C2), pp. 141–149 (2007)
2.
Zurück zum Zitat Wu, P.B., Chen, Q.X., Ma, L.: The study on large scale duplicated web pages of Chinese fast deletion algorithm based on string of feature code. J. Chin. Inf. Process. 17(2), 28–35 (2003) (in Chinese) Wu, P.B., Chen, Q.X., Ma, L.: The study on large scale duplicated web pages of Chinese fast deletion algorithm based on string of feature code. J. Chin. Inf. Process. 17(2), 28–35 (2003) (in Chinese)
3.
Zurück zum Zitat Mohammad, A.K., Vijaypal, S.D., Sanjeev, K.S.: Web crawler based on mobile agent and java aglets. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 5(10), 85–91 (2013) Mohammad, A.K., Vijaypal, S.D., Sanjeev, K.S.: Web crawler based on mobile agent and java aglets. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 5(10), 85–91 (2013)
4.
Zurück zum Zitat Manish, K., Ankit, B., Robin, G., Rajesh, B.: Keyword query based focused web crawler. Procedia Comput. Sci. 125, 584–590 (2018)CrossRef Manish, K., Ankit, B., Robin, G., Rajesh, B.: Keyword query based focused web crawler. Procedia Comput. Sci. 125, 584–590 (2018)CrossRef
5.
Zurück zum Zitat Armand, J., Edouard, G., Piotr B., Tomas, M.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, no. 1, pp. 427–431 (2016) Armand, J., Edouard, G., Piotr B., Tomas, M.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, no. 1, pp. 427–431 (2016)
6.
Zurück zum Zitat Jacob, D., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Jacob, D., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
7.
Zurück zum Zitat Li, M.G., Ma, Z.X.: Chinese text similarity algorithm based on part-of-speech tagging and word vector model. In: 11th International Conference on Computer Science and Information Technology, vol. 5, no. 1, pp. 98–106 (2018) Li, M.G., Ma, Z.X.: Chinese text similarity algorithm based on part-of-speech tagging and word vector model. In: 11th International Conference on Computer Science and Information Technology, vol. 5, no. 1, pp. 98–106 (2018)
8.
Zurück zum Zitat Zhao, Q., Jing, Q., Li, A.P.: A short text similarity calculation method based on semantic and syntactic structure. Comput. Eng. Sci. 40(7), 1287–1294 (2018) (in Chinese) Zhao, Q., Jing, Q., Li, A.P.: A short text similarity calculation method based on semantic and syntactic structure. Comput. Eng. Sci. 40(7), 1287–1294 (2018) (in Chinese)
9.
Zurück zum Zitat Arora, S.,Liang, Y.,Ma, T.: A simple but tough-to-beat baseline forsentence embeddings.In: Proceedings of the 5th International Conferenceon Learning Representions.Toulon. ICLR, vo. 1, no. 3, pp. 1–16 (2017) Arora, S.,Liang, Y.,Ma, T.: A simple but tough-to-beat baseline forsentence embeddings.In: Proceedings of the 5th International Conferenceon Learning Representions.Toulon. ICLR, vo. 1, no. 3, pp. 1–16 (2017)
10.
Zurück zum Zitat Bojanowski, P., Grave, E., Joulin, A.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5(1), 135–146 (2017)CrossRef Bojanowski, P., Grave, E., Joulin, A.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5(1), 135–146 (2017)CrossRef
11.
Zurück zum Zitat Wang, Z., Mi, H.A.: Sentence similarity learning by lexical decomposition and composition. In: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 11–16 December 2016, Osaka, Japan, pp. 1340–1349 (2016) Wang, Z., Mi, H.A.: Sentence similarity learning by lexical decomposition and composition. In: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 11–16 December 2016, Osaka, Japan, pp. 1340–1349 (2016)
12.
Zurück zum Zitat Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016) Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016)
13.
Zurück zum Zitat Li, F., Hou, J.Y., Zeng, R.R.: Research on multicharacter sentence similarity calculation method of fusion word vector. J. Comput. Sci. Technol. 11(4), 608–618 (2017) (in Chinese) Li, F., Hou, J.Y., Zeng, R.R.: Research on multicharacter sentence similarity calculation method of fusion word vector. J. Comput. Sci. Technol. 11(4), 608–618 (2017) (in Chinese)
14.
Zurück zum Zitat Heintze, N.: Scalable document fingerprinting. In: Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland), vol. 2, no. 5, pp. 191–200 (1996) Heintze, N.: Scalable document fingerprinting. In: Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland), vol. 2, no. 5, pp. 191–200 (1996)
15.
Zurück zum Zitat Bharat, K., Broder, A.Z., Dean, J.: A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inform. Sci. (JASIS) 5(2), 1114–1122 (2000)CrossRef Bharat, K., Broder, A.Z., Dean, J.: A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inform. Sci. (JASIS) 5(2), 1114–1122 (2000)CrossRef
16.
Zurück zum Zitat Broder, A., Glassman, S., Manasse, S.: Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW), pp. 391–404 (1997) Broder, A., Glassman, S., Manasse, S.: Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW), pp. 391–404 (1997)
Metadaten
Titel
A New Shingling Similar Text Detection Algorithm
verfasst von
Peng Li
Tianling Qiao
Yongxing Guang
Lan Zhang
Copyright-Jahr
2021
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-33-4575-1_9