nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

A New Shingling Similar Text Detection Algorithm

verfasst von : Peng Li, Tianling Qiao, Yongxing Guang, Lan Zhang

Erschienen in: Advances in Simulation and Process Modelling

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Since the traditional Shingling algorithm removes duplicate text in Chinese web pages, the amount of data that needs to be processed is particularly large, the efficiency of the algorithm is low. In this paper, a new text duplicate checking algorithm is constructed by changing the traditional Shingling algorithm. The meaningless dummy words are deleted in the text. Then the text is segmented according to semantics. Finally, the text similarity calculation formula is used to calculate the text similarity. The above work focuses on removing meaningless words from the text. In this way, the calculation rate of text similarity can be improved, and the accuracy of the text query and the complete query rate can be improved. The simulation results show that the algorithm is simple and feasible and has good text similarity calculation effect and certain advantages.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Research on the Connection Performance of SMA Pipe Coupling for Sports Equipment

Nächstes Kapitel Kalman Consensus Filtering Algorithm Based on Update Scheduling Scheme for Estimating the Concentrations of Pollutants

Gurmeet, S.M., Arvind, J., Anish, D.S.: Detecting near duplicates for web crawling. In: International World Wide Web Conference Committee (IW3C2), pp. 141–149 (2007)

Wu, P.B., Chen, Q.X., Ma, L.: The study on large scale duplicated web pages of Chinese fast deletion algorithm based on string of feature code. J. Chin. Inf. Process. 17(2), 28–35 (2003) (in Chinese)

Mohammad, A.K., Vijaypal, S.D., Sanjeev, K.S.: Web crawler based on mobile agent and java aglets. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 5(10), 85–91 (2013)

Manish, K., Ankit, B., Robin, G., Rajesh, B.: Keyword query based focused web crawler. Procedia Comput. Sci. 125, 584–590 (2018)CrossRef

Armand, J., Edouard, G., Piotr B., Tomas, M.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, no. 1, pp. 427–431 (2016)

Jacob, D., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

Li, M.G., Ma, Z.X.: Chinese text similarity algorithm based on part-of-speech tagging and word vector model. In: 11th International Conference on Computer Science and Information Technology, vol. 5, no. 1, pp. 98–106 (2018)

Zhao, Q., Jing, Q., Li, A.P.: A short text similarity calculation method based on semantic and syntactic structure. Comput. Eng. Sci. 40(7), 1287–1294 (2018) (in Chinese)

Arora, S.,Liang, Y.,Ma, T.: A simple but tough-to-beat baseline forsentence embeddings.In: Proceedings of the 5th International Conferenceon Learning Representions.Toulon. ICLR, vo. 1, no. 3, pp. 1–16 (2017)

10.

Bojanowski, P., Grave, E., Joulin, A.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5(1), 135–146 (2017)CrossRef

11.

Wang, Z., Mi, H.A.: Sentence similarity learning by lexical decomposition and composition. In: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 11–16 December 2016, Osaka, Japan, pp. 1340–1349 (2016)

12.

Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016)

13.

Li, F., Hou, J.Y., Zeng, R.R.: Research on multicharacter sentence similarity calculation method of fusion word vector. J. Comput. Sci. Technol. 11(4), 608–618 (2017) (in Chinese)

14.

Heintze, N.: Scalable document fingerprinting. In: Proceedings of the Second USENIX Electronic Commerce Workshop (Oakland), vol. 2, no. 5, pp. 191–200 (1996)

15.

Bharat, K., Broder, A.Z., Dean, J.: A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inform. Sci. (JASIS) 5(2), 1114–1122 (2000)CrossRef

16.

Broder, A., Glassman, S., Manasse, S.: Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW), pp. 391–404 (1997)

Titel: A New Shingling Similar Text Detection Algorithm
verfasst von: Peng Li
Tianling Qiao
Yongxing Guang
Lan Zhang
Verlag: Springer Singapore
Buch: Advances in Simulation and Process Modelling
Print ISBN: 978-981-334-574-4

Electronic ISBN: 978-981-334-575-1

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-981-33-4575-1_9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"