Skip to main content
Top
Published in:

27-04-2024

TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

Authors: Jing Zhao, Rui Chen, Pengcheng Fan

Published in: The Journal of Supercomputing | Issue 12/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Web crawler detection is critical for preventing unauthorized extraction of valuable information from websites. Current methods rely on heuristics, leading to time-consuming processes and inability to detect novel crawlers. Privacy protection and communication burdens during training are overlooked, resulting in potential privacy leaks. To address these issues, we propose a federated deep learning crawler detection model that analyzes access behaviors while preserving privacy. First, individual clients locally host website data, while the central server aggregates information for detection model parameters, eliminating raw user data transmission or access. We then develop an innovative algorithm constructing access path trees from user logs, effectively extracting temporal and spatial behavior features. Additionally, we propose a novel time series model with fused additive attention, enabling effective web crawler detection while preserving privacy and reducing data transmission. Finally, comprehensive evaluations on public datasets demonstrate robust privacy protection and effective detection of emerging crawler types.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643CrossRef Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643CrossRef
4.
go back to reference Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. arXiv preprint arXiv:1905.05928 Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. arXiv preprint arXiv:​1905.​05928
5.
go back to reference Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078
6.
go back to reference Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449 Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449
8.
go back to reference Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
9.
go back to reference Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600 Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600
10.
go back to reference Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606CrossRef Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606CrossRef
11.
go back to reference Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78CrossRef Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78CrossRef
12.
go back to reference Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128CrossRef Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128CrossRef
13.
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
14.
go back to reference Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456
15.
go back to reference Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310 Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310
16.
go back to reference Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35CrossRef Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35CrossRef
17.
go back to reference Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573 Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573
18.
go back to reference Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028CrossRef Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028CrossRef
19.
go back to reference Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:​1909.​11942
20.
go back to reference Li S, Lee CH, Eun DY (2020) Trapping malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784 Li S, Lee CH, Eun DY (2020) Trapping malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784
21.
go back to reference Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605 Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605
22.
go back to reference Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810 Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810
23.
go back to reference McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282 McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282
24.
go back to reference Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274 Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274
26.
go back to reference Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
27.
go back to reference Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001CrossRef Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001CrossRef
30.
go back to reference Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681CrossRef Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681CrossRef
31.
go back to reference Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368 Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368
32.
go back to reference Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNet Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNet
33.
go back to reference Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481 Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481
34.
go back to reference Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27 Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27
35.
go back to reference Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30 Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
36.
go back to reference Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6 Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6
37.
go back to reference Xia W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242 Xia W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242
38.
go back to reference Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036 Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036
40.
go back to reference Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728 Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728
Metadata
Title
TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors
Authors
Jing Zhao
Rui Chen
Pengcheng Fan
Publication date
27-04-2024
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 12/2024
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-024-06133-6

Premium Partner