Skip to main content
Erschienen in:

27.04.2024

TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors

verfasst von: Jing Zhao, Rui Chen, Pengcheng Fan

Erschienen in: The Journal of Supercomputing | Ausgabe 12/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web crawler detection is critical for preventing unauthorized extraction of valuable information from websites. Current methods rely on heuristics, leading to time-consuming processes and inability to detect novel crawlers. Privacy protection and communication burdens during training are overlooked, resulting in potential privacy leaks. To address these issues, we propose a federated deep learning crawler detection model that analyzes access behaviors while preserving privacy. First, individual clients locally host website data, while the central server aggregates information for detection model parameters, eliminating raw user data transmission or access. We then develop an innovative algorithm constructing access path trees from user logs, effectively extracting temporal and spatial behavior features. Additionally, we propose a novel time series model with fused additive attention, enabling effective web crawler detection while preserving privacy and reducing data transmission. Finally, comprehensive evaluations on public datasets demonstrate robust privacy protection and effective detection of emerging crawler types.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643CrossRef Acien A, Morales A, Fierrez J et al (2022) BeCAPTCHA-Mouse: synthetic mouse trajectories and improved bot detection. Pattern Recognit 127:108643CrossRef
4.
Zurück zum Zitat Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. arXiv preprint arXiv:1905.05928 Chen G, Chen P, Shi Y et al (2019) Rethinking the usage of batch normalization and dropout in the training of deep neural networks. arxiv 2019. arXiv preprint arXiv:​1905.​05928
5.
Zurück zum Zitat Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078
6.
Zurück zum Zitat Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449 Chu Z, Gianvecchio S, Wang H (2018) Bot or human? a behavior-based online bot detection system. In: From database to cyber security. Springer, pp 432–449
8.
Zurück zum Zitat Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
9.
Zurück zum Zitat Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600 Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, pp 1597–1600
10.
Zurück zum Zitat Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606CrossRef Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606CrossRef
11.
Zurück zum Zitat Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78CrossRef Eswaran S, Rani V, Ramakrishnan J et al (2022) An enhanced network intrusion detection system for malicious crawler detection and security event correlations in ubiquitous banking infrastructure. Int J Pervasive Comput Commun 18(1):59–78CrossRef
12.
Zurück zum Zitat Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128CrossRef Gao Y, Feng Z, Wang X et al (2023) Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing 520:115–128CrossRef
13.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
14.
Zurück zum Zitat Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. PMLR, pp 448–456
15.
Zurück zum Zitat Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310 Joulin A, Cissé M, Grangier D et al (2017) Efficient softmax approximation for GPUs. In: International Conference on Machine Learning. PMLR, pp 1302–1310
16.
Zurück zum Zitat Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35CrossRef Kayan H, Nunes M, Rana O et al (2022) Cybersecurity of industrial cyber-physical systems: a review. ACM Comput Surv (CSUR) 54(11s):1–35CrossRef
17.
Zurück zum Zitat Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573 Kwak N, Choi CH, Choi JY (2001) Feature extraction using ICA. In: International Conference on Artificial Neural Networks. Springer, pp 568–573
18.
Zurück zum Zitat Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028CrossRef Lagopoulos A, Tsoumakas G (2020) Content-aware web robot detection. Appl Intell 50(11):4017–4028CrossRef
19.
Zurück zum Zitat Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 Lan Z, Chen M, Goodman S et al (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:​1909.​11942
20.
Zurück zum Zitat Li S, Lee CH, Eun DY (2020) Trapping malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784 Li S, Lee CH, Eun DY (2020) Trapping malicious crawlers in social networks. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 775–784
21.
Zurück zum Zitat Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605 Li X, Azad BA, Rahmati A et al (2021) Good bot, bad bot: Characterizing automated browsing activity. In: 2021 IEEE Symposium on Security and Privacy (sp). IEEE, pp 1589–1605
22.
Zurück zum Zitat Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810 Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: 2006 International Conference on Communications, Circuits and Systems. IEEE, pp 1806–1810
23.
Zurück zum Zitat McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282 McMahan B, Moore E, Ramage D et al (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics. PMLR, pp 1273–1282
24.
Zurück zum Zitat Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274 Menshchikov A, Komarova A, Gatchin Y et al (2017) A study of different web-crawler behaviour. In: 2017 20th Conference of Open Innovations Association (FRUCT). IEEE, pp 268–274
26.
Zurück zum Zitat Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
27.
Zurück zum Zitat Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001CrossRef Rahman RU, Tomar DS (2020) New biostatistics features for detecting web bot activity on web applications. Comput Secur 97:102001CrossRef
30.
Zurück zum Zitat Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681CrossRef Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681CrossRef
31.
Zurück zum Zitat Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368 Shkapenyuk V, Suel T (2002) Design and implementation of a high-performance distributed web crawler. In: Proceedings 18th International Conference on Data Engineering. IEEE, pp 357–368
32.
Zurück zum Zitat Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNet Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNet
33.
Zurück zum Zitat Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481 Suchacka G, Motyka I (2018) Efficiency analysis of resource request patterns in classification of web robots and humans. In: ECMS, pp 475–481
34.
Zurück zum Zitat Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27 Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27
35.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30 Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
36.
Zurück zum Zitat Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6 Wan S, Li Y, Sun K (2017) Protecting web contents against persistent distributed crawlers. In: 2017 IEEE International Conference on Communications (ICC). IEEE, pp 1–6
37.
Zurück zum Zitat Xia W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242 Xia W, Zhao F, Wang H et al (2021) Crawler detection in location-based services using attributed action net. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp 4234–4242
38.
Zurück zum Zitat Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036 Yu L, Li Y, Zeng Q et al (2020) Summary of web crawler technology research. In: Journal of Physics: Conference Series. IOP Publishing, p 012036
39.
40.
Zurück zum Zitat Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728 Zhuang Z, Kong X, Elke R et al (2019) Attributed sequence embedding. In: 2019 IEEE International Conference on Big Data (big data). IEEE, pp 1723–1728
Metadaten
Titel
TS-Finder: privacy enhanced web crawler detection model using temporal–spatial access behaviors
verfasst von
Jing Zhao
Rui Chen
Pengcheng Fan
Publikationsdatum
27.04.2024
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 12/2024
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-024-06133-6