Skip to main content

2019 | OriginalPaper | Buchkapitel

A Hybrid Approach for Recognizing Web Crawlers

verfasst von : Weiping Zhu, Hang Gao, Zongjian He, Jiangbo Qin, Bo Han

Erschienen in: Wireless Algorithms, Systems, and Applications

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In recent years, web crawlers have been widely used for collecting data from the Internet. Accurately recognizing web crawlers can help to better utilize friendly crawlers while stopping malicious ones. Existing web crawler recognition researches have difficulties in handling new crawlers, such as distributed crawlers, proxy based crawlers, and browser engine based crawlers. Moreover, it is non-trivial to achieve both high identification accuracy and high response time simultaneously. To tackle these issues, we propose a novel approach to web crawler recognition which combines real-time recognition methods based on heuristic rules and offline recognition methods based on machine learning. The aforementioned problems are well solved in this approach. The advantage of this approach is that both accuracy and efficiency are improved. We build a website and analyze its web access log using the proposed method. According to the results, the proposed approach achieves desirable performance in both accuracy and efficiency.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
Weka is a collection of machine learning algorithms for data mining tasks. Its website is https://​www.​cs.​waikato.​ac.​nz/​ml/​weka/​.
 
Literatur
1.
Zurück zum Zitat da Silva, A.S., Veloso, E.A., Golgher, P.B., Ribeiro-Neto, B., Laender, A.H.F., Ziviani, N.: Cobweb-a crawler for the Brazilian web. In: 6th International Symposium on String Processing and Information Retrieval, pp. 184–191 (1999) da Silva, A.S., Veloso, E.A., Golgher, P.B., Ribeiro-Neto, B., Laender, A.H.F., Ziviani, N.: Cobweb-a crawler for the Brazilian web. In: 6th International Symposium on String Processing and Information Retrieval, pp. 184–191 (1999)
2.
Zurück zum Zitat Raina, S., Agarwal, A.P.: How crawlers aid regression testing in web applications: the state of the art. Int. J. Comput. Appl. 68(14), 33–38 (2014) Raina, S., Agarwal, A.P.: How crawlers aid regression testing in web applications: the state of the art. Int. J. Comput. Appl. 68(14), 33–38 (2014)
3.
Zurück zum Zitat Lau, C.H., Tao, X., Tjondronegoro, D., Li, Y.: Retrieving information from microblog using pattern mining and relevance feedback. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds.) ICDKE 2012. LNCS, vol. 7696, pp. 152–160. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34679-8_15 Lau, C.H., Tao, X., Tjondronegoro, D., Li, Y.: Retrieving information from microblog using pattern mining and relevance feedback. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds.) ICDKE 2012. LNCS, vol. 7696, pp. 152–160. Springer, Heidelberg (2012). https://​doi.​org/​10.​1007/​978-3-642-34679-8_​15
4.
Zurück zum Zitat Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: irobot: An intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM, New York (2008) Cai, R., Yang, J.-M., Lai, W., Wang, Y., Zhang, L.: irobot: An intelligent crawler for web forums. In: Proceedings of the 17th International Conference on World Wide Web, pp. 447–456. ACM, New York (2008)
5.
Zurück zum Zitat Lu, P.: Bring you into the world of crawler and anti-crawler. Softw. Integr. Circ. 12, 12–13 (2016) Lu, P.: Bring you into the world of crawler and anti-crawler. Softw. Integr. Circ. 12, 12–13 (2016)
7.
Zurück zum Zitat Friesel, R.: PhantomJS cookbook over 70 recipes to help boost the productivity of your applications using real-world testing with PhantomJS (2014) Friesel, R.: PhantomJS cookbook over 70 recipes to help boost the productivity of your applications using real-world testing with PhantomJS (2014)
8.
Zurück zum Zitat Chan, L.: Anti crawler technology in the era of big data. Comput. Inf. Technol. 24(6), 2016 Chan, L.: Anti crawler technology in the era of big data. Comput. Inf. Technol. 24(6), 2016
9.
Zurück zum Zitat Fan, C., Yuan, B., Yu, Z., Xu, L.: Spider detection based on trap techniques. J. Comput. Appl. 30(7), 1782–1784 (2010) Fan, C., Yuan, B., Yu, Z., Xu, L.: Spider detection based on trap techniques. J. Comput. Appl. 30(7), 1782–1784 (2010)
10.
Zurück zum Zitat Doran, D., Morillo, K., Gokhale, S.S.: A comparison of web robot and human requests. In: Proceedigs of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1374–1380 (2013) Doran, D., Morillo, K., Gokhale, S.S.: A comparison of web robot and human requests. In: Proceedigs of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1374–1380 (2013)
11.
Zurück zum Zitat Jacob, G., Kirda, E., Kruegel, C., Vigna, G.: PUBCRAWL: protecting users and businesses from CRAWLers. In: Proceedings of USENIX Conference on Security Symposium, p. 25 (2013) Jacob, G., Kirda, E., Kruegel, C., Vigna, G.: PUBCRAWL: protecting users and businesses from CRAWLers. In: Proceedings of USENIX Conference on Security Symposium, p. 25 (2013)
12.
Zurück zum Zitat Wan, S., Li, Y., Sun, K.: Protecting web contents against persistent distributed crawlers. In: Proceedings of IEEE International Conference on Communications (2017) Wan, S., Li, Y., Sun, K.: Protecting web contents against persistent distributed crawlers. In: Proceedings of IEEE International Conference on Communications (2017)
13.
Zurück zum Zitat Suchacka, G., Sobków, M.: Detection of internet robots using a Bayesian approach. In: Proceedings of IEEE International Conference on Cybernetics, pp. 365–370 (2015) Suchacka, G., Sobków, M.: Detection of internet robots using a Bayesian approach. In: Proceedings of IEEE International Conference on Cybernetics, pp. 365–370 (2015)
15.
Zurück zum Zitat Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: a probabilistic reasoning approach. Comput. Netw. 53(3), 265–278 (2009) Stassopoulou, A., Dikaiakos, M.D.: Web robot detection: a probabilistic reasoning approach. Comput. Netw. 53(3), 265–278 (2009)
16.
Zurück zum Zitat Lalani, A.S.: Data mining of web access logs. In: Hybrid Intelligent Systems (2003) Lalani, A.S.: Data mining of web access logs. In: Hybrid Intelligent Systems (2003)
17.
Zurück zum Zitat Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6, 9–35 (2002) Tan, P.N., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6, 9–35 (2002)
18.
Zurück zum Zitat Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explor. Newsl. 1(2), 12–23 (2000) Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explor. Newsl. 1(2), 12–23 (2000)
19.
Zurück zum Zitat Zhuang, L., Kou, Z., Zhang, C.: Session identification based on time interval in web log mining. J. Tsinghua Univ. 163, 389–396 (2004) Zhuang, L., Kou, Z., Zhang, C.: Session identification based on time interval in web log mining. J. Tsinghua Univ. 163, 389–396 (2004)
20.
Zurück zum Zitat Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs J. Comput. 15(2), 171–190 (2003) Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs J. Comput. 15(2), 171–190 (2003)
21.
Zurück zum Zitat Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Proceedings of the Third International World-Wide Web Conference on Technology, Tools and Applications, pp. 1065–1073 (1995) Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Proceedings of the Third International World-Wide Web Conference on Technology, Tools and Applications, pp. 1065–1073 (1995)
23.
Zurück zum Zitat Algiryage, N.: Distinguishing real web crawlers from fakes: Googlebot example. In: 2018 Moratuwa Engineering Research Conference (MERCon), pp. 13–18 (2018) Algiryage, N.: Distinguishing real web crawlers from fakes: Googlebot example. In: 2018 Moratuwa Engineering Research Conference (MERCon), pp. 13–18 (2018)
Metadaten
Titel
A Hybrid Approach for Recognizing Web Crawlers
verfasst von
Weiping Zhu
Hang Gao
Zongjian He
Jiangbo Qin
Bo Han
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-23597-0_41

Premium Partner