Skip to main content
Erschienen in: Journal of Computer Virology and Hacking Techniques 3/2021

01.04.2021 | Original Paper

Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network

verfasst von: Baptiste David, Maxence Delong, Eric Filiol

Erschienen in: Journal of Computer Virology and Hacking Techniques | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the domain of web security, websites strive to prevent themselves from data gathering performed by automatic programs called bots. In that way, crawler traps are an efficient brake against this kind of programs. By creating similar pages or random content dynamically, crawler traps give fake information to the bot and resulting by wasting time and resources. Nowadays, there is no available bots able to detect the presence of a crawler trap. Our aim was to find a generic solution to escape any type of crawler trap. Since the random generation is potentially endless, the only way to perform crawler trap detection is on the fly. Using machine learning, it is possible to compute the comparison between datasets of webpages extracted from regular websites from those generated by crawler traps. Since machine learning requires to use distances, we designed our system using information theory. We considered widely used distances compared to a new one designed to take into account heterogeneous data. Indeed, two pages does not have necessary the same words and it is operationally impossible to know all possible words by advance. To solve our problematic, our new distance compares two webpages and the results showed that our distance is more accurate than other tested distances. By extension, we can say that our distance has a much larger potential range than just crawler traps detection. This opens many new possibilities in the scope of data classification and data mining.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
7.
8.
Zurück zum Zitat Matusita, K.: Decision rules, based on the distance, for problems of fit, two samples, and estimation. Ann. Math. Stat. 26, 12 (1955)MathSciNetCrossRef Matusita, K.: Decision rules, based on the distance, for problems of fit, two samples, and estimation. Ann. Math. Stat. 26, 12 (1955)MathSciNetCrossRef
9.
Zurück zum Zitat Bhattacharyya, A.: On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 7(4), 401–406 (1946)MathSciNetMATH Bhattacharyya, A.: On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 7(4), 401–406 (1946)MathSciNetMATH
10.
Zurück zum Zitat Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)CrossRef Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)CrossRef
11.
Zurück zum Zitat Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948). Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948).
13.
Zurück zum Zitat Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)MathSciNetMATH Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)MathSciNetMATH
14.
Zurück zum Zitat Aherne, F.J., Thacker, N.A., Rockett, P.I.: The Bhattacharyya metric as an absolute similarity measure for frequency coded data. Kybernetika 34, 363–368 (1998)MathSciNetMATH Aherne, F.J., Thacker, N.A., Rockett, P.I.: The Bhattacharyya metric as an absolute similarity measure for frequency coded data. Kybernetika 34, 363–368 (1998)MathSciNetMATH
15.
Zurück zum Zitat Choi, E., Lee, C.: Feature extraction based on the Bhattacharyya distance. Pattern Recogn. 36, 1703–1709 (2003)CrossRef Choi, E., Lee, C.: Feature extraction based on the Bhattacharyya distance. Pattern Recogn. 36, 1703–1709 (2003)CrossRef
16.
Zurück zum Zitat Mohammadi, A., Plataniotis, K.: Improper complex-valued Bhattacharyya distance. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1049–1064 (2015) Mohammadi, A., Plataniotis, K.: Improper complex-valued Bhattacharyya distance. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1049–1064 (2015)
17.
Zurück zum Zitat Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Professional, Inc., San Diego (1990)MATH Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Professional, Inc., San Diego (1990)MATH
18.
Zurück zum Zitat Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1-2. Wiley, Hoboken (1968)MATH Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1-2. Wiley, Hoboken (1968)MATH
19.
Zurück zum Zitat Kendall, M., Stuart, A.: The Advanced Theory of Statistics. Distribution Theory, vol. 1, 4th edn. Macmillan, New York (1977)MATH Kendall, M., Stuart, A.: The Advanced Theory of Statistics. Distribution Theory, vol. 1, 4th edn. Macmillan, New York (1977)MATH
20.
Zurück zum Zitat Saporta, G.: Probabilités, analyse des données et statistique, 2e édition révisée et augmentée. Technip (2006) Saporta, G.: Probabilités, analyse des données et statistique, 2e édition révisée et augmentée. Technip (2006)
21.
Zurück zum Zitat Zwillinger, D.: CRC Standard Mathematical Tables and Formulae: Ser. Mathematical Science References, 30th edn. CRC-Press, Boca Raton (1995)CrossRef Zwillinger, D.: CRC Standard Mathematical Tables and Formulae: Ser. Mathematical Science References, 30th edn. CRC-Press, Boca Raton (1995)CrossRef
22.
Zurück zum Zitat Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, ninth dover printing, tenth gpo printing ed. Dover, New York (1964) Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, ninth dover printing, tenth gpo printing ed. Dover, New York (1964)
24.
Zurück zum Zitat Zenga, M.: L’impiego della funzione arcotangente incompleta nello studio della distribuzione asintotica dello scarto standardizzato assoluto massimo di una trinomiale. Statistica 12(XXXIX), 269–286 (1979)MathSciNetMATH Zenga, M.: L’impiego della funzione arcotangente incompleta nello studio della distribuzione asintotica dello scarto standardizzato assoluto massimo di una trinomiale. Statistica 12(XXXIX), 269–286 (1979)MathSciNetMATH
27.
Zurück zum Zitat Nosek, B., Alter, G., Banks, G., Borsboom, D., Bowman, S., Breckler, S., Buck, S., Chambers, C., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D., Hesse, B., Humphreys, M., Yarkoni, T.: Promoting an open research culture. Science (New York, N.Y.) 348, 1422–1425 (2015)CrossRef Nosek, B., Alter, G., Banks, G., Borsboom, D., Bowman, S., Breckler, S., Buck, S., Chambers, C., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D., Hesse, B., Humphreys, M., Yarkoni, T.: Promoting an open research culture. Science (New York, N.Y.) 348, 1422–1425 (2015)CrossRef
29.
Zurück zum Zitat Fisher, S.R.: 272: The nature of probability. Centen. Rev. 2, 261–274 (1958) Fisher, S.R.: 272: The nature of probability. Centen. Rev. 2, 261–274 (1958)
31.
Zurück zum Zitat Pólya, G.: über den zentralen grenzwertsatz der wahrscheinlichkeitsrechnung und das momentenproblem. Mathematische Zeitschrift 8, 171–181 (1920)MathSciNetCrossRef Pólya, G.: über den zentralen grenzwertsatz der wahrscheinlichkeitsrechnung und das momentenproblem. Mathematische Zeitschrift 8, 171–181 (1920)MathSciNetCrossRef
32.
Zurück zum Zitat Knol, M.J., Pestman, W.R., Grobbee, D.E.: The (mis)use of overlap of confidence intervals to assess effect modification. Eur. J. Epidemiol. 26(4), 253–254 (2011)CrossRef Knol, M.J., Pestman, W.R., Grobbee, D.E.: The (mis)use of overlap of confidence intervals to assess effect modification. Eur. J. Epidemiol. 26(4), 253–254 (2011)CrossRef
33.
Zurück zum Zitat Payton, M.E., Greenstone, M.H., Schenker, N.: Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003)CrossRef Payton, M.E., Greenstone, M.H., Schenker, N.: Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003)CrossRef
35.
Zurück zum Zitat Austin, P., Hux, J.: A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002)CrossRef Austin, P., Hux, J.: A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002)CrossRef
38.
Zurück zum Zitat Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 241–272 (1901) Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37, 241–272 (1901)
40.
Zurück zum Zitat Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)CrossRef Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)CrossRef
42.
Zurück zum Zitat Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Cambridge (2012)MATH Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press, Cambridge (2012)MATH
43.
Zurück zum Zitat Aggarwal, C.C.: Data Mining: The Textbook. Springer, Cham (2015)MATH Aggarwal, C.C.: Data Mining: The Textbook. Springer, Cham (2015)MATH
44.
Zurück zum Zitat Khanam, M., Mahboob, T., Imtiaz, W., Ghafoor, H., Sehar, R.: A survey on unsupervised machine learning algorithms for automation, classification and maintenance. Int. J. Comput. Appl. 119, 34–39 (2015) Khanam, M., Mahboob, T., Imtiaz, W., Ghafoor, H., Sehar, R.: A survey on unsupervised machine learning algorithms for automation, classification and maintenance. Int. J. Comput. Appl. 119, 34–39 (2015)
45.
Zurück zum Zitat Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L., Elkhatib, Y., Hussain, A., Al-Fuqaha, A.: Unsupervised machine learning for networking: Techniques, applications and research challenges, IEEE Access (2017) Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L., Elkhatib, Y., Hussain, A., Al-Fuqaha, A.: Unsupervised machine learning for networking: Techniques, applications and research challenges, IEEE Access (2017)
46.
Zurück zum Zitat MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif.: University of California Press, pp. 281–297. https://projecteuclid.org/euclid.bsmsp/1200512992 (1967) MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. Berkeley, Calif.: University of California Press, pp. 281–297. https://​projecteuclid.​org/​euclid.​bsmsp/​1200512992 (1967)
47.
Zurück zum Zitat Huber, P., Ronchetti, E.: Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley, Hoboken (2011) Huber, P., Ronchetti, E.: Robust Statistics, ser. Wiley Series in Probability and Statistics. Wiley, Hoboken (2011)
48.
Zurück zum Zitat Herwindiati, D.E., Djauhari, M.A., Jaupi, L.: Robust statistics for classification of remote sensing data. In: 20th International Conference on Computational Statistics. COMPSTAT 2012, Limassol, Cyprus, pp. 317–328, proceedings of COMPSTAT 2012ISBN:978-90-73592-32-2p. 317-328. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02468060 (2012) Herwindiati, D.E., Djauhari, M.A., Jaupi, L.: Robust statistics for classification of remote sensing data. In: 20th International Conference on Computational Statistics. COMPSTAT 2012, Limassol, Cyprus, pp. 317–328, proceedings of COMPSTAT 2012ISBN:978-90-73592-32-2p. 317-328. [Online]. Available: https://​hal.​archives-ouvertes.​fr/​hal-02468060 (2012)
49.
Zurück zum Zitat Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRef Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRef
50.
54.
Zurück zum Zitat García-Escudero, L., Gordaliza, A.: Robustness properties of k means and trimmed k means. J. Am. Stat. Assoc. 94, 956–969 (1999)MathSciNetMATH García-Escudero, L., Gordaliza, A.: Robustness properties of k means and trimmed k means. J. Am. Stat. Assoc. 94, 956–969 (1999)MathSciNetMATH
55.
Zurück zum Zitat García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 4, 89–109 (2010)MathSciNetCrossRef García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A review of robust clustering methods. Adv. Data Anal. Classif. 4, 89–109 (2010)MathSciNetCrossRef
57.
Zurück zum Zitat Gordaliza, A.: Best approximations to random variables based on trimming procedures. J. Approx. Theory 64(2), 162–180 (1991)MathSciNetCrossRef Gordaliza, A.: Best approximations to random variables based on trimming procedures. J. Approx. Theory 64(2), 162–180 (1991)MathSciNetCrossRef
58.
Zurück zum Zitat Cuesta-Albertos, J., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25, 553–576 (1997)MathSciNetCrossRef Cuesta-Albertos, J., Gordaliza, A., Matrán, C.: Trimmed k-means: an attempt to robustify quantizers. Ann. Stat. 25, 553–576 (1997)MathSciNetCrossRef
59.
Zurück zum Zitat García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)MathSciNetCrossRef García-Escudero, L., Gordaliza, A., Matrán, C., Mayo, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)MathSciNetCrossRef
60.
Zurück zum Zitat Adnan, R., Jedi, A.: Tclust?: trimming approach of robust clustering method. Malays. J. Fund. Sci. 8, 253–258 (2012) Adnan, R., Jedi, A.: Tclust?: trimming approach of robust clustering method. Malays. J. Fund. Sci. 8, 253–258 (2012)
61.
Zurück zum Zitat David, B., Delong, M., Filiol, E.: Detection of crawler traps: formalization and implementation defeating protection on internet and on the TOR network. In: 4th International Workshop on FORmal methods for Security Engineering (ForSE 2020)/6th International Conference on Information Systems Security and Privacy (ICISSP 2020), Valetta, Malta, 25–27 February, 2020 David, B., Delong, M., Filiol, E.: Detection of crawler traps: formalization and implementation defeating protection on internet and on the TOR network. In: 4th International Workshop on FORmal methods for Security Engineering (ForSE 2020)/6th International Conference on Information Systems Security and Privacy (ICISSP 2020), Valetta, Malta, 25–27 February, 2020
Metadaten
Titel
Detection of crawler traps: formalization and implementation—defeating protection on internet and on the TOR network
verfasst von
Baptiste David
Maxence Delong
Eric Filiol
Publikationsdatum
01.04.2021
Verlag
Springer Paris
Erschienen in
Journal of Computer Virology and Hacking Techniques / Ausgabe 3/2021
Elektronische ISSN: 2263-8733
DOI
https://doi.org/10.1007/s11416-021-00380-4

Weitere Artikel der Ausgabe 3/2021

Journal of Computer Virology and Hacking Techniques 3/2021 Zur Ausgabe

Editorial

Editorial

Premium Partner