Skip to main content

2019 | OriginalPaper | Buchkapitel

Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset

verfasst von : Adrián Riesco, Eduardo Fidalgo, Mhd Wesam Al-Nabki, Francisco Jáñez-Martino, Enrique Alegre

Erschienen in: Hybrid Artificial Intelligent Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Online notepad services allow users to upload and share free text anonymously. Reviewing Pastebin, one of the most popular online notepad services websites, it is possible to find textual content that could be related to illegal activities, such as leaks of personal information or hyperlinks to multimedia files containing child sexual abuse images or videos. An automatic approach to monitor and to detect these activities in such an active and a dynamic environment could be useful for Law Enforcement Agencies to fight against cybercrime. In this work, we present Pastes Content Classification 17K (PasteCC_17K), a dataset of 17640 textual samples crawled from Pastebin, which are classified in 15 categories, being 6 of them suspicious to be related to illegal ones. We used PasteCC_17K to evaluated two well-known text representation techniques, ensembled with three different supervised approaches to classify the pastes of the Pastebin website. We found that the best performance is achieved ensembling TF-IDF encoding with Logistic Regression obtaining an accuracy of \(98.63\%\). The proposed model could assist the authorities in the detection of suspicious content shared in Pastebin.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)CrossRef Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)CrossRef
2.
Zurück zum Zitat Al-Nabki, M.W., Fidalgo, E., Alegre, E., Fernández-Robles, L.: Torank: identifying the most influential suspicious domains in the tor network. Expert Syst. Appl. 123, 212–226 (2019)CrossRef Al-Nabki, M.W., Fidalgo, E., Alegre, E., Fernández-Robles, L.: Torank: identifying the most influential suspicious domains in the tor network. Expert Syst. Appl. 123, 212–226 (2019)CrossRef
3.
Zurück zum Zitat Al Nabki, M.W., Fidalgo, E., Alegre, E., de Paz Centeno, I.: Classifying illegal activities on tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Valencia, Spain, April 2017 Al Nabki, M.W., Fidalgo, E., Alegre, E., de Paz Centeno, I.: Classifying illegal activities on tor network based on web textual contents. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Valencia, Spain, April 2017
4.
Zurück zum Zitat Bui, D.D.A., Fiol, G.D., Jonnalagadda, S.: Pdf text classification to leverage information extraction from publication reports. J. Biomed. Inform. 61, 141–148 (2016)CrossRef Bui, D.D.A., Fiol, G.D., Jonnalagadda, S.: Pdf text classification to leverage information extraction from publication reports. J. Biomed. Inform. 61, 141–148 (2016)CrossRef
5.
Zurück zum Zitat Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATH Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATH
6.
Zurück zum Zitat Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc. B 20, 215–242 (1958)MathSciNetMATH Cox, D.R.: The regression analysis of binary sequences. J. Roy. Stat. Soc. B 20, 215–242 (1958)MathSciNetMATH
7.
Zurück zum Zitat Diab, D.M., Hindi, K.: Using differential evolution for fine tuning naïve bayesian classifiers and its application for text classification. Appl. Soft Comput. 54, 183–199 (2016)CrossRef Diab, D.M., Hindi, K.: Using differential evolution for fine tuning naïve bayesian classifiers and its application for text classification. Appl. Soft Comput. 54, 183–199 (2016)CrossRef
8.
Zurück zum Zitat Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRef Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRef
9.
Zurück zum Zitat Herath, H.: Web information extraction system to sense information leakage. Master’s thesis, University of Moratuwa, Sri Lanka (2003) Herath, H.: Web information extraction system to sense information leakage. Master’s thesis, University of Moratuwa, Sri Lanka (2003)
12.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR abs/1607.01759 (2016) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. CoRR abs/1607.01759 (2016)
13.
Zurück zum Zitat Lochter, J.V., Zanetti, R.F., Reller, D., Almeida, T.A.: Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst. Appl. 62, 243–249 (2016)CrossRef Lochter, J.V., Zanetti, R.F., Reller, D., Almeida, T.A.: Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst. Appl. 62, 243–249 (2016)CrossRef
14.
Zurück zum Zitat Matic, S., Fattori, A., Bruschi, D., Cavallaro, L.: Peering into the muddy waters of pastebin. ERCIM News 90, 16 (2012) Matic, S., Fattori, A., Bruschi, D., Cavallaro, L.: Peering into the muddy waters of pastebin. ERCIM News 90, 16 (2012)
15.
Zurück zum Zitat Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. CoRR abs/1704.06879 (2017) Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. CoRR abs/1704.06879 (2017)
16.
Zurück zum Zitat Mironczuk, M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)CrossRef Mironczuk, M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)CrossRef
17.
Zurück zum Zitat Panchenko, A., Ruppert, E., Faralli, S., Ponzetto, S.P., Biemann, C.: Building a web-scale dependency-parsed corpus from commoncrawl. CoRR abs/1710.01779 (2017) Panchenko, A., Ruppert, E., Faralli, S., Ponzetto, S.P., Biemann, C.: Building a web-scale dependency-parsed corpus from commoncrawl. CoRR abs/1710.01779 (2017)
18.
Zurück zum Zitat Perlroth, N.: Hackers breach 53 universities and dump thousands of personal records online. New York Times, New York (2012) Perlroth, N.: Hackers breach 53 universities and dump thousands of personal records online. New York Times, New York (2012)
19.
Zurück zum Zitat Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRef
20.
Zurück zum Zitat Silva, R.M., Almeida, T.A., Yamakami, A.: Mdltext: an efficient and lightweight text classifier. Knowl.-Based Syst. 118, 152–164 (2017)CrossRef Silva, R.M., Almeida, T.A., Yamakami, A.: Mdltext: an efficient and lightweight text classifier. Knowl.-Based Syst. 118, 152–164 (2017)CrossRef
21.
Zurück zum Zitat Stein, R.A., Jaques, P.A., Valiati, J.F.: An analysis of hierarchical text classification using word embeddings. CoRR abs/1809.01771 (2018) Stein, R.A., Jaques, P.A., Valiati, J.F.: An analysis of hierarchical text classification using word embeddings. CoRR abs/1809.01771 (2018)
22.
Zurück zum Zitat Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! CoRR abs/1709.03856 (2017) Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! CoRR abs/1709.03856 (2017)
23.
Zurück zum Zitat Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on twitter. In: EMNLP (2016) Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on twitter. In: EMNLP (2016)
24.
Zurück zum Zitat Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657. Neural Information Processing Systems Foundation, January 2015 Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657. Neural Information Processing Systems Foundation, January 2015
25.
Zurück zum Zitat Zhu, D., Wong, K.W.: An evaluation study on text categorization using automatically generated labeled dataset. Neurocomputing 249, 321–336 (2017)CrossRef Zhu, D., Wong, K.W.: An evaluation study on text categorization using automatically generated labeled dataset. Neurocomputing 249, 321–336 (2017)CrossRef
Metadaten
Titel
Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset
verfasst von
Adrián Riesco
Eduardo Fidalgo
Mhd Wesam Al-Nabki
Francisco Jáñez-Martino
Enrique Alegre
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-29859-3_39

Premium Partner