Skip to main content
Erschienen in: The Journal of Supercomputing 10/2018

16.04.2018

Towards ontology-based multilingual URL filtering: a big data problem

verfasst von: Mubashar Hussain, Mansoor Ahmed, Hasan Ali Khattak, Muhammad Imran, Abid Khan, Sadia Din, Awais Ahmad, Gwanggil Jeon, Alavalapati Goutham Reddy

Erschienen in: The Journal of Supercomputing | Ausgabe 10/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web content filtering is one among many techniques to limit the exposure of selective content on the Internet. It has gotten trivial with time, yet filtering of multilingual web content is still a difficult task, especially while considering big data landscape. The enormity of data increases the challenge of developing an effective content filtering system that can work in real time. There are several systems which can filter the URLs based on artificial intelligence techniques to identify the site with objectionable content. Most of these systems classify the URLs only in the English language. These systems either fail to respond when multilingual URLs are processed, or over-blocking is experienced. This paper introduces a filtering system that can classify multilingual URLs based on predefined criteria for URL, title, and metadata of a web page. Ontological approaches along with local multilingual dictionaries are used as the knowledge base to facilitate the challenging task of blocking URLs not meeting the filtering criteria. The proposed work shows high accuracy in classifying multilingual URLs into two categories, white and black. Evaluation results conducted on a large dataset show that the proposed system achieves promising accuracy, which is on a par with those achieved in state-of-the-art literature on semantic-based URL filtering.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
2
All the tests have been carried out using Java SE and Protégé 4.3 ontology editor. The system was Core i-7, 2.2-GHz laptop with 8 GB RAM and 1 Mb internet connection.
 
Literatur
1.
Zurück zum Zitat Dalek J, Haselton B, Noman H, Senft A, Crete-Nishihata M, Gill P, Deibert RJ (2013) A method for identifying and confirming the use of URL filtering products for censorship. In: Proceedings of the 2013 Conference on Internet Measurement Conference. ACM, pp 23–30 Dalek J, Haselton B, Noman H, Senft A, Crete-Nishihata M, Gill P, Deibert RJ (2013) A method for identifying and confirming the use of URL filtering products for censorship. In: Proceedings of the 2013 Conference on Internet Measurement Conference. ACM, pp 23–30
2.
Zurück zum Zitat Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1245–1254 Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1245–1254
3.
Zurück zum Zitat Cowings D, Hoogstrate D, Jensen S, Medlar A, Schneider K (2012) U.S. Patent No. 8,145,710. U.S. Patent and Trademark Office, Washington Cowings D, Hoogstrate D, Jensen S, Medlar A, Schneider K (2012) U.S. Patent No. 8,145,710. U.S. Patent and Trademark Office, Washington
4.
Zurück zum Zitat Srivastava M, Garg R, Mishra P (2014) Preprocessing techniques in web usage mining: a survey. Int J Comput Appl 97(18):1–9 Srivastava M, Garg R, Mishra P (2014) Preprocessing techniques in web usage mining: a survey. Int J Comput Appl 97(18):1–9
5.
Zurück zum Zitat Huang D, Xu K, Pei J (2014) Malicious URL detection by dynamically mining patterns without pre-defined elements. World Wide Web 17(6):1375–1394CrossRef Huang D, Xu K, Pei J (2014) Malicious URL detection by dynamically mining patterns without pre-defined elements. World Wide Web 17(6):1375–1394CrossRef
6.
Zurück zum Zitat Chandrinos K, Androutsopoulos I, Paliouras G, Spyropoulos C (2000) Automatic web rating: filtering obscene content on the web. In: Research and Advanced Technology for Digital Libraries, pp 403–406CrossRef Chandrinos K, Androutsopoulos I, Paliouras G, Spyropoulos C (2000) Automatic web rating: filtering obscene content on the web. In: Research and Advanced Technology for Digital Libraries, pp 403–406CrossRef
7.
Zurück zum Zitat Lee LH, Juan YC, Chen HH, Tseng YH (2013) Objectionable content filtering by click-through data. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, pp 1581–1584 Lee LH, Juan YC, Chen HH, Tseng YH (2013) Objectionable content filtering by click-through data. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, pp 1581–1584
8.
Zurück zum Zitat Zhou Z, Song T, Jia Y (2010) A high-performance url lookup engine for url filtering systems. In: 2010 IEEE International Conference on Communications (ICC). IEEE, pp 1–5 Zhou Z, Song T, Jia Y (2010) A high-performance url lookup engine for url filtering systems. In: 2010 IEEE International Conference on Communications (ICC). IEEE, pp 1–5
9.
Zurück zum Zitat Zheng H, Liu H, Daoudi M (2004) Blocking objectionable images: adult images and harmful symbols. In: 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME’04, vol. 2. IEEE, pp 1223–1226 Zheng H, Liu H, Daoudi M (2004) Blocking objectionable images: adult images and harmful symbols. In: 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME’04, vol. 2. IEEE, pp 1223–1226
10.
Zurück zum Zitat Liu BB, Su JY, Lu ZM, Li Z (2008) Pornographic images detection based on CBIR and skin analysis. In: Fourth International Conference on Semantics, Knowledge and Grid, 2008. SKG’08. IEEE, pp 487–488 Liu BB, Su JY, Lu ZM, Li Z (2008) Pornographic images detection based on CBIR and skin analysis. In: Fourth International Conference on Semantics, Knowledge and Grid, 2008. SKG’08. IEEE, pp 487–488
12.
Zurück zum Zitat Forte M, de Souza WL, do Prado AF (2006) A content classification and filtering server for the Internet. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 1166–1171 Forte M, de Souza WL, do Prado AF (2006) A content classification and filtering server for the Internet. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 1166–1171
13.
Zurück zum Zitat Thangaraj M, Karthikeyan VKT (2014) KT-grand: an algorithm for web content filtering. J Adva Resea Comp Sci Mana Stud 2(9):371–376 Thangaraj M, Karthikeyan VKT (2014) KT-grand: an algorithm for web content filtering. J Adva Resea Comp Sci Mana Stud 2(9):371–376
14.
Zurück zum Zitat Rajalakshmi R, Aravindan C (2011) Naive Bayes approach for website classification. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Communications in computer and information science, vol 147. Springer, Berlin, HeidelbergCrossRef Rajalakshmi R, Aravindan C (2011) Naive Bayes approach for website classification. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Communications in computer and information science, vol 147. Springer, Berlin, HeidelbergCrossRef
15.
Zurück zum Zitat Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evol Comput 16(5):645–661CrossRef Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evol Comput 16(5):645–661CrossRef
16.
Zurück zum Zitat Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732 Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732
17.
Zurück zum Zitat Zhang JB, Xu ZM, Xiu KL, Pan QS (2010) A web site classification approach based on its topological structure. Int J Asian Lang Proc 20(2):75–86 Zhang JB, Xu ZM, Xiu KL, Pan QS (2010) A web site classification approach based on its topological structure. Int J Asian Lang Proc 20(2):75–86
18.
Zurück zum Zitat Chou C, Condron L, Belland JC (2005) A review of the research on Internet addiction. Psychol Rev 17(4):363–388 Chou C, Condron L, Belland JC (2005) A review of the research on Internet addiction. Psychol Rev 17(4):363–388
19.
Zurück zum Zitat Pai A (2011) FCC guide: children’s internet protection act. Federal Communications Commission Pai A (2011) FCC guide: children’s internet protection act. Federal Communications Commission
21.
Zurück zum Zitat Lee LH, Juan YC, Tseng WL, Chen HH, Tseng YH (2015) Mining browsing behaviors for objectionable content filtering. J Assoc Inf Sci Technol 66(5):930–942CrossRef Lee LH, Juan YC, Tseng WL, Chen HH, Tseng YH (2015) Mining browsing behaviors for objectionable content filtering. J Assoc Inf Sci Technol 66(5):930–942CrossRef
22.
Zurück zum Zitat Mahmood K, Takahashi H, Raza A, Qaiser A, Farooqui A (2015) Semantic based highly accurate autonomous decentralized URL classification system for Web filtering. In: 2015 IEEE twelfth international symposium on autonomous decentralized systems (ISADS). IEEE, pp 17–24 Mahmood K, Takahashi H, Raza A, Qaiser A, Farooqui A (2015) Semantic based highly accurate autonomous decentralized URL classification system for Web filtering. In: 2015 IEEE twelfth international symposium on autonomous decentralized systems (ISADS). IEEE, pp 17–24
23.
Zurück zum Zitat Feroz MN, Mengel S (2015). Phishing URL detection using URL ranking. In: 2015 IEEE international congress on Big Data (BigData congress). IEEE, pp 635–638 Feroz MN, Mengel S (2015). Phishing URL detection using URL ranking. In: 2015 IEEE international congress on Big Data (BigData congress). IEEE, pp 635–638
27.
Zurück zum Zitat Astrakhantsev N, Fedorenko D, Turdakov D (2014) Automatic enrichment of informal ontology by analyzing a domain-specific text collection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue, vol. 13, pp 29–42 Astrakhantsev N, Fedorenko D, Turdakov D (2014) Automatic enrichment of informal ontology by analyzing a domain-specific text collection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue, vol. 13, pp 29–42
28.
Zurück zum Zitat Barve A, Divakar S (2011) An efficient soft clustering algorithm for web page prediction. J Adv Eng Sci 1(1):3–6 Barve A, Divakar S (2011) An efficient soft clustering algorithm for web page prediction. J Adv Eng Sci 1(1):3–6
29.
Zurück zum Zitat Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: 2011 IEEE symposium on security and privacy (SP). IEEE, pp 447–462 Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: 2011 IEEE symposium on security and privacy (SP). IEEE, pp 447–462
30.
Zurück zum Zitat Khare R (1999) Anatomy of a URL (and other internet-scale namespaces, part 1). IEEE Internet Comput 3(5):78CrossRef Khare R (1999) Anatomy of a URL (and other internet-scale namespaces, part 1). IEEE Internet Comput 3(5):78CrossRef
31.
Zurück zum Zitat McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C Recomm 10(10):20 McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C Recomm 10(10):20
32.
Zurück zum Zitat Pasin M, Motta E (2011) Ontological requirements for annotation and navigation of philosophical resources. Synthese 182(2):235–267CrossRef Pasin M, Motta E (2011) Ontological requirements for annotation and navigation of philosophical resources. Synthese 182(2):235–267CrossRef
33.
Zurück zum Zitat Noy NF, Sintek M, Decker S, Crubézy M, Fergerson RW, Musen MA (2001) Creating semantic web contents with protege-2000. IEEE Intell Syst 16(2):60–71CrossRef Noy NF, Sintek M, Decker S, Crubézy M, Fergerson RW, Musen MA (2001) Creating semantic web contents with protege-2000. IEEE Intell Syst 16(2):60–71CrossRef
Metadaten
Titel
Towards ontology-based multilingual URL filtering: a big data problem
verfasst von
Mubashar Hussain
Mansoor Ahmed
Hasan Ali Khattak
Muhammad Imran
Abid Khan
Sadia Din
Awais Ahmad
Gwanggil Jeon
Alavalapati Goutham Reddy
Publikationsdatum
16.04.2018
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 10/2018
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-018-2338-1

Weitere Artikel der Ausgabe 10/2018

The Journal of Supercomputing 10/2018 Zur Ausgabe