Skip to main content
Top
Published in:
Cover of the book

2019 | OriginalPaper | Chapter

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

Authors : Dimitrios Pritsos, Anderson Rocha, Efstathios Stamatatos

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Web genre identification can boost information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. The open-set scenario is more realistic for this task as web genres evolve over time and it is not feasible to define a universally agreed genre palette. In this work, we bring to bear a novel approach to web genre identification underpinned by distributional features acquired by doc2vec and a recently-proposed open-set classification algorithm—the nearest neighbors distance ratio classifier. We present experimental results using a benchmark corpus and a strong baseline and demonstrate that the proposed approach is highly competitive, especially when emphasis is given on precision.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abramson, M., Aha, D.W.: What’s in a URL? Genre classification from URLs. Intelligent techniques for web personalization and recommender systems. AAAI Technical report. Association for the Advancement of Artificial Intelligence (2012) Abramson, M., Aha, D.W.: What’s in a URL? Genre classification from URLs. Intelligent techniques for web personalization and recommender systems. AAAI Technical report. Association for the Advancement of Artificial Intelligence (2012)
2.
go back to reference Asheghi, N.R.: Human Annotation and Automatic Detection of Web Genres. Ph.D. thesis, University of Leeds (2015) Asheghi, N.R.: Human Annotation and Automatic Detection of Web Genres. Ph.D. thesis, University of Leeds (2015)
3.
go back to reference Asheghi, N.R., Markert, K., Sharoff, S.: Semi-supervised graph-based genre classification for web pages. In: TextGraphs-9, p. 39 (2014) Asheghi, N.R., Markert, K., Sharoff, S.: Semi-supervised graph-based genre classification for web pages. In: TextGraphs-9, p. 39 (2014)
4.
go back to reference Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 632–639. ACM (2005) Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 632–639. ACM (2005)
6.
go back to reference Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 313–316 (2006) Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 313–316 (2006)
7.
go back to reference Jebari, C.: A pure URL-based genre classification of web pages. In: 2014 25th International Workshop on Database and Expert Systems Applications (DEXA), pp. 233–237. IEEE (2014) Jebari, C.: A pure URL-based genre classification of web pages. In: 2014 25th International Workshop on Database and Expert Systems Applications (DEXA), pp. 233–237. IEEE (2014)
8.
go back to reference Jebari, C.: A combination based on OWA operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Nat. 54, 13–20 (2015) Jebari, C.: A combination based on OWA operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Nat. 54, 13–20 (2015)
9.
go back to reference Joho, H., Sanderson, M.: The spirit collection: an overview of a large web collection. SIGIR Forum 38(2), 57–61 (2004)CrossRef Joho, H., Sanderson, M.: The spirit collection: an overview of a large web collection. SIGIR Forum 38(2), 57–61 (2004)CrossRef
10.
go back to reference Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manage. 45(5), 499–512 (2009)CrossRef Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manage. 45(5), 499–512 (2009)CrossRef
11.
go back to reference Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS 2005, p. 99c. IEEE (2005) Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS 2005, p. 99c. IEEE (2005)
12.
go back to reference Kumari, K.P., Reddy, A.V., Fatima, S.S.: Web page genre classification: impact of n-gram lengths. Int. J. Comput. Appl. 88(13), 13–17 (2014) Kumari, K.P., Reddy, A.V., Fatima, S.S.: Web page genre classification: impact of n-gram lengths. Int. J. Comput. Appl. 88(13), 13–17 (2014)
13.
go back to reference Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, pp. 131–131. IEEE (2008) Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, pp. 131–131. IEEE (2008)
14.
go back to reference Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)CrossRef Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)CrossRef
16.
go back to reference Malhotra, R., Sharma, A.: Quantitative evaluation of web metrics for automatic genre classification of web pages. Int. J. Syst. Assur. Eng. Manage. 8(2), 1567–1579 (2017)CrossRef Malhotra, R., Sharma, A.: Quantitative evaluation of web metrics for automatic genre classification of web pages. Int. J. Syst. Assur. Eng. Manage. 8(2), 1567–1579 (2017)CrossRef
17.
go back to reference Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009) Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009)
19.
go back to reference Mendes Júnior, P.R., et al.: Nearest neighbors distance ratio open-set classifier. Mach. Learn. 106, 1–28 (2016)MathSciNet Mendes Júnior, P.R., et al.: Nearest neighbors distance ratio open-set classifier. Mach. Learn. 106, 1–28 (2016)MathSciNet
20.
go back to reference Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781 (2013)
21.
go back to reference Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
22.
go back to reference Nooralahzadeh, F., Brun, C., Roux, C.: Part of speech tagging for French social media data. In: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 23–29 August 2014, Dublin, Ireland, pp. 1764–1772 (2014) Nooralahzadeh, F., Brun, C., Roux, C.: Part of speech tagging for French social media data. In: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 23–29 August 2014, Dublin, Ireland, pp. 1764–1772 (2014)
23.
go back to reference Onan, A.: An ensemble scheme based on language function analysis and feature engineering for text genre classification. J. Inf. Sci. 44(1), 28–47 (2018)CrossRef Onan, A.: An ensemble scheme based on language function analysis and feature engineering for text genre classification. J. Inf. Sci. 44(1), 28–47 (2018)CrossRef
24.
go back to reference Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)CrossRef Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)CrossRef
25.
go back to reference Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)CrossRef Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)CrossRef
27.
go back to reference Pritsos, D., Stamatatos, E.: Open set evaluation of web genre identification. Lang. Resour. Eval. 52, 1–20 (2018)CrossRef Pritsos, D., Stamatatos, E.: Open set evaluation of web genre identification. Lang. Resour. Eval. 52, 1–20 (2018)CrossRef
29.
go back to reference Priyatam, P.N., Iyengar, S., Perumal, K., Varma, V.: Don’t use a lot when little will do: genre identification using URLs. Res. Comput. Sci. 70, 207–218 (2013) Priyatam, P.N., Iyengar, S., Perumal, K., Varma, V.: Don’t use a lot when little will do: genre identification using URLs. Res. Comput. Sci. 70, 207–218 (2013)
32.
go back to reference Santini, M.: Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton (2007) Santini, M.: Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton (2007)
34.
go back to reference Sharoff, S., Wu, Z., Markert, K.: The Web Library of Babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010) Sharoff, S., Wu, Z., Markert, K.: The Web Library of Babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010)
35.
go back to reference Shepherd, M.A., Watters, C.R., Kennedy, A.: Cybergenre: automatic identification of home pages on the web. J. Web Eng. 3(3–4), 236–251 (2004) Shepherd, M.A., Watters, C.R., Kennedy, A.: Cybergenre: automatic identification of home pages on the web. J. Web Eng. 3(3–4), 236–251 (2004)
36.
go back to reference Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Carnegie Mellon University (2009) Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Carnegie Mellon University (2009)
37.
go back to reference Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise: noise in genre. Int. J. Doc. Anal. Recogn. (IJDAR) 10(3–4), 199–209 (2007)CrossRef Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise: noise in genre. Int. J. Doc. Anal. Recogn. (IJDAR) 10(3–4), 199–209 (2007)CrossRef
38.
go back to reference Vidulin, V., Luštrek, M., Gams, M.: Using genres to improve search engines. In: Proceedings of the International Workshop Towards Genre-Enabled Search Engines, pp. 45–51 (2007) Vidulin, V., Luštrek, M., Gams, M.: Using genres to improve search engines. In: Proceedings of the International Workshop Towards Genre-Enabled Search Engines, pp. 45–51 (2007)
39.
go back to reference Worsham, J., Kalita, J.: Genre identification and the compositional effect of genre in literature. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1963–1973 (2018) Worsham, J., Kalita, J.: Genre identification and the compositional effect of genre in literature. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1963–1973 (2018)
Metadata
Title
Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio
Authors
Dimitrios Pritsos
Anderson Rocha
Efstathios Stamatatos
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-15719-7_1