Skip to main content

Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec

  • Conference paper
  • First Online:
Advances and Trends in Artificial Intelligence. From Theory to Practice (IEA/AIE 2019)

Abstract

In this paper, we investigate the influence of distance metrics on the results of open-set subject classification of text documents. We utilize the Local Outlier Factor (LOF) algorithm to extend a closed-set classifier (i.e. multilayer perceptron) with an additional class that identifies outliers. The analyzed text documents are represented by averaged word embeddings calculated using the fastText method on training data. Conducting the experiment on two different text corpora we show how the distance metric chosen for LOF (Euclidean or cosine) and a transformation of the feature space (vector representation of documents) both influence the open-set classification results. The general conclusion seems to be that the cosine distance outperforms the Euclidean distance in terms of performance of open-set classification of text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://scikit-learn.org/0.19/modules/generated/sklearn.neighbors.LocalOutlierFactor.html.

  2. 2.

    http://qwone.com/~jason/20Newsgroups/.

  3. 3.

    https://docs.scipy.org/doc/scipy/reference/spatial.distance.html.

References

  1. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: Identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000). https://doi.org/10.1145/335191.335388

    Article  Google Scholar 

  2. Doan, T., Kalita, J.: Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), pp. 1–7. IEEE (2017)

    Google Scholar 

  3. Fei, G., Liu, B.: Breaking the closed world assumption in text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 506–514 (2016)

    Google Scholar 

  4. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7. Autres impressions : 2011 (corr.), 2013 (7e corr.)

    Book  MATH  Google Scholar 

  5. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)

    Google Scholar 

  6. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: ICDT 1999 Proceedings of the 7th International Conference on Database Theory, pp. 217–235 (1999)

    Google Scholar 

  7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781

  8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  9. Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories, CLARIN-PL digital repository (2015). http://hdl.handle.net/11321/222

  10. Pandey, N.: Density based clustering for cricket world cup tweets using cosine similarity and time parameter. In: 2015 Annual IEEE India Conference (INDICON), pp. 1–6 (2015). https://doi.org/10.1109/INDICON.2015.7443520

  11. Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convolutional neural networks. In: Proceedings of the 14th International Conference on Natural Language Processing, pp. 466–475. NLP Association of India, Kolkata (2017)

    Google Scholar 

  12. Qian, G., Sural, S., Gu, Y., Pramanik, S.: Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC 2004, pp. 1232–1237. ACM, New York (2004). https://doi.org/10.1145/967900.968151

  13. Walkowiak, T., Datko, S., Maciejewski, H.: Algorithm based on modified angle-based outlier factor for open-set classification of text documents. Appl. Stochast. Models Bus. Ind. 34(5), 718–729 (2018)

    Article  MathSciNet  Google Scholar 

  14. Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence - ICAART, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)

    Google Scholar 

Download references

Acknowledgement

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walkowiak, T., Datko, S., Maciejewski, H. (2019). Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec. In: Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., Ali, M. (eds) Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019. Lecture Notes in Computer Science(), vol 11606. Springer, Cham. https://doi.org/10.1007/978-3-030-22999-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-22999-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-22998-6

  • Online ISBN: 978-3-030-22999-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics