Skip to main content
Top
Published in: Knowledge and Information Systems 8/2020

26-03-2020 | Regular Paper

Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning

Authors: Jonathan R. Wells, Sunil Aryal, Kai Ming Ting

Published in: Knowledge and Information Systems | Issue 8/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Existing distance metric learning methods require optimisation to learn a feature space to transform data—this makes them computationally expensive in large datasets. In classification tasks, they make use of class information to learn an appropriate feature space. In this paper, we present a simple supervised dissimilarity measure which does not require learning or optimisation. It uses class information to measure dissimilarity of two data instances in the input space directly. It is a supervised version of an existing data-dependent dissimilarity measure called \(m_\mathrm{e}\). Our empirical results in k-NN and LVQ classification tasks show that the proposed simple supervised dissimilarity measure generally produces predictive accuracy better than or at least as good as existing state-of-the-art supervised and unsupervised dissimilarity measures.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
Use to explain the behaviour of Random Forest, RF similarity aims to track the sign of the margin of x (defined as \(P(+1|x)-P(-1|x)\), where \(+\) 1 and − 1 are the two class labels) [4]. In contrast, iForest-based similarity aims to measure similarity of two points such that points are more similar in sparse region than two points of the same inter-point distance in dense region [22].
 
2
Path length was used by iForest [14] as the anomaly score for the purpose of anomaly detection; and path length is a proxy to mass in mass estimation (see Section 4 in Ting et al. [19]). Mass-based dissimilarity [20], mentioned earlier, is an extension of mass estimation which is implemented using completely random trees such as iForest. Though based on RF, some path length-based similarity [28] can be viewed as a variant of mass-based dissimilarity which is implemented using classification trees rather than completely random trees.
 
Literature
1.
go back to reference Aryal S (2017) A data-dependent dissimilarity measure: an effective alternative to distance measures. Monash University, Clayton PhD thesis Aryal S (2017) A data-dependent dissimilarity measure: an effective alternative to distance measures. Monash University, Clayton PhD thesis
2.
go back to reference Aryal S, Ting KM, Haffari G, Washio T (2014) \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining, IEEE, pp 707–712 Aryal S, Ting KM, Haffari G, Washio T (2014) \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining, IEEE, pp 707–712
3.
go back to reference Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506CrossRef Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506CrossRef
4.
go back to reference Breiman L (2000) Some infinity theory for predictor ensembles, Technical Report 577. Statistics Dept, UCB Breiman L (2000) Some infinity theory for predictor ensembles, Technical Report 577. Statistics Dept, UCB
5.
go back to reference Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27MATHCrossRef Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27MATHCrossRef
6.
go back to reference Davies A, Ghahramani Z (2014) The random forest kernel and creating other kernels for big data from random partitions. arXiv:1402.4293 Davies A, Ghahramani Z (2014) The random forest kernel and creating other kernels for big data from random partitions. arXiv:​1402.​4293
7.
go back to reference Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
10.
go back to reference Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18CrossRef
11.
go back to reference Kohonen T (1995) Learning vector quantization. Springer, Berlin, pp 175–189 Kohonen T (1995) Learning vector quantization. Springer, Berlin, pp 175–189
12.
go back to reference Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463CrossRef Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463CrossRef
14.
go back to reference Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422 Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422
15.
go back to reference Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297 Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297
16.
go back to reference Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305CrossRef Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305CrossRef
17.
go back to reference Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106 Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
20.
go back to reference Ting KM, Zhu Y, Carman M, Zhu Y, Washio T, Zhou Z-H (2019) Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach Learn 108(2):331–376MathSciNetMATHCrossRef Ting KM, Zhu Y, Carman M, Zhu Y, Washio T, Zhou Z-H (2019) Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach Learn 108(2):331–376MathSciNetMATHCrossRef
21.
go back to reference Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1205–1214 Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1205–1214
22.
go back to reference Ting KM, Zhu Y, Zhou Z-H (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2329–2337 Ting KM, Zhu Y, Zhou Z-H (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2329–2337
23.
24.
go back to reference Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564MathSciNetMATHCrossRef Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564MathSciNetMATHCrossRef
25.
go back to reference Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH
26.
go back to reference Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University
27.
go back to reference Zadeh PH, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 2464–2471 Zadeh PH, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 2464–2471
28.
go back to reference Zhu X, Loy CC, Gong S (2014) Constructing robust affinity graphs for spectral clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1450–1457 Zhu X, Loy CC, Gong S (2014) Constructing robust affinity graphs for spectral clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1450–1457
Metadata
Title
Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning
Authors
Jonathan R. Wells
Sunil Aryal
Kai Ming Ting
Publication date
26-03-2020
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 8/2020
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01454-3

Other articles of this Issue 8/2020

Knowledge and Information Systems 8/2020 Go to the issue

Premium Partner