Top

Knowledge and Information Systems

Published in:

26-03-2020 | Regular Paper

Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning

Authors: Jonathan R. Wells, Sunil Aryal, Kai Ming Ting

Published in: Knowledge and Information Systems | Issue 8/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Existing distance metric learning methods require optimisation to learn a feature space to transform data—this makes them computationally expensive in large datasets. In classification tasks, they make use of class information to learn an appropriate feature space. In this paper, we present a simple supervised dissimilarity measure which does not require learning or optimisation. It uses class information to measure dissimilarity of two data instances in the input space directly. It is a supervised version of an existing data-dependent dissimilarity measure called \(m_\mathrm{e}\). Our empirical results in k-NN and LVQ classification tasks show that the proposed simple supervised dissimilarity measure generally produces predictive accuracy better than or at least as good as existing state-of-the-art supervised and unsupervised dissimilarity measures.

previous article Lexifield: a system for the automatic building of lexicons by semantic expansion of short word lists

next article Feature extraction from null and non-null spaces of kernel local discriminant embedding

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Use to explain the behaviour of Random Forest, RF similarity aims to track the sign of the margin of x (defined as \(P(+1|x)-P(-1|x)\), where \(+\) 1 and − 1 are the two class labels) [4]. In contrast, iForest-based similarity aims to measure similarity of two points such that points are more similar in sparse region than two points of the same inter-point distance in dense region [22].

Path length was used by iForest [14] as the anomaly score for the purpose of anomaly detection; and path length is a proxy to mass in mass estimation (see Section 4 in Ting et al. [19]). Mass-based dissimilarity [20], mentioned earlier, is an extension of mass estimation which is implemented using completely random trees such as iForest. Though based on RF, some path length-based similarity [28] can be viewed as a variant of mass-based dissimilarity which is implemented using classification trees rather than completely random trees.

The source code for ClustRF is at http://www.eecs.qmul.ac.uk/~xiatian/project_robust_graphs/index.html.

Aryal S (2017) A data-dependent dissimilarity measure: an effective alternative to distance measures. Monash University, Clayton PhD thesis

Aryal S, Ting KM, Haffari G, Washio T (2014) \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining, IEEE, pp 707–712

Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506CrossRef

Breiman L (2000) Some infinity theory for predictor ensembles, Technical Report 577. Statistics Dept, UCB

Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27MATHCrossRef

Davies A, Ghahramani Z (2014) The random forest kernel and creating other kernels for big data from random partitions. arXiv:1402.4293

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH

Deza MM, Deza E (2009) Encyclopedia of distances. Springer, BerlinMATHCrossRef

Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

10.

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18CrossRef

11.

Kohonen T (1995) Learning vector quantization. Springer, Berlin, pp 175–189

12.

Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463CrossRef

13.

Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364MathSciNetMATHCrossRef

14.

Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the eighth IEEE international conference on data mining, pp 413–422

15.

Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297

16.

Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305CrossRef

17.

Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

18.

Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–657MathSciNetMATHCrossRef

19.

Ting KM, Zhou G-T, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160MathSciNetMATHCrossRef

20.

Ting KM, Zhu Y, Carman M, Zhu Y, Washio T, Zhou Z-H (2019) Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach Learn 108(2):331–376MathSciNetMATHCrossRef

21.

Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1205–1214

22.

Ting KM, Zhu Y, Zhou Z-H (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 2329–2337

23.

Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352CrossRef

24.

Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564MathSciNetMATHCrossRef

25.

Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH

26.

Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State University

27.

Zadeh PH, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the 33rd international conference on international conference on machine learning, vol 48, pp 2464–2471

28.

Zhu X, Loy CC, Gong S (2014) Constructing robust affinity graphs for spectral clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1450–1457

Title: Simple supervised dissimilarity measure: Bolstering iForest-induced similarity with class information without learning
Authors: Jonathan R. Wells
Sunil Aryal
Kai Ming Ting
Publication date: 26-03-2020
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 8/2020
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-020-01454-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 8/2020

Online anomaly search in time series: significant online discords

Recommender systems with selfish users

Exploiting review embedding and user attention for item recommendation

Lexifield: a system for the automatic building of lexicons by semantic expansion of short word lists

Decision support for personalized hospital choice using the DEX hierarchical model with SMAA

The examination of the effect of the criterion for neural network’s learning on the effectiveness of the qualitative analysis of multidimensional data

Premium Partner