nach oben

International Journal of Data Science and Analytics

Erschienen in:

31.03.2020 | Regular Paper

Unsupervised extra trees: a stochastic approach to compute similarities in heterogeneous data

verfasst von: Kevin Dalleau, Miguel Couceiro, Malika Smail-Tabbone

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, we present a method to compute similarities on unlabelled data, based on extremely randomised trees. The main idea of our method, unsupervised extremely randomised trees (UET), is to randomly split the data in an iterative fashion until a stopping criterion is met, and to compute a similarity based on the co-occurrence of instances in the leaves of each generated tree. We evaluate our method on synthetic and real-world datasets by comparing the mean similarities between instances with the same label and the mean similarities between instances with different labels. These metrics are analogous to intracluster and intercluster similarities, and they are used to assess the computed similarities instead of a clustering algorithm’s results. Our empirical study shows that the method effectively gives distinct similarity values between instances belonging to different clusters and gives indiscernible values when there is no cluster structure. We also assess some interesting properties such as invariance under monotonic transformations of variables and robustness to correlated variables and noise. Finally, we perform hierarchical agglomerative clustering on synthetic and both real-world homogeneous and heterogeneous datasets and compare the results obtained by using UET and by using standard similarity measures. Our experiments show that our approach based on UET outperforms existing methods in some cases and reduces the amount of preprocessing typically needed when dealing with real-world datasets.

Vorheriger Artikel GINN: gradient interpretable neural networks for visualizing financial texts

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

https://archive.ics.uci.edu/ml/index.php.

https://labs.genetics.ucla.edu/horvath/RFclustering/RFclustering.htm.

\(m_{\mathrm{try}}\) is the number of variables used at each node when a tree is grown in RF.

Deza, M.M., Deza, E.: Encyclopedia of distances. In: Encyclopedia of Distances. Springer, pp. 1–583 (2009)

Ferreira, J.P., Duarte, K., McMurray, J.J.V., Pitt, B., van Veldhuisen, D.J., Vincent, J., Ahmad, T., Tromp, J., Rossignol, P., Zannad, F.: Data driven approach to identify subgroups of heart failure with reduced ejection fraction patients with different prognoses and aldosterone antagonist response patterns. Circ. Heart Fail. 11(7), e004926 (2018)CrossRef

Grabczewski, K., Jankowski, N.: Transformations of symbolic data for continuous data oriented models. In: Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP 2003. Springer, pp. 359–366 (2003)

Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRef

Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)CrossRef

van den Hoven, J.: Clustering with optimised weights for Gower’s metric. University of Amsterdam, Netherlands (2015)

Jian, S., Hu, L., Cao, L., Lu, K.: Metric-based auto-instructor for learning mixed data representation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)MathSciNetCrossRef

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRef

10.

Percha, B., Garten, Y., Altman, R.B.: Discovery and explanation of drug–drug interactions via text mining. In: Pacific Symposium on Biocomputing, pp. 410–421 (2012)

11.

Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)CrossRef

12.

Kim, H.L., Seligson, D., Liu, X., Janzen, N., Bui, M.H., Yu, H., Shi, T., Belldegrun, A.S., Horvath, S., Figlin, R.A.: Using tumor markers to predict the survival of patients with metastatic renal cell carcinoma. J. Urol. 173(5), 1496–1501 (2005)CrossRef

13.

Abba, M.C., Sun, H., Hawkins, K.A., Drake, J.A., Hu, Y., Nunez, M.I., Gaddis, S., Shi, T., Horvath, S., Sahin, A., et al.: Breast cancer molecular signatures as determined by sage: correlation with lymph node status. Mol. Cancer Res. 5(9), 881–890 (2007)CrossRef

14.

Rennard, S.I., Locantore, N., Delafont, B., Tal-Singer, R., Silverman, E.K., Vestbo, J., Miller, B.E., Bakke, P., Celli, B., Calverley, P.M., et al.: Identification of five chronic obstructive pulmonary disease subgroups with different prognoses in the eclipse cohort using cluster analysis. Ann. Am. Thorac. Soc. 12(3), 303–312 (2015)CrossRef

15.

Peerbhay, K.Y., Mutanga, O., Ismail, R.: Random forests unsupervised classification: the detection and mapping of solanum mauritianum infestations in plantation forestry using hyperspectral data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 8(6), 3107–3122 (2015)CrossRef

16.

Ting, K.M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., Zhou, Z.: Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach. Learn. 108(2), 331–376 (2019)MathSciNetCrossRef

17.

Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE, pp. 413–422 (2008)

18.

Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)CrossRef

19.

Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, New York (2009)MATH

20.

Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRef

21.

Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRef

22.

Fisher, R.A., Marshall, M.: Iris data set. RA Fisher, UC Irvine Machine Learning Repository (1936)

23.

Forina, M., et al.: An extendible package for data exploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, 16147 (1991)

24.

Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. Computer Sciences Department, University of Wisconsin-Madison (1990)

25.

Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952)CrossRef

26.

Friedman, J.H.: Recent advances in predictive (machine) learning. J. Classif. 23(2), 175–197 (2006)MathSciNetCrossRef

27.

Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH

28.

Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, p 313 (2009)

29.

Elghazel, H., Aussem, A.: Feature selection for unsupervised learning using random cluster ensembles. In: 2010 IEEE 10th International Conference on Data Mining (ICDM). IEEE, pp. 168–175 (2010)

Titel: Unsupervised extra trees: a stochastic approach to compute similarities in heterogeneous data
verfasst von: Kevin Dalleau
Miguel Couceiro
Malika Smail-Tabbone
Publikationsdatum: 31.03.2020
Verlag: Springer International Publishing
Erschienen in: International Journal of Data Science and Analytics / Ausgabe 4/2020
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI: https://doi.org/10.1007/s41060-020-00214-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2020

GINN: gradient interpretable neural networks for visualizing financial texts

Motif-aware diffusion network inference

Topic-aware joint analysis of overlapping communities and roles in social media

Accelerating adaptive online learning by matrix approximation

ICANE: interaction content-aware network embedding via co-embedding of nodes and edges