Skip to main content
Erschienen in: International Journal of Data Science and Analytics 4/2020

31.03.2020 | Regular Paper

Unsupervised extra trees: a stochastic approach to compute similarities in heterogeneous data

verfasst von: Kevin Dalleau, Miguel Couceiro, Malika Smail-Tabbone

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we present a method to compute similarities on unlabelled data, based on extremely randomised trees. The main idea of our method, unsupervised extremely randomised trees (UET), is to randomly split the data in an iterative fashion until a stopping criterion is met, and to compute a similarity based on the co-occurrence of instances in the leaves of each generated tree. We evaluate our method on synthetic and real-world datasets by comparing the mean similarities between instances with the same label and the mean similarities between instances with different labels. These metrics are analogous to intracluster and intercluster similarities, and they are used to assess the computed similarities instead of a clustering algorithm’s results. Our empirical study shows that the method effectively gives distinct similarity values between instances belonging to different clusters and gives indiscernible values when there is no cluster structure. We also assess some interesting properties such as invariance under monotonic transformations of variables and robustness to correlated variables and noise. Finally, we perform hierarchical agglomerative clustering on synthetic and both real-world homogeneous and heterogeneous datasets and compare the results obtained by using UET and by using standard similarity measures. Our experiments show that our approach based on UET outperforms existing methods in some cases and reduces the amount of preprocessing typically needed when dealing with real-world datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Deza, M.M., Deza, E.: Encyclopedia of distances. In: Encyclopedia of Distances. Springer, pp. 1–583 (2009) Deza, M.M., Deza, E.: Encyclopedia of distances. In: Encyclopedia of Distances. Springer, pp. 1–583 (2009)
2.
Zurück zum Zitat Ferreira, J.P., Duarte, K., McMurray, J.J.V., Pitt, B., van Veldhuisen, D.J., Vincent, J., Ahmad, T., Tromp, J., Rossignol, P., Zannad, F.: Data driven approach to identify subgroups of heart failure with reduced ejection fraction patients with different prognoses and aldosterone antagonist response patterns. Circ. Heart Fail. 11(7), e004926 (2018)CrossRef Ferreira, J.P., Duarte, K., McMurray, J.J.V., Pitt, B., van Veldhuisen, D.J., Vincent, J., Ahmad, T., Tromp, J., Rossignol, P., Zannad, F.: Data driven approach to identify subgroups of heart failure with reduced ejection fraction patients with different prognoses and aldosterone antagonist response patterns. Circ. Heart Fail. 11(7), e004926 (2018)CrossRef
3.
Zurück zum Zitat Grabczewski, K., Jankowski, N.: Transformations of symbolic data for continuous data oriented models. In: Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP 2003. Springer, pp. 359–366 (2003) Grabczewski, K., Jankowski, N.: Transformations of symbolic data for continuous data oriented models. In: Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP 2003. Springer, pp. 359–366 (2003)
4.
Zurück zum Zitat Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRef Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRef
5.
Zurück zum Zitat Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)CrossRef Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)CrossRef
6.
Zurück zum Zitat van den Hoven, J.: Clustering with optimised weights for Gower’s metric. University of Amsterdam, Netherlands (2015) van den Hoven, J.: Clustering with optimised weights for Gower’s metric. University of Amsterdam, Netherlands (2015)
7.
Zurück zum Zitat Jian, S., Hu, L., Cao, L., Lu, K.: Metric-based auto-instructor for learning mixed data representation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) Jian, S., Hu, L., Cao, L., Lu, K.: Metric-based auto-instructor for learning mixed data representation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
8.
Zurück zum Zitat Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)MathSciNetCrossRef Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)MathSciNetCrossRef
9.
10.
Zurück zum Zitat Percha, B., Garten, Y., Altman, R.B.: Discovery and explanation of drug–drug interactions via text mining. In: Pacific Symposium on Biocomputing, pp. 410–421 (2012) Percha, B., Garten, Y., Altman, R.B.: Discovery and explanation of drug–drug interactions via text mining. In: Pacific Symposium on Biocomputing, pp. 410–421 (2012)
11.
Zurück zum Zitat Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)CrossRef Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)CrossRef
12.
Zurück zum Zitat Kim, H.L., Seligson, D., Liu, X., Janzen, N., Bui, M.H., Yu, H., Shi, T., Belldegrun, A.S., Horvath, S., Figlin, R.A.: Using tumor markers to predict the survival of patients with metastatic renal cell carcinoma. J. Urol. 173(5), 1496–1501 (2005)CrossRef Kim, H.L., Seligson, D., Liu, X., Janzen, N., Bui, M.H., Yu, H., Shi, T., Belldegrun, A.S., Horvath, S., Figlin, R.A.: Using tumor markers to predict the survival of patients with metastatic renal cell carcinoma. J. Urol. 173(5), 1496–1501 (2005)CrossRef
13.
Zurück zum Zitat Abba, M.C., Sun, H., Hawkins, K.A., Drake, J.A., Hu, Y., Nunez, M.I., Gaddis, S., Shi, T., Horvath, S., Sahin, A., et al.: Breast cancer molecular signatures as determined by sage: correlation with lymph node status. Mol. Cancer Res. 5(9), 881–890 (2007)CrossRef Abba, M.C., Sun, H., Hawkins, K.A., Drake, J.A., Hu, Y., Nunez, M.I., Gaddis, S., Shi, T., Horvath, S., Sahin, A., et al.: Breast cancer molecular signatures as determined by sage: correlation with lymph node status. Mol. Cancer Res. 5(9), 881–890 (2007)CrossRef
14.
Zurück zum Zitat Rennard, S.I., Locantore, N., Delafont, B., Tal-Singer, R., Silverman, E.K., Vestbo, J., Miller, B.E., Bakke, P., Celli, B., Calverley, P.M., et al.: Identification of five chronic obstructive pulmonary disease subgroups with different prognoses in the eclipse cohort using cluster analysis. Ann. Am. Thorac. Soc. 12(3), 303–312 (2015)CrossRef Rennard, S.I., Locantore, N., Delafont, B., Tal-Singer, R., Silverman, E.K., Vestbo, J., Miller, B.E., Bakke, P., Celli, B., Calverley, P.M., et al.: Identification of five chronic obstructive pulmonary disease subgroups with different prognoses in the eclipse cohort using cluster analysis. Ann. Am. Thorac. Soc. 12(3), 303–312 (2015)CrossRef
15.
Zurück zum Zitat Peerbhay, K.Y., Mutanga, O., Ismail, R.: Random forests unsupervised classification: the detection and mapping of solanum mauritianum infestations in plantation forestry using hyperspectral data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 8(6), 3107–3122 (2015)CrossRef Peerbhay, K.Y., Mutanga, O., Ismail, R.: Random forests unsupervised classification: the detection and mapping of solanum mauritianum infestations in plantation forestry using hyperspectral data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 8(6), 3107–3122 (2015)CrossRef
16.
Zurück zum Zitat Ting, K.M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., Zhou, Z.: Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach. Learn. 108(2), 331–376 (2019)MathSciNetCrossRef Ting, K.M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., Zhou, Z.: Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach. Learn. 108(2), 331–376 (2019)MathSciNetCrossRef
17.
Zurück zum Zitat Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE, pp. 413–422 (2008) Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE, pp. 413–422 (2008)
18.
Zurück zum Zitat Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)CrossRef Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)CrossRef
19.
Zurück zum Zitat Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, New York (2009)MATH Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, New York (2009)MATH
20.
Zurück zum Zitat Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRef Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRef
21.
Zurück zum Zitat Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRef Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRef
22.
Zurück zum Zitat Fisher, R.A., Marshall, M.: Iris data set. RA Fisher, UC Irvine Machine Learning Repository (1936) Fisher, R.A., Marshall, M.: Iris data set. RA Fisher, UC Irvine Machine Learning Repository (1936)
23.
Zurück zum Zitat Forina, M., et al.: An extendible package for data exploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, 16147 (1991) Forina, M., et al.: An extendible package for data exploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, 16147 (1991)
24.
Zurück zum Zitat Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. Computer Sciences Department, University of Wisconsin-Madison (1990) Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. Computer Sciences Department, University of Wisconsin-Madison (1990)
25.
Zurück zum Zitat Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952)CrossRef Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47(260), 583–621 (1952)CrossRef
26.
27.
Zurück zum Zitat Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
28.
Zurück zum Zitat Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, p 313 (2009) Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, p 313 (2009)
29.
Zurück zum Zitat Elghazel, H., Aussem, A.: Feature selection for unsupervised learning using random cluster ensembles. In: 2010 IEEE 10th International Conference on Data Mining (ICDM). IEEE, pp. 168–175 (2010) Elghazel, H., Aussem, A.: Feature selection for unsupervised learning using random cluster ensembles. In: 2010 IEEE 10th International Conference on Data Mining (ICDM). IEEE, pp. 168–175 (2010)
Metadaten
Titel
Unsupervised extra trees: a stochastic approach to compute similarities in heterogeneous data
verfasst von
Kevin Dalleau
Miguel Couceiro
Malika Smail-Tabbone
Publikationsdatum
31.03.2020
Verlag
Springer International Publishing
Erschienen in
International Journal of Data Science and Analytics / Ausgabe 4/2020
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI
https://doi.org/10.1007/s41060-020-00214-4

Weitere Artikel der Ausgabe 4/2020

International Journal of Data Science and Analytics 4/2020 Zur Ausgabe