Skip to main content
Erschienen in: Soft Computing 14/2020

04.12.2019 | Methodologies and Application

Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data

verfasst von: Qiude Li, Qingyu Xiong, Shengfen Ji, Min Gao, Yang Yu, Chao Wu

Erschienen in: Soft Computing | Ausgabe 14/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Categorical attributes are ubiquitous in real-world collected data. However, such attributes lack a well-defined distance metric and cannot be directly manipulated per algebraic operations, so many data mining algorithms are unable to work directly on them. Learning an appropriate metric or an effective numerical embedding is very vital yet challenging, for categorical attributes with multi-view heterogeneous data characteristics. This paper proposes a novel multi-view heterogeneous fusion model (MVHF), which first captures basic coupling information for each view and then fuses these heterogeneous information from different views by multi-kernel metric learning, to measure the intrinsic distances between this type of categorical attributes; based on these measured distances, further, we use the manifold learning method to learn a high-quality numerical embedding for each categorical value. Experiments on 33 mixed data sets demonstrate that MVHF-enabled classification significantly enhances the performance, compared with state-of-the-art distance metrics or embedding competitors.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66 Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Zurück zum Zitat Aitchison J, Aitken CG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420MathSciNetMATH Aitchison J, Aitken CG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420MathSciNetMATH
Zurück zum Zitat Alexandridis A, Chondrodima E, Giannopoulos N, Sarimveis H (2017) A fast and efficient method for training categorical radial basis function networks. IEEE Trans Neural Netw Learn Syst 28(11):2831–2836 Alexandridis A, Chondrodima E, Giannopoulos N, Sarimveis H (2017) A fast and efficient method for training categorical radial basis function networks. IEEE Trans Neural Netw Learn Syst 28(11):2831–2836
Zurück zum Zitat Bashon Y, Neagu D, Ridley MJ (2013) A framework for comparing heterogeneous objects: on the similarity measurements for fuzzy, numerical and categorical attributes. Soft Comput 17(9):1595–1615 Bashon Y, Neagu D, Ridley MJ (2013) A framework for comparing heterogeneous objects: on the similarity measurements for fuzzy, numerical and categorical attributes. Soft Comput 17(9):1595–1615
Zurück zum Zitat Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828 Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Zurück zum Zitat Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining, SIAM, pp 243–254 Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining, SIAM, pp 243–254
Zurück zum Zitat Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186 Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186
Zurück zum Zitat Cao F, Liang J, Li D, Bai L, Dang C (2012) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127 Cao F, Liang J, Li D, Bai L, Dang C (2012) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127
Zurück zum Zitat Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn 107:1477–1494MathSciNetMATH Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn 107:1477–1494MathSciNetMATH
Zurück zum Zitat Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513MathSciNet Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513MathSciNet
Zurück zum Zitat Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632 Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Zurück zum Zitat Chen L, Wang S, Wang K, Zhu J (2016a) Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognit 51:322–332 Chen L, Wang S, Wang K, Zhu J (2016a) Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognit 51:322–332
Zurück zum Zitat Chen L, Ye Y, Guo G, Zhu J (2016b) Kernel-based linear classification on categorical data. Soft Comput 20(8):2981–2993MATH Chen L, Ye Y, Guo G, Zhu J (2016b) Kernel-based linear classification on categorical data. Soft Comput 20(8):2981–2993MATH
Zurück zum Zitat Cohen P, West SG, Aiken LS (2014) Applied multiple regression/correlation analysis for the behavioral sciences. Psychology Press, London Cohen P, West SG, Aiken LS (2014) Applied multiple regression/correlation analysis for the behavioral sciences. Psychology Press, London
Zurück zum Zitat Cox MAA, Cox TF (2001) Multidimensional scaling. J R Stat Soc 46(2):1050–1057MATH Cox MAA, Cox TF (2001) Multidimensional scaling. J R Stat Soc 46(2):1050–1057MATH
Zurück zum Zitat Croft WB, Metzler D, Strohman T (2010) Search engines: Information retrieval in practice, vol 283. Addison-Wesley, Reading Croft WB, Metzler D, Strohman T (2010) Search engines: Information retrieval in practice, vol 283. Addison-Wesley, Reading
Zurück zum Zitat Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30MathSciNetMATH Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30MathSciNetMATH
Zurück zum Zitat Diab DM, El Hindi K (2018) Using differential evolution for improving distance measures of nominal values. Appl Soft Comput 64:14–34 Diab DM, El Hindi K (2018) Using differential evolution for improving distance measures of nominal values. Appl Soft Comput 64:14–34
Zurück zum Zitat Frank A, Asuncion A (2010) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine Frank A, Asuncion A (2010) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine
Zurück zum Zitat Golinko E, Sonderman T, Zhu X (2017) CNFL: categorical to numerical feature learning for clustering and classification. In: 2017 IEEE second international conference on data science in cyberspace (DSC). IEEE, pp 585–594 Golinko E, Sonderman T, Zhu X (2017) CNFL: categorical to numerical feature learning for clustering and classification. In: 2017 IEEE second international conference on data science in cyberspace (DSC). IEEE, pp 585–594
Zurück zum Zitat Hernández-Pereira E, Suárez-Romero JA, Fontenla-Romero O, Alonso-Betanzos A (2009) Conversion methods for symbolic features: a comparison applied to an intrusion detection problem. Expert Syst Appl 36(7):10612–10617 Hernández-Pereira E, Suárez-Romero JA, Fontenla-Romero O, Alonso-Betanzos A (2009) Conversion methods for symbolic features: a comparison applied to an intrusion detection problem. Expert Syst Appl 36(7):10612–10617
Zurück zum Zitat Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classification Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classification
Zurück zum Zitat Ienco D, Pensa RG (2016) Positive and unlabeled learning in categorical data. Neurocomputing 196:113–124 Ienco D, Pensa RG (2016) Positive and unlabeled learning in categorical data. Neurocomputing 196:113–124
Zurück zum Zitat Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data (TKDD) 6(1):1 Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data (TKDD) 6(1):1
Zurück zum Zitat Jain P, Kulis B, Dhillon IS (2010) Inductive regularized learning of kernel functions. In: Advances in neural information processing systems, pp 946–954 Jain P, Kulis B, Dhillon IS (2010) Inductive regularized learning of kernel functions. In: Advances in neural information processing systems, pp 946–954
Zurück zum Zitat Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13(Mar):519–547MathSciNetMATH Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13(Mar):519–547MathSciNetMATH
Zurück zum Zitat Jia H, Cheung J, Liu J (2016) A new distance metric for unsupervised learning of categorical data. IEEE Trans Neural Netw Learn Syst 27(5):1065–1079MathSciNet Jia H, Cheung J, Liu J (2016) A new distance metric for unsupervised learning of categorical data. IEEE Trans Neural Netw Learn Syst 27(5):1065–1079MathSciNet
Zurück zum Zitat Jian S, Cao L, Lu K, Gao H (2018a) Unsupervised coupled metric similarity for non-IID categorical data. IEEE Trans Knowl Data Eng 30:1810–1823 Jian S, Cao L, Lu K, Gao H (2018a) Unsupervised coupled metric similarity for non-IID categorical data. IEEE Trans Knowl Data Eng 30:1810–1823
Zurück zum Zitat Jian S, Pang G, Cao L, Lu K, Gao H (2018b) CURE: flexible categorical data representation by hierarchical coupling learning. IEEE Trans Knowl Data Eng 31:853–866 Jian S, Pang G, Cao L, Lu K, Gao H (2018b) CURE: flexible categorical data representation by hierarchical coupling learning. IEEE Trans Knowl Data Eng 31:853–866
Zurück zum Zitat Kasif S, Salzberg S, Waltz D, Rachlin J, Aha DW (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104(1–2):287–311MathSciNetMATH Kasif S, Salzberg S, Waltz D, Rachlin J, Aha DW (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104(1–2):287–311MathSciNetMATH
Zurück zum Zitat Kim K, Js Hong (2017) A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis. Pattern Recognit Lett 98:39–45 Kim K, Js Hong (2017) A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis. Pattern Recognit Lett 98:39–45
Zurück zum Zitat Le SQ, Ho TB (2005) An association-based dissimilarity measure for categorical data. Pattern Recognit Lett 26(16):2549–2557 Le SQ, Ho TB (2005) An association-based dissimilarity measure for categorical data. Pattern Recognit Lett 26(16):2549–2557
Zurück zum Zitat LeCun Y, Bottou L, Orr GB, Müller K (2012) Efficient backprop. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade, 2nd edn. Springer, Berlin, pp 9–48 LeCun Y, Bottou L, Orr GB, Müller K (2012) Efficient backprop. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade, 2nd edn. Springer, Berlin, pp 9–48
Zurück zum Zitat Li C, Jiang L, Li H, Wu J, Zhang P (2017a) Toward value difference metric with attribute weighting. Knowl Inf Syst 50(3):795–825 Li C, Jiang L, Li H, Wu J, Zhang P (2017a) Toward value difference metric with attribute weighting. Knowl Inf Syst 50(3):795–825
Zurück zum Zitat Li Z, Nie F, Chang X, Yang Y (2017b) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110 Li Z, Nie F, Chang X, Yang Y (2017b) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110
Zurück zum Zitat Li Q, Xiong Q, Ji S, Wen J, Gao M, Yu Y, Xu R (2019) Using fine-tuned conditional probabilities for data transformation of nominal attributes. Pattern Recognit Lett 128:107–114 Li Q, Xiong Q, Ji S, Wen J, Gao M, Yu Y, Xu R (2019) Using fine-tuned conditional probabilities for data transformation of nominal attributes. Pattern Recognit Lett 128:107–114
Zurück zum Zitat Müller B, Reinhardt J, Strickland MT (2012) Neural networks: an introduction. Springer, BerlinMATH Müller B, Reinhardt J, Strickland MT (2012) Neural networks: an introduction. Springer, BerlinMATH
Zurück zum Zitat Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281MATH Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281MATH
Zurück zum Zitat Ng MK, Mark Junjie L, Joshua Zhexue H, Zengyou H (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507 Ng MK, Mark Junjie L, Joshua Zhexue H, Zengyou H (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
Zurück zum Zitat Ortakaya AF (2017) Independently weighted value difference metric. Pattern Recognit Lett 97:61–68 Ortakaya AF (2017) Independently weighted value difference metric. Pattern Recognit Lett 97:61–68
Zurück zum Zitat Ouyang D, Li Q, Racine J (2006) Cross-validation and the estimation of probability distributions with categorical data. J Nonparametr Stat 18(1):69–100MathSciNetMATH Ouyang D, Li Q, Racine J (2006) Cross-validation and the estimation of probability distributions with categorical data. J Nonparametr Stat 18(1):69–100MathSciNetMATH
Zurück zum Zitat Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1986) Numerical recipes. The art of scientific computing. Cambridge University, LondonMATH Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1986) Numerical recipes. The art of scientific computing. Cambridge University, LondonMATH
Zurück zum Zitat Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228 Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Zurück zum Zitat Wang C, Dong X, Zhou F, Cao L, Chi CH (2015) Coupled attribute similarity learning on categorical data. IEEE Trans Neural Netw Learn Syst 26(4):781–797MathSciNet Wang C, Dong X, Zhou F, Cao L, Chi CH (2015) Coupled attribute similarity learning on categorical data. IEEE Trans Neural Netw Learn Syst 26(4):781–797MathSciNet
Zurück zum Zitat Wang H, Feng L, Liu Y (2016) Metric learning with geometric mean for similarities measurement. Soft Comput 20(10):3969–3979 Wang H, Feng L, Liu Y (2016) Metric learning with geometric mean for similarities measurement. Soft Comput 20(10):3969–3979
Zurück zum Zitat Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: multiple transitive distance learning and embedding. In: Proceedings of the 2015 SIAM international conference on data mining. SIAM, pp 46–54 Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: multiple transitive distance learning and embedding. In: Proceedings of the 2015 SIAM international conference on data mining. SIAM, pp 46–54
Zurück zum Zitat Zhao W, Li Q, Zhu C, Song J, Liu X, Yin J (2018) Model-aware categorical data embedding: a data-driven approach. Soft Comput 22:3603–3619MATH Zhao W, Li Q, Zhu C, Song J, Liu X, Yin J (2018) Model-aware categorical data embedding: a data-driven approach. Soft Comput 22:3603–3619MATH
Zurück zum Zitat Zhou ZH (2016) Machine learning. Tsinghua Press, Beijing Zhou ZH (2016) Machine learning. Tsinghua Press, Beijing
Zurück zum Zitat Zhu C, Cao L, Liu Q, Yin J, Kumar V (2018) Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Trans Knowl Data Eng 30(7):1254–1267 Zhu C, Cao L, Liu Q, Yin J, Kumar V (2018) Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Trans Knowl Data Eng 30(7):1254–1267
Metadaten
Titel
Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data
verfasst von
Qiude Li
Qingyu Xiong
Shengfen Ji
Min Gao
Yang Yu
Chao Wu
Publikationsdatum
04.12.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 14/2020
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-019-04586-z

Weitere Artikel der Ausgabe 14/2020

Soft Computing 14/2020 Zur Ausgabe