Skip to main content
Top
Published in: Data Mining and Knowledge Discovery 4/2019

27-03-2019

A new class of metrics for learning on real-valued and structured data

Authors: Ruiyu Yang, Yuxiang Jiang, Scott Mathews, Elizabeth A. Housworth, Matthew W. Hahn, Predrag Radivojac

Published in: Data Mining and Knowledge Discovery | Issue 4/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with f-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski (\(L^p\)) family, the fractional \(L^p\) family, two f-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
K-means algorithm aims to group the data so as to minimize the sum-of-squared-errors objective; i.e., the sum of squared Euclidean distances between data points and their respective centroids (Tan et al. 2006).
 
Literature
go back to reference Aggarwal CC et al (2001) On the surprising behavior of distance metrics in high dimensional space. Proc Int Conf Database Theory (ICDT) 2001:420–434MATH Aggarwal CC et al (2001) On the surprising behavior of distance metrics in high dimensional space. Proc Int Conf Database Theory (ICDT) 2001:420–434MATH
go back to reference Ashburner M et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29CrossRef Ashburner M et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29CrossRef
go back to reference Bairoch A et al (2005) The universal protein resource (UniProt). Nucleic Acids Res 33(Databse issue):D154–D159 Bairoch A et al (2005) The universal protein resource (UniProt). Nucleic Acids Res 33(Databse issue):D154–D159
go back to reference Baraty S et al (2011) The impact of triangular inequality violations on medoid-based clustering. Proc Int Symp Methodol Intell Syst (ISMIS) 2011:280–289 Baraty S et al (2011) The impact of triangular inequality violations on medoid-based clustering. Proc Int Symp Methodol Intell Syst (ISMIS) 2011:280–289
go back to reference Ben-David S, Ackerman M (2009) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst (NIPS) 2009:121–128 Ben-David S, Ackerman M (2009) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst (NIPS) 2009:121–128
go back to reference Beyer K et al (1999) When is “nearest neighbor” meaningful? Proc Int Conf Database Theory (ICDT) 1999:217–235 Beyer K et al (1999) When is “nearest neighbor” meaningful? Proc Int Conf Database Theory (ICDT) 1999:217–235
go back to reference Bilenko M et al (2004) Integrating constraints and metric learning in semi-supervised clustering. Proc Int Conf Mach Learn (ICML) 2004:81–88 Bilenko M et al (2004) Integrating constraints and metric learning in semi-supervised clustering. Proc Int Conf Mach Learn (ICML) 2004:81–88
go back to reference Cao M et al (2013) Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8(10):e76339CrossRef Cao M et al (2013) Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8(10):e76339CrossRef
go back to reference Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. Ph.D. thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. Ph.D. thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa
go back to reference Clark WT, Radivojac P (2013) Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 29(13):i53–i61CrossRef Clark WT, Radivojac P (2013) Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 29(13):i53–i61CrossRef
go back to reference Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27MATHCrossRef Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27MATHCrossRef
go back to reference Cover TM, Thomas JA (2006) Elements of information theory. Wiley, HobokenMATH Cover TM, Thomas JA (2006) Elements of information theory. Wiley, HobokenMATH
go back to reference Csiszár I (1967) Information-type measure of difference of probability distributions and indirect observations. Studia Sci Math Hungar 2:299–318MathSciNetMATH Csiszár I (1967) Information-type measure of difference of probability distributions and indirect observations. Studia Sci Math Hungar 2:299–318MathSciNetMATH
go back to reference Dalkilic MM et al (2006) Using compression to identify classes of inauthentic papers. Proc SIAM Int Conf Data Min (SDM) 2006:604–608MathSciNet Dalkilic MM et al (2006) Using compression to identify classes of inauthentic papers. Proc SIAM Int Conf Data Min (SDM) 2006:604–608MathSciNet
go back to reference Elkan C (2003) Using the triangle inequality to accelerate k-means. Proc Int Conf Mach Learn (ICML) 2003:147–153 Elkan C (2003) Using the triangle inequality to accelerate k-means. Proc Int Conf Mach Learn (ICML) 2003:147–153
go back to reference Goldfarb L (1992) What is distance and why do we need the metric model for pattern learning? Pattern Recognit 25(4):431–438CrossRef Goldfarb L (1992) What is distance and why do we need the metric model for pattern learning? Pattern Recognit 25(4):431–438CrossRef
go back to reference Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. Proc Int Conf Mach Learn (ICML) 2006:377–384 Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. Proc Int Conf Mach Learn (ICML) 2006:377–384
go back to reference Grosshans M et al (2014) Joint prediction of topics in a URL hierarchy. Proc Joint Eur Conf Mach Learn Knowl Disc Databases (ECML/PKDD) 2014:514–529CrossRef Grosshans M et al (2014) Joint prediction of topics in a URL hierarchy. Proc Joint Eur Conf Mach Learn Knowl Disc Databases (ECML/PKDD) 2014:514–529CrossRef
go back to reference Guntuboyina A (2011) Lower bounds for the minimax risk using \(f\)-divergences, and applications. IEEE Trans Inform Theory 57(4):2386–2399MathSciNetMATHCrossRef Guntuboyina A (2011) Lower bounds for the minimax risk using \(f\)-divergences, and applications. IEEE Trans Inform Theory 57(4):2386–2399MathSciNetMATHCrossRef
go back to reference Hamerly G (2010) Making k-means even faster. Proc SIAM Int Conf Data Min (SDM) 2010:130–140 Hamerly G (2010) Making k-means even faster. Proc SIAM Int Conf Data Min (SDM) 2010:130–140
go back to reference Hassanzadeh FF, Milenkovic O (2014) An axiomatic approach to constructing distances for rank comparison and aggregation. IEEE Trans Inf Theory 60(10):6417–6439MathSciNetMATHCrossRef Hassanzadeh FF, Milenkovic O (2014) An axiomatic approach to constructing distances for rank comparison and aggregation. IEEE Trans Inf Theory 60(10):6417–6439MathSciNetMATHCrossRef
go back to reference Hinneburg A et al (2000) What is the nearest neighbor in high dimensional spaces? Proc Int Conf Very Large Databases (VLDB) 2000:506–515 Hinneburg A et al (2000) What is the nearest neighbor in high dimensional spaces? Proc Int Conf Very Large Databases (VLDB) 2000:506–515
go back to reference Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRef Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRef
go back to reference Jiang Y et al (2014) The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. Bioinformatics 30(17):i609–i616CrossRef Jiang Y et al (2014) The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. Bioinformatics 30(17):i609–i616CrossRef
go back to reference Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. Proc Int Conf Rough Sets Curr Trends Comput (RSCTC) 2010:60–69CrossRef Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. Proc Int Conf Rough Sets Curr Trends Comput (RSCTC) 2010:60–69CrossRef
go back to reference Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. Proc Int Conf World Wide Web (WWW) 2010:571–580CrossRef Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. Proc Int Conf World Wide Web (WWW) 2010:571–580CrossRef
go back to reference Liese F, Vajda I (2006) On divergences and informations in statistics and information theory. IEEE Trans Inform Theory 52(10):4394–4412MathSciNetMATHCrossRef Liese F, Vajda I (2006) On divergences and informations in statistics and information theory. IEEE Trans Inform Theory 52(10):4394–4412MathSciNetMATHCrossRef
go back to reference Marczewski E, Steinhaus H (1958) On a certain distance of sets and the corresponding distance of functions. Colloq Math 6:319–327MathSciNetMATHCrossRef Marczewski E, Steinhaus H (1958) On a certain distance of sets and the corresponding distance of functions. Colloq Math 6:319–327MathSciNetMATHCrossRef
go back to reference Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. Proc Conf Uncertain Artif Intell (UAI) 2000:397–405 Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. Proc Conf Uncertain Artif Intell (UAI) 2000:397–405
go back to reference Movshovitz-Attias Y et al (2015) Ontological supervision for fine grained classification of street view storefronts. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2015:1693–1702 Movshovitz-Attias Y et al (2015) Ontological supervision for fine grained classification of street view storefronts. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2015:1693–1702
go back to reference Nehrt NL et al (2011) Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 7(6):e1002073MathSciNetCrossRef Nehrt NL et al (2011) Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 7(6):e1002073MathSciNetCrossRef
go back to reference Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the annual meeting on association for computational linguistics (ACL) 2004 Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the annual meeting on association for computational linguistics (ACL) 2004
go back to reference Pinsker MS (1964) Information and information stability of random variables and processes. Holden-Day Pinsker MS (1964) Information and information stability of random variables and processes. Holden-Day
go back to reference Rachev ST, Römisch W (2002) Quantitative stability in stochastic programming: the method of probability metrics. Math Oper Res 27(4):792–818MathSciNetMATHCrossRef Rachev ST, Römisch W (2002) Quantitative stability in stochastic programming: the method of probability metrics. Math Oper Res 27(4):792–818MathSciNetMATHCrossRef
go back to reference Radovanović M et al (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531MathSciNetMATH Radovanović M et al (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531MathSciNetMATH
go back to reference Robinson PN, Bauer S (2011) Introduction to bio-ontologies. CRC Press, Boca RatonCrossRef Robinson PN, Bauer S (2011) Introduction to bio-ontologies. CRC Press, Boca RatonCrossRef
go back to reference Rogers MF, Ben-Hur A (2009) The use of gene ontology evidence codes in preventing classifier assessment bias. Bioinformatics 25(9):1173–1177CrossRef Rogers MF, Ben-Hur A (2009) The use of gene ontology evidence codes in preventing classifier assessment bias. Bioinformatics 25(9):1173–1177CrossRef
go back to reference Schölkopf B (2000) The kernel trick for distances. Adv Neural Inf Process Syst (NIPS) 2000:301–307 Schölkopf B (2000) The kernel trick for distances. Adv Neural Inf Process Syst (NIPS) 2000:301–307
go back to reference Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, CambridgeMATHCrossRef Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, CambridgeMATHCrossRef
go back to reference Tan PN et al (2006) Introduction to data mining. Pearson, New York Tan PN et al (2006) Introduction to data mining. Pearson, New York
go back to reference Ting KM et al (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. Proc Int Conf Knowl Discov Data Min (KDD) 2016:1205–1214 Ting KM et al (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. Proc Int Conf Knowl Discov Data Min (KDD) 2016:1205–1214
go back to reference Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATH
go back to reference Wu D et al (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS ONE 6(3):e18011CrossRef Wu D et al (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS ONE 6(3):e18011CrossRef
go back to reference Wu X, Kumar V (2009) The top ten algorithms in data mining. CRC Press, Boca RatonCrossRef Wu X, Kumar V (2009) The top ten algorithms in data mining. CRC Press, Boca RatonCrossRef
go back to reference Xing EP et al (2003) Distance metric learning with application to clustering with side-information. Adv Neural Inf Process Syst (NIPS) 2003:521–528 Xing EP et al (2003) Distance metric learning with application to clustering with side-information. Adv Neural Inf Process Syst (NIPS) 2003:521–528
go back to reference Yang L, Jin R (2006) Distance metric learning: a comprehensive survey. Mich State Univ 2(2):4 Yang L, Jin R (2006) Distance metric learning: a comprehensive survey. Mich State Univ 2(2):4
go back to reference Yujian L, Bo L (2007) A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095CrossRef Yujian L, Bo L (2007) A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095CrossRef
Metadata
Title
A new class of metrics for learning on real-valued and structured data
Authors
Ruiyu Yang
Yuxiang Jiang
Scott Mathews
Elizabeth A. Housworth
Matthew W. Hahn
Predrag Radivojac
Publication date
27-03-2019
Publisher
Springer US
Published in
Data Mining and Knowledge Discovery / Issue 4/2019
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-019-00622-6

Other articles of this Issue 4/2019

Data Mining and Knowledge Discovery 4/2019 Go to the issue

Premium Partner