Skip to main content
Erschienen in: Knowledge and Information Systems 3/2015

01.12.2015 | Regular Paper

Multiplicative distance: a method to alleviate distance instability for high-dimensional data

verfasst von: Jafar Mansouri, Morteza Khademi

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, it has been shown that under a broad set of conditions, the commonly used distance functions will become unstable in high-dimensional data space; i.e., the distance to the farthest data point approaches the distance to the nearest data point of a given query point with increasing dimensionality. It has been shown that if dimensions are independently distributed, and normalized to have zero mean and unit variance, instability happens. In this paper, it is shown that the normalization condition is not necessary, but all appropriate moments must be finite. Furthermore, a new distance function, namely multiplicative distance, is introduced. It is theoretically proved that this function is stable for data with independent dimensions (with identical or nonidentical distribution). In contrast to usual distance functions which are based on the summation of distances over all dimensions (distance components), the multiplicative distance is based on the multiplication of distance components. Experimental results show the stability of the multiplicative distance for data with independent and correlated dimensions in the high-dimensional space and the superiority of the multiplicative distance over the norm distances for the high-dimensional data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aggarwal CC, Hinneburg A, Keim DA (2001) Parametric detection of meaningless distances in high dimensional data. In: Proceedings international conference on database theory (ICDT), pp 420–434 Aggarwal CC, Hinneburg A, Keim DA (2001) Parametric detection of meaningless distances in high dimensional data. In: Proceedings international conference on database theory (ICDT), pp 420–434
2.
Zurück zum Zitat Aggarwal CC, Yu PS (2000) The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’00), pp 119–129 Aggarwal CC, Yu PS (2000) The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’00), pp 119–129
3.
Zurück zum Zitat Bassani HF, Araujo AFR (2012) Dimension selective self-organizing maps for clustering high dimensional data. In: Proceedings international conference on neural networks (IJCNN), pp 1–8 Bassani HF, Araujo AFR (2012) Dimension selective self-organizing maps for clustering high dimensional data. In: Proceedings international conference on neural networks (IJCNN), pp 1–8
4.
Zurück zum Zitat Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings seventh international conference on database theory (ICDT ’99), vol. 1540, pp 217–235 Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings seventh international conference on database theory (ICDT ’99), vol. 1540, pp 217–235
5.
Zurück zum Zitat Chang J, Lee A (2008) Parallel high-dimensional index structure for content-based information retrieval. In: Proceedings 8th IEEE international conference on computer and information technology, pp 101–106 Chang J, Lee A (2008) Parallel high-dimensional index structure for content-based information retrieval. In: Proceedings 8th IEEE international conference on computer and information technology, pp 101–106
6.
Zurück zum Zitat Cheng Q (2010) A sparse learning machine for high-dimensional data with application to microarray gene analysis. IEEE/ACM Trans Comput Biol Bioinform 7(4):636–646CrossRef Cheng Q (2010) A sparse learning machine for high-dimensional data with application to microarray gene analysis. IEEE/ACM Trans Comput Biol Bioinform 7(4):636–646CrossRef
7.
Zurück zum Zitat Chu Y-H, Huang J-W, Chuang K-T, Yang D-N, Chen M-S (2010) Density conscious subspace clustering for high-dimensional data. IEEE Trans Knowl Data Eng 22(1):16–30CrossRef Chu Y-H, Huang J-W, Chuang K-T, Yang D-N, Chen M-S (2010) Density conscious subspace clustering for high-dimensional data. IEEE Trans Knowl Data Eng 22(1):16–30CrossRef
8.
Zurück zum Zitat Cui J, Xiao B, Yin Z (2010) Speed up linear scan in high-dimensions using extended B+-tree. In: Proceedings 2nd international workshop on database technology and applications (DBTA), pp 1–4 Cui J, Xiao B, Yin Z (2010) Speed up linear scan in high-dimensions using extended B+-tree. In: Proceedings 2nd international workshop on database technology and applications (DBTA), pp 1–4
9.
Zurück zum Zitat Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings international conference on machine learning and applications, pp 245–250 Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings international conference on machine learning and applications, pp 245–250
10.
Zurück zum Zitat Demartines P (1994) Analyse de donne’es par re’seaux de neurones auto-organise’s. PhD dissertation, Institut Nat’l Polytechnique de Grenoble, Grenoble, France (in French) Demartines P (1994) Analyse de donne’es par re’seaux de neurones auto-organise’s. PhD dissertation, Institut Nat’l Polytechnique de Grenoble, Grenoble, France (in French)
11.
Zurück zum Zitat Durrant RJ, Kaban A (2009) When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J Complex 25(4):385–397MathSciNetCrossRefMATH Durrant RJ, Kaban A (2009) When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J Complex 25(4):385–397MathSciNetCrossRefMATH
12.
Zurück zum Zitat Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings SIAM international conference on data mining Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings SIAM international conference on data mining
13.
Zurück zum Zitat Fauvel M, Chanussot J, Benediktsson JA, Villa A (2013) Parsimonious Mahalanobis kernel for the classification of high dimensional data. Pattern Recognit 46(3):845–854CrossRef Fauvel M, Chanussot J, Benediktsson JA, Villa A (2013) Parsimonious Mahalanobis kernel for the classification of high dimensional data. Pattern Recognit 46(3):845–854CrossRef
14.
Zurück zum Zitat Francois D, Wertz M-V, Verleysen SM-M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886CrossRef Francois D, Wertz M-V, Verleysen SM-M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886CrossRef
15.
Zurück zum Zitat Gu X, Zhang Y, Zhang L, Zhang D, Li J (2013) An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features. Signal Process 93(8):2244–2255CrossRef Gu X, Zhang Y, Zhang L, Zhang D, Li J (2013) An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features. Signal Process 93(8):2244–2255CrossRef
16.
Zurück zum Zitat Hasan A, Adnan MA (2012) High dimensional microarray data classification using correlation based feature selection. In: Proceedings international conference on biomedical engineering (ICoBE), pp 319–321 Hasan A, Adnan MA (2012) High dimensional microarray data classification using correlation based feature selection. In: Proceedings international conference on biomedical engineering (ICoBE), pp 319–321
17.
Zurück zum Zitat He Q, Wang Q, Du C-Y, Ma X-D, Shi Z-Z (2010) A parallel hyper-surface classifier for high dimensional data. In: Proceedings international symposium on knowledge acquisition and modeling, pp 338–343 He Q, Wang Q, Du C-Y, Ma X-D, Shi Z-Z (2010) A parallel hyper-surface classifier for high dimensional data. In: Proceedings international symposium on knowledge acquisition and modeling, pp 338–343
18.
Zurück zum Zitat Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings 26th international conference on very large data bases, pp 506–515 Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings 26th international conference on very large data bases, pp 506–515
19.
Zurück zum Zitat Hsu C-M, Chen M-S (2009) On the design and applicability of distance functions in high-dimensional data space. IEEE Trans Knowl Data Eng 21(4):523–536MathSciNetCrossRef Hsu C-M, Chen M-S (2009) On the design and applicability of distance functions in high-dimensional data space. IEEE Trans Knowl Data Eng 21(4):523–536MathSciNetCrossRef
20.
Zurück zum Zitat Huang S-C, Wu, T-K (2012) Robust semi-supervised SVM on kernel partial least discriminant space for high dimensional data mining. In: Proceedings international conference on information science and applications (ICISA), pp 1–6 Huang S-C, Wu, T-K (2012) Robust semi-supervised SVM on kernel partial least discriminant space for high dimensional data mining. In: Proceedings international conference on information science and applications (ICISA), pp 1–6
21.
Zurück zum Zitat Jagadish HV, Ooi BC, Tan K-L, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2):364–397CrossRef Jagadish HV, Ooi BC, Tan K-L, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst 30(2):364–397CrossRef
22.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
23.
Zurück zum Zitat Kabán A (2011) On the distance concentration awareness of certain data reduction techniques. Pattern Recognit 44(2):265–277CrossRefMATH Kabán A (2011) On the distance concentration awareness of certain data reduction techniques. Pattern Recognit 44(2):265–277CrossRefMATH
24.
Zurück zum Zitat Kabán A (2012) Non-parametric detection of meaningless distances in high dimensional data. Stat Comput 22:375–385MathSciNetCrossRef Kabán A (2012) Non-parametric detection of meaningless distances in high dimensional data. Stat Comput 22:375–385MathSciNetCrossRef
25.
Zurück zum Zitat Koudas N, Ooi BC, Shen HT, Tung AKH (2004) LDC: enabling search by partial distance in a hyper-dimensional space. In: Proceedings international conference on data engineering, pp 6–17 Koudas N, Ooi BC, Shen HT, Tung AKH (2004) LDC: enabling search by partial distance in a hyper-dimensional space. In: Proceedings international conference on data engineering, pp 6–17
26.
Zurück zum Zitat Ledoux M (2001) The concentration of measure phenomenon. Mathematical Surveys and Monographs, vol 89. American Mathematical Society Ledoux M (2001) The concentration of measure phenomenon. Mathematical Surveys and Monographs, vol 89. American Mathematical Society
27.
Zurück zum Zitat Lee G, Rodriguez C, Madabhushi A (2008) Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies. IEEE/ACM Trans Comput Biol Bioinform 5(3):368–384CrossRef Lee G, Rodriguez C, Madabhushi A (2008) Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies. IEEE/ACM Trans Comput Biol Bioinform 5(3):368–384CrossRef
28.
Zurück zum Zitat Lejsek H, Asmundsson FH, Jonsson BT, Amsaleg L (2009) NV-Tree: an efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Trans Pattern Anal Mach Intell 31(5):869–883CrossRef Lejsek H, Asmundsson FH, Jonsson BT, Amsaleg L (2009) NV-Tree: an efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Trans Pattern Anal Mach Intell 31(5):869–883CrossRef
29.
Zurück zum Zitat Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652CrossRef Li Y, Luo C, Chung SM (2008) Text clustering with feature selection by using statistical data. IEEE Trans Knowl Data Eng 20(5):641–652CrossRef
30.
Zurück zum Zitat Liang J, Vaishnavi VK, Vandenberg A (2006) Clustering of LDAP directory schemas to facilitate information resources interoperability across organizations. IEEE Trans Syst Man Cybern A Syst Hum 36(4):631–642CrossRef Liang J, Vaishnavi VK, Vandenberg A (2006) Clustering of LDAP directory schemas to facilitate information resources interoperability across organizations. IEEE Trans Syst Man Cybern A Syst Hum 36(4):631–642CrossRef
31.
Zurück zum Zitat Liu H, Wei R-X, Jiang G-P (2012) Similarity measurement for data with high-dimensional and mixed feature values through fuzzy clustering. In: Proceedings international conference on computer science and automation engineering (CSAE), vol 3, pp 617–62 Liu H, Wei R-X, Jiang G-P (2012) Similarity measurement for data with high-dimensional and mixed feature values through fuzzy clustering. In: Proceedings international conference on computer science and automation engineering (CSAE), vol 3, pp 617–62
32.
Zurück zum Zitat Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Transa Knowl Data Eng 24(1):59–71CrossRef Mo D, Huang SH (2012) Fractal-based intrinsic dimension estimation and its application in dimensionality reduction. IEEE Transa Knowl Data Eng 24(1):59–71CrossRef
33.
Zurück zum Zitat Okada S, Nishida T (2011) Online incremental clustering with distance metric learning for high dimensional data. In: Proceedings international joint conference on neural networks, pp 2047–2054 Okada S, Nishida T (2011) Online incremental clustering with distance metric learning for high dimensional data. In: Proceedings international joint conference on neural networks, pp 2047–2054
34.
Zurück zum Zitat Radovanovi’c M, Nanopoulos A, Ivanovi’c M (2010) On the existence of obstinate results in vector space models. In: Proceedings 33rd international ACM SIGIR conference on research and development in information retrieval, New York, pp 186–193 Radovanovi’c M, Nanopoulos A, Ivanovi’c M (2010) On the existence of obstinate results in vector space models. In: Proceedings 33rd international ACM SIGIR conference on research and development in information retrieval, New York, pp 186–193
35.
Zurück zum Zitat Samiappan S, Prasad S, Bruce LM (2013) Non-uniform random feature selection and kernel density scoring with SVM based ensemble classification for hyperspectral image analysis. IEEE J Sel Top Appl Earth Obs Remote Sens 6(2):792–800CrossRef Samiappan S, Prasad S, Bruce LM (2013) Non-uniform random feature selection and kernel density scoring with SVM based ensemble classification for hyperspectral image analysis. IEEE J Sel Top Appl Earth Obs Remote Sens 6(2):792–800CrossRef
36.
Zurück zum Zitat Triguero I, Derrac J, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans Syst Man Cybern C Appl Rev 42(1):86–100CrossRef Triguero I, Derrac J, Herrera F (2012) A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans Syst Man Cybern C Appl Rev 42(1):86–100CrossRef
37.
Zurück zum Zitat Turkay C, Lundervold A, Lundervold AJ, Hauser H (2012) Representative factor generation for the interactive visual analysis of high-dimensional data. IEEE Trans Vis Comput Gr 18(12):2621–2630CrossRef Turkay C, Lundervold A, Lundervold AJ, Hauser H (2012) Representative factor generation for the interactive visual analysis of high-dimensional data. IEEE Trans Vis Comput Gr 18(12):2621–2630CrossRef
38.
Zurück zum Zitat Varshney KR, Willsky AS (2011) Linear dimensionality reduction for margin-based classification: high-dimensional data and sensor networks. IEEE Trans Signal Process 59(6):2496–2512MathSciNetCrossRef Varshney KR, Willsky AS (2011) Linear dimensionality reduction for margin-based classification: high-dimensional data and sensor networks. IEEE Trans Signal Process 59(6):2496–2512MathSciNetCrossRef
39.
Zurück zum Zitat Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In: Proceedings very large data base conference (VLDB’98), pp 194–205 Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In: Proceedings very large data base conference (VLDB’98), pp 194–205
40.
Zurück zum Zitat Wei B, Guan T, Yu J (2014) Projected residual vector quantization for ANN search. IEEE Multimed 21(3):41–51CrossRef Wei B, Guan T, Yu J (2014) Projected residual vector quantization for ANN search. IEEE Multimed 21(3):41–51CrossRef
41.
Zurück zum Zitat Wu C, Yang H, Zhu J, Zhang J, King I, Lyu RM (2013) Sparse Poisson coding for high dimensional document clustering. In: Proceedings IEEE international conference on big data, pp 512–517 Wu C, Yang H, Zhu J, Zhang J, King I, Lyu RM (2013) Sparse Poisson coding for high dimensional document clustering. In: Proceedings IEEE international conference on big data, pp 512–517
42.
Zurück zum Zitat Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef
43.
Zurück zum Zitat Xiong H, Wu J, Chen J (2009) K-Means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331CrossRef Xiong H, Wu J, Chen J (2009) K-Means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331CrossRef
44.
Zurück zum Zitat Xu M, Chen H, Varshney PK (2013) Dimensionality reduction for registration of high-dimensional data sets. IEEE Trans Image Process 22(8):3041–3049MathSciNetCrossRef Xu M, Chen H, Varshney PK (2013) Dimensionality reduction for registration of high-dimensional data sets. IEEE Trans Image Process 22(8):3041–3049MathSciNetCrossRef
45.
Zurück zum Zitat Yasen Z, Xinwei Z, Ge L, Xian S, Hongqi W, Kun F (2014) Semi-supervised manifold learning based multigraph fusion for high-resolution remote sensing image classification. IEEE Lett Geosci Remote Sens 11(2):464–468CrossRef Yasen Z, Xinwei Z, Ge L, Xian S, Hongqi W, Kun F (2014) Semi-supervised manifold learning based multigraph fusion for high-resolution remote sensing image classification. IEEE Lett Geosci Remote Sens 11(2):464–468CrossRef
46.
Zurück zum Zitat Yu J, Wang M, Tao D (2012) Semisupervised multiview distance metric learning for cartoon synthesis. IEEE Trans Image Process 21(11):4636–4648MathSciNetCrossRef Yu J, Wang M, Tao D (2012) Semisupervised multiview distance metric learning for cartoon synthesis. IEEE Trans Image Process 21(11):4636–4648MathSciNetCrossRef
47.
Zurück zum Zitat Zhang H, Li N (2012) K-Dvd-Tree: a high dimensional data index applying to SIFT feature matching. In: Proceedings fifth international symposium on computational intelligence and design (ISCID), vol 2, pp 14–17 Zhang H, Li N (2012) K-Dvd-Tree: a high dimensional data index applying to SIFT feature matching. In: Proceedings fifth international symposium on computational intelligence and design (ISCID), vol 2, pp 14–17
Metadaten
Titel
Multiplicative distance: a method to alleviate distance instability for high-dimensional data
verfasst von
Jafar Mansouri
Morteza Khademi
Publikationsdatum
01.12.2015
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2015
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-014-0813-4

Weitere Artikel der Ausgabe 3/2015

Knowledge and Information Systems 3/2015 Zur Ausgabe