Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 1/2017

11.05.2016

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

verfasst von: Thilak L. Fernando, Geoffrey I. Webb

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A monotonic transform, \(f:{\mathbb {R}} \rightarrow {\mathbb {R}}\) is either \(\forall x > y \Leftrightarrow f(x) \ge f(y)\) or \(\forall x > y \Leftrightarrow f(x) \le f(y)\). Such a transform can produce ambiguities in the order. A strictly monotonic transform is either \(\forall x > y \Leftrightarrow f(x) > f(y)\) or \(\forall x > y \Leftrightarrow f(x) < f(y)\). Hence, a strictly monotonic transform can guarantee either order preservation or order reversal.
 
2
Datasets are generally subjected to min-max normalization. As a result, linear order preserving transforms do not alter the similarity scores.
 
3
Only the top k instances are important to the user in information retrieval.
 
4
ReFeat works only with imbalanced trees. Sample size 2 can only produce balanced trees.
 
Literatur
Zurück zum Zitat Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRef
Zurück zum Zitat Ashby FG, Ennis DM (2007) Similarity measures. Scholarpedia 2(12):4116CrossRef Ashby FG, Ennis DM (2007) Similarity measures. Scholarpedia 2(12):4116CrossRef
Zurück zum Zitat Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307MathSciNet Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307MathSciNet
Zurück zum Zitat Conover WJ (1980) Practical nonparametric statistics. Wiley, New York Conover WJ (1980) Practical nonparametric statistics. Wiley, New York
Zurück zum Zitat Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second ACM international conference on knowledge discovery and data mining, pp 226–231 Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second ACM international conference on knowledge discovery and data mining, pp 226–231
Zurück zum Zitat Faith DP, Minchin PR, Belbin L (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69(1–3):57–68CrossRef Faith DP, Minchin PR, Belbin L (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69(1–3):57–68CrossRef
Zurück zum Zitat Giacinto G, Roli F (2005) Instance-based relevance feedback for image retrieval. Adv Neural Inf Process Syst 17:489–496 Giacinto G, Roli F (2005) Instance-based relevance feedback for image retrieval. Adv Neural Inf Process Syst 17:489–496
Zurück zum Zitat Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, BurlingtonMATH Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, BurlingtonMATH
Zurück zum Zitat He J, Li M, Zhang HJ, Tong H, Zhang C (2004) Manifold-ranking based image retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, MULTIMEDIA ’04, ACM, New York, pp 9–16 He J, Li M, Zhang HJ, Tong H, Zhang C (2004) Manifold-ranking based image retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, MULTIMEDIA ’04, ACM, New York, pp 9–16
Zurück zum Zitat Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefMATH Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefMATH
Zurück zum Zitat Osborne J (2002) Notes on the use of data transformations. Pract Assess Res Eval 8(6):1–8 Osborne J (2002) Notes on the use of data transformations. Pract Assess Res Eval 8(6):1–8
Zurück zum Zitat Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Pract Assess Res Eval 15(12):1–9 Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Pract Assess Res Eval 15(12):1–9
Zurück zum Zitat Petitjean F, Gançarski P (2012) Summarizing a set of time series by averaging: from steiner sequence to compact multiple alignment. Theor Comput Sci 414(1):76–91MathSciNetCrossRefMATH Petitjean F, Gançarski P (2012) Summarizing a set of time series by averaging: from steiner sequence to compact multiple alignment. Theor Comput Sci 414(1):76–91MathSciNetCrossRefMATH
Zurück zum Zitat Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323 Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323
Zurück zum Zitat Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Gr Stat 15(1):118–138MathSciNetCrossRef Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Gr Stat 15(1):118–138MathSciNetCrossRef
Zurück zum Zitat Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach, advances in database systems, vol 32. Springer, BerlinMATH Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach, advances in database systems, vol 32. Springer, BerlinMATH
Zurück zum Zitat Zhang R, Zhang ZM (2006) BALAS: empirical bayesian learning in the relevance feedback for image retrieval. Image Vis Comput 24(3):211–223CrossRef Zhang R, Zhang ZM (2006) BALAS: empirical bayesian learning in the relevance feedback for image retrieval. Image Vis Comput 24(3):211–223CrossRef
Zurück zum Zitat Zhou G, Ting K, Liu F, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recognit 45(4):1707–1720CrossRef Zhou G, Ting K, Liu F, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recognit 45(4):1707–1720CrossRef
Zurück zum Zitat Zhou ZH, Dai HB (2006) Query-sensitive similarity measure for content-based image retrieval. In: Proceedings of the sixth international conference on data mining, ICDM ’06, IEEE Computer Society, Washington, DC, pp 1211–1215 Zhou ZH, Dai HB (2006) Query-sensitive similarity measure for content-based image retrieval. In: Proceedings of the sixth international conference on data mining, ICDM ’06, IEEE Computer Society, Washington, DC, pp 1211–1215
Metadaten
Titel
SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption
verfasst von
Thilak L. Fernando
Geoffrey I. Webb
Publikationsdatum
11.05.2016
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 1/2017
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-016-0463-0

Weitere Artikel der Ausgabe 1/2017

Data Mining and Knowledge Discovery 1/2017 Zur Ausgabe