Skip to main content
Top
Published in: Pattern Analysis and Applications 3/2017

16-12-2015 | Theoretical Advances

A novel supervised cluster adjustment method using a fast exact nearest neighbor search algorithm

Authors: Ali Zaghian, Fakhroddin Noorbehbahani

Published in: Pattern Analysis and Applications | Issue 3/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Supervised clustering is a new research area that aims to improve unsupervised clustering algorithms exploiting supervised information. Today, there are several clustering algorithms, but the effective supervised cluster adjustment method which is able to adjust the resulting clusters, regardless of applied clustering algorithm has not been presented yet. In this paper, we propose a new supervised cluster adjustment method which can be applied to any clustering algorithm. Since the adjustment method is based on finding the nearest neighbors, a novel exact nearest neighbor search algorithm is also introduced which is significantly faster than the classic one. Several datasets and clustering evaluation metrics are employed to examine the effectiveness of the proposed cluster adjustment method and the proposed fast exact nearest neighbor algorithm comprehensively. The experimental results show that the proposed algorithms are significantly effective in improving clusters and accelerating nearest neighbor searches.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588CrossRefMATH Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588CrossRefMATH
3.
go back to reference Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68 Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68
4.
go back to reference Eick CF, Zeidat NM, Zhao Z (2004) Supervised clustering—algorithms and benefits. In: ICTAI, pp 774–776 Eick CF, Zeidat NM, Zhao Z (2004) Supervised clustering—algorithms and benefits. In: ICTAI, pp 774–776
5.
go back to reference Vilalta R, Achari M, Eick CF (2004) Piece-wise model fitting using local data patterns. In: Proceedings of the 16th European conference on artificial intelligence, ECAI’2004, Vol 16, p 559 Vilalta R, Achari M, Eick CF (2004) Piece-wise model fitting using local data patterns. In: Proceedings of the 16th European conference on artificial intelligence, ECAI’2004, Vol 16, p 559
6.
go back to reference Eick CF, Rouhana A, Bagherjeiran A, Vilalta R (2006) Using clustering to learn distance functions for supervised similarity assessment. Eng Appl Artif Intell 19(4):395–401CrossRef Eick CF, Rouhana A, Bagherjeiran A, Vilalta R (2006) Using clustering to learn distance functions for supervised similarity assessment. Eng Appl Artif Intell 19(4):395–401CrossRef
7.
go back to reference Basu S, Bilenko M, Mooney M (2003) Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In: Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data, pp 42–49 Basu S, Bilenko M, Mooney M (2003) Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In: Proceedings of the ICML-2003 workshop on the continuum from labeled to unlabeled data, pp 42–49
8.
go back to reference Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Constrained Clust Adv Algorithms Theory Appl 4(1):17–32MATH Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Constrained Clust Adv Algorithms Theory Appl 4(1):17–32MATH
9.
go back to reference Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference machine learning, pp 307–314 Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference machine learning, pp 307–314
10.
go back to reference Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15, pp 505–512 Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15, pp 505–512
11.
go back to reference Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC, pp 39–48. doi:10.1145/956755.956759 Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC, pp 39–48. doi:10.​1145/​956755.​956759
12.
go back to reference Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th international conference on machine learning (ICML-2002), pp 19–26 Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19th international conference on machine learning (ICML-2002), pp 19–26
13.
go back to reference Demiriz A, Bennett K, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: Artificial neural networks in engineering (ANNIE-99), pp 809–814 Demiriz A, Bennett K, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: Artificial neural networks in engineering (ANNIE-99), pp 809–814
14.
go back to reference Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of 17th international conference machine learning (ICML 2000), pp 1103–1110 Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of 17th international conference machine learning (ICML 2000), pp 1103–1110
15.
go back to reference Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z-H, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P, Zhou Z-H, Steinbach M, Hand D, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef
16.
go back to reference Kolahdouzan MR, Shahabi C (2004) Voronoi-based K nearest neighbor search for spatial network databases. In: VLDB, pp 840–851 Kolahdouzan MR, Shahabi C (2004) Voronoi-based K nearest neighbor search for spatial network databases. In: VLDB, pp 840–851
17.
go back to reference Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: Proceedings of the 34th symposium on the INTERFACE, pp 17–20 Toussaint G (2002) Proximity graphs for nearest neighbor decision rules: recent progress. In: Proceedings of the 34th symposium on the INTERFACE, pp 17–20
19.
go back to reference de Berg M, Cheong O, van Kreveld M, Overmars M (2008) Computational geometry algorithms and applications, 3rd edn. Springer, HeidelbergMATH de Berg M, Cheong O, van Kreveld M, Overmars M (2008) Computational geometry algorithms and applications, 3rd edn. Springer, HeidelbergMATH
20.
go back to reference Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: International conference on very large data bases (VLDB), pp 426–435 Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: International conference on very large data bases (VLDB), pp 426–435
21.
go back to reference Guttmann A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of ACM-SIGMOD, pp 47–57 Guttmann A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of ACM-SIGMOD, pp 47–57
22.
go back to reference Kim YJ, Patel JM (2010) Performance comparison of the R*-tree and the quadtree for knn and distance join queries. IEEE Trans Knowl Data Eng 22(7):1014–1027CrossRef Kim YJ, Patel JM (2010) Performance comparison of the R*-tree and the quadtree for knn and distance join queries. IEEE Trans Knowl Data Eng 22(7):1014–1027CrossRef
23.
go back to reference Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB’99 proceedings of the 25th international conference on very large data bases, pp 518–529 Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB’99 proceedings of the 25th international conference on very large data bases, pp 518–529
24.
go back to reference Wang X (2011) A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In: IJCNN, pp 1293–1299 Wang X (2011) A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In: IJCNN, pp 1293–1299
26.
go back to reference Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr Boston 12(4):461–486CrossRef Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr Boston 12(4):461–486CrossRef
27.
go back to reference Zhao Y (2001) Criterion functions for document clustering: experiments and analysis (technical report), Department of Computer Science, University of Minnesota, pp 1–30 Zhao Y (2001) Criterion functions for document clustering: experiments and analysis (technical report), Department of Computer Science, University of Minnesota, pp 1–30
28.
go back to reference Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques (00-34), Technical report, University of Minnesota Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques (00-34), Technical report, University of Minnesota
29.
go back to reference Meila M (2003) Comparing clusterings by the variation of information. Learning theory and kernel machines. Springer, Washington, pp 173–187CrossRef Meila M (2003) Comparing clusterings by the variation of information. Learning theory and kernel machines. Springer, Washington, pp 173–187CrossRef
30.
go back to reference Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 17th international conference on computational linguistics, vol 1, pp 79–85 Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 17th international conference on computational linguistics, vol 1, pp 79–85
31.
go back to reference Gennari JH, Langley P, Fisher DH (1989) Models of incremental concept formation. Artif Intell 40(1–3):11–61CrossRef Gennari JH, Langley P, Fisher DH (1989) Models of incremental concept formation. Artif Intell 40(1–3):11–61CrossRef
32.
go back to reference MacQueen JB (1966) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297 MacQueen JB (1966) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297
33.
go back to reference Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New YorkCrossRefMATH Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New YorkCrossRefMATH
34.
go back to reference Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035 Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
36.
go back to reference Rousseeuw L, Kaufman L (1987) Clustering by means of medoids. Statistical data analysis based L1-norm related methods. First international conference, 405, 405–416 Rousseeuw L, Kaufman L (1987) Clustering by means of medoids. Statistical data analysis based L1-norm related methods. First international conference, 405, 405–416
37.
go back to reference Powers DMW (2011) Evaluation: from precision, recall and f-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63MathSciNet Powers DMW (2011) Evaluation: from precision, recall and f-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63MathSciNet
Metadata
Title
A novel supervised cluster adjustment method using a fast exact nearest neighbor search algorithm
Authors
Ali Zaghian
Fakhroddin Noorbehbahani
Publication date
16-12-2015
Publisher
Springer London
Published in
Pattern Analysis and Applications / Issue 3/2017
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-015-0527-6

Other articles of this Issue 3/2017

Pattern Analysis and Applications 3/2017 Go to the issue

Premium Partner