Skip to main content
Top

2017 | OriginalPaper | Chapter

Pivot-Based Distributed K-Nearest Neighbor Mining

Authors : Caitlin Kuhlman, Yizhou Yan, Lei Cao, Elke Rundensteiner

Published in: Machine Learning and Knowledge Discovery in Databases

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

k-nearest neighbor (kNN) search is a fundamental data mining task critical to many data analytics methods. Yet no effective techniques to date scale kNN search to large datasets. In this work we present PkNN, an exact distributed method that by leveraging modern distributed architectures for the first time scales kNN search to billion point datasets. The key to the PkNN strategy is a multi-round kNN search that exploits pivot-based data partitioning at each stage. This includes an outlier-driven partition adjustment mechanism that effectively minimizes data duplication and achieves a balanced workload across the compute cluster. Aggressive data-driven bounds along with a tiered support assignment strategy ensure correctness while limiting computation costs. Our experimental study on multi-dimensional real-world data demonstrates that PkNN achieves significant speedup over the state-of-the-art and scales effectively in data cardinality. Code and data related to this chapter are available at: http://​solar-10.​wpi.​edu/​cakuhlman/​PkNN.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
Literature
2.
go back to reference Afrati, F.N., Sarma, A.D., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endow. 6(4), 277–288 (2013)CrossRef Afrati, F.N., Sarma, A.D., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endow. 6(4), 277–288 (2013)CrossRef
3.
go back to reference Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure, pp. 49–60. ACM Press (1999) Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure, pp. 49–60. ACM Press (1999)
4.
go back to reference Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York (2000) Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York (2000)
5.
go back to reference Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., Lee, W.C., Pitoura, E.: Distributed in-memory processing of all k nearest neighbor queries. IEEE Trans. Knowl. Data Eng. 28(4), 925–938 (2016)CrossRef Chatzimilioudis, G., Costa, C., Zeinalipour-Yazti, D., Lee, W.C., Pitoura, E.: Distributed in-memory processing of all k nearest neighbor queries. IEEE Trans. Knowl. Data Eng. 28(4), 925–938 (2016)CrossRef
6.
go back to reference Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of 23rd International Conference on Very Large Data Bases, pp. 426–435 (1997) Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of 23rd International Conference on Very Large Data Bases, pp. 426–435 (1997)
7.
go back to reference Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRefMATH Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRefMATH
9.
go back to reference Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: International Conference on Data Engineering, pp. 1352–1363. IEEE (2015) Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: International Conference on Data Engineering, pp. 1352–1363. IEEE (2015)
10.
go back to reference Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
11.
go back to reference Guttman, A.: R-trees: a dynamic index structure for spatial searching, vol. 14. ACM (1984) Guttman, A.: R-trees: a dynamic index structure for spatial searching, vol. 14. ACM (1984)
12.
go back to reference Haghani, P., Michel, S., Cudré-Mauroux, P., Aberer, K.: LSH at large-distributed KNN search in high dimensions. In: International Workshop on the Web and Databases (2008) Haghani, P., Michel, S., Cudré-Mauroux, P., Aberer, K.: LSH at large-distributed KNN search in high dimensions. In: International Workshop on the Web and Databases (2008)
13.
go back to reference Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst. 28(4), 517–580 (2003)CrossRef Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst. 28(4), 517–580 (2003)CrossRef
14.
go back to reference Lin, K.I., Jagadish, H.V., Faloutsos, C.: The TV-tree: an index structure for high-dimensional data. Int. J. Very Large Data Bases 3(4), 517–542 (1994)CrossRef Lin, K.I., Jagadish, H.V., Faloutsos, C.: The TV-tree: an index structure for high-dimensional data. Int. J. Very Large Data Bases 3(4), 517–542 (1994)CrossRef
15.
go back to reference Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. 5(10), 1016–1027 (2012)CrossRef Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow. 5(10), 1016–1027 (2012)CrossRef
16.
go back to reference Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)CrossRef Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)CrossRef
17.
go back to reference Maillo, J., Triguero, I., Herrera, F.: A mapreduce-based k-nearest neighbor approach for big data classification. In: Trustcom/BigDataSE/ISPA, vol. 2, pp. 167–172. IEEE (2015) Maillo, J., Triguero, I., Herrera, F.: A mapreduce-based k-nearest neighbor approach for big data classification. In: Trustcom/BigDataSE/ISPA, vol. 2, pp. 167–172. IEEE (2015)
18.
go back to reference Novak, D., Zezula, P.: M-Chord: a scalable distributed similarity search structure. In: Proceedings of the 1st International Conference on Scalable Information Systems, p. 19. ACM (2006) Novak, D., Zezula, P.: M-Chord: a scalable distributed similarity search structure. In: Proceedings of the 1st International Conference on Scalable Information Systems, p. 19. ACM (2006)
19.
go back to reference Ramaswamy, S., Rose, K.: Adaptive cluster distance bounding for high-dimensional indexing. IEEE Trans. Knowl. Data Eng. 23(6), 815–830 (2011)CrossRef Ramaswamy, S., Rose, K.: Adaptive cluster distance bounding for high-dimensional indexing. IEEE Trans. Knowl. Data Eng. 23(6), 815–830 (2011)CrossRef
20.
go back to reference Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, vol. 29, pp. 427–438. ACM (2000) Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, vol. 29, pp. 427–438. ACM (2000)
21.
go back to reference Sarma, A.D., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. In: Proceedings of the VLDB Endowment, vol. 6, pp. 277–288. VLDB Endowment (2013) Sarma, A.D., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. In: Proceedings of the VLDB Endowment, vol. 6, pp. 277–288. VLDB Endowment (2013)
22.
go back to reference Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1–10. IEEE (2010) Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1–10. IEEE (2010)
24.
go back to reference Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: ACM SIGMOD International Conference on Management of Data, pp. 1071–1085 (2016) Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: ACM SIGMOD International Conference on Management of Data, pp. 1071–1085 (2016)
25.
go back to reference Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)
26.
go back to reference Zhang, C., Li, F., Jestes, J.: Efficient parallel KNN joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM (2012) Zhang, C., Li, F., Jestes, J.: Efficient parallel KNN joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM (2012)
Metadata
Title
Pivot-Based Distributed K-Nearest Neighbor Mining
Authors
Caitlin Kuhlman
Yizhou Yan
Lei Cao
Elke Rundensteiner
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-71246-8_51

Premium Partner