Skip to main content
Erschienen in: Computing 1/2019

26.04.2018

The out-of-core KNN awakens: the light side of computation force on large datasets

verfasst von: Javier Olivares, Anne-Marie Kermarrec, Nitin Chiluka

Erschienen in: Computing | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

K-nearest neighbors (KNN) is a crucial tool for many applications, e.g. recommender systems, image classification and web-related applications. However, KNN is a resource greedy operation particularly for large datasets. We focus on the challenge of KNN computation over large datasets on a single commodity PC with limited memory. We propose a novel approach to compute KNN on large datasets by leveraging both disk and main memory efficiently. The main rationale of our approach is to minimize random accesses to disk, maximize sequential accesses to data and efficient usage of only the available memory. We evaluate our approach on large datasets, in terms of performance and memory consumption. The evaluation shows that our approach requires only 7% of the time needed by an in-memory baseline to compute a KNN graph.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The term ‘pons’ is Latin for ‘bridge’.
 
Literatur
1.
Zurück zum Zitat Ahmed A, Shervashidze N, Narayanamurthy S, Josifovski V, Smola AJ (2013) Distributed large-scale natural graph factorization. In: Proceedings of the 22nd international conference on world wide web, WWW’13. International World Wide Web Conferences Steering Committee, pp 37–48 Ahmed A, Shervashidze N, Narayanamurthy S, Josifovski V, Smola AJ (2013) Distributed large-scale natural graph factorization. In: Proceedings of the 22nd international conference on world wide web, WWW’13. International World Wide Web Conferences Steering Committee, pp 37–48
2.
Zurück zum Zitat Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 97–104 Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 97–104
3.
Zurück zum Zitat Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: IEEE conference on computer vision and pattern recognition, pp 1–8 Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: IEEE conference on computer vision and pattern recognition, pp 1–8
4.
Zurück zum Zitat Boutet A, Frey D, Guerraoui R, Jegou A, Kermarrec AM (2013) Whatsup: a decentralized instant news recommender. In: IEEE 27th international symposium on parallel distributed processing (IPDPS), pp 741–752 Boutet A, Frey D, Guerraoui R, Jegou A, Kermarrec AM (2013) Whatsup: a decentralized instant news recommender. In: IEEE 27th international symposium on parallel distributed processing (IPDPS), pp 741–752
5.
Zurück zum Zitat Boutet A, Frey D, Guerraoui R, Jegou A, Kermarrec AM (2014) Privacy-preserving distributed collaborative filtering. In: Noubir G, Raynal M (eds) Networked systems, LNCS, vol 8593. Springer, Berlin, pp 169–184 Boutet A, Frey D, Guerraoui R, Jegou A, Kermarrec AM (2014) Privacy-preserving distributed collaborative filtering. In: Noubir G, Raynal M (eds) Networked systems, LNCS, vol 8593. Springer, Berlin, pp 169–184
6.
Zurück zum Zitat Boutet A, Frey D, Guerraoui R, Kermarrec AM, Patra R (2014) Hyrec: leveraging browsers for scalable recommenders. In: Proceedings of the 15th international middleware conference. ACM, pp 85–96 Boutet A, Frey D, Guerraoui R, Kermarrec AM, Patra R (2014) Hyrec: leveraging browsers for scalable recommenders. In: Proceedings of the 15th international middleware conference. ACM, pp 85–96
7.
Zurück zum Zitat Chen J, Fang H, Saad Y (2009) Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection. J Mach Learn Res 10:1989–2012MathSciNetMATH Chen J, Fang H, Saad Y (2009) Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection. J Mach Learn Res 10:1989–2012MathSciNetMATH
8.
Zurück zum Zitat Chiluka N, Kermarrec AM, Olivares J (2014) Scaling kNN computation over large graphs on a PC. In: Proceedings of the posters and demos session, middleware’14. ACM, pp 9–10 Chiluka N, Kermarrec AM, Olivares J (2014) Scaling kNN computation over large graphs on a PC. In: Proceedings of the posters and demos session, middleware’14. ACM, pp 9–10
9.
Zurück zum Zitat Debatty T, Michiardi P, Thonnard O, Mees W (2014) Building k-NN graphs from large text data. In: IEEE international conference on big data, pp 573–578 Debatty T, Michiardi P, Thonnard O, Mees W (2014) Building k-NN graphs from large text data. In: IEEE international conference on big data, pp 573–578
10.
Zurück zum Zitat Dong W, Moses C, Li K (2011) Efficient \(k\)-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on world wide web. ACM, pp 577–586 Dong W, Moses C, Li K (2011) Efficient \(k\)-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on world wide web. ACM, pp 577–586
11.
Zurück zum Zitat Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput C–24(7):750–753CrossRefMATH Fukunaga K, Narendra PM (1975) A branch and bound algorithm for computing k-nearest neighbors. IEEE Trans Comput C–24(7):750–753CrossRefMATH
12.
Zurück zum Zitat Han WS, Lee S, Park K, Lee JH, Kim MS, Kim J, Yu H (2013) Turbograph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 77–85 Han WS, Lee S, Park K, Lee JH, Kim MS, Kim J, Yu H (2013) Turbograph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 77–85
13.
Zurück zum Zitat Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. In: IEEE international conference on acoustics, speech and signal processing, pp 861–864 Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. In: IEEE international conference on acoustics, speech and signal processing, pp 861–864
14.
Zurück zum Zitat Katayama N, Satoh S (1997) The SR-tree: An index structure for high-dimensional nearest neighbor queries. In: Proceedings of the international conference on management of data. ACM, pp 369–380 Katayama N, Satoh S (1997) The SR-tree: An index structure for high-dimensional nearest neighbor queries. In: Proceedings of the international conference on management of data. ACM, pp 369–380
15.
Zurück zum Zitat Kermarrec AM, Mittal N, Olivares J (2017) Multithreading approach to process real-time updates in kNN algorithms. In: 5th international conference on networked systems. Springer, pp 109–114 Kermarrec AM, Mittal N, Olivares J (2017) Multithreading approach to process real-time updates in kNN algorithms. In: 5th international conference on networked systems. Springer, pp 109–114
16.
Zurück zum Zitat Kyrola A, Blelloch G, Guestrin C (2012) Graphchi: large-scale graph computation on just a PC. In: 10th USENIX symposium on operating systems design and implementation (OSDI 12). USENIX, pp 31–46 Kyrola A, Blelloch G, Guestrin C (2012) Graphchi: large-scale graph computation on just a PC. In: 10th USENIX symposium on operating systems design and implementation (OSDI 12). USENIX, pp 31–46
18.
Zurück zum Zitat Lin Z, Kahng M, Sabrin K, Chau D, Lee H, Kang U (2014) Mmap: fast billion-scale graph computation on a PC via memory mapping. In: IEEE international conference on big data, pp 159–164 Lin Z, Kahng M, Sabrin K, Chau D, Lee H, Kang U (2014) Mmap: fast billion-scale graph computation on a PC via memory mapping. In: IEEE international conference on big data, pp 159–164
19.
Zurück zum Zitat McRoberts RE, Nelson MD, Wendt DG (2002) Stratified estimation of forest area using satellite imagery, inventory data, and the \(k\)-nearest neighbors technique. Remote Sens Environ 82(2):457–468CrossRef McRoberts RE, Nelson MD, Wendt DG (2002) Stratified estimation of forest area using satellite imagery, inventory data, and the \(k\)-nearest neighbors technique. Remote Sens Environ 82(2):457–468CrossRef
20.
Zurück zum Zitat Roy A, Mihailovic I, Zwaenepoel W (2013) X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the 24th ACM symposium on operating systems principles. ACM, pp 472–488 Roy A, Mihailovic I, Zwaenepoel W (2013) X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the 24th ACM symposium on operating systems principles. ACM, pp 472–488
21.
Zurück zum Zitat Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 3360–3367 Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 3360–3367
22.
Zurück zum Zitat Wong WK, Cheung DWl, Kao B, Mamoulis N (2009) Secure kNN computation on encrypted databases. In: Proceedings of the international conference on management of data. ACM, pp 139–152 Wong WK, Cheung DWl, Kao B, Mamoulis N (2009) Secure kNN computation on encrypted databases. In: Proceedings of the international conference on management of data. ACM, pp 139–152
23.
Zurück zum Zitat Zhu X, Han W, Chen W (2015) Gridgraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: USENIX annual technical conference. USENIX Association, pp 375–386 Zhu X, Han W, Chen W (2015) Gridgraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: USENIX annual technical conference. USENIX Association, pp 375–386
Metadaten
Titel
The out-of-core KNN awakens: the light side of computation force on large datasets
verfasst von
Javier Olivares
Anne-Marie Kermarrec
Nitin Chiluka
Publikationsdatum
26.04.2018
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 1/2019
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-018-0616-7

Weitere Artikel der Ausgabe 1/2019

Computing 1/2019 Zur Ausgabe