nach oben

The VLDB Journal

Erschienen in:

12.09.2022 | Regular Paper

Learning-based query optimization for multi-probe approximate nearest neighbor search

verfasst von: Pengcheng Zhang, Bin Yao, Chao Gao, Bin Wu, Xiao He, Feifei Li, Yuanfei Lu, Chaoqun Zhan, Feilong Tang

Erschienen in: The VLDB Journal | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Approximate nearest neighbor search (ANNS) is a fundamental problem that has attracted widespread attention for decades. Multi-probe ANNS is one of the most important classes of ANNS methods, playing crucial roles in disk-based, GPU-based, and distributed scenarios. The state-of-the-art multi-probe ANNS approaches typically perform in a fixed-configuration manner. For example, each query is dispatched to a fixed number of partitions to run ANNS algorithms locally, and the results will be merged to obtain the final result set. Our observation shows that such fixed configurations typically lead to a non-optimal accuracy–efficiency trade-off. To further optimize multi-probe ANNS, we propose to generate efficient configurations for each query individually. By formalizing the per-query optimization as a 0–1 knapsack problem and its variants, we identify that the kNN distribution (the proportion of k nearest neighbors of a query placed in each partition) is essential to the optimization. Then we develop LEQAT (LEarned Query-Aware OpTimizer), which leverages kNN distribution to seek optimal configurations for each query. LEQAT comes with (i) a machine learning model to learn and estimate kNN distributions based on historical or sample queries and (ii) efficient query optimization algorithms to determine the partitions to probe and the number of searching neighbors in each partition. We apply LEQAT to three state-of-the-art ANNS methods IVF, HNSW, and SSG under clustering-based partitioning, evaluating the overall performance on several real-world datasets. The results show that LEQAT consistently reduces the latency by up to 58% and improves the throughput by up to 3.9 times.

Vorheriger Artikel Multi-constraint shortest path using forest hop labeling

Nächster Artikel Toward maintenance of hypercores in large-scale dynamic hypergraphs

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://github.com/FALCONN-LIB/FALCONN

https://github.com/nmslib/hnswlib

https://github.com/ZJULearning/SSG

Adeniyi, D.A., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using \(k\)-nearest neighbor (\(k\)NN) classification method. Appl. Comput. Inform. 12, 90–108 (2016)CrossRef

Anagnostopoulos, C., Triantafillou, P.: Learning set cardinality in distance nearest neighbours. In: ICDM, pp. 691–696. IEEE (2015)

Andoni, A., Indyk P., Laarhoven T., Razenshteyn I., Schmidt, L.: Practical and optimal LSH for angular distance. arXiv preprint arXiv:1509.02897 (2015)

Babenko, A., Lempitsky, V.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2014)CrossRef

Babenko, A., Lempitsky, V.: Efficient indexing of billion-scale datasets of deep descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2055–2063 (2016)

Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975)CrossRefMATH

Bijalwan, V., Kumar, V., Kumari, P., Pascual, J.: \(k\)NN based machine learning approach for text and document mining. Int. J. Database Theory Appl. 7, 61–70 (2014)CrossRef

Chauvin, Y., Rumelhart, D.E.: Backpropagation: Theory, Architectures, and Applications. Psychology Press, London (1995)

Chen, L., Özsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. In: SIGMOD, pp. 491–502 (2005)

10.

Chen, Q., Wang, H., Li, M., Ren, G., Li, S., Zhu, J., Li, J., Liu, C., Zhang, L., Wang, J.: SPTAG: a library for fast approximate nearest neighbor search (2018)

11.

Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symposium on Computational Geometry, pp. 253–262 (2004)

12.

Dong, Y., Indyk, P., Razenshteyn, I., Wagner, T.: Learning space partitions for nearest neighbor search. In: ICLR (2020)

13.

Dudziński, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Oper. Res. 28(1), 3–21 (1987)MathSciNetCrossRefMATH

14.

Fu, C., Wang, C., Cai, D.: Satellite system graph: towards the efficiency up-boundary of graph-based approximate nearest neighbor search. CoRR, arXiv:1907.06146 (2019)

15.

Fu, C., Xiang, C., Wang, C., Cai, D.: Fast approximate nearest neighbor search with the navigating spreading-out graph. VLDB 12, 461–474 (2019)

16.

Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)CrossRef

17.

Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization for approximate nearest neighbor search. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953 (2013)

18.

Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. VLDB 99, 518–529 (1999)

19.

Gray, R.: Vector quantization. IEEE Assp Mag. 1, 4–29 (1984)CrossRef

20.

Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: \(k\)NN model-based approach in classification. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, pp. 986–996 (2003)

21.

Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: ACM (1984)

22.

Hartigan, J.A., Wong, M.A.: Algorithm as 136: a \(k\)-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28, 100–108 (1979)MATH

23.

Hilbe, J.M.: Logistic Regression Models. Chapman and Hall/CRC, Boca Raton (2009)CrossRefMATH

24.

Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. VLDB 9, 1–12 (2015)

25.

Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)

26.

Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2010)CrossRef

27.

Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 117–128 (2011)CrossRef

28.

Jégou, H., Tavenard, R., Douze, M., Amsaleg, L.: Searching in one billion vectors: re-rank with source coding. In: ICASSP, pp. 861–864 (2011)

29.

Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data 7(3), 535–547 (2019)CrossRef

30.

Kalantidis, Y., Avrithis, Y.: Locally optimized product quantization for approximate nearest neighbor search. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2321–2328 (2014)

31.

Kellerer, H., Pferschy, U., Pisinger, D.: The multiple-choice knapsack problem. In: Knapsack Problems, pp. 317–347. Springer, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24777-7_11

32.

Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: SIGMOD, pp. 489–504 (2018)

33.

Li, C., Zhang, M., Andersen, D.G., He, Y.: Improving approximate nearest neighbor search through learned adaptive early termination. In: SIGMOD, pp. 2539–2554 (2020)

34.

Li, P., Lu, H., Zheng, Q., Yang, L., Pan, G.: Lisa: a learned index structure for spatial data. In: SIGMOD, pp. 2119–2133 (2020)

35.

Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32, 1475–1488 (2019)CrossRef

36.

Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 950–961. Association for Computing Machinery, Inc. (2007)

37.

Malkov, Y., Ponomarenko, A., Logvinov, A., Krylov, V.: Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst. 45, 61–68 (2014)CrossRef

38.

Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018)CrossRef

39.

Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2227–2240 (2014)CrossRef

40.

Omohundro, S.M.: Five balltree construction algorithms. In: International Computer Science Institute Berkeley (1989)

41.

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

42.

Qin, J., Wang, W., Xiao, C., Zhang, Y.: Similarity query processing for high-dimensional data. VLDB 13(12), 3437–3440 (2020)

43.

Robinson, J.T.: The KDB-tree: a search structure for large multidimensional dynamic indexes. In: SIGMOD, pp. 10–18 (1981)

44.

Rui, Y., Huang, T.S., Chang, S.-F.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10, 39–62 (1999)CrossRef

45.

Shamos, M.I., Hoey, D.: Closest-point problems. In: Symposium on Foundations of Computer Science, pp. 151–162 (1975)

46.

Suchal, J., Návrat, P.: Full text search engine as scalable \(k\)-nearest neighbor recommendation system. In: International Conference on Artificial Intelligence in Theory and Practice, pp. 165–173 (2010)

47.

Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: SIGMOD, pp. 1745–1757 (2021)

48.

Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS: solving c-approximate nearest neighbor queries in high dimensional Euclidean space with a tiny index. VLDB 8, 1–12 (2014)

49.

Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., Dubey, P.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. VLDB 6, 1930–1941 (2013)

50.

Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: MDM, pp. 569–574. IEEE (2019)

51.

Wang, Y., Xiao, C., Qin, J., Cao, X., Sun, Y., Wang, W., Onizuka, M.: Monotonic cardinality estimation of similarity selection: a deep learning approach. In: SIGMOD, pp. 1197–1212 (2020)

52.

Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. VLDB 98, 194–205 (1998)

53.

Wei, C., Wu, B., Wang, S., Lou, R., Zhan, C., Li, F., Cai, Y.: Analyticdb-v: a hybrid analytical engine towards query fusion for structured and unstructured data. VLDB 13, 3152–3165 (2020)

54.

Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD, pp. 1071–1085 (2016)

55.

Yang, W., Li, T., Fang, G., Wei, H.: Pase: postgresql ultra-high-dimensional approximate nearest neighbor search extension. In: SIGMOD, pp. 2241–2253 (2020)

56.

Yao, B., Li, F., Kumar, P.: \(K\) nearest neighbor queries and \(k\)NN-joins in large relational databases (almost) for free. In: IEEE International Conference on Data Engineering (ICDE), pp. 4–15 (2010)

57.

Yu, J., Wu, J., Sarwat, M.: Geospark: a cluster computing framework for processing large-scale spatial data. In: ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 1–4 (2015)

58.

Zhang, S., Li, X., Zong, M., Zhu, X., Cheng, D.: Learning \(k\) for \(k\)NN classification. ACM Trans. Intell. Syst. Technol. (TIST) 8, 1–19 (2017)

59.

Zhang, W., Li, D., Xu, Y., Zhang, Y.: Shuffle-efficient distributed locality sensitive hashing on spark. In: IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 766–767 (2016)

Titel: Learning-based query optimization for multi-probe approximate nearest neighbor search
verfasst von: Pengcheng Zhang
Bin Yao
Chao Gao
Bin Wu
Xiao He
Feifei Li
Yuanfei Lu
Chaoqun Zhan
Feilong Tang
Publikationsdatum: 12.09.2022
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 3/2023
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-022-00762-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2023

Reverse spatial top-k keyword queries

Efficient kNN query for moving objects on time-dependent road networks

PARROT: pattern-based correlation exploitation in big partitioned data series

Toward maintenance of hypercores in large-scale dynamic hypergraphs

A Pareto optimal Bloom filter family with hash adaptivity

Multi-constraint shortest path using forest hop labeling