Skip to main content
Erschienen in: Distributed and Parallel Databases 2/2020

11.04.2019

High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

verfasst von: Gheorghi Guzun, Guadalupe Canahuate

Erschienen in: Distributed and Parallel Databases | Ausgabe 2/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aggarwal, C.C., Yu, P.S.: The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 119–129 (2000) Aggarwal, C.C., Yu, P.S.: The igrid index: reversing the dimensionality curse for similarity indexing in high dimensional space. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 119–129 (2000)
2.
Zurück zum Zitat Amato, G., Esuli, A., Falchi, F.: A comparison of pivot selection techniques for permutation-based indexing. Inf. Syst. 52, 176–188 (2015)CrossRef Amato, G., Esuli, A., Falchi, F.: A comparison of pivot selection techniques for permutation-based indexing. Inf. Syst. 52, 176–188 (2015)CrossRef
3.
Zurück zum Zitat Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)CrossRef Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5, 4308 (2014)CrossRef
4.
Zurück zum Zitat Barrios, J.M., Bustos, B., Skopal, T.: Analyzing and dynamically indexing the query set. Inf. Syst. 45, 37–47 (2014)CrossRef Barrios, J.M., Bustos, B., Skopal, T.: Analyzing and dynamically indexing the query set. Inf. Syst. 45, 37–47 (2014)CrossRef
5.
Zurück zum Zitat Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: International Conference on Database Theory, Springer, New York, pp. 217–235 (1999) Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: International Conference on Database Theory, Springer, New York, pp. 217–235 (1999)
7.
Zurück zum Zitat Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pp. 1–8 (2008) Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pp. 1–8 (2008)
8.
Zurück zum Zitat Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps (2014). arXiv:1402.6407 Chambi, S., Lemire, D., Kaser, O., Godin, R.: Better bitmap performance with roaring bitmaps (2014). arXiv:1402.6407
9.
Zurück zum Zitat Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. Proc. VLDB Endow. 10(10), 1058–1069 (2017)CrossRef Chen, L., Gao, Y., Zheng, B., Jensen, C.S., Yang, H., Yang, K.: Pivot-based metric indexing. Proc. VLDB Endow. 10(10), 1058–1069 (2017)CrossRef
10.
Zurück zum Zitat Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math. Chall. Lect. 1, 32 (2000) Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math. Chall. Lect. 1, 32 (2000)
11.
Zurück zum Zitat Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996) Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
12.
Zurück zum Zitat Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993) Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)
13.
Zurück zum Zitat Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: Dsh: data sensitive hashing for high-dimensional k-nn search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, pp. 1127–1138 (2014) Gao, J., Jagadish, H.V., Lu, W., Ooi, B.C.: Dsh: data sensitive hashing for high-dimensional k-nn search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, pp. 1127–1138 (2014)
14.
Zurück zum Zitat García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRef García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRef
15.
Zurück zum Zitat Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 518–529 (1999) Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 518–529 (1999)
16.
Zurück zum Zitat Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (New York, NY, USA), SIGMOD ’84, ACM, pp. 47–57 (1984) Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (New York, NY, USA), SIGMOD ’84, ACM, pp. 47–57 (1984)
17.
Zurück zum Zitat Guzun, G., Canahuate, G.: Hybrid query optimization for hard-to-compress bit-vectors. VLDB J 25, 339–354 (2015)CrossRef Guzun, G., Canahuate, G.: Hybrid query optimization for hard-to-compress bit-vectors. VLDB J 25, 339–354 (2015)CrossRef
18.
Zurück zum Zitat Guzun, G., Canahuate, G.: Supporting dynamic quantization for high-dimensional data analytics. In: Proceedings of the ExploreDB’17 (New York, NY, USA), ExploreDB ’17, ACM, pp. 6:1–6:6 (2017) Guzun, G., Canahuate, G.: Supporting dynamic quantization for high-dimensional data analytics. In: Proceedings of the ExploreDB’17 (New York, NY, USA), ExploreDB ’17, ACM, pp. 6:1–6:6 (2017)
19.
Zurück zum Zitat Guzun, G., Canahuate, G.: Distributed query-aware quantization for high-dimensional similarity searches. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2018, pp. 373–384 (2018) Guzun, G., Canahuate, G.: Distributed query-aware quantization for high-dimensional similarity searches. In: Advances in Database Technology: Proceedings. International Conference on Extending Database Technology, vol. 2018, pp. 373–384 (2018)
20.
Zurück zum Zitat Guzun, G., Canahuate, G., Chiu, D., Sawin, J.: A tunable compression framework for bitmap indices. In: IEEE 30th International Conference on Data Engineering (ICDE), IEEE, 2014, pp. 484–495 (2014) Guzun, G., Canahuate, G., Chiu, D., Sawin, J.: A tunable compression framework for bitmap indices. In: IEEE 30th International Conference on Data Engineering (ICDE), IEEE, 2014, pp. 484–495 (2014)
21.
Zurück zum Zitat Guzun, G., Tosado, J., Canahuate, G.: Slicing the dimensionality: top-k query processing for high-dimensional spaces. In: Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV. Springer, New York, pp. 26–50 (2014) Guzun, G., Tosado, J., Canahuate, G.: Slicing the dimensionality: top-k query processing for high-dimensional spaces. In: Transactions on Large-Scale Data-and Knowledge-Centered Systems XIV. Springer, New York, pp. 26–50 (2014)
22.
Zurück zum Zitat Guzun, G., Canahuate, G., Chiu, D.: A two-phase Mapreduce algorithm for scalable preference queries over high-dimensional data. In: Proceedings of the 20th International Database Engineering & Applications Symposium, IDEAS (2016) Guzun, G., Canahuate, G., Chiu, D.: A two-phase Mapreduce algorithm for scalable preference queries over high-dimensional data. In: Proceedings of the 20th International Database Engineering & Applications Symposium, IDEAS (2016)
23.
Zurück zum Zitat Guzun, G., McClurg, J.C., Canahuate, G., Mudumbai, R.: Power efficient big data analytics algorithms through low-level operations. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 355–361 (2016) Guzun, G., McClurg, J.C., Canahuate, G., Mudumbai, R.: Power efficient big data analytics algorithms through low-level operations. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 355–361 (2016)
24.
25.
Zurück zum Zitat Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)CrossRef Huang, Q., Feng, J., Zhang, Y., Fang, Q., Ng, W.: Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proc. VLDB Endow. 9(1), 1–12 (2015)CrossRef
26.
Zurück zum Zitat Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)CrossRef Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)CrossRef
27.
Zurück zum Zitat Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017). arXiv:1702.08734 Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus (2017). arXiv:1702.08734
28.
Zurück zum Zitat Kamel, I., Faloutsos, C.: Hilbert r-tree: an improved r-tree using fractals. In: Proceedings of the 20th International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’94, Morgan Kaufmann Publishers Inc., pp. 500–509 (1994) Kamel, I., Faloutsos, C.: Hilbert r-tree: an improved r-tree using fractals. In: Proceedings of the 20th International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’94, Morgan Kaufmann Publishers Inc., pp. 500–509 (1994)
29.
Zurück zum Zitat Katayama, N., Satoh, S.: The sr-tree: an index structure for high-dimensional nearest neighbor queries. SIGMOD Rec. 26(2), 369–380 (1997)CrossRef Katayama, N., Satoh, S.: The sr-tree: an index structure for high-dimensional nearest neighbor queries. SIGMOD Rec. 26(2), 369–380 (1997)CrossRef
30.
Zurück zum Zitat Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. San Jose, CA, July 12–16, pp. 123–128 (1992) Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the 10th National Conference on Artificial Intelligence. San Jose, CA, July 12–16, pp. 123–128 (1992)
31.
Zurück zum Zitat Kurgan, L.A., Cios, K.J.: Caim discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004)CrossRef Kurgan, L.A., Cios, K.J.: Caim discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004)CrossRef
32.
Zurück zum Zitat Lemire, D., Kaser, O., Gutarra, E.: Reordering rows for better compression: beyond the lexicographic order. ACM Trans. Datab. Syst. 37(3), 20:1–20:29 (2012) Lemire, D., Kaser, O., Gutarra, E.: Reordering rows for better compression: beyond the lexicographic order. ACM Trans. Datab. Syst. 37(3), 20:1–20:29 (2012)
33.
Zurück zum Zitat Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef
34.
Zurück zum Zitat Liu, F., Lee, H.J.: Use of social network information to enhance collaborative filtering performance. Expert Syst. Appl. 37(7), 4772–4778 (2010)CrossRef Liu, F., Lee, H.J.: Use of social network information to enhance collaborative filtering performance. Expert Syst. Appl. 37(7), 4772–4778 (2010)CrossRef
35.
Zurück zum Zitat O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, ACM Press, pp. 38–49 (1997) O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of data, ACM Press, pp. 38–49 (1997)
36.
Zurück zum Zitat Pareto, V.: Manual of Political Economy. Augustus M. Kelley Publishers, New York (1906) Pareto, V.: Manual of Political Economy. Augustus M. Kelley Publishers, New York (1906)
37.
Zurück zum Zitat Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)CrossRef Phung, S.L., Bouzerdoum, A., Chai, D.: Skin segmentation using color pixel classification: analysis and comparison. IEEE Trans. Pattern Anal. Mach. Intell. 27(1), 148–154 (2005)CrossRef
38.
Zurück zum Zitat Rinfret, D.: Answering preference queries with bit-sliced index arithmetic. In: Proceedings of the 2008 C3S2E Conference, ACM, pp. 173–185 (2008) Rinfret, D.: Answering preference queries with bit-sliced index arithmetic. In: Proceedings of the 2008 C3S2E Conference, ACM, pp. 173–185 (2008)
39.
Zurück zum Zitat Rinfret, D., O’Neil, P., O’Neil, E.: Bit-sliced index arithmetic. In: ACM SIGMOD Record, vol. 30, ACM, pp. 47–57 (2001) Rinfret, D., O’Neil, P., O’Neil, E.: Bit-sliced index arithmetic. In: ACM SIGMOD Record, vol. 30, ACM, pp. 47–57 (2001)
40.
Zurück zum Zitat Samet, H.: The Design and Analysis of Spatial Data Structures, vol. 199. Addison-Wesley, Reading, MA (1990) Samet, H.: The Design and Analysis of Spatial Data Structures, vol. 199. Addison-Wesley, Reading, MA (1990)
41.
Zurück zum Zitat Shy Goh, K., Li, B., Chang, E.: Dyndex: a dynamic and non-metric space indexer. In: In ACM Multimedia, pp. 466–475 (2002) Shy Goh, K., Li, B., Chang, E.: Dyndex: a dynamic and non-metric space indexer. In: In ACM Multimedia, pp. 466–475 (2002)
42.
Zurück zum Zitat Tsai, C., Lee, C., Yang, W.: A discretization algorithm based on class-attribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008)CrossRef Tsai, C., Lee, C., Yang, W.: A discretization algorithm based on class-attribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008)CrossRef
43.
Zurück zum Zitat Tung, A., K.H., Zhang, R., Koudas, N., Ooi, B.C.: Similarity search: a matching based approach. In: VLDB’2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 631–642 (2006) Tung, A., K.H., Zhang, R., Koudas, N., Ooi, B.C.: Similarity search: a matching based approach. In: VLDB’2006: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 631–642 (2006)
44.
Zurück zum Zitat Weber, R.: Parallel va-file. In: Proceeding of ECDL, Springer, New York, pp. 83–92 (2000) Weber, R.: Parallel va-file. In: Proceeding of ECDL, Springer, New York, pp. 83–92 (2000)
45.
Zurück zum Zitat Weber, R., Blott, S.: An approximation based data structure for similarity search. Technical Report (1997) Weber, R., Blott, S.: An approximation based data structure for similarity search. Technical Report (1997)
46.
Zurück zum Zitat Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24rd International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’98, Morgan Kaufmann Publishers Inc., pp. 194–205 (1998) Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24rd International Conference on Very Large Data Bases (San Francisco, CA, USA), VLDB ’98, Morgan Kaufmann Publishers Inc., pp. 194–205 (1998)
47.
Zurück zum Zitat Wu, K., Otoo, E.J., Shoshani, A., Nordberg, H.: Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory (2001) Wu, K., Otoo, E.J., Shoshani, A., Nordberg, H.: Notes on design and implementation of compressed bit vectors. Technical Report LBNL/PUB-3161, Lawrence Berkeley National Laboratory (2001)
48.
Zurück zum Zitat Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM’02), pp. 99–108 (2002) Wu, K., Otoo, E.J., Shoshani, A.: Compressing bitmap indexes for faster search operations. In: Proceedings of the 2002 International Conference on Scientific and Statistical Database Management Conference (SSDBM’02), pp. 99–108 (2002)
49.
Zurück zum Zitat Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016) Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)
Metadaten
Titel
High-dimensional similarity searches using query driven dynamic quantization and distributed indexing
verfasst von
Gheorghi Guzun
Guadalupe Canahuate
Publikationsdatum
11.04.2019
Verlag
Springer US
Erschienen in
Distributed and Parallel Databases / Ausgabe 2/2020
Print ISSN: 0926-8782
Elektronische ISSN: 1573-7578
DOI
https://doi.org/10.1007/s10619-019-07266-x

Weitere Artikel der Ausgabe 2/2020

Distributed and Parallel Databases 2/2020 Zur Ausgabe