Skip to main content
Erschienen in: The VLDB Journal 3/2013

01.06.2013 | Regular Paper

Similarity queries: their conceptual evaluation, transformations, and processing

verfasst von: Yasin N. Silva, Walid G. Aref, Per-Ake Larson, Spencer S. Pearson, Mohamed H. Ali

Erschienen in: The VLDB Journal | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, for example join and selection, to be aware of data similarities, there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this paper is the study of similarity queries that contain one or multiple first-class similarity database operators such as Similarity Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze the implementation techniques of several similarity operators, introduce a consistent and comprehensive conceptual evaluation model for similarity queries, and present a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Silva, Y.N., Aref, W.G, Ali, M.H.: Similarity group-by. In: Proceedings of the 2009 IEEE International Conference on Data, Engineering, 2009 Silva, Y.N., Aref, W.G, Ali, M.H.: Similarity group-by. In: Proceedings of the 2009 IEEE International Conference on Data, Engineering, 2009
2.
Zurück zum Zitat Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: Proceedings of the 2010 IEEE International Conference on Data, Engineering, 2010 Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: Proceedings of the 2010 IEEE International Conference on Data, Engineering, 2010
3.
Zurück zum Zitat Silva, Y.N., Arshad, M.U., Aref, W.G.: Exploiting similarity-aware grouping in decision support systems. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009 Silva, Y.N., Arshad, M.U., Aref, W.G.: Exploiting similarity-aware grouping in decision support systems. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009
4.
Zurück zum Zitat Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: Simdb: a similarity-aware database system. In: Proceedings of the 2010 International Conference on Management of data, 2010 Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: Simdb: a similarity-aware database system. In: Proceedings of the 2010 International Conference on Management of data, 2010
5.
Zurück zum Zitat Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998 Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998
6.
Zurück zum Zitat Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of data, 1996 Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of data, 1996
7.
Zurück zum Zitat Zhang, C., Huang, Y.: Cluster by: a new sql extension for spatial data aggregation. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic, Information Systems, 2007 Zhang, C., Huang, Y.: Cluster by: a new sql extension for spatial data aggregation. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic, Information Systems, 2007
8.
Zurück zum Zitat Li, C., Wang, M., Lim, L., Wang, H., Chang, K.C.-C.: Supporting ranking and clustering as generalized order-by and group-by. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of data, 2007 Li, C., Wang, M., Lim, L., Wang, H., Chang, K.C.-C.: Supporting ranking and clustering as generalized order-by and group-by. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of data, 2007
9.
Zurück zum Zitat Schallehn, E., Sattler, K.-u., Saake, G.: Extensible grouping and aggregation for data reconciliation. In: In Proceedings of 4th International Workshop on Engineering Federated, Information Systems, EFIS01, 2001 Schallehn, E., Sattler, K.-u., Saake, G.: Extensible grouping and aggregation for data reconciliation. In: In Proceedings of 4th International Workshop on Engineering Federated, Information Systems, EFIS01, 2001
10.
Zurück zum Zitat Schallehn, E., Sattler, K.-U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 361–387 (2004)CrossRef Schallehn, E., Sattler, K.-U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 361–387 (2004)CrossRef
11.
Zurück zum Zitat Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Datab. Syst. 33, 7:1–7:38 (2008) Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Datab. Syst. 33, 7:1–7:38 (2008)
12.
Zurück zum Zitat Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998 Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of data, 1998
13.
Zurück zum Zitat Böhm, C., Krebs, F.: The k-nearest neighbour join: turbo charging the kdd process. Knowl. Inf. Syst. 6, 728–749 (2004)CrossRef Böhm, C., Krebs, F.: The k-nearest neighbour join: turbo charging the kdd process. Knowl. Inf. Syst. 6, 728–749 (2004)CrossRef
14.
Zurück zum Zitat Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data, Engineering, 2006 Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data, Engineering, 2006
15.
Zurück zum Zitat Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data, Bases, 2001 Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data, Bases, 2001
16.
Zurück zum Zitat Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: Proceedings of the 2008 IEEE 24th International Conference on Data, Engineering, 2008 Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: Proceedings of the 2008 IEEE 24th International Conference on Data, Engineering, 2008
17.
Zurück zum Zitat Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, 2008 Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, 2008
18.
Zurück zum Zitat Wichterich, M., Assent, I., Kranen, P., Seidl, T.: Efficient emd-based similarity search in multimedia databases via flexible dimensionality reduction. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, 2008 Wichterich, M., Assent, I., Kranen, P., Seidl, T.: Efficient emd-based similarity search in multimedia databases via flexible dimensionality reduction. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, 2008
19.
Zurück zum Zitat Adali, S., Bonatti, P., Sapino, M.L., Subrahmanian, V.S.: A multi-similarity algebra. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998 Adali, S., Bonatti, P., Sapino, M.L., Subrahmanian, V.S.: A multi-similarity algebra. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998
20.
Zurück zum Zitat Ferreira, M.R.P., Traina, C., Jr., Traina, A.J.M.: An efficient framework for similarity query optimization. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic, Information Systems, 2007 Ferreira, M.R.P., Traina, C., Jr., Traina, A.J.M.: An efficient framework for similarity query optimization. In: Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic, Information Systems, 2007
21.
Zurück zum Zitat Traina, C. Jr., Traina, A.J.M., Vieira, M.R., Arantes, A.S., Faloutsos, C.: Efficient processing of complex similarity queries in rdbms through query rewriting, In: Proceedings of the 15th ACM International Conference on Information and, Knowledge Management, 2006 Traina, C. Jr., Traina, A.J.M., Vieira, M.R., Arantes, A.S., Faloutsos, C.: Efficient processing of complex similarity queries in rdbms through query rewriting, In: Proceedings of the 15th ACM International Conference on Information and, Knowledge Management, 2006
22.
Zurück zum Zitat Barioni, M.C.N., Razente, H., Traina, A., Traina, C. Jr.: Siren: a similarity retrieval engine for complex data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, 2006 Barioni, M.C.N., Razente, H., Traina, A., Traina, C. Jr.: Siren: a similarity retrieval engine for complex data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, 2006
23.
Zurück zum Zitat Baioco, G.B., Traina, A.J.M., Traina, C. Jr.: Mamcost: Global and local estimates leading to robust cost estimation of similarity queries. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007 Baioco, G.B., Traina, A.J.M., Traina, C. Jr.: Mamcost: Global and local estimates leading to robust cost estimation of similarity queries. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007
26.
Zurück zum Zitat Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: Proceedings of the 20th International Conference on Very Large Data, Bases, 1994 Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: Proceedings of the 20th International Conference on Very Large Data, Bases, 1994
27.
Zurück zum Zitat Yan, W.P., Larson, P.-A.: Eager aggregation and lazy aggregation. In: Proceedings of the 21th International Conference on Very Large Data, Bases, 1995 Yan, W.P., Larson, P.-A.: Eager aggregation and lazy aggregation. In: Proceedings of the 21th International Conference on Very Large Data, Bases, 1995
28.
Zurück zum Zitat Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995) Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)
29.
Zurück zum Zitat Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: Proceedings of the Ninth International Conference on Data Engineering, pp. 209–218. IEEE Computer Society, Washington, DC (1993) Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: Proceedings of the Ninth International Conference on Data Engineering, pp. 209–218. IEEE Computer Society, Washington, DC (1993)
30.
Zurück zum Zitat Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 59–68. ACM, New York, NY (1998) Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 59–68. ACM, New York, NY (1998)
31.
Zurück zum Zitat Lee, H., Ng, R.T., Shim, K.: Similarity join size estimation using locality sensitive hashing. Proc. VLDB Endow. 4, 338–349 (2011) Lee, H., Ng, R.T., Shim, K.: Similarity join size estimation using locality sensitive hashing. Proc. VLDB Endow. 4, 338–349 (2011)
Metadaten
Titel
Similarity queries: their conceptual evaluation, transformations, and processing
verfasst von
Yasin N. Silva
Walid G. Aref
Per-Ake Larson
Spencer S. Pearson
Mohamed H. Ali
Publikationsdatum
01.06.2013
Verlag
Springer-Verlag
Erschienen in
The VLDB Journal / Ausgabe 3/2013
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-012-0296-4

Weitere Artikel der Ausgabe 3/2013

The VLDB Journal 3/2013 Zur Ausgabe