research-article

Memory-efficient hash joins

Authors:
R. Barber

IBM Research, Almaden

IBM Research, Almaden
View Profile

,
G. Lohman

IBM Research, Almaden

IBM Research, Almaden
View Profile

,
I. Pandis

IBM Research, Almaden

IBM Research, Almaden
View Profile

,
V. Raman

IBM Research, Almaden

IBM Research, Almaden
View Profile

,
R. Sidle

IBM Research, Almaden

IBM Research, Almaden
View Profile

,
G. Attaluri

IBM Software Group

IBM Software Group
View Profile

,
N. Chainani

IBM Software Group

IBM Software Group
View Profile

,
S. Lightstone

IBM Software Group

IBM Software Group
View Profile

,
D. Sharpe

IBM Software Group

IBM Software Group
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 4pp 353–364https://doi.org/10.14778/2735496.2735499

Published:01 December 2014Publication History

Proceedings of the VLDB Endowment

Abstract

We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded population counts to almost entirely avoid collisions. This bitmap also serves as a Bloom filter for use in multi-table joins.

We study the random access characteristics of hash joins, and renew the case for non-partitioned hash joins. We introduce a variant of partitioned joins in which only the build is partitioned, but the probe is not, as this is more efficient for large outer tables than traditional partitioned joins. This also avoids partitioning costs during the probe, while at the same time allowing parallel build without latching overheads. Additionally, we present a variant of CHT, called a concise array table (CAT), that can be used when the key domain is moderately dense. CAT is collision-free and avoids storing join keys in the hash table.

We perform a detailed comparison of CHT and CAT against leading in-memory hash joins. Our experiments show that we can reduce the memory usage by one to three orders of magnitude, while also being competitive in performance.

References

Full experimental results. http://researcher.ibm.com/person/us-ravijay.Google Scholar
Understanding hash joins. Microsoft TechNet SQL Server.Google Scholar
D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. Materialization strategies in a column-oriented DBMS. In ICDE, 2007.Google ScholarCross Ref
M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10), 2012. Google ScholarDigital Library
C. Balkesen, G. Alonso, J. Teubner, and T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 2014. Google ScholarDigital Library
C. Balkesen, J. Teubner, G. Alonso, and T. Özsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE, 2013. Google ScholarDigital Library
S. K. Begley, Z. He, and Y.-P. P. Chen. Mcjoin: A memory-constrained join for column-store main-memory databases. In SIGMOD, 2012. Google ScholarDigital Library
S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, 2011. Google ScholarDigital Library
S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Improving hash join performance through prefetching. ACM Trans. Database Syst., 32(3), 2007. Google ScholarDigital Library
D. J. DeWitt et al. Implementation techniques for main memory database systems. In SIGMOD, 1984. Google ScholarDigital Library
C. Kim et al. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. PLVDB, 2(2), 2009. Google ScholarDigital Library
M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1), 1983.Google Scholar
H. Lang et al. Massively parallel NUMA-aware hash joins. In IMDM, 2013.Google Scholar
V. Leis et al. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In SIGMOD, 2014. Google ScholarDigital Library
Y. Li et al. NUMA-aware algorithms: the case of data shuffling. In CIDR, 2013.Google Scholar
L. F. Mackert and G. M. Lohman. R* Optimizer Validation and Performance Evaluation for Distributed Queries. In VLDB, 1986. Google ScholarDigital Library
S. Manegold, P. Boncz, and M. Kersten. Optimizing database architecture for the new bottleneck: memory access. VLDB Journal, 9(3), 2000. Google ScholarDigital Library
S. Manegold, P. A. Boncz, and M. L. Kersten. What happens during a join? Dissecting CPU and memory optimization effects. In VLDB, 2000. Google ScholarDigital Library
R. Pagh and F. F. Rodler. Cuckoo hashing. In ESA, 2001. Google ScholarDigital Library
V. Raman et al. DB2 with BLU Acceleration: So much more than just a column store. PVLDB, 6, 2013. Google ScholarDigital Library
K. A. Ross. Efficient hash probes on modern processors. In ICDE, 2007.Google ScholarCross Ref

Index Terms

Memory-efficient hash joins
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Double-Block-Length Hash Function for Minimum Memory Size
Advances in Cryptology – ASIACRYPT 2021
Abstract
Sharing a common primitive for multiple functionalities is essential for lightweight cryptography, and NIST’s lightweight cryptography competition (LWC) considers the integration of hashing to AEAD. While permutations are natural primitive choices ...
Read More
Efficient joins with compressed bitmap indexes
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

We present a new class of adaptive algorithms that use compressed bitmap indexes to speed up evaluation of the range join query in relational databases. We determine the best strategy to process a join query based on a fast sub-linear time computation ...
Read More
Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins

The pipelined execution of multijoin queries in a multiprocessor-based database system is explored in this paper. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 4
December 2014
132 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 December 2014
Published in pvldb Volume 8, Issue 4
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 512
  Total Downloads
- Downloads (Last 12 months)84
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Memory-efficient hash joins

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Double-Block-Length Hash Function for Minimum Memory Size

Efficient joins with compressed bitmap indexes

Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Memory-efficient hash joins

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Double-Block-Length Hash Function for Minimum Memory Size

Efficient joins with compressed bitmap indexes

Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media