Abstract
We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded population counts to almost entirely avoid collisions. This bitmap also serves as a Bloom filter for use in multi-table joins.
We study the random access characteristics of hash joins, and renew the case for non-partitioned hash joins. We introduce a variant of partitioned joins in which only the build is partitioned, but the probe is not, as this is more efficient for large outer tables than traditional partitioned joins. This also avoids partitioning costs during the probe, while at the same time allowing parallel build without latching overheads. Additionally, we present a variant of CHT, called a concise array table (CAT), that can be used when the key domain is moderately dense. CAT is collision-free and avoids storing join keys in the hash table.
We perform a detailed comparison of CHT and CAT against leading in-memory hash joins. Our experiments show that we can reduce the memory usage by one to three orders of magnitude, while also being competitive in performance.
- Full experimental results. http://researcher.ibm.com/person/us-ravijay.Google Scholar
- Understanding hash joins. Microsoft TechNet SQL Server.Google Scholar
- D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. Materialization strategies in a column-oriented DBMS. In ICDE, 2007.Google ScholarCross Ref
- M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10), 2012. Google ScholarDigital Library
- C. Balkesen, G. Alonso, J. Teubner, and T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 2014. Google ScholarDigital Library
- C. Balkesen, J. Teubner, G. Alonso, and T. Özsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE, 2013. Google ScholarDigital Library
- S. K. Begley, Z. He, and Y.-P. P. Chen. Mcjoin: A memory-constrained join for column-store main-memory databases. In SIGMOD, 2012. Google ScholarDigital Library
- S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, 2011. Google ScholarDigital Library
- S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Improving hash join performance through prefetching. ACM Trans. Database Syst., 32(3), 2007. Google ScholarDigital Library
- D. J. DeWitt et al. Implementation techniques for main memory database systems. In SIGMOD, 1984. Google ScholarDigital Library
- C. Kim et al. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. PLVDB, 2(2), 2009. Google ScholarDigital Library
- M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Computing, 1(1), 1983.Google Scholar
- H. Lang et al. Massively parallel NUMA-aware hash joins. In IMDM, 2013.Google Scholar
- V. Leis et al. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In SIGMOD, 2014. Google ScholarDigital Library
- Y. Li et al. NUMA-aware algorithms: the case of data shuffling. In CIDR, 2013.Google Scholar
- L. F. Mackert and G. M. Lohman. R* Optimizer Validation and Performance Evaluation for Distributed Queries. In VLDB, 1986. Google ScholarDigital Library
- S. Manegold, P. Boncz, and M. Kersten. Optimizing database architecture for the new bottleneck: memory access. VLDB Journal, 9(3), 2000. Google ScholarDigital Library
- S. Manegold, P. A. Boncz, and M. L. Kersten. What happens during a join? Dissecting CPU and memory optimization effects. In VLDB, 2000. Google ScholarDigital Library
- R. Pagh and F. F. Rodler. Cuckoo hashing. In ESA, 2001. Google ScholarDigital Library
- V. Raman et al. DB2 with BLU Acceleration: So much more than just a column store. PVLDB, 6, 2013. Google ScholarDigital Library
- K. A. Ross. Efficient hash probes on modern processors. In ICDE, 2007.Google ScholarCross Ref
Index Terms
- Memory-efficient hash joins
Recommendations
Double-Block-Length Hash Function for Minimum Memory Size
Advances in Cryptology – ASIACRYPT 2021AbstractSharing a common primitive for multiple functionalities is essential for lightweight cryptography, and NIST’s lightweight cryptography competition (LWC) considers the integration of hashing to AEAD. While permutations are natural primitive choices ...
Efficient joins with compressed bitmap indexes
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementWe present a new class of adaptive algorithms that use compressed bitmap indexes to speed up evaluation of the range join query in relational databases. We determine the best strategy to process a join query based on a fast sub-linear time computation ...
Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins
The pipelined execution of multijoin queries in a multiprocessor-based database system is explored in this paper. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent ...
Comments