ABSTRACT
Analytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, strays from the mainstream CPU design by packing a larger number of simpler cores per chip, relying on SIMD instructions to fill the performance gap. Databases have been attempting to utilize the SIMD capabilities of CPUs. However, mainstream CPUs have only recently adopted wider SIMD registers and more advanced instructions, since they do not rely primarily on SIMD for efficiency. In this paper, we present novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters. We study selections, hash tables, and partitioning; and combine them to build sorting and joins. Our evaluation on the MIC-based Xeon Phi co-processor as well as the latest mainstream CPUs shows that our vectorization designs are up to an order of magnitude faster than the state-of-the-art scalar and vector approaches. Also, we highlight the impact of efficient vectorization on the algorithmic design of in-memory database operators, as well as the architectural design and power efficiency of hardware, by making simple cores comparably fast to complex cores. This work is applicable to CPUs and co-processors with advanced SIMD capabilities, using either many simple cores or fewer complex cores.
- M.-C. Albutiu et al. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10):1064--1075, June 2012. Google ScholarDigital Library
- P. Bakkum et al. Accelerating SQL database operations on a GPU with CUDA. In GPGPU, pages 94--103, 2010. Google ScholarDigital Library
- C. Balkesen et al. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE, pages 362--373, 2013. Google ScholarDigital Library
- C. Balkesen et al. Multicore, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, Sept. 2013. Google ScholarDigital Library
- S. Blanas et al. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In SIGMOD, pages 37--48, 2011. Google ScholarDigital Library
- P. Boncz et al. MonetDB/X100: Hyper-pipelining query execution. In CIDR, 2005.Google Scholar
- J. Chhugani et al. Efficient implementation of sorting on multi-core SIMD CPU architecture. In VLDB, pages 1313--1324, 2008. Google ScholarDigital Library
- J. Cieslewicz et al. Automatic contention detection and amelioration for data-intensive operations. In SIGMOD, pages 483--494, 2010. Google ScholarDigital Library
- D. J. DeWitt et al. Materialization strategies in a column-oriented DBMS. In ICDE, pages 466--475, 2007.Google Scholar
- J. Hofmann et al. Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. CoRR, arXiv:1401.7494, 2014.Google ScholarDigital Library
- H. Inoue et al. AA-sort: A new parallel sorting algorithm for multi-core SIMD processors. In PACT, pages 189--198, 2007. Google ScholarCross Ref
- S. Jha et al. Improving main memory hash joins on Intel Xeon Phi processors: An experimental approach. PVLDB, 8(6):642--653, Feb. 2015. Google ScholarDigital Library
- T. Kaldewey et al. GPU join processing revisited. In DaMoN, 2012. Google ScholarDigital Library
- C. Kim et al. Sort vs. hash revisited: fast join implementation on modern multicore CPUs. In VLDB, pages 1378--1389, 2009. Google ScholarDigital Library
- C. Kim et al. Fast: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
- K. Krikellas et al. Generating code for holistic query evaluation. In ICDE, pages 613--624, 2010.Google ScholarCross Ref
- S. Larsen et al. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000. Google ScholarDigital Library
- V. Leis et al. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754, 2014. Google ScholarDigital Library
- S. Manegold et al. Optimizing database architecture for the new bottleneck: Memory access. J. VLDB, 9(3):231--246, Dec. 2000. Google ScholarDigital Library
- S. Manegold et al. What happens during a join? dissecting CPU and memory optimization effects. In VLDB, pages 339--350, 2000. Google ScholarDigital Library
- S. Manegold et al. Cache-conscious radix-decluster projections. In VLDB, pages 684--695, 2004. Google ScholarDigital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, June 2011. Google ScholarDigital Library
- R. Pagh et al. Cuckoo hashing. J. Algorithms, 51(2):122--144, May 2004. Google ScholarDigital Library
- H. Pirk et al. Accelerating foreign-key joins using asymmetric memory channels. In ADMS, 2011.Google Scholar
- O. Polychroniou et al. High throughput heavy hitter aggregation for modern SIMD processors. In DaMoN, 2013. Google ScholarDigital Library
- O. Polychroniou et al. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In SIGMOD, pages 755--766, 2014. Google ScholarDigital Library
- O. Polychroniou et al. Vectorized Bloom filters for advanced SIMD processors. In DaMoN, 2014. Google ScholarDigital Library
- V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, Aug. 2013. Google ScholarDigital Library
- K. A. Ross. Selection conditions in main memory. ACM Trans. Database Systems, 29(1):132--161, Mar. 2004. Google ScholarDigital Library
- K. A. Ross. Efficient hash probes on modern processors. In ICDE, pages 1297--1301, 2007.Google ScholarCross Ref
- N. Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010. Google ScholarDigital Library
- B. Schlegel et al. Scalable frequent itemset mining on many-core processors. In DaMoN, 2013. Google ScholarDigital Library
- L. Seiler et al. Larrabee: A many-core x86 architecture for visual computing. ACM Trans. Graphics, 27(3), Aug. 2008. Google ScholarDigital Library
- J. Shin. Introducing control flow into vectorized code. In PACT, pages 280--291, 2007. Google ScholarDigital Library
- L. Sidirourgos et al. Column imprints: A secondary index structure. In SIGMOD, pages 893--904, 2013. Google ScholarDigital Library
- E. A. Sitaridi et al. Optimizing select conditions on GPUs. In DaMoN, 2013. Google ScholarDigital Library
- M. Stonebraker et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
- J. Wassenberg et al. Engineering a multi core radix sort. In EuroPar, pages 160--169, 2011. Google ScholarDigital Library
- T. Willhalm et al. SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1):385--394, Aug. 2009. Google ScholarDigital Library
- Y. Ye et al. Scalable aggregation on multicore processors. In DaMoN, 2011. Google ScholarDigital Library
- J. Zhou et al. Implementing database operations using SIMD instructions. In SIGMOD, pages 145--156, 2002. Google ScholarDigital Library
- M. Zukowski et al. Architecture-conscious hashing. In DaMoN, 2006. Google ScholarDigital Library
Index Terms
- Rethinking SIMD Vectorization for In-Memory Databases
Recommendations
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing unstructured mesh computations for many-core architectures
Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and ...
Comments