Efficient data management for incoherent ray tracing

https://doi.org/10.1016/j.asoc.2012.07.002Get rights and content

Abstract

To obtain good performance on the GPU hardware, it is necessary to design algorithms to manage data, access memory under GPU memory hierarchy, and schedule more efficient threads. In this paper, we propose an efficient data management and task management designed for GPU based ray tracing. Due to the dynamic and uncertainty in ray tracing, we design data-management layer and task-management layer combined with fuzzy spatial analysis, use the two-level ray sorting and a ray bucket structure to reorganize ray data, then a warp's threads can be scheduled to access coherent geometry and nodes data, reduce memory bandwidth, and dispatch the data locally. We schedule tasks in data-driven execution according to coherent data, propose an adaptive ray compaction to eliminate inactive threads, maintain task efficiency of threads in a warp, and design two heuristics to decrease the compaction cost. On the basis of it, we also introduce a memory-optimized dynamic traversal management to reduce incoherent memory access, and avoid frequent sorting computation and compaction operations. Our experiments demonstrate all of these work combined can achieve good performance.

Graphical abstract

  1. Download : Download full-size image

As shown in the figure, we introduce the bucket structure and two-level ray resorting at the data-management layer. Then, we schedule tasks in data-driven execution according to coherent data, and use adaptive ray compaction and memory-optimized dynamic traversal management at the management layer.

Highlights

► An efficient data management designed for GPU based ray tracing is proposed. ► Data-management layer and task-management layer process the uncertainty in ray tracing. ► We introduce the two-level ray sorting and a ray bucket structure to reorganize ray data. ► We schedule tasks in data-driven execution according to coherent data. ► The memory-optimized dynamic traversal management reduces incoherent memory access.

Introduction

Since the introduction of ray tracing in its classic form [1] among the two decades, exponential growth in the available compute power and a variety of algorithmic developments have been used to realize real-time ray tracing on commodity processors. Current research on real-time ray tracing focuses on optimizing primary and shadow rays. The results are encouraging, approaching similar frame rates compared with rasterization-based techniques. However, the attractive power of ray tracing is the ability to easily produce photorealistic effects by secondary rays, such as soft shadows, glossy reflections, motion blur, depth of field, and diffuse global illumination. Thus, how to generate these fuzzy effects should be a valuable topic.

Different from primary rays, which have a common origin (the camera/eye) and may traverse the same nodes and intersect the same primitives, secondary rays here mean that those rays have multiple bounces of reflections and refractions, multiple shadow rays per light source. Current interactive ray tracers try to use SIMD instructions and packets to achieve high performance. However, these techniques explicitly rely on high coherence, either SIMD or packet, to provide similar benefit with traditional ray tracers using single rays. But primary rays are just the minority of rays in high quality rendering. That is, for renderings that send multiple shadow rays per light source or compute multiple bounces of reflections or refractions, the first level of rays does not account for a large percentage of the rendering time. On the other hand, despite the secondary rays are not completely random, it is quite obvious that they do not have the same manner as coherent primary rays: two secondary rays generated from near points may intersect with objects that are far away in the scene. Therefore, efficient computation of secondary rays is a harder problem than for primary rays.

Coupled with current architecture trends, it is necessary that we may apply highly parallel algorithms to leverage throughput-oriented architectures such as GPUs and to scale well as processors become increasingly parallel. For a highly parallel algorithm such as ray tracing, efficiently exploiting many cores is very important. GPU-based ray tracers have recently achieved a performance equal and even better than that of CPU implementations, various techniques have been proposed to use GPU to accelerate ray tracing. Though some optimizations have been applied to obtain benefits, the limitations of the underlying current architecture programming model have hampered the efficiency and performance of these algorithms. We focus specifically on designing algorithms for CUDA [2] that take advantage of the massively multi-threaded design of modern NVIDIA GF110 GPUs [3]. NVIDIA GF110 architecture contains up to 512 process units, and is commonly viewed as a massively multi-threaded scalar architecture.

Because of some hardware limitations of current GPU architecture, if to obtain good performance, it is necessary to design some special algorithms to manage data, access memory based on the hierarchy of GPU memory, and schedule more efficient threads execution. To exploit the high utilization in maximum, we should follow two principles according to the features of current parallel hardware architecture, that is not very important to the algorithm implemented on CPU: First, high performance can only be obtained when all of a warp's threads mostly execute the same instruction; Secondly, the warp's threads should access local memory as much as possible, here “local” means to a partial continuous area relative to the whole memory area. However, complex lighting algorithms for photo-realistic images may contain many incoherent memory accesses and divergent execution, and the implementation of the dynamic and uncertainty in ray tracing violates these two principles, finally causes low efficiency on GPU.

As discussed above, it is necessary to reorganize the data, so that a warp's threads can access coherent data, reduce memory bandwidth usage, and dispatch the data locally. Then, we can schedule task executions according to the coherent data. In this paper, we propose an efficient data management especially designed for GPU based ray tracing according to the two optimization principles mentioned above. Our approach is especially efficient for GPU memory features, and exploits the hardware parallel computation heavily.

We provide a way to cluster and manage geometry and ray data, then threads can execute almost same instructions and access local memory in maximum. We design two-level ray resorting, first group rays by direction vectors to make them into eight buckets, then within each bucket, further to divide subsets based ray origins and acceleration structure information. Also, the tasks are distributed among CPU and GPU based on features of computation capabilities of these two devices. Specially, we propose an adaptive ray compaction to eliminate inactive threads, and maintain task efficiency of threads in a warp, and design two heuristics to decrease the compaction cost. On the basis of it, considering the dynamic and unpredictable feature of secondary rays during traversal, we introduce a memory-optimized dynamic traversal management to reduce incoherent memory access, and avoid frequent sorting computation and compaction operation. Intersection information is stored in hierarchy among ray generations. Acceleration structure and geometry data are stored on texture units to improve cache utilization. Our experiments demonstrate all of these work combined can achieve good performance.

Section 2 presents previous work in this area, while Section 3 introduces more formal definitions of our approach. Section 4 discusses our algorithm implementation, analyzes the performance of our implementation and compares it with current work. Finally, in Section 5, we offer some thoughts on future work.

Section snippets

Previous work

In this section, we give a brief overview of prior work about ray tracing.

Hardware architecture

The main computational unit on CUDA is the thread. As opposed to other GPU architectures, threads on CUDA can read and write freely GPU memory and can synchronize and communicate with each other. To enable communication and synchronization, the threads on CUDA are logically grouped in blocks. Threads in a block synchronize by setting barriers and communicate through a small high-speed low-latency on-chip memory (a.k.a. shared memory). Threads are physically processed in chunks of size 32 in

Experiments and results

To evaluate the performance of our method, in this section, we test and verify our algorithm in a variety of experiments. The experiments include comparison with single ray tracing and ray packet methods, and cover frame rate performance, hardware utilization, optimization on block size, data traffic, and GPU utilization. All experiments are done under single light source. The described algorithm has been tested on an Intel Xeon 3.7 GHz CPU with an NVIDIA GTX 580 graphics card. We use OpenGL to

Conclusions and future work

In this paper, we propose an efficient data management and task management designed for GPU based ray tracing, and our algorithm is especially for the features of current parallel hardware architecture. Due to the dynamic and uncertainty in ray tracing, we design the data-management layer and task-management layer combined with fuzzy spatial analysis, use two-level ray resorting and a ray bucket structure to reorganize ray data, then a warp's threads can be scheduled to access coherent geometry

Xin Yang is a PhD student in Zhejiang University, China and is generally interested in most aspects of computer graphics, and computer vision and image processing techniques that can be applied to graphics problems. He is particularly interested in parallel computing with a focus on real-time rendering and GPU data structure design.

References (22)

  • Z. Wu et al.

    SAH KD-tree construction on GPU

  • J. Nickolls et al.

    Scalable parallel programming with CUDA

    Queue

    (2008)
  • E. Lindholm et al.

    NVIDIA Tesla: a unified graphics and computing architecture

    IEEE Micro

    (2008)
  • I. Wald et al.

    Interactive rendering with coherent ray tracing

    Computer Graphics Forum

    (2001)
  • I. Wald et al.

    Ray tracing deformable scenes using dynamic bounding volume hierarchies

    ACM Transactions on Graphics

    (2007)
  • A. Reshetov

    Omnidirectional ray tracing traversal algorithm for kd-trees

  • A. Reshetov

    Faster ray packets-triangle intersection through vertex culling

  • M. Pharr et al.

    Rendering complex scenes with memory-coherent ray tracing

  • H. Dammertz et al.

    Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays

    Computer Graphics Forum

    (2008)
  • I. Wald et al.

    SIMD Ray Stream Tracing-SIMD Ray Traversal with Generalized Ray Packets and On-the-fly Re-ordering

    (2007)
  • R. Overbeck et al.

    Large ray packets for real-time whitted ray tracing

  • Cited by (5)

    Xin Yang is a PhD student in Zhejiang University, China and is generally interested in most aspects of computer graphics, and computer vision and image processing techniques that can be applied to graphics problems. He is particularly interested in parallel computing with a focus on real-time rendering and GPU data structure design.

    Duan-qing Xu is a professor in Zhejiang University, China. He is interested in broad topics in the field of computer systems, in particular computer graphics.

    Lei Zhao is an assistant professor in Zhejiang University, China. His research mainly focuses on computer graphics related, in particular those that use innovative hardware and software working together to solve challenging engineering problems.

    View full text