Abstract

This paper presents a literature survey on existing disparity map algorithms. It focuses on four main stages of processing as proposed by Scharstein and Szeliski in a taxonomy and evaluation of dense two-frame stereo correspondence algorithms performed in 2002. To assist future researchers in developing their own stereo matching algorithms, a summary of the existing algorithms developed for every stage of processing is also provided. The survey also notes the implementation of previous software-based and hardware-based algorithms. Generally, the main processing module for a software-based implementation uses only a central processing unit. By contrast, a hardware-based implementation requires one or more additional processors for its processing module, such as graphical processing unit or a field programmable gate array. This literature survey also presents a method of qualitative measurement that is widely used by researchers in the area of stereo vision disparity mappings.

1. Introduction

Computer vision is currently an important field of research. It includes methods such as image acquisition, processing, analysis, and understanding [1]. Computer vision techniques attempt to model a complex visual environment using various mathematical methods. One of the purposes of computer vision is to define the world that we see based on one or more images and to restructure its properties, such as its illumination, shape, and color distributions. Stereo vision is an area within the field of computer vision that addresses an important research problem: which is the reconstruction of the three-dimensional coordinates of points for depth estimation. A system of stereo vision system consists of a stereo camera, namely, two cameras placed horizontally (i.e., one on the left and the other on the right). The two images captured simultaneously by these cameras are then processed for the recovery of visual depth information [2]. The challenge is to determine the best method of approximating the differences between the views shown in the two images to map (i.e., plot) the correspondence (i.e., disparity) of the environment. Intuitively, a disparity map represents corresponding pixels that are horizontally shifted between the left image and right image. New methods and techniques for solving this problem are developed every year and exhibit a trend toward improvement in accuracy and time consumption.

Another device that is used to acquire depth information is a time-of-flight (ToF) or structured light sensor. Such a device is a type of active sensor, unlike a classic stereo vision camera. Devices of this type such as the Microsoft Kinect are cheap and have led to increased interest in computer vision applications. However, these active sensors suffer from certain characteristic problems [3]. First, they are subject to systematic errors such as noise and ambiguity, which are related to the particular sensor that is used. Second, they are subject to nonsystematic errors such as scattering and motion blur. According to the comparative analyses performed by Foix et al. [4], Kim et al. [5], and Zhang et al. [6], ToF devices perform satisfactorily only up to a maximum distance of approximately 5–7 meters and are too sensitive to be used in outdoor environments, especially in very bright areas. Because of these limitations of ToF sensors, stereo vision sensors (i.e., passive sensors) are more reliable and robust; they are able to produce high-resolution disparity maps and are suitable for both indoor and outdoor environments [7].

In stereo vision disparity map processing, the number of calculations required increases with an increasing number of pixels per image. This phenomenon causes the matching problem to be computationally complex [8]. The improvements to and reduction in computational complexity that have been achieved with recent advances in hardware technology have been beneficial for the advancement of research in the stereo vision field. Thus, the main motivation for hardware-based implementation is to achieve real time processing [9]. In real time stereo vision applications, such as autonomous driving, 3D gaming, and autonomous robotic navigation, fast but accurate depth estimations are required [10]. Additional processing hardware is therefore necessary to improve the processing speed.

An updated survey on stereo vision disparity map algorithms would be valuable to those who are interested in this research area. Figures 1(a) and 1(b) illustrate the quantity of original contributions published in this area over the past ten years (i.e., 2005–2014) from the databases of ScienceDirect and IEEE Xplore. The keywords used were stereo vision/stereo vision algorithm, and the components that were searched were the title, abstract, and keywords/index terms of the papers in the databases. All of these papers may represent contributions to fundamental algorithm development, analysis, or application of stereo vision algorithms. In both figures, the trendlines are increasing indicating that the field of stereo vision remains active in research and development and has become an interesting and challenging area of research. This paper provides a brief introduction to the state-of-the-art developments accomplished in the context of such algorithms. This work reviews the latest published stereo vision algorithms and categorizes them into different stages of processing, which are based on the taxonomy proposed by Scharstein and Szeliski [11]. This paper also discusses two types of implementation platforms for these algorithms (i.e., software-based and hardware-based). In software-based platforms, the techniques are implemented only on a standard CPU, without any other additional processing hardware. In contrast to hardware-based platforms, the algorithms are executed on a CPU, with a GPU or FPGA as a standalone system.

The remainder of the paper is organized as follows. Previous review papers related to stereo vision disparity map algorithms are discussed in Section 2. Then, the taxonomy for the stages of processing performed in stereo vision disparity map algorithms is presented in Section 3. It consists of four subsections (i.e., matching cost computation, cost aggregation, disparity selection and optimization, and disparity refinement). Section 4 presents a review of algorithms implemented through software-based platforms, and Section 5 discusses real time stereo vision disparity map algorithms based on additional hardware (i.e., FPGAs and GPUs). A method of measuring the accuracy of stereo vision algorithms is explained in Section 6, and the conclusion is presented in Section 7.

2. Previous Reviews of Stereo Vision Disparity Map Algorithms

Numerous methods of implementation for stereo vision disparity mapping have been established in the past few years. This can be observed from the review papers listed in Table 1. The contents of these review papers are also summarized in this table. Among these review papers, the main focus was to summarize and compare the accuracy level and execution time of each cited algorithm. However, none of these reviews provided a detailed discussion of the stages of implementation based on the taxonomy proposed by Scharstein and Szeliski [11], as does the survey presented in this paper. Furthermore, this paper also reviews the latest algorithms implemented using two different types of platforms (i.e., software-based and hardware-based).

3. A Taxonomy for the Processing Stages of Stereo Vision Disparity Map Algorithms

Most stereo vision disparity map algorithms have been implemented using multistage techniques. These techniques, as codified by Scharstein and Szeliski, consist of four main steps as shown in Figure 2 [11]. In this figure, the input images are obtained from stereo vision sensors (i.e., from at least two cameras). Commonly, these cameras are arranged horizontally and set up which produce two or more corresponding images. For the explanation or the process as described by adopted taxonomy, these input images are assumed to be rectified images. Next, the image pair to be analysed will pass through all of the blocks, in sequence, beginning with Step and ending with Step . The output of this process should be a smooth disparity map. In essence, each block represents one or more algorithms whose performance can be measured based on the expected output. This taxonomy has been widely used by many current developers of stereo vision disparity map algorithms [8, 12].

In general, stereo vision disparity map algorithms can be classified into local or global approaches. A local approach is also known as area based or window based approach. This is because the disparity computation at a given point (or pixel) depends only on the intensity values within a predefined support window. Thus, such method considers only local information and therefore has a low computational complexity and a short run time. Local methods include all four steps of the taxonomy. Examples of implementation of such methods are provided by the work of Mattoccia et al. [13], Arranz et al. [14], and Xu et al. [15]. The disparity map value assignment is achieved through winner take all (WTA) optimization. For each pixel, the corresponding disparity value with the minimum cost is assigned to that pixel. The matching cost is aggregated via a sum or an average over the support window.

By contrast, a global method treats disparity assignment as a problem of minimizing a global energy function for all disparity values. Such a method is formulated as an energy minimization process with two terms in the objective function (i.e., a data term, which penalizes solutions that are inconsistent with the target data and a smoothness term, which enforces the piecewise smoothing assumption with neighboring pixels). The smoothness term is designed to retain smoothness in disparity among pixels in the same region. The disparity map is produced by assigning similar depth values to neighboring pixels. Global methods produce good results but are computationally expensive. Therefore, they are impractical for use in real time systems. Global methods typically skip Step of the taxonomy depicted in Figure 2 (i.e., they do not perform cost aggregation and therefore contain only three steps) [1618]. Markov random field (MRF) modelling is the approach that is most common approach used in global methods. This type of modelling uses an iterative framework to ensure smooth disparity maps and high similarity between matching pixels.

3.1. Matching Cost Computation

All stereo matching algorithms require a cost criterion to measure the extent of matching between two pixels. The matching cost computation is the stage in which whether the values of two pixels correspond to the same point in a scene is determined. Therefore, the stereo matching cost computation can be defined as a method of determining the parallax values of each point between the left and right images [19]. The matching cost is computed at each pixel for all pixels under consideration. The difference in pixel intensity between a pair of the matching pixels in two images is called the disparity and can be associated with depth values through three-dimensional (3D) projection.

The matching points must lie on epipolar lines , as shown in Figure 3. This matching can be performed via a one-dimensional horizontal search if the stereo pairs are accurately calibrated [20]. A target point is viewed from the optical centers and of the two cameras. It produces one left image plane and one corresponding right image plane one from each of the two cameras. The points and represent the matching pixel intensities in the left and right image planes, respectively, for the same scene at point . Therefore, the targeted left and right matching points must be located at the same coordinates on the horizontal line . In a stereo vision system, this requirement is imposed in the form of an epipolar constraint [21]. This epipolar constraint plays a significant role in stereo matching. This is because the search for correspondences can be limited to a line instead of the entire image space, thereby reducing the required time and search range. Early stereo vision disparity map algorithm for the matching cost computation task uses pixel based technique [12]. These algorithms are the methods of absolute differences (AD), squared differences (SD); adaptations of the former include the methods of sampling-insensitive absolute differences and truncated absolute differences. These algorithms can be applied to grayscale or color images.

Area based or window based techniques are capable of offering richer data than matching techniques based on individual pixels or features. Such techniques can be more accurate because the matching process considers the entire set of pixels associated with image regions. Common algorithms for window based techniques include the sum of absolute differences (SAD), the sum of squared differences (SSD), normalized cross correlation (NCC), rank transforms (RT), and census transforms (CT) [12]. The matching cost is calculated over a support region. This support region, which is commonly referred to as support or aggregating window, may be square or rectangular and may be fixed or adaptive in size. The major shortcoming of window based technique is that these approaches commonly assume that all pixels within a support window have similar disparity values. This is not necessarily true for pixels near depth discontinuities or edges. Hence, an improper selection of the size and shape of the matching window can lead to poor depth estimations.

3.1.1. Absolute Differences (AD)

The AD algorithm aggregates the differences in luminance (or intensity values) between the pixel in the left image and the corresponding pixels in the right image as given by In this equation, represents the disparity map coordinates, where are the coordinates of the pixel of interest and is the disparity (or depth) value. Typically, in the matching process, is used as the reference image and the right image represents the target (or candidate) image. The AD algorithm is the simplest among matching cost algorithms. Because of its low complexity, Wang et al. [22] used this algorithm for real time stereo matching using graphics hardware (GPU). The AD algorithm functions satisfactorily in regions with little texture, but, for highly textured images, this algorithm is not capable of producing a smooth disparity map. To overcome this difficulty, the truncated version of the AD algorithm was developed. The truncated absolute difference (TAD) algorithm, as implemented by Min et al. [23] and Pham and Jeon [24], is able to minimize the errors in disparity maps. Furthermore, the TAD algorithm uses the colors and gradients at matching pixels to improve its robustness against variations in illumination.

3.1.2. Squared Differences (SD)

The SD algorithm aggregates the squared differences between the reference pixels in and the candidate pixels in as described in Yang et al. [25] implemented the SD algorithm for their matching cost computation in a subpixel estimation method for disparity mapping. Considerable noise was generated at the boundaries in their initial disparity maps. However, Yang et al. applied a bilateral filter (i.e., a type of edge preserving filter) to improve the flattening of edges and to smooth areas near depth discontinuities. Recently, Miron et al. [26] tested various matching cost functions in their stereo disparity map algorithms for intelligent vehicle applications. They concluded that the SD algorithm produced the largest error. The errors that occur due to the SD algorithm are highly sensitive to brightness and noise, especially in real time environment.

3.1.3. Feature Based Techniques

The matching cost function can also be constructed using feature based techniques. This approach attempts to establish correspondences only for similar feature points that can be unambiguously matched. Common methods of feature extraction include those based on visual features (e.g., edges, shapes, textures, segmentation, and gradient peaks), statistical characteristics (e.g., minima, medians, and histograms), and transformation features (e.g., Hough transforms, wavelet transforms, and Gabor transforms) [27]. As an example, Sharma et al. [28] developed a new disparity map algorithm using features derived from the scale invariant feature transform (SIFT) for autonomous vehicle navigation. In their implementation, they modified the SIFT algorithm to use self-organizing map to achieve more efficient performance in feature matching process. Their results indicate that the computation time of their method is reduced compared with the conventional SIFT algorithm. Because only feature points are correlated, although the computational cost is significantly reduced, the complete disparity map cannot be obtained. The feature matching accuracy is still low. Sparse disparity maps are produced because only matching points derived from the targeted object features are used [29]. Furthermore, feature based matching techniques exhibit low accuracy and are rather insensitive to occlusion and textureless areas. Liu et al. [30] used a combination of image segmentation and edge detection for the matching cost function. The time implementation was fast but the accuracy level also remained low in regions of discontinuity. Ekstrand et al. [31] used a segmentation process that created one-dimensional segments containing information on color and edge coordinates. A segment correspondence estimation was performed for every row of the image to reduce the possibility of mismatched segments. However, errors still occurred as a result of using limited vertical support segments during the matching process. Thus, feature based techniques are not preferred, and their usage by researchers for the development of disparity map algorithms remains low.

3.1.4. Sum of Absolute Differences (SAD)

The SAD algorithm, defined in (3), considers the absolute difference between the intensity of each pixel in the reference block and that of the corresponding pixel in the target block The differences are summed over the aggregated support window to generate a simple metric of block similarity known as a disparity map. The SAD algorithm is a well-known algorithm for matching cost computation. The SAD algorithm is able to function in real time implementation because of its low computational complexity. This was proven by Tippetts et al. [32], who calculated and evaluated SAD performances for real time human pose images in a resource limited system. Lee and Sharma [33] implemented real time disparity map algorithm estimations using the sliding window technique to calculate matching costs using the SAD algorithm. Their algorithm uses parallel processing via a graphical processing unit (GPU). By virtue of applying this new technique at the matching cost stage, the accuracy of stereo vision processing can be increased while simultaneously improving the speed.

Gupta and Cho [34] implemented a new technique using two different sizes of correlation windows in the SAD algorithm. At the first level, the initial cost aggregation is determined, and, at the second level, the object boundaries are improved using a smaller window size compared with the first-level implementation. The results produced are more accurate in terms of pixel matching but still require the implementation of multiple loops at different window sizes. The SAD algorithm is fast, but the quality of the initial disparity map that is produced is low because of the noise at object boundaries and in textureless regions.

3.1.5. Sum of Squared Differences (SSD)

Equation (4) presents the SSD algorithm, in which the summation is performed over the squared differences in pixel intensity values between two corresponding pixels in the aggregated support window An early implementation of the SSD algorithm for the matching cost calculation stage was achieved by Fusiello et al. [36]. They tested the SSD algorithm on multiple fixed window blocks to reduce the incidence of occlusion errors. The purpose of using multiple blocks of windows is to search for the smallest error, to select an appropriate pixel of interest in the disparity map. Yang and Pollefeys [37] used technique similar to that presented in [36], but their algorithm was implemented on platform with a GPU. They achieved good results in terms of speed compared with Fusiello’s work. Currently, there is still relatively little research on the use of the SSD algorithm in for stereo vision disparity map algorithms compared with that on other matching cost algorithms. This is evident from the previous review papers [8, 12] on stereo vision disparity map algorithms and is also shown in Table 2.

3.1.6. Normalized Cross Correlation (NCC)

The NCC algorithm is another method of determining the correspondence between two windows around a pixel of interest. The normalization within the window compensates for differences in gain and bias [38]. Equation (5) specifies the formula for the NCC technique:However, the NCC algorithm tends to blur regions of discontinuity more than other matching cost algorithms [38]. This is because any outliers lead to large errors in the NCC calculations. A new method for low-dimensional image features matching using NCC has been proposed by Satoh [39]. The NCC algorithm was chosen because of its robustness to intensity offsets and changes in contrast. The results achieved in Satoh’s work exhibit high accuracy, but considerable computational resources are required. Additionally, Cheng et al. [40] implemented their matching cost calculation using a zero mean normalized cross correlation (ZNCC) in which the pixels at which edges are located are manipulated via a multiple-window strategy. This method relies on a neutral network model. Furthermore, the least-mean-square delta rule is used for training and for the determination of the proper window shape and size for each support region. These techniques offer improved accuracy, but their computational requirements are still high.

3.1.7. Rank Transform (RT)

The matching cost for RT is given by (6) and is calculated based on the absolute difference between two ranks (i.e., from the reference image and from the target image):In this equation, and are calculated as shown in with as defined in where are the coordinates of a neighboring pixel and are the coordinates of the pixel of interest. Equation (8) computes the number of neighboring pixels that have values larger than that of the central pixel . A new model of a disparity map algorithm that uses the RT algorithm to achieve improved accuracy has been proposed by Gac et al. [41]. Through careful selection of the window sizes, a reliable initial disparity map is efficiently obtained. At the time of its publication, this method demonstrated the highest correct initial matching rate of any matching cost algorithm in date. The RT algorithm is typically effective for coping with brightness differences and image distortions. Sometimes when the RT algorithm is used, a matching pixel may look extremely similar to a neighboring pixel, leading to matching ambiguity. In [42], a new extension of the RT approach was developed to reduce this matching ambiguity using a Bayesian model. This model considers not only the similarities between the left and right image pixels similarities but also the level of ambiguity within each image independently. The results of experiments on images exhibiting variations in intensity and brightness differences, as reported by authors of that study, indicate a reduction in matching ambiguities.

3.1.8. Census Transform (CT)

The CT algorithm translates the results of comparisons between a center pixel and its neighboring pixels within a window into a bit string as shown This algorithm is calculated using the Hamming distances between the census bit strings of on the corresponding match candidates, as given by where represents the census bit string from the reference image and represents the census bit string from the target image. The CT algorithm is rather robust to the disparity discontinuities because of its good outlier tolerance, as described by Humenberger et al. [43]. This claim was proven by performance comparisons between the CT algorithm and the SAD algorithm. The disparity maps produced by the CT algorithm exhibited higher matching quality at object borders than those produced by the SAD algorithm. The disadvantage of the CT algorithm is its tendency to produce incorrect matches in regions with repetitive structures. This shortcoming was mitigated by Ma et al. [44] through their modifications to the CT algorithm. They implemented additional bits to represent the differences between the pixel of interest and the neighborhood pixels. According to their results, the accuracy of the disparity map was improved and the incorrect matching problem was alleviated by this modification. In addition, the proposed algorithm demonstrated greater robustness when applied to a noisy image compared with the conventional CT algorithm.

Several researchers have also developed matching cost methods based on a combination of two algorithms. A combination of the AD and CT algorithms as shown by Mei et al. [45] successfully reduces the occurrence of errors. The reason for combining these two methods is to compensate for their respective limitations. The CT algorithm tends to produce incorrect matches in regions with repetitive local structures, whereas the AD algorithm does not performance well on large, textureless regions. Similarly, a combination of the SAD and CT algorithms will also lead to higher performance but will incur an increase in computational complexity [46]. The SAD and CT cost measures are obtained individually, and the final cost function is constructed as a linear combination of both cost measures based on a weighting factor. The accuracy improvement achieved by Zhang et al. [47] was accomplished by means of a cost measure combining the SAD approach and arm length differences (ALD). The use of ALD was inspired by the similarity of the matching pixels support regions in the vertical direction as a result of the pixels being located on the same horizontal line. This combination is able to reduce errors in most regions, especially those containing repeated color and shapes. Lee et al. [48] combined the CT and gradient difference approaches to achieve a higher matching cost quality. However, according to them, matching ambiguities can occur in certain regions as a result of similar or repetitive texture patterns.

3.2. Cost Aggregation

Cost aggregation is the most important stage for determining the general performance of a stereo vision disparity map algorithm, especially for local methods. The purpose of cost aggregation is to minimize matching uncertainties. Cost aggregation is needed because the information obtained for a single pixel upon calculating the matching cost is not sufficient for precise matching. Local methods aggregate the matching cost by summing them over a support region [11]. This support region is typically defined by a square window centered on the current pixel of interest, as shown in Figure 4(a). The most straightforward aggregation method is to apply a simple low-pass filter in the square support window. The fixed-size window (FW) technique (e.g., binomial or Gaussian, uniform (box filters)) suffers an increased error rate when the size of the support window is increased over a certain threshold. Moreover, this method requires the parameters to be set to values suitable for the particular input dataset. Otherwise, it tends to blur object boundaries [49]. To avoid fattening artifacts near depth discontinuities, methods using shifting window or multiple windows (MW) as well as methods using adaptive windows (AW), windows with adaptive sizes, or adaptive support weights (ASW) have been developed.

In the MW technique, multiple windows are selected from among a number of candidates based on the support windows that produce smaller matching costs. This method was implemented by Hirschmüller et al. [50] and Veksler [51] in their previous studies of real time stereo vision disparity map algorithms. However, their experimental results reveal difficulties in preserving dedicated pixel arrangements in disparity maps, especially at object boundaries. This occurs because of the shape of the support windows. This approach is imperfect for a small number of candidates. To resolve this problem, the AW technique was developed to reduce the errors in the disparity map caused by boundary problems. In this method, the support regions are constructed as approximations to the local image structures. Figure 4(b) illustrates the application of this method with five subwindows with dimensions of 3 × 3. These subwindows must be located near the target pixel as shown in Figure 4. The cost aggregation with the minimum matching cost value for this pixel is calculated. For example, the cost can be calculated as the summation over the target pixel subwindow and any two other adjacent subwindows. The chosen shape of the valid matching windows for aggregation can therefore be any of the shapes shown in Figure 4(d). In practice, the shape of the adaptive window is adaptively varied to reflect the local image content, such as corners and edges.

The AW technique was implemented by Lu et al. [52] who achieved high quality results both near depth discontinuities and in homogenous regions. Lu’s work was improved upon by Zhang et al. [53] through a modification to the concept of adaptive support regions. They developed support regions with arbitrarily adaptive shapes and implemented the algorithm on a GPU for real time applications. The shapes of these support regions are more flexible and are not restricted to be rectangles. These authors achieved high matching accuracy with real time implementation. In this AW technique, the algorithm attempts to find support windows that fit the shape or size of each region, while preventing them from crossing object boundaries. Furthermore, this technique is able to reduce computational costs as discussed by Chen and Su [54]. These authors proposed a shape adaptive low complexity technique for eliminating computational redundancy between stereo image pairs for pixels matching. They grouped pixels with the same depth value to reduce the number of computations.

A comparative study of the use of different support region techniques in the cost aggregation stage was performed by Fang et al. [55]. This study addressed the FW, AW, and ASW approaches. The authors concluded that the most advantageous technique for cost aggregation is the ASW approach. In this technique, each pixel in the support region is assigned a support weight, which depends on its intensity dissimilarity and spatial distance from the anchor pixel as shown in Figure 4(c). The target pixel which is located at the center is assigned different weight depending on distance as indicated by the different tone of colors. Generally, for typical ASW techniques, (11) is used to aggregate the matching costs at pixel and disparity where is a square support window centered on pixel . The window size is a user defined parameter. The value of the function represents the possibility that a pixel will possess a disparity value similar to that of the window’s center pixel . represents a target pixel with a disparity value . Ideally, should return a value of “1” if pixels and have equal disparity values and “0” otherwise. Chen et al. [56] developed a trilateral filter based on the ASW approach with using a bilateral filter. They also added a new weighted term to increase the robustness against object boundaries.

Essentially, in ASW application, a higher weight will be allocated to a pixel if its intensity is more similar to that of the anchor pixel and if it is located at a smaller distance from the anchor pixel, as implemented by Zhang et al. [57]. This method is able to produce a disparity map in which the object boundaries are well preserved and the accuracy is very high compared with the previous methods reported in their literature. Hosni et al. [58] presented an extensive evaluation of ASW regions. They performed their test on a GPU to evaluate whether the speed and computational efficiency were sufficient for real time responses. Their evaluations indicated that the ASW approach produces outstanding results in terms of both computational efficiency and the quality of the generated disparity maps. Nalpantidis and Gasteratos [59] developed a new approach based on the ASW technique. They combined it with the quantified gestalt law to calculate a weighting factor. In general, a correlation weight reflects the proximity, similarity, and continuity between both input images (i.e., left and right images).

3.3. Disparity Computation and Optimization

Generally, a stereo matching algorithm represents one of the two major optimizations approaches: the local approach or the global approach. In the local approach, when the final disparities are computed, the disparity for each pixel is essentially selected using a local winner takes all (WTA) strategy as define by The disparity associated with the minimum aggregated cost at each pixel is chosen. represents the aggregate cost obtained after the matching cost calculation, and denotes the set of all allowed discrete disparities. The WTA strategy is utilized in this stage in local algorithms such as those implemented by Cigla and Alatan [46], Zhang et al. [53], and Lee et al. [60]. According to their findings, the disparity maps obtained at this stage still contain errors in the form of unmatched pixels or occluded regions. Because the aggregation in local methods is performed through summation or averaging over support regions, their accuracy is sensitive to noise and unclear regions. This occurs because only local information from a small number of pixels surrounding the pixel of interest is utilized to make each decision. Therefore, the accuracy of a local method at this stage depends on the matching cost computation and cost aggregation stages. Subsequently, in the disparity refinement stage, the errors will be reduced using several filtering techniques.

By contrast, in a global approach, certain assumptions are made about the depth of field of the scene, which are usually expressed in an energy minimization framework. The bulk of the effort in a global method is expended during the disparity computation phase, and the aggregation step is often skipped [11]. The most commonly adopted assumption is that the scene is locally smooth except for object boundaries, and thus neighboring pixels should have very similar disparities. This constraint is referred to as a smoothness constraint in the stereo vision literature. In the typical global stereo vision formulation, the objective is to find an optimal energy disparity assignment function that minimizes where represents the matching costs at the coordinates ; the smoothness energy encourages neighboring pixels to have similar disparities based on the previous stated assumptions and β is a weighting factor.

A global method such as the belief propagation (BP) approach requires large amounts of computational resources and memory for the storage of the image data and the execution of the algorithm. For the improvement to the BP technique achieved implemented by Liang et al. [61] which was implemented on the GPU, the time required for processing is still large compared with that required by the local method strategy. Wang et al. [62] implemented global approach using a graph cut (GC) algorithm to optimize the energy function. Their method selects disparity values with a lower energy value. Another well-known global technique is a dynamic programming (DP). DP is executed for each scan line (row) independently, resulting in polynomial complexity. The assumption adopted DP is that of an ordering constraint between neighboring pixels of the same row. Recently, the multiresolution energy minimization framework introduced by Arranz et al. [14] achieved real time performance while maintaining the resolution of producing disparity maps. The advantage of this framework is the reduction in computational complexity that is achieved through the multiresolution technique. However, for images of higher resolution and with many more different levels of disparity, the framework is unable to perform at a real time frame rate (30 frames per second).

3.4. Disparity Map Refinement

The purpose of the disparity refinement stage is to reduce noise and improve the disparity maps. Typically, the refinement step consists of regularization and occlusion filling or interpolation. The regularization process will reduce the overall noise through the filtering of inconsistent pixels and small variations among pixels on disparity map. The occlusion filling or interpolation process is responsible for approximating the disparity values in areas in which the disparity is unclear. Typically, occluded regions are filled with disparities similar to those of the background or textureless areas. Usually, the occlusion regions are detected by using left-right consistency check, such as those implemented by Yang et al. [63] and Heo et al. [64]. If the matching algorithm rejects disparities with low confidence, then the interpolation algorithm will estimate approximations to the correct disparities based on the local neighborhoods. The disparity refinement step normally combines local information from the local neighborhood near each measurement with a confidence metric.

Two classic and common techniques for local disparity refinement are Gaussian convolution and the median filter. In Gaussian convolution, disparities are estimated in combination with those of neighboring pixels in accordance with weights defined by a Gaussian distribution. The primary purpose of this method is to reduce the noise in the disparity map, but the Gaussian filter also reduces the amount of fine detail present in the final disparity map. A technique developed by Vijayanagar et al. [65] uses the weights defined by a Gaussian filter to improve disparity maps by approximating missing disparity values based nearby high confidence disparity pixels as a guide to prevent filtering across object boundaries. Meanwhile, the median filter is able to remove small, isolated mismatches in disparity by virtue of its edge preserving property and it is suitable for real time implementation because of its low computational complexity. This filter selects the median value within window of pixels as the final result for the central pixel. In a study by Michael et al. [66], a disparity map refinement approach using median filtering was developed for a real time stereo vision algorithm. Furthermore, the median filter was modified by Ma et al. [67] using the constant time weighted technique. Their modification achieves high accuracy in removing noise and error while maintaining the edges in disparity maps.

The diffusion technique performs a function similar to that of Gaussian convolution. Moreover, there exists an adaptation of this approach called anisotropic diffusion. Unlike Gaussian convolution, which destroys edges and fine details, anisotropic diffusion applies smoothing without crossing any edges as implemented by Banno and Ikeuchi [68] in the disparity map refinement stage of their disparity map algorithm. Their approach was improved upon by Vijayanagar et al. [65], yielding a method called multiresolution anisotropic diffusion. In this method, the disparity map is downsampled using three different resolution factors. At each resolution, 35 iterations of the anisotropic diffusion process are performed. The result of the proposed algorithm is free of occlusion errors and the edges in the disparity map have been refined. Another approach, which was developed by Zhang et al. [6], employs a two-step process to further refine the estimated disparity map. The authors presented the results they achieved through a color image guided depth matting process in a framework based on Bayesian matting and 2D polynomial regression smoothing techniques. This technique was found to be used to be effectively preserving the discontinuities at object boundaries while achieving smoothing in flat regions.

4. Software-Based Stereo Vision Disparity Map Algorithms

This section reviews several software-based implementation processes of global and local methods for the generation of disparity maps. These algorithms were developed and tested using only a CPU as the processing hardware as shown in Figure 5. A software-based implementation is designed to use the CPU to interface with API software. The API software provides a set of libraries, such as Open Computer Vision (OpenCV), Open Computing Language (OpenCL) libraries, and Open Graphic Language (OpenGL). A previous summary of software-based stereo vision disparity map algorithm and their performances was presented by Brown et al. [19]. The discussion also addressed the corresponding methods and occlusion handling techniques. In essence, the differences in these algorithms lie in the cost aggregation stage and in the optimization of the building blocks which determine the main characteristics of the developed algorithms. Several researchers, such as Park et al. [69] and Cigla et al. [70], have developed new algorithms by taking advantages of local and global methods for handling occlusions, object boundaries, and untextured regions. These techniques follow an iterative process for allocating disparities that spread into certain segments by applying pixel similarity, constraints, considering overlapping regions, enforcing smoothness between similarly colored neighboring segments, and penalizing occlusions. Several other researchers have applied other global techniques with various modifications, resulting in a semiglobal approach. This method involves dynamic programming optimization, such as that implemented by Hirschmüller [71] and Salmen et al. [16]. The results reported by these authors indicate low computational complexity.

4.1. Global Approaches

The best known approaches among the global methods are belief propagation (BP) and graph cut (GC) algorithms. Pérez and Sánchez [17] used the BP approach to develop a real time, high-definition algorithm that outperformed classical BP by implementing two BP algorithms in their 3D telepresence systems. The first instance of BP performs a classification of the pixels into areas designated as reliable, containing occlusion errors and textureless to reduce the numbers of memory accesses required for these three groups of pixels. The second BP process is used to decrease memory traffic by generating the final disparity map with a reduced number of iterations due to information from previous BP iterations. The experimental results demonstrated improved performance. The authors compared this approach with classical BP and observed a 90% improvement in efficiency. Wang and Yang [72] implemented ground control points (GCP) influenced by the MRF model. In their method, GCP-based regularization for the optimization framework is performed using a Bayesian rule. Meanwhile, the energy minimization technique for finding an optimal solution to the inference problem is implemented using the GC approach. Evaluations of this method demonstrated its effectiveness at improving the disparity map reconstruction to regularize problems of incorrect stereo matching. The approaches developed by Pérez and Sánchez [17] and Wang and Yang [72] both produce accurate disparity maps, but their computational costs remain high. However, Wang et al. [18] proposed a hierarchical bilateral disparity structure (HBDS) algorithm based on a GC technique to reduce computational complexity and improve the accuracy of the generated disparity maps. These authors divide all disparity levels hierarchically into a series of bilateral disparity structures to increase the fineness of the disparity map. During the refinement stage, any fattened regions are recalibrated based on the disparity values of all nearby pixels. The evaluation results indicated good performance with reduced processing time and improved disparity map accuracy.

A new technique was proposed by Chen and Lai [73] based on augmenting paths and the adoption of a push-relabelling scheme. The augmenting path algorithm functions by using multiple threads to calculate each block individually and to select another completed nearby block with which to merge. The proposed method identifies the independent processing loops in the GC approach and isolates the computation of each loop. Each image is sliced into smaller image segments, which will then be processed in parallel. The proposed method enables a remarkable decrease in execution time by a factor of 4.7 compared with the original GC approach, but considerable computational programming efforts are required. Kolmogorov et al. [74] developed four different smoothness terms (i.e., data, smoothness, occlusion. and uniqueness) to improve the accuracy of their results. The objectives of their method are to reduce the errors in occluded areas and increase the efficiency during postprocessing. Wang et al. [75] developed an algorithm using the MRF framework to eliminate holes and misaligned pixels. Their work produced high quality disparity maps but also required complex computational programming. Ploumpis et al. [76] developed a new stereo matching approach based on particle filters and scattered control landmarks. The proposed method consists of three steps. First multiple disparity maps are used to acquire a set of features or landmarks and then segment the images. Afterward, to estimate the best disparity values, scan line particle filtering is applied. In the last step, a Markov chain model is employed to reduce the computational redundancy of the particle filtering process. Using this method, high quality disparity maps can be produced.

4.2. Local Approaches

In local methods, pixel correspondences are generated by measuring the correspondence and similarity between image regions and very effective implementation can be produced using this approach [77]. The assignment of disparity values is achieved by applying WTA strategy after calculating each candidate disparity value individually. The matching cost function is aggregated via a summation or an averaging over a support region. The disparity value with the minimum cost for each pixel is assigned to that pixel. An algorithm based on an efficient cost aggregation strategy was proposed by Mattoccia et al. [13]. These authors used joint bilateral filtering and expanded the calculation structures that allow for the efficient and accurate generation of disparity maps. The idea behind adopting the selected bilateral filtering approach in the developed algorithm is to combine a geometric constraint (i.e., a spatial filter) with color proximity constraint (i.e., a range filter). The performance of the proposed approach was tested for noise and accuracy using the Middlebury dataset. Another refinement technique was proposed by Psota et al. [78] which does not use image segmentation or plane fitting. Instead, the algorithm performs iterative refinement of the results of adaptive weight stereo matching. In each iteration of disparity refinement, the algorithm uses the ASW approach to penalize disparity differences in local windows. A total of eight iterations on the Middlebury dataset were performed by Psota et al. and the correspondence error percentage was observed to decrease from 1.46% to 0.83%. A new technique for local cost aggregation for stereo matching was proposed by Yang [79]. In this technique, the matching cost values are aggregated adaptively based on a tree structure. The nodes of this tree consist of all of the image pixels and the tree contains all edges between nearest neighboring pixels. A spanning tree can be computed by removing unwanted edges. Edge with high weights will be removed during spanning tree construction. Then, the minimum spanning tree MST approach is applied to obtain the sum of the minimum values of all spanning trees. Yang’s method offers a low computational complexity and high accuracy but has not been tested for use in real time implementation.

Xu et al. [80] proposed an algorithm that calculates the aggregation cost via the join optimization of both the left and right matching costs. The authors assign reasonable weighting coefficient and exclude occluded pixels, while preserving sufficient support windows for accurate matching. The result is the ability to reduce unwanted pixels in the foreground and increase accuracy in highly textured regions. Furthermore, Lee et al. [60] developed an algorithm based on local approach with no iteration using three-mode cross CT with noise buffering to increase the robustness against image noise in textureless areas. This technique also provides two bits of cross CT within three modes of implementation to increase the reliability of the census measure. Most disparity methods encounter difficulties when confronted with fast-moving objects, but Lee’s algorithm addresses this problem by using the concept of optical flow to support weight computations within a localized window. Koo et al. [81] used a gradient based matching technique to reduce the radiometric errors and improve the matching cost function by using a Gaussian-based weighting function. The reference image is divided into two difference images corresponding to low and high frequencies. Then, the Difference of Gaussian (DoG) function is employed to reduce the errors that arise during the matching process. The authors demonstrated a reduction of errors on a sample set of images acquired in an outdoor environment. Matsuo et al. [82] used a local approach based on the AD algorithm and the Sobel operator in the matching cost calculation stage and box filtering with the WTA optimization in the cost aggregation stage. They used a weighted joint bilateral filter (JBF) in the refinement stage. They produced accurate disparity maps using several iterations and a fixed window size for the JBF. Nalpantidis and Gasteratos [83] developed a new stereo matching algorithm that employs the AD algorithm and performs aggregation by considering the gestalt laws of proximity, similarity, and continuity within a psychophysically based weight assignment framework. Their proposed algorithm yielded accurate results when applied to the Middlebury dataset.

5. Real Time Stereo Vision Disparity Map Algorithms Using Additional Hardware

The ability to implement stereo matching algorithms in real time represents a new research area in the field of computer vision. The results of the online Middlebury benchmarking system established by Scharstein and Szeliski [35] indicate that algorithms developed using a parallel processing approach or additional hardware are able to deliver processing times among the best ones achieved on a standard benchmarking dataset. Real time stereo vision algorithms are able to achieve rates of greater than 30 frames per second in their disparity mapping output. In this section, the discussion is limited to platforms that use FPGAs and GPUs for real time implementation of stereo vision algorithms. Figure 5 shows a basic block diagram for such hardware-based implementation on FPGA or GPU. The FPGA consists of a CPU, a multi-input/output port and a large set of configurable logic blocks (CLB) that can be configured according to the developer’s preferred design. The features of FPGA include versatility and the flexibility to operate either as standalone systems or as coprocessors on expansion cards for computer. FPGAs are most often programmed using hardware description language (HDL) [84] and Bacon et al. [85]. By contrast, a GPU is a dedicated coprocessor with a fixed architecture that enables the acceleration of the rendering of 2D and 3D graphics by offloading the related processes from the CPU. Recent GPU designs have evolved from being dedicated graphics rendering processors to more general parallel processors. A GPU consists of multiple processors with more than hundred cores, depending on the model. A GPU is able to operate in combination with CPU and open-source libraries such as the OpenCL, OpenCV, and Compute Unified Device Architecture (CUDA) libraries [86].

5.1. Comparative Studies of FPGAs and GPUs

Several papers related to performance evaluations of FPGAs, GPUs, and CPUs were reviewed for this literature survey as specified in Table 3. An early comparison among the FPGA, GPU, and CPU was conducted by Gac et al. [41]. The results indicated that the GPU achieved the highest absolute performance in terms of reconstruction time. The authors applied the back projection technique in their global algorithm via 3D tomography image reconstruction. Kalarot and Morris [87] compared the performance of the DP algorithm implemented on the FPGA and GPU when applied to their own rectified images for different disparity range. Their results indicated that the FPGA offered faster processing than the GPU for a disparity range below 128 but that the FPGA was unable to handle a disparity range of greater than 256, unlike the GPU. This finding can be attributed to the memory limitations of standalone FPGA systems, which prevent their use for processing large images. Another structured evaluation and comparison of the FPGA and GPU was performed by Pauwels et al. [88] in the context of a real time analysis of optical flow, local image features, and stereo vision applications. The authors applied their method to the Middlebury dataset. The comparison was performed based on the hardware architecture, speed, data dependency, accuracy, and time required to design the structure of the algorithms. The presented results demonstrated that the GPU implementation was superior in all respects and yielded more accurate and faster results when implemented as real time stereo vision systems compared with the FPGA implementation. Xu et al. [89] compared the speed performances achieved using CPU and a GPU for their pyramidal stereo algorithm. Their results indicated that the GPU was 12x faster than the CPU.

Russo et al. [90] performed a performance comparison of FPGA and a GPU for image convolution processing. They reported that the GPU exhibited the better performance in terms of execution time and number of clock cycles. This finding was attributed to the fact that the GPU tended to be better able to explore the extremely large amounts of data contained within the high-resolution images. Moreover, the characteristic features of a GPU, such as multiple pipelines and high bandwidth, assist in enhancing its performance. Fowers et al. [91] conducted a performance and energy comparison of a GPU, FPGA, and a CPU with a multicore architecture. They used the SAD algorithm with the sliding window technique as the algorithm implemented for the comparison. It was found that the FPGA provided the best energy efficiency, whereas the GPU delivered the best performance. Jin and Maruyama [92] compared implementation of their algorithm on the FPGA and a GPU based on speed and accuracy. They also proposed a method of improving the circuit design for FPGAs to reduce the required memory resources while maintaining accuracy. However, the results reported in all of the reviewed papers are subject to processing performance, which depend on the available hardware resources and the computational requirements of the considered task. A recent study of FPGAs was conducted by Lentaris et al. [93] for their ongoing projects SPARTAN, SEXTANT, and COMPASS to improve the behaviour of autonomous planetary exploration rovers. The study focused on the potential use of FPGAs for implementing a variety of stereo correspondence, feature extraction, and visual odometry algorithms.

5.2. Global Approaches

Several global methods have been implemented on FPGAs and GPUs for the development of real time stereo vision disparity map algorithms. A global optimization algorithm for stereo matching based on improvement to the BP approach implemented on a GPU was presented by Xiang et al. [9]. Their technique involves the integration of color-weighted correlations to improve hierarchical BP. Occlusion problems are resolved by combining a uniqueness constraint and a similarity constraint for the detection of occluded regions. The approach of Xiang et al. outperforms other BP methods with regard to their real time implementation on GPUs. However, its results in discontinuous regions of the disparity maps are somewhat poor and it requires more complex computations. The approach also suffers an increased time delay when the algorithm is attempting to generate accurate results for such discontinuous regions. An improvement to the poor quality of the disparity maps was achieved by Wang et al. [94]. They used the AD-CT algorithm in the matching cost calculation stage with a semiglobal optimization framework on FPGA board. Semiglobal optimization involves optimizing the smoothness of the disparity map along different directions separately. The designated directions are along lines traveling to the right, bottom, bottom right, and bottom left. This system was found to be able to adjust the image resolution and degree of parallelism to achieve maximum efficiency. The result was the ability to produce high quality disparity maps from high-definition images. In [95], a new cost function was developed for the matching of corresponding pixels. The authors proposed a parallel approach to a variant of a global matching cost calculation method implemented on a GPU for symmetric stereo images. A bank of log-Gabor wavelets was developed for the analysis of such symmetric images in the spectral domain. Using a GPU, the authors achieved real time disparity estimations for high-resolution images.

Implementation of a new algorithm using the GC method on a GPU was presented by Choi and Park [96]. They built their algorithm to operate in three stages using the graph construction method to accelerate the convergence of the GC calculation. A reordering heuristic and initialization method were employed to further increase the execution speed based on the proposed graph construction method. Then, a repetitive block based push and relabel method was used to increase the data transfer efficiency. Finally, they used low-overhead global relabelling algorithm to increase the GPU occupancy. They achieved an improved execution time compared with typical global methods at the cost of considerable programming efforts. Yao et al. [97] treated image warping as an energy minimization problem. First, they developed a sparse disparity map by means of stereo matching process. Then, the map was warped an energy minimization function with three independent terms (i.e., a disparity constraint, a structure constraint, and a temporal constraint). They applied their method to test images with different resolutions and evaluated the results that are based on the execution times required for the GPU and CPU implementation. The GPU runtime was 24x faster than the CPU processing time, satisfying the requirements of real time operation.

5.3. Local Approach

Current FPGA technology offers thousands of small logic blocks embedded in the connection matrix. This allows arbitrary computation blocks to be constructed from basic computing blocks through parallel circuit connections. Detailed summary of information regarding the advantages and disadvantages of real time implementation of stereo vision algorithms on FPGAs has been provided by Samarawickrama [98]. Kalarot and Morris [87] implemented an algorithm on FPGA using a fast and simple approach by combining the distortion removal and alignment correction tasks in a single step by means of lookup tables. However, a problem was encountered in the case of images of more than 1 megapixel in size, at which the FPGA was unable to process because of its very limited onboard memory. This memory limitation makes FPGAs unsuitable for the processing of high-definition images unless external memory is used to support it. Mattoccia [99] performed a comparison of three different algorithms, namely, a fixed window algorithm, an ASW algorithm, and a semiglobal algorithm implemented on FPGAs. Mattoccia’s results demonstrated that the output disparity maps were fairly accurate for all tested algorithms. Recently, Colodro-Conde et al. [100] implemented an area based stereo matching algorithm on FPGA board and tested its performance using the Middlebury dataset. They developed the algorithm to use the SAD approach for the matching cost calculation and the median filter to refine the disparity maps. Their architecture design involved multiple buffers for temporary memory storage. In this design, when the window size is increased, the buffers also need to be increased for the parallel processing of the allocated memory. However, the memory size and inherent frequency of FPGA limit its suitability for such tasks and applications, especially for real time applications. Nalpantidis et al. [20] used FPGA to prepare an efficient implementation of their hierarchical matching algorithm on uncalibrated stereo vision images. In their approach, two-dimensional correspondence search is performed using a hierarchical technique. Then, the intermediate results are refined by three-dimensional cellular automata (CA). The final disparity value is defined in terms of the distance between the matching positions. This proposed algorithm is able to process uncalibrated and nonrectified stereo images when implemented on the FPGA.

Excessive time consumption is the main challenge facing real time algorithm implementation because of their computational complexity. The reasons that real time vision algorithms are generally suitable for implementation on GPUs have been explained and discussed by Kim et al. [107]. A GPU unit is able to run the same instructions on multiple sets of data simultaneously. Based on the functionality, Mei et al. [45] developed a stereo matching algorithm for implementation on a GPU with good performance in terms of both accuracy and speed. The matching cost value was initialized using the AD measure and CT. The cost was aggregated in dynamic cross based regions and updated in a multidirection scan line optimization. Several researchers have developed algorithms based on domain transformations. This technique was previously initiated by Gastal and Oliveira [108], who used a transformation technique that enables the aggregation of 2D cost data using a sequence of 1D filters. This technique was improved upon by Pham and Jeon [24] by means of dimensionality reduction technique. The advantage of this technique is that it reduces the complexity of the computational requirements compared with a 2D cost aggregation calculation. A multiresolution anisotropic diffusion approach based on a disparity refinement algorithm that can be executed in a real time environment was proposed by Vijayanagar et al. [65]. This algorithm exploits the image pyramid concept to gradually enhance the disparity map at different levels of resolution and to align the object boundaries in color images. This technique allows smoothing to be achieved without loss of edges, making it a useful tool for improving image segmentation.

A novel local method for stereo matching using a GPU was presented by Kowalczuk et al. [104]. The algorithm begins with an approximation based on ASW aggregation and a low-complexity iterative disparity refinement technique. The probabilistic framework combines the summation term into a matching cost minimization via a series of approximations and facilitates interactive processing to improve the accuracy of the disparity map. The refinement algorithm operates by calculating the estimated disparity value of each pixel during the current iteration using nearby pixel disparities from previous iterations. The implementation of this method of cost aggregation and iterative disparity refinement was performed by Yoon and Kweon [102]. Instead of searching for the matching window with the optimal size and shape, it is possible to aggregate costs after local smoothing within a corresponding window to reduce matching noise. Usually noise can be effectively reduced by applying a linear filter such as a Gaussian filter, but the resulting disparity map always exhibits edge fattening. Therefore, to address this problem of mismatching pixels or noise around regions of discontinuity disparity maps, Lin et al. [105] proposed a new algorithm based on an edge preserving filter for the ASW method computed using a hierarchical clustering algorithm. This algorithm used a novel cost aggregation block to compute corresponding response for all the corresponding pixels in a set of sampling points. Tippetts et al. [106] used an intensity profile shape-matching algorithm implemented on an FPGA to achieve real time estimations for microscale unmanned vehicles (i.e., helicopter). The algorithm consists of three steps. In the first stage, filtering is performed using a Gaussian kernel. Then, the shapes of target objects are identified on a row-by-row basis. In the last stage, after the entire image has been processed, a vertical smoothing filter is applied to reduce the remaining noise. The authors also presented designs for FPGA blocks for each stage of implementation.

5.4. The Challenges of Implementing Stereo Vision Algorithms on GPUs and FPGAs

Over the past decade, developments in computing architectures have exhibited a clear trend toward increased heterogeneity and parallelism, with most mainstream microprocessors now possessing multiple cores and robust system architectures [109]. At the same time, the increasing number of accelerator options has considerably increased the complexity of application design because of the need to perform an extensive exploration of the available design space when attempting to choose a suitable device. Although GPUs with CUDA have come into common use as accelerators because of their low cost, ready availability, and simple programming model comparable to that of FPGAs, Ekstrand et al. [110], Perez-Patricio et al. [111], Stein [112], and Long et al. [113] have all presented results that different devices are better suited for different applications. Therefore, sufficient exploration of the different available devices for different applications is critical to prevent researchers from selecting unsuitable devices during the design phase. In this survey, for the summary of performance comparisons between GPUs and FPGAs, the results vary among different implementation and application domains. None of the platforms appears to be universally superior. The preferred design depends on the specifications of the target platform. However, the use of GPUs and FPGAs can facilitate increased speeds and reduced execution times. The challenge for a new researcher in this field is to determine how to develop an algorithm that is appropriate to a specific application and the most suitable platform.

6. Accuracy Measurement

There are several academic research centers that provide qualitative accuracy assessments of disparity maps through online submissions. The datasets used for these assessments can also be downloaded from the associated web pages, for example, the Middlebury Computer Vision pages [35], KITTI Vision Benchmark [114], and DIBRIS [115]. These datasets include both static and dynamic scenes. Most of the articles reviewed in this survey use the qualitative accuracy measurements provided by [35]. Thus, this paper uses the same resource to report the accuracy performance of a stereo vision algorithm in terms of the percentage of bad pixels. According to Scharstein and Szeliski [11], the evaluation of the accuracy level for each image is based on three attributes (i.e., the percentages of bad pixels among all pixels in nonoccluded regions (nonocc), all pixels detected as valid pixels (all), and pixels in regions near depth discontinuities and occluded regions (disc)). Four standard benchmarking images are used in this evaluation; these images are Tsukuba, Venus, Teddy, and Cones and the original images and ground-truth images for each are shown in Figure 6. Figures 7 and 8 show the nonocc, all, and disc results obtained on these images for approximately 60 algorithms selected based on both local and global optimizations methods. The bad pixel percentages of these algorithms are among the lowest values represented in the database of [35].

Because this section discusses only accuracy measurements cited from [35] which are based on online submissions, the implementation of the algorithms in Figures 7 and 8 is not specified as either software-based or hardware-based. There is some possibility that accuracy improvement can be achieved through implementation on additional hardware as shown by Pauwels et al. [88]. However, as reported by Kalarot and Morris [87], the primary advantage of hardware-based (i.e., FPGA and GPU) implementation is that the speed or execution time can be tremendously improved compared with implementation using only a CPU. Figure 9 shows the average errors of local and global methods. Here, the algorithms are represented by numbers, which correspond to the algorithms represented at the same -axis positions in Figure 7 (i.e., for local methods) and 8 (i.e., for global methods). This figure shows only the accuracy performances of existing methods and is intended as a guidance or reference for those who wish to develop their own algorithms.

7. Conclusion

The stereo matching problem remains a challenge for computer vision researchers. A literature survey of the latest stereo vision disparity map algorithms is provided here and all cited algorithms are categorized according to the processing steps with which they are associated in the taxonomy of Scharstein and Szeliski. Becoming familiar with the state-of-the-art algorithms for stereo vision disparity mapping is a time consuming task. In this survey of the latest developments in the area of stereo matching algorithms, the processing steps composing such an algorithm and their software-based as well as hardware-based implementation was therefore performed and presented to assist in this task. The qualitative measurement of the accuracy of such algorithms was also discussed. To assist the reader in navigating the numerous works presented, Table 2 is presented as a summary. It specifies the steps and computational platforms used in each approach as a reference for the development of new algorithms.

Nomenclature

:Left image
:Right image
:Disparity value
:Pixel coordinates
:Support window
:Element of
:Neighboring pixel coordinates
:Support window centered on pixel
:Pixel of interest
:Smallest pixel value
:Set of disparity values
:Weighting factor
:Energy function
:Neighboring pixels in the support window.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the Universiti Sains Malaysia Research University Individual (RUI) with Account no. 1001/PELECT/814169 and the Universiti Teknikal Malaysia Melaka.