“for the Entwives desired order, and plenty, and peace (by which they meant that things should remain where they had set them).”—J. R. R. Tolkien, Lord of the Rings
Feature Descriptor Families
-
Local Binary Descriptors. These sample point-pairs in a local region and create a binary coded bit vector, 1 bit per compare, amenable to Hamming distance feature matching. Examples include LBP, FREAK, ORB, BRISK, Census.
-
Spectra Descriptors. These use a wide range of spectra values, such as gradients and region averages. There is no practical limit to the spectra that could be used with these features. One of the most common spectra used in detectors is the local region gradient, such as in SIFT. Gradients are also used in several interest point and edge detectors, such as Harris, Sobel.
-
Basis Space Descriptors. These methods encode the feature vector into a set of basis functions, such as the familiar Fourier series of sine and cosine magnitude and phase. In addition, existing and novel basis features are being devised in the form of sparse codebooks and visual vocabularies (we use the term basis space loosely).
-
Polygon Shape Descriptors. These take the shape of objects as measured by statistical metrics, such as area, perimeter, and centroid. Typically, the shapes are extracted using a morphological vision pipeline and regional algorithms, which can be more complex than localized algorithms for feature detectors and feature descriptors (as will be discussed in Chapter 8). Image moments [518] is a term often used in the literature to describe shape features.
Prior Work on Computer Vision Taxonomies
-
Affine Covariant Interest Point Detectors. A good taxonomy is provided by Mikolajczyk et al. [153] for affine covariant interest point detectors. Also, Lindberg [150] has studied the area of scale independent interest point methods extensively. We seek a much richer taxonomy, however, to cover design principles for feature descriptors, and we have developed our taxonomy around families of descriptor methods with common design characteristics.
-
Annotated Computer Vision Bibliography. From USC and maintained by Keith Price, this resource provides a detailed breakdown of computer vision into several branches, as well as links to some key research in the field and computer vision resources.1
-
CVonline: The Evolving, Distributed, Non-Proprietary, On-Line Compendium of Computer Vision. This provides a comprehensive and detailed list of topics in computer vision. The website is maintained by Robert Fisher, and indexes the key Wikipedia articles. This may be one of the best online resources currently available.2
-
Local Invariant Feature Detectors: A Survey. Prepared by Tinne Tuytelaars and Krystian Mikolajczyk [107], this reference provides a good overview of several feature description methods, as well as a discussion of literature on local features, performance and accuracy evaluations of several methods, types of methods (corner detectors, blob detectors, feature detectors), and implementation details.
Robustness and Accuracy
General Robustness Taxonomy
-
Illumination
-
Color
-
Incompleteness
-
Resolution and distance
-
Geometric distortion
-
Discrimination and uniqueness
Illumination
-
Uneven illumination: image contains dark and bright regions, sometimes obscuring a feature that is dependent on a certain range of pixel intensities.
-
Brightness: there’s too much or too little total light, affecting feature detection and matching.
-
Contrast: intensity bands are too narrow, too wide, or contained in several bands.
-
Vignette: light is distributed unevenly, such as dark around the edges.
Color Criteria
-
Color space accuracy: which color space should be used—RGB, YIQ, HSV, or a perceptually accurate color sapce such as CIECAM02 Jch or Jab? Each color space has accuracy and utility considerations, such as the ease of transforming colors to and from color spaces.
-
Color channels: since cameras typically provide RGB data, extracting the gray scale intensity from the RGB data is often important. There are many methods for converting RGB color to gray scale intensity, and many color spaces to choose from.
-
Color bit depth: color information, when used, must be accurate enough for the application. For example, 8-bit color may be suitable for most applications, unless color discrimination is necessary, so higher precision color using 10,12,14, or 16 bits per channel may be needed.
Incompleteness
-
Clutter: the feature is obscured by surrounding image features, and the feature aliases and blends into the surrounding pixels.
-
Occlusion: the feature is partially hidden; in many cases the application will encounter occluded features or sets of features.
-
Outliers, proximity: sometimes only features in certain regions are used, and outlying features must be detected and ignored.
-
Noise: can come from rain, bad image sensors, and many other sources. A constant problem, noise can be compensated for, if it is understood, using a wide range of filter methods during pre-processing.
-
Motion blur: if it is measured and understood, motion blur can be compensated for using filtering during pre-processing.
-
Jitter, judder: a motion artifact, jitter or judder can be corrected, but not always; this can be a difficult robustness criteria to meet.
Resolution and Accuracy
-
Location accuracy or position: how close does the metric need to provide coordinate location under scale, rotation, noise and other criteria? Is pixel acuracy or sub-pixel accuracy needed? Regional accuracy methods of feature description cannot determine positional accuracy as well; for example, methods that use HAAR-like features and integral images can suffer the most, since in computing the HAAR rectangle, all pixels in the rectangle are summed together, throwing away discrimination of individual pixel locations. Pixel-accurate feature accuracy can also be challenging, since as features move and rotate they distort, and the pixel sampling artifacts create uncertainty.
-
Shape and thickness distortion: distance, resolution, and rotation combine to distort the pixel sample shapes, so a feature may appear to be thicker than it really is or thinner. Distortion is a type of sampling artifact.
-
Focal plane or depth: depending on distance, the pixel area covered by each pixel changes size. In this case, depth sensors can provide some help when used along with RGB or other sensors.
-
Pixel depth resolution: for example, processing color chanels to preserve the bit accuracy using float or unsigned short int as a minimum can be required.
Geometric Distortion
-
Scale: distance from viewpoint, a commonly addressed robustness criteria.
-
Rotation: important in many applications, such as industrial inspection.
-
Geometric warp: key area of research in the fields of activity recognition and dynamic texture analysis, as discussed in Chapters 4 and 6.
-
Reflection: flipping the image by 180 degrees.
-
Radial distortion: a key problem in depth sensing and also for 2D camera geometry in general, since depth fields are not uniform or simple; see Chapter 1.
-
Polar distortion: a key problem in depth sensing geometry; see Chapter 1.
Efficiency Variables, Costs and Benefits
Discrimination and Uniqueness
General Vision Metrics Taxonomy
-
Feature Descriptor Family
-
Spectra Dimension
-
Spectra Value
-
Interest Point
-
Storage Format
-
Data Types
-
Descriptor Memory
-
Feature Shape
-
Feature Pattern
-
Feature Density
-
Feature Search Method
-
Pattern Pair Sampling
-
Pattern Region Size
-
Distance Function
-
Run-Time Compute
Feature Descriptor Family
-
Local Binary Descriptors
-
Spectra Descriptors
-
Basis Space Descriptors
-
Polygon Shape Descriptors
Spectra Dimensions
-
Single variate: stores a single value such as an integral image or region average, or just a simple set of pixel gradients.
-
Multivariate: multiple spectra are stored; for example, a combination of spectra such as color information, gradient magnitude and direction, and other values.
Spectra Type
-
Gradient magnitude: a measure of local region texture or difference, used by a wide range of patch-based feature descriptor methods. It is well known [248] that the human visual system responds to gradient information in a scale and rotationally invariant manner across the retina, as demonstrated in SIFT and many other feature description methods, thus the use of gradients is a preferred method for computer vision.
-
Gradient direction: some descriptor methods compute a gradient direction and others do not. A simple region gradient direction method is used by several feature descriptors and edge detection methods, including Sobel and SIFT, to provide rotational invariance.
-
Orientation vector: some descriptors are oriented and others are not. Orientation can be computed by methods other than a simple gradient—for example, SURF uses a method of sampling many gradient directions to compute the dominant gradient orientation of the entire patch region as the orientation vector. In the RIFF method, a radial relative orientation is computed. In the SIFT method, any orientations detected within 80 percent of the dominant orientation will result in an additional interest point being generated, so the same descriptor may allow multiple interest points differing only in orientation.
-
Sensor data: data such as accelerometer or GPS information is added to the descriptor. In the GAFD method, a gravity vector computed from an accelerometer is used for orientation.
-
Multigeometry: multiple geometric transforms of the descriptor data that are stored together in the descriptor, such as several different perspective transforms of the same data as used in the RFM2.3 descriptor; the latter contains the same patch computed over various geometric transforms to increase the scale, rotation, and geometric robustness.
-
Multiscale: instead of relying on a scale-space pyramid, the descriptor stores a copy of several scaled representations. The multi-resolution histogram method described in Chapter 4 is one such method of approximating feature description over a range of scales, where scale is approximated using a range of Gaussian blur functions, and their resulting histograms are stored as the multi-scale descriptor.
-
Fourier magnitude: both the sine and cosine basis functions from the Fourier series can be used in the descriptor—for example, in the polygon shape family of descriptors as illustrated in Figure 6-29. The magnitude of the sine or cosine alone is a revealing shape factor, without the phase, as illustrated in Figure 6-6, which shows the histogram of LBPs run through a Fourier series to produce the power spectrum. This illustrates how the LBP histogram power spectrum provides rotational invariance. Other methods related to Fourier series may use alternative arrangements of the computation, such as the discrete cosine transform (DCT), which uses only the cosine component and is amenable to integer computations and hardware acceleration as commonly done for media applications.
-
Fourier phase: phase information has been shown to be valuable for creating a blur-invariant feature descriptor, as demonstrated in the LPQ method discussed in Chapter 6.
-
Other basis functions: can be used for feature description. Wavelets are commonly used in place of Fourier methods owing to greater control over the function window and tuning of the basis functions derived from the mother wavelet into the family of related wavelets. See Chapter 2 for a discussion of wavelets compared to other basis functions.
-
Morphological shape metrics: predominantly used in the polygon shape descriptor family, composed of shape factors, and referred to as image moments in some literature. They are computed over the gross features of a polygon image region such as area, perimeter, centroid, and many others. The vision pipeline and image pre-processing used for polygon shape description may include morphological and texture operators, rather than local interest point and descriptor computations.
-
Learned binary descriptors: created by running ground truth data through a training step, such as developed in ORB and FREAK, to create a set of statistically optimized binary sampling point-pair patterns.
-
Dictionary, codebook, vocabulary from feature learning methods: build up a visual vocabulary, dictionary, or sparse codebook as a sparse set of unique features using a wide range of descriptor methods, such as simple images correlation patches or SIFT descriptors. When combined as a sparse set, these are representative of the features found in a set of ground truth data for an application domain, such as automobile recognition or face recognition.
-
Region histogram 2D: used for several types of information, such as binning gradient direction, as in CARD, RFM2.3, and SURF; or for binning linear binary patterns, such as the LBP. The SIFT method of histogramming gradient information uses a fairly large histogram bin region, which provides for some translation invariance, similar to the human visual system treatment of the 3D position of gradients across the retina [248].
-
3D histogram: used in methods such as used in SIFT, which represents gradient magnitude and orientation together as a 3D histogram.
-
Cartesian bins: a common method of binning local region information into the descriptor simply based on the Cartesian position of pixels in a patch—for example, histogramming the pixel intensity magnitude of each point in the region.
-
Log polar bins: instead of binning local region feature information in Cartesian rectangular arrangements, some descriptors such as GLOH use a log polar coordinate system to prepare values for histogram binning, with the goal of adding better rotational invariance to the descriptor.
-
Region sum: such as an integral image, a method used to quickly sum the local region pixel values, or HAAR feature. The region sum is stored into the feature representing the total value of all the pixels in the region. Note that region summation may be good for coarse-feature description of an area, but the summation process eliminates fine local texture detail.
-
Region average: average value of the pixels in a region area, also referred to as a box filter, which may be computed from a convolution operation, scaled integral image, or by simply adding up the pixel values in the array.
-
Region statistical: such as region moments, like standard deviation, variance, or max or min values.
-
Binary pattern: such as a vector of binary values, or bits—for example, stored as a result of local pixel pair compare computations of local neighborhood pixel values as used in the local binary descriptor family, such as LBP, Census, and ORB.
-
DoG (1-bit quantized): as used in the FREAK descriptor, a set of DoG or bandpass filter features of different sizes, taken over a local binary region in a retinal sampling pattern similar to the human visual system, compared in pairs, and quantized to a single bit in a histogram vector.
-
DoG (multi-bit): a type of bandpass filter that is implemented using many variations, where a Gaussian blur filter is applied to the image, then the image is subtracted from (a) a shifted copy of itself, (b) a copy of itself at another Gaussian blur level, or (3) a copy of itself at another image scale as in the SIFT descriptor method.
-
Bit vector of values: a bit string containing a sequence of values quantized to a single bit, such as a threshold.
-
3D surface normals: the analog to 2D gradients except in 3D, used in the HON4D method [198] to describe the surface of a 3D object location in the feature descriptor.
-
Line segment metric: as in the CCH method, used to describe the line segments composing an object perimeter. Or, as used as a shape factor for objects where the length of a set of radial line segments originating at the centroid and extending to the perimeter are recorded in the descriptor, which can be fed into a Fourier transform to yield a power spectrum signature, as shown in Figure 6-29.
-
Color space info: some descriptors do not take advantage of color information, which in many cases can provide added discrimination and accuracy. Both the use of simple RGB channels, such as in the RGB-D methods [75,118], or using color space conversions into more accurate spaces are invaluable. For example, face recognition has problems distinguishing faces from different cultures, and since the skin tone varies across regions, the color value can be measured and added to the descriptor. However, several descriptors make use of color information, such as S-LBP, which operates in a colorimetric, accurate color space such as CIE-Lab, or the F-LBP, which computes a Fourier spectrum of color distance from the center pixel to adjacent pixels, as well as color variants of SIFT and many others.
-
Gray scale info: the gray scale or color intensity value is the default spectra in almost all descriptors. However, the method used to create the gray scale from color, and the image pre-processing used to prepare intensity for analysis and measurement, are critical for the vision pipeline and were discussed in Chapter 2.
Interest Point
-
Point, edge, or corner: these methods typically start with locating the local region maxima and minima; methods used include gradients, local curvature, Harris methods, blob detectors, and edge detectors.
-
Contour based, perimeter: some methods do not start feature description at maxima and minima, and instead look for structure in the image, such as a contour or perimeter, and this is true mainly for the morphological shape based methods.
-
Other: there are other possibilities for determining interest point location, such as prediction of likely interest point or feature positions, or using grid or tile regions.
-
No interest point: some methods do not use any interest points at all.
Storage Formats
-
Spectra vector: may be a set of histograms, a set of color values, a set of basis vectors.
-
Bit vector: local binary patterns use bit vector data types, some programming languages include bit vector constructs, and some instruction sets include bit vector handling instructions.
-
Multivariate collection: a set of values such as statistical moments or shape factors.
Data Types
-
Float: many applications require floating point for accuracy. For example, a Fourier transform of images requires at least 64 bits double precision (larger images require more precision); other applications like target tracking may require 32-bit floating point for precision trajectory computations.
-
Integer: pixel values are commonly represented with 8 bit values, with 16 bits per pixel common as image sensors provide better data. At least 32-bit integers are needed for many data structures and numerical results, such as integral images.
-
Fixed point: this is an alternative representation to floating point, which saves data space and can be implemented more efficiently in silicon. Most modern GPUs support several fixed-point formats, and some CPUs as well. Fixed-point formats include 8-,16-, and 24-bit representations. Accuracy may be close enough using fixed point, depending on the application. In addition to fixed-point data types, GPUs and some processors also provide various normalized data types (see manufacturer information).
Descriptor Memory
-
Fixed length or variable length: some descriptors allows for alternative representations.
-
Byte count: the length of all data in the descriptor.
Feature Shapes
-
Rectangle block patch: simple x, y, dx, dy range.
-
Symmetric polygon region: may be an octagon, as in the CenSurE method, or a circular region, like FREAK or DAISY.
-
Irregular segmented region: such as computed using morphological methods following segmented regions or thresholded perimeter.
-
Volumetric region: some features make use of stacks of images resembling a volume structure. As shown in Figure 6-12, the VLBP or Volume LBP and the LBP-TOP make use of volumetric data structures. The dynamic texture methods and activity recognition methods often use sets of three adjacent patches from the current frame plus 2 past frames, organized in a spatio-temporal image frame history, similar to a volume.
-
Deformable: most features use a rigid shape, such as a fixed-size rectangle or a circle; however, some descriptors are designed with deformation in mind, such as scale deformations [345,346], and affine or homographic deformation [220], to enable more robust matching.
Feature Pattern
-
Rectangular kernel: some methods use a kernel to define which elements in the region are included in the sample; see Figure 5-3 (left image) showing a kernel that does not use the corner pixels in the region; see also Figure 4-10.
-
Binary compare pattern: such as FREAK, ORB, and BRISK, where specific pixels in a region are paired to form a complex sampling pattern.
-
DNET line sample strip set: where points along a line segment are sampled densely; see Figure 4-8.
-
Radial line sampling pattern: where points on radial line segments originating at a center point are sampled densely; for example, used to compute Fourier descriptors for polygon region shape; see Figure 6-29.
-
Perimeter or contour edge: where points around the edge of a shape or region are sampled densely.
-
Sample weighting pattern: as shown in Figure 6-17, SIFT uses a circular weighting pattern in the histogram bins to decrease the contribution of points farther away from the center of the patch. The D-NETS method uses binary weighting of samples along the line strips, favoring points away from the endpoints and ignoring points close to the end points. Weighting patterns can provide invariance to noise and occlusion.
Feature Density
-
Global: covers the entire image, each pixel in the image.
-
Regional: covers fairly large regions of the image, typically on a grid, or around a segmented structure or region, not anchored at interest points.
-
Sparse: may be taken at interest points, or in small regions at selected points such as random points in the BRIEF descriptor, trained points such as FREAK and ORB, or a sparse sampling grid as in the RFM2.3 descriptor.
Feature Search Methods
-
Coarse-to-fine image pyramid: or multi-scale search, using a pyramid of coarser resolution copies of the original.
-
Scale space pyramid: the scale space pyramid is a variation of the regular coarse-to-fine image pyramid, where a Gaussian blur function is computed over each pyramid scale image [547] to create a more uniform search space; see Figure 4-17.
-
Pyramid scale factor: captures pyramid scale intervals, such as octaves or other scales—for example, ORB uses a ∼1.41x scale.
-
Dense sliding window: where the search is made over each pixel in the image, often within a sliding rectangular region centered at each pixel.
-
Grid block search: where the image is divided into a fixed grid or tiles, so the search can be faster but does not discriminate as well as dense methods. For example, see Figure 6-17 describing the PHOG method, which computes descriptors at different grid resolutions across the entire image.
-
Window search: limited dense search to particular regions, such as in stereo matching between two L/R frames where the correspondence search range is limited to expected locations.
-
Sparse at interest points: where a corner detector or other detector is used to determine where valid features may be found.
-
Sparse at predicted points: such as in tracking and mapping algorithms like PTAM, where the location of interest points is predicted based on motion or trajectory, and then a feature search begins at the predicted points.
-
Sparse in segmented regions: for example, when morphological shape segmentation methods or thresholding segmentation methods define a region, and a second pass is made through the region looking for features.
-
Depth segmented regions (Z): when depth camera information is used to threshold the image into foreground and background, and only the foreground regions are searched for features.
-
Super-pixel search: similar to the image pyramid method, but a multi-scale representation of the image is created by combining pixel values together using super-pixel integration methods, as discussed in Chapter 2.
-
Sub-pixel search: where sub-pixel accuracy is needed—for example, with region correlation, so several searches are made around a single pixel, with sub-pixel offsets computed for each compare, and in some cases geometric transforms of the pattern are made prior to feature matching.
-
Double-scale first pyramid level: In the SIFT scale-space pyramid method, the lowest level of the pyramid is computed from a doubled 2x linear interpolated version of the full-scale image, which has the effect of preserving high-frequency information in the lowest level of the image pyramid, and increasing the number of stable keypoints by about four times, which is quite significant. Otherwise, computing the Gaussian blur across the original image would have the effect of throwing away most of the high-frequency details.
Pattern Pair Sampling
-
Center – boundary pair: such as in the LBP family and Census transform.
-
Random pair points: such as in BRIEF, and semi-random in ORB.
-
Foveal centered trained pairs: such as in FREAK and Daisy.
-
Trained pairs: many methods train the point-pairs using ground truth data to meet objective criteria, such as FREAK and ORB.
-
Symmetric pairs: such as BRISK, which provides short and long line segments spaced symmetrically for point-pair comparisons.
Pattern Region Size
-
Bounding box (x size, y size): for example, the bounding box around a rectangular region, circular region, or polygon shape region.
Distance Function
Euclidean or Cartesian Distance Family
-
Euclidean distance
-
Squared Euclidean distance
-
Cosine similarity
-
SAD L1 Norm
-
SSD L2 Norm
-
Correlation distance
-
Hellinger distance
Grid Distance Family
-
Manhattan distance
-
Chessboard or Chebychev distance
Statistical Distance Family
-
Earth movers distance
-
Mahalanobis distance
-
Bray Curtis difference
-
Canberra distance
Binary or Boolean Distance Family
-
L0 Norm
-
Hamming distance
-
Jaccard similarity
Feature Metric Evaluation
Efficiency Variables, Costs and Benefits
-
Costs: compute, memory, time, power
-
Benefits: accuracy, robustness, and invariance attributes provided
-
Efficiency: benefits vs. costs
Image Reconstruction Efficiency Metric
Example Feature Metric Evaluations
SIFT Example
VISION METRIC TAXONOMY FME
-
Name: SIFT
-
Feature Family: Spectra
-
Spectra dimensions: Multivariate
-
Spectra: Gradient magnitude and direction, DoG Scale Space Maxima
-
Storage format: Orientation and position, gradient orientation histograms
-
Data type: Float, integer
-
Descriptor Memory: 128 bytes for descriptor histogram
-
Feature shape: Rectangular region
-
Search method: Dense sliding window in 2D & 3D 3x3x3 image pyramid
-
Feature density: Local
-
Feature pattern: Rectangular and pyramid-cubic
-
Pattern pair sampling: -
-
Pattern region size: 16x16
-
Distance function: Euclidean distance
GENERAL ROBUSTNESS ATTRIBUTES
-
Total:5 (scale, illumination, rotation, affine transforms, noise)
LBP Example
VISION METRIC TAXONOMY FME
-
Name: LBP
-
Feature Family: Local Binary
-
Spectra dimensions: Single-variate
-
Spectra: Pixel pair compares with center pixel
-
Storage format: Binary Bit Vector
-
Data type: Integer
-
Descriptor Memory: 1 byte
-
Feature shape: Square centered at center pixel
-
Search method: Dense sliding window
-
Feature density: Local
-
Feature pattern: Rectangular kernel
-
Pattern pair sampling: Center - boundary pairs
-
Pattern region size: 3x3 or more
-
Distance function: Hamming distance
GENERAL ROBUSTNESS ATTRIBUTES
-
Total: 3 (brightness, contrast, rotation using RILBP)
Shape Factors Example
VISION METRIC TAXONOMY FME
-
Name: Shape Factors
-
Feature Family: Polygon Shape
-
Spectra dimensions: Multivariate
-
Spectra: Perimeter following, area, perimeter, centroid, other image moments
-
Storage format: complex data structure
-
Data type: Float, integer
-
Descriptor Memory: Variable, several hundred bytes possible
-
Feature shape: Polygon shapes, rectangular bounding box region
-
Search method: Dense, recursive
-
Feature density: Regional
-
Feature pattern: Perimeter contour or edge
-
Pattern pair sampling: -
-
Pattern region size: Entire image
-
Distance function: Multiple methods, multiple comparisons
GENERAL ROBUSTNESS ATTRIBUTES
-
Total: 8 or more (scale, rotation, occlusion, shape, affine, reflection, noise, illumination)