Erschienen in:

Open Access 2014 | OriginalPaper | Buchkapitel

5. Taxonomy of Feature Description Attributes

verfasst von : Scott Krig

Erschienen in: Computer Vision Metrics

Verlag: Apress

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

■■■

“for the Entwives desired order, and plenty, and peace (by which they meant that things should remain where they had set them).”

—J. R. R. Tolkien, Lord of the Rings

This chapter develops a general Vision Metrics Taxonomy for feature description, so as to collect summary descriptor attributes for high-level analysis. The taxonomy includes a set of general robustness criteria for feature description and ground truth datasets. The material presented and discussed in this book follows and reflects this taxonomy. By developing a standard vocabulary in the taxonomy, terms and techniques are intended to be consistently communicated and better understood. The taxonomy is used in the survey of feature descriptor methods in Chapter 6 to record ‘what’ practitioners are doing.

As shown in Figure 5-1, the Vision Metrics Taxonomy is based on feature descriptor dimensions using three axes—shape and pattern, spectra, and density—intended to create a simple framework for analysis and discussion. A few new terms and concepts have been introduced where there had been no standard, such as for the the term feature descriptor families. These have been broken down into categories of local binary descriptors, spectra descriptors, basis space descriptors, and polygon shape descriptors; these descriptor families are also discussed in detail in Chapter 4. Additionally, the taxonomy borrows some useful terminology from the literature when it exists there, including several terms for the robustness and invariance attributes.

Why create a taxonomy that is guaranteed to be fuzzy, includes several variables, and will not perfectly express the attributes of any feature descriptor? The intent is to provide a framework to describe various design approaches used for feature description. However, the taxonomy is not intended to be used for comparing descriptors in terms of their goodness, performance, or accuracy.

The three axes of the Vision Metrics Taxonomy are:

Shape and pattern: How the pixels are taken from the target image.

Density: The extent of the image required for the descriptor, differentiating among local, regional, and global descriptors.

Spectra: The scalar and vector quantities used for the metrics, and a summary breakdown of the algorithms and computations.

Feature Descriptor Families

Feature descriptors and metrics have developed along several lines of thinking into separate families. In many cases, the research communities for the various families are working on different problems, and there is little cross-pollination or mutual interest. For example, cell biology and medical applications are typically interested in polygon shape descriptors, also referred to in the literature as image moments. Those involved with trendy augmented reality applications for mobile phones, as discussed in the computer vision literature, may be more interested in local binary descriptors. In some cases, there are common concepts shared by feature detectors and feature descriptors, as will be discussed in detail in Chapter 6; these include the use of gradients and local binary patterns.

Based on the taxonomy shown in Figure 5-1, we divide features into the following families:

Local Binary Descriptors. These sample point-pairs in a local region and create a binary coded bit vector, 1 bit per compare, amenable to Hamming distance feature matching. Examples include LBP, FREAK, ORB, BRISK, Census.
Spectra Descriptors. These use a wide range of spectra values, such as gradients and region averages. There is no practical limit to the spectra that could be used with these features. One of the most common spectra used in detectors is the local region gradient, such as in SIFT. Gradients are also used in several interest point and edge detectors, such as Harris, Sobel.
Basis Space Descriptors. These methods encode the feature vector into a set of basis functions, such as the familiar Fourier series of sine and cosine magnitude and phase. In addition, existing and novel basis features are being devised in the form of sparse codebooks and visual vocabularies (we use the term basis space loosely).
Polygon Shape Descriptors. These take the shape of objects as measured by statistical metrics, such as area, perimeter, and centroid. Typically, the shapes are extracted using a morphological vision pipeline and regional algorithms, which can be more complex than localized algorithms for feature detectors and feature descriptors (as will be discussed in Chapter 8). Image moments [518] is a term often used in the literature to describe shape features.

Prior Work on Computer Vision Taxonomies

Several research papers compare and contrast various aspects of sparse local features, and the field is rich with examples of comparisons of keypoint detectors [306,93] and feature descriptors [145,107]. New feature descriptor methods and improvements are usually compared to existing methods, utilizing several robustness and invariance criteria. However, there is a lack of formal taxonomy work to highlight the subtle details affecting design and comparison. For a good survey covering state-of-the-art computer vision methods, see Szelinski [324].

It should be noted that computer vision is a huge field. Several thousand research papers are published every year, and several thousand equally interesting research papers are rejected by conference publishers. Here are a few noteworthy works that survey and organize the field of feature metrics and computer vision.

Affine Covariant Interest Point Detectors. A good taxonomy is provided by Mikolajczyk et al. [153] for affine covariant interest point detectors. Also, Lindberg [150] has studied the area of scale independent interest point methods extensively. We seek a much richer taxonomy, however, to cover design principles for feature descriptors, and we have developed our taxonomy around families of descriptor methods with common design characteristics.
Annotated Computer Vision Bibliography. From USC and maintained by Keith Price, this resource provides a detailed breakdown of computer vision into several branches, as well as links to some key research in the field and computer vision resources.¹
CVonline: The Evolving, Distributed, Non-Proprietary, On-Line Compendium of Computer Vision. This provides a comprehensive and detailed list of topics in computer vision. The website is maintained by Robert Fisher, and indexes the key Wikipedia articles. This may be one of the best online resources currently available.²
Local Invariant Feature Detectors: A Survey. Prepared by Tinne Tuytelaars and Krystian Mikolajczyk [107], this reference provides a good overview of several feature description methods, as well as a discussion of literature on local features, performance and accuracy evaluations of several methods, types of methods (corner detectors, blob detectors, feature detectors), and implementation details.

Robustness and Accuracy

A key goal for computer vision is robustness, or the ability of a feature to be recognized under various conditions. Robustness can be broken down into several attributes. For example, detecting a feature should be robust over various criteria that are critical to a given application, such as scale, rotation, or illumination. We might also use the terms invariant or invariance to describe robustness. The end goal is accurate localization, correspondence, and robustness under invariance criteria.

However, some robustness attributes are dependent on the feature descriptor combined with other variables. For example, many local feature descriptor methods compute position and orientation based on a chosen interest point method, so the descriptor accuracy is interrelated with the interest point method. The distance function and classification method are interrelated as well, to determine final accuracy.

Note

Since it is not possible to define robustness or accuracy of a feature descriptor in isolation from the interest point method, the classifier, and the distance function, the opportunity exists to mix and match well-known detectors and descriptors, combined with various classifiers, to yield the desired robustness and accuracy.

Robustness and accuracy are a combination of the following factors:

Interest point accuracy, since many descriptors depend on the keypoint location and orientation.

Descriptor accuracy, as each descriptor method varies, and can be tuned.

Classifier and distance function accuracy, as a poor classifier and matching stage can lead to the wrong results.

Part of the challenge for an application, thus, is to define the robustness criteria, attribute by attribute, and then to define the limits and bounds of invariance sought. For example, scale invariance from 1x to 100x magnification may not be needed and hardly possible, but scale invariance from 1x to 4x may be all that is needed and much simpler to reach.

Several attributes of robustness are developed here into a robustness taxonomy. To determine actual robustness, ground truth data is needed as a basis to check the algorithms and measure results. Chapter 7 provides a background in ground truth data selection and design.

General Robustness Taxonomy

Robustness criteria can be expressed in terms of attributes and measured as invariance or robustness to those attributes. (See Chapter 7, Table 7-1, for more information on each of the robustness criteria attributes, with considerations for creating ground truth datasets.) Robustness criteria and attributes are grouped under the following group headings:

Illumination
Color
Incompleteness
Resolution and distance
Geometric distortion
Discrimination and uniqueness

Each robustness criterions group contains several finer-grain attributes, as illustrated in Figure 5-2.

Let’s take a look at these robustness attributes, along with some practical considerations for design and implementation of feature descriptors and the corresponding ground truth data to address the attributes.

Illumination

Light is the source of all imaging, and it should be the no.1 priority area for analysis and consideration when setting requirements for a given application. Illumination has several facets and is considered separately from color and color spaces. In some cases, the illumination can be corrected by changing the light source, or by adding or relocating light sources. In other cases, image pre-processing is needed to correct the illumination to prepare the image for further analysis and feature extraction.

Attention to illumination cannot be stressed enough; for example, see Figure 4-3 showing the effects of pre-processing to change the illumination in terms of increasing the contrast for feature extraction. Key illumination attributes are:

Uneven illumination: image contains dark and bright regions, sometimes obscuring a feature that is dependent on a certain range of pixel intensities.
Brightness: there’s too much or too little total light, affecting feature detection and matching.
Contrast: intensity bands are too narrow, too wide, or contained in several bands.
Vignette: light is distributed unevenly, such as dark around the edges.

Color Criteria

When color is used, accuracy of color is critical. Color management and color spaces are discussed in Chapter 2, but some major considerations are:

Color space accuracy: which color space should be used—RGB, YIQ, HSV, or a perceptually accurate color sapce such as CIECAM02 Jch or Jab? Each color space has accuracy and utility considerations, such as the ease of transforming colors to and from color spaces.
Color channels: since cameras typically provide RGB data, extracting the gray scale intensity from the RGB data is often important. There are many methods for converting RGB color to gray scale intensity, and many color spaces to choose from.
Color bit depth: color information, when used, must be accurate enough for the application. For example, 8-bit color may be suitable for most applications, unless color discrimination is necessary, so higher precision color using 10,12,14, or 16 bits per channel may be needed.

Also, depending on the camera sensor used, there will be signal characteristics, such as color sensitivity and dynamic range, which differ for each color channel. For demanding color-critical applications, the camera sensor should be well understood and have a known method of calibration. Individual colors may need to be compensated during image pro-processing. (See Chapter 1 for a discussion of camera sensors.)

Incompleteness

Features are not always presented in the image from frame to frame the way they are expected, or in the way they were learned. The features may appear to be incomplete. Key attributes of incompleteness include:

Clutter: the feature is obscured by surrounding image features, and the feature aliases and blends into the surrounding pixels.
Occlusion: the feature is partially hidden; in many cases the application will encounter occluded features or sets of features.
Outliers, proximity: sometimes only features in certain regions are used, and outlying features must be detected and ignored.
Noise: can come from rain, bad image sensors, and many other sources. A constant problem, noise can be compensated for, if it is understood, using a wide range of filter methods during pre-processing.
Motion blur: if it is measured and understood, motion blur can be compensated for using filtering during pre-processing.
Jitter, judder: a motion artifact, jitter or judder can be corrected, but not always; this can be a difficult robustness criteria to meet.

Resolution and Accuracy

Robustness regarding resolution, scale, and distance is often a challenge for computer vision. This is especially true when using feature metrics that rely on discrete pixel sizes over which the pixel area varies with distance. For example, feature metrics that rely on pixel neighborhood structure alone do not scale well or easily, such as correlation templates and most local region kernel methods. Other descriptors, such as those based on shape factors, may provide robustness that pixel region structures cannot achieve. Depending on the application, more than one descriptor method may be required to handle resolution and scale.

To meet the challenge of resolution and distance robustness, various methods are employed in practice, such as scale-space image pyramid collections and feature-space pyramids, which contain multi-scale representations of the feature. Key criteria for resolution and distance robustness include:

Location accuracy or position: how close does the metric need to provide coordinate location under scale, rotation, noise and other criteria? Is pixel acuracy or sub-pixel accuracy needed? Regional accuracy methods of feature description cannot determine positional accuracy as well; for example, methods that use HAAR-like features and integral images can suffer the most, since in computing the HAAR rectangle, all pixels in the rectangle are summed together, throwing away discrimination of individual pixel locations. Pixel-accurate feature accuracy can also be challenging, since as features move and rotate they distort, and the pixel sampling artifacts create uncertainty.
Shape and thickness distortion: distance, resolution, and rotation combine to distort the pixel sample shapes, so a feature may appear to be thicker than it really is or thinner. Distortion is a type of sampling artifact.
Focal plane or depth: depending on distance, the pixel area covered by each pixel changes size. In this case, depth sensors can provide some help when used along with RGB or other sensors.
Pixel depth resolution: for example, processing color chanels to preserve the bit accuracy using float or unsigned short int as a minimum can be required.

Geometric Distortion

Perhaps the most common distortion of image features is geometric, since geometric distortions take many forms as the camera moves and as objects move. Geometric attributes for robustness include the following:

Scale: distance from viewpoint, a commonly addressed robustness criteria.
Rotation: important in many applications, such as industrial inspection.
Geometric warp: key area of research in the fields of activity recognition and dynamic texture analysis, as discussed in Chapters 4 and 6.
Reflection: flipping the image by 180 degrees.
Radial distortion: a key problem in depth sensing and also for 2D camera geometry in general, since depth fields are not uniform or simple; see Chapter 1.
Polar distortion: a key problem in depth sensing geometry; see Chapter 1.

Efficiency Variables, Costs and Benefits

We consider efficiency to be related to compute, memory, and total invariance attributes provided. How efficient is a feature descriptor or feature metric? How much compute is needed to create the metric? How much memory is needed to store the metric? How accurate is the metric? How much robustness and invariance are provided vs. the cost of compute and memory? To answer the above questions is very difficult and depends on how the entire vision pipeline is implemented for an application, as well as the compute resources available. The Vision Metrics Taxonomy provides information to pursue such questions, but as always pursuing the wrong questions may lead to the wrong answers.

Discrimination and Uniqueness

The selection of optimal, discriminating features is achieved using a variety of methods. For example, local feature detector methods filter out only the most discriminating or unique candidates based on criteria such as corner strength; then descriptors are computed at the selected interest points as patches or other shapes; and finally the resulting descriptor is either accepted or rejected based on uniqueness criteria. Uniqueness is also the key criterion for creating sparse codebooks discussed in Chapter 4.

Discrimination can be measured by the ability to recreate an image from only the descriptor information, as discussed in Chapter 4. A descriptor with too little information to adequately recreate an image may be considered weak or non discriminating.

General Vision Metrics Taxonomy

To understand feature metrics, we develop a Vision Metrics Taxonomy composed of summary criteria. Each criterion is selected with a practical, engineering perspective in mind to provide information for evaluation and implementation in specific terms, such as algorithm, spectra, memory size, and other attributes. The basic categories of the Vision Metrics Taxonomy are shown in Table 5-1, and also summarized here as a list, and each list item is discussed in separate sections in this chapter:

Feature Descriptor Family
Spectra Dimension
Spectra Value
Interest Point
Storage Format
Data Types
Descriptor Memory
Feature Shape
Feature Pattern
Feature Density
Feature Search Method
Pattern Pair Sampling
Pattern Region Size
Distance Function
Run-Time Compute

Table 5-1.

Vision Metrics Taxonomy

https://static-content.springer.com/image/chp%3A10.1007%2F978-1-4302-5930-5_5/MediaObjects/978-1-4302-5930-5_5_Figa_HTML.jpg

Many of the background concepts used in the taxonomy are discussed in Chapter 4, where attributes about the internal structure and goals of common features are analyzed. In addition, this taxonomy is illustrated in the Feature Metric Evaluation (FME) information tables later in this chapter. A small subset of the taxonomy is used in the Chapter 6 survey of feature descriptors to record summary information. The taxonomy in Table 5-1 is a guideline for collecting and summarizing information. No judgment on goodness or performance is recorded or implied.

Feature Descriptor Family

As described at the beginning of this chapter, feature descriptors are classified in this taxonomy as follows:

Local Binary Descriptors
Spectra Descriptors
Basis Space Descriptors
Polygon Shape Descriptors

Spectra Dimensions

The spectra or values recorded in the feature descriptor vary, and may include one or more types of information or spectra. We divide the categories as follows:

Single variate: stores a single value such as an integral image or region average, or just a simple set of pixel gradients.
Multivariate: multiple spectra are stored; for example, a combination of spectra such as color information, gradient magnitude and direction, and other values.

Spectra Type

The spectral type of feature descriptor is a major axis in this taxonomy, as shown in Figure 5-1. Here are common spectra, which have been discussed in Chapter 3 and will be discussed in Chapter 6 as well.

Gradient magnitude: a measure of local region texture or difference, used by a wide range of patch-based feature descriptor methods. It is well known [248] that the human visual system responds to gradient information in a scale and rotationally invariant manner across the retina, as demonstrated in SIFT and many other feature description methods, thus the use of gradients is a preferred method for computer vision.
Gradient direction: some descriptor methods compute a gradient direction and others do not. A simple region gradient direction method is used by several feature descriptors and edge detection methods, including Sobel and SIFT, to provide rotational invariance.
Orientation vector: some descriptors are oriented and others are not. Orientation can be computed by methods other than a simple gradient—for example, SURF uses a method of sampling many gradient directions to compute the dominant gradient orientation of the entire patch region as the orientation vector. In the RIFF method, a radial relative orientation is computed. In the SIFT method, any orientations detected within 80 percent of the dominant orientation will result in an additional interest point being generated, so the same descriptor may allow multiple interest points differing only in orientation.
Sensor data: data such as accelerometer or GPS information is added to the descriptor. In the GAFD method, a gravity vector computed from an accelerometer is used for orientation.
Multigeometry: multiple geometric transforms of the descriptor data that are stored together in the descriptor, such as several different perspective transforms of the same data as used in the RFM2.3 descriptor; the latter contains the same patch computed over various geometric transforms to increase the scale, rotation, and geometric robustness.
Multiscale: instead of relying on a scale-space pyramid, the descriptor stores a copy of several scaled representations. The multi-resolution histogram method described in Chapter 4 is one such method of approximating feature description over a range of scales, where scale is approximated using a range of Gaussian blur functions, and their resulting histograms are stored as the multi-scale descriptor.
Fourier magnitude: both the sine and cosine basis functions from the Fourier series can be used in the descriptor—for example, in the polygon shape family of descriptors as illustrated in Figure 6-29. The magnitude of the sine or cosine alone is a revealing shape factor, without the phase, as illustrated in Figure 6-6, which shows the histogram of LBPs run through a Fourier series to produce the power spectrum. This illustrates how the LBP histogram power spectrum provides rotational invariance. Other methods related to Fourier series may use alternative arrangements of the computation, such as the discrete cosine transform (DCT), which uses only the cosine component and is amenable to integer computations and hardware acceleration as commonly done for media applications.
Fourier phase: phase information has been shown to be valuable for creating a blur-invariant feature descriptor, as demonstrated in the LPQ method discussed in Chapter 6.
Other basis functions: can be used for feature description. Wavelets are commonly used in place of Fourier methods owing to greater control over the function window and tuning of the basis functions derived from the mother wavelet into the family of related wavelets. See Chapter 2 for a discussion of wavelets compared to other basis functions.
Morphological shape metrics: predominantly used in the polygon shape descriptor family, composed of shape factors, and referred to as image moments in some literature. They are computed over the gross features of a polygon image region such as area, perimeter, centroid, and many others. The vision pipeline and image pre-processing used for polygon shape description may include morphological and texture operators, rather than local interest point and descriptor computations.
Learned binary descriptors: created by running ground truth data through a training step, such as developed in ORB and FREAK, to create a set of statistically optimized binary sampling point-pair patterns.
Dictionary, codebook, vocabulary from feature learning methods: build up a visual vocabulary, dictionary, or sparse codebook as a sparse set of unique features using a wide range of descriptor methods, such as simple images correlation patches or SIFT descriptors. When combined as a sparse set, these are representative of the features found in a set of ground truth data for an application domain, such as automobile recognition or face recognition.
Region histogram 2D: used for several types of information, such as binning gradient direction, as in CARD, RFM2.3, and SURF; or for binning linear binary patterns, such as the LBP. The SIFT method of histogramming gradient information uses a fairly large histogram bin region, which provides for some translation invariance, similar to the human visual system treatment of the 3D position of gradients across the retina [248].
3D histogram: used in methods such as used in SIFT, which represents gradient magnitude and orientation together as a 3D histogram.
Cartesian bins: a common method of binning local region information into the descriptor simply based on the Cartesian position of pixels in a patch—for example, histogramming the pixel intensity magnitude of each point in the region.
Log polar bins: instead of binning local region feature information in Cartesian rectangular arrangements, some descriptors such as GLOH use a log polar coordinate system to prepare values for histogram binning, with the goal of adding better rotational invariance to the descriptor.
Region sum: such as an integral image, a method used to quickly sum the local region pixel values, or HAAR feature. The region sum is stored into the feature representing the total value of all the pixels in the region. Note that region summation may be good for coarse-feature description of an area, but the summation process eliminates fine local texture detail.
Region average: average value of the pixels in a region area, also referred to as a box filter, which may be computed from a convolution operation, scaled integral image, or by simply adding up the pixel values in the array.
Region statistical: such as region moments, like standard deviation, variance, or max or min values.
Binary pattern: such as a vector of binary values, or bits—for example, stored as a result of local pixel pair compare computations of local neighborhood pixel values as used in the local binary descriptor family, such as LBP, Census, and ORB.
DoG (1-bit quantized): as used in the FREAK descriptor, a set of DoG or bandpass filter features of different sizes, taken over a local binary region in a retinal sampling pattern similar to the human visual system, compared in pairs, and quantized to a single bit in a histogram vector.
DoG (multi-bit): a type of bandpass filter that is implemented using many variations, where a Gaussian blur filter is applied to the image, then the image is subtracted from (a) a shifted copy of itself, (b) a copy of itself at another Gaussian blur level, or (3) a copy of itself at another image scale as in the SIFT descriptor method.
Bit vector of values: a bit string containing a sequence of values quantized to a single bit, such as a threshold.
3D surface normals: the analog to 2D gradients except in 3D, used in the HON4D method [198] to describe the surface of a 3D object location in the feature descriptor.
Line segment metric: as in the CCH method, used to describe the line segments composing an object perimeter. Or, as used as a shape factor for objects where the length of a set of radial line segments originating at the centroid and extending to the perimeter are recorded in the descriptor, which can be fed into a Fourier transform to yield a power spectrum signature, as shown in Figure 6-29.
Color space info: some descriptors do not take advantage of color information, which in many cases can provide added discrimination and accuracy. Both the use of simple RGB channels, such as in the RGB-D methods [75,118], or using color space conversions into more accurate spaces are invaluable. For example, face recognition has problems distinguishing faces from different cultures, and since the skin tone varies across regions, the color value can be measured and added to the descriptor. However, several descriptors make use of color information, such as S-LBP, which operates in a colorimetric, accurate color space such as CIE-Lab, or the F-LBP, which computes a Fourier spectrum of color distance from the center pixel to adjacent pixels, as well as color variants of SIFT and many others.
Gray scale info: the gray scale or color intensity value is the default spectra in almost all descriptors. However, the method used to create the gray scale from color, and the image pre-processing used to prepare intensity for analysis and measurement, are critical for the vision pipeline and were discussed in Chapter 2.

Interest Point

The use of interest points is optional with feature description. Some methods do not use interest points, and sample the image on a fixed grid rather than at every pixel, such as the Viola Jones method using HAAR-like features. It is also possible to simply create a feature descriptor for every pixel rather than just at interest points, but since the performance impact is considerable, interest points are typically used to find the best location for a feature first.

Several methods for finding interest points are surveyed and discussed in Chapter 6. Categories of interest points for the taxonomy include:

Point, edge, or corner: these methods typically start with locating the local region maxima and minima; methods used include gradients, local curvature, Harris methods, blob detectors, and edge detectors.
Contour based, perimeter: some methods do not start feature description at maxima and minima, and instead look for structure in the image, such as a contour or perimeter, and this is true mainly for the morphological shape based methods.
Other: there are other possibilities for determining interest point location, such as prediction of likely interest point or feature positions, or using grid or tile regions.
No interest point: some methods do not use any interest points at all.

Storage Formats

Storage formats are a practical matter for memory efficiency and engineering real systems and designing data structures. Knowing the storage format can guide efforts during engineering and optimization toward various programming constructs, instruction sets, and memory architecture.

For example, both CPU and GPGPU graphics processors often provide dedicated silicon to support various storage format organizations, such as scatter and gather operations, and sparse and dense data structure support. Understanding the GPGPU capabilities can provide guidelines for designing the storage format, as discussed in Chapter 8. Storage format summary:

Spectra vector: may be a set of histograms, a set of color values, a set of basis vectors.
Bit vector: local binary patterns use bit vector data types, some programming languages include bit vector constructs, and some instruction sets include bit vector handling instructions.
Multivariate collection: a set of values such as statistical moments or shape factors.

Data Types

The data types used for feature description are critical for accuracy, memory use, and compute. However, it is worth noting that data types can be changed as a tradeoff for accuracy in some cases. For example, converting floating point to fixed point or integer computations may be more memory efficient, as well as power efficient, since a floating point silicon ALU complex occupies almost four times more die space, thus consuming more power than an integer ALU. The data type summary includes:

Float: many applications require floating point for accuracy. For example, a Fourier transform of images requires at least 64 bits double precision (larger images require more precision); other applications like target tracking may require 32-bit floating point for precision trajectory computations.
Integer: pixel values are commonly represented with 8 bit values, with 16 bits per pixel common as image sensors provide better data. At least 32-bit integers are needed for many data structures and numerical results, such as integral images.
Fixed point: this is an alternative representation to floating point, which saves data space and can be implemented more efficiently in silicon. Most modern GPUs support several fixed-point formats, and some CPUs as well. Fixed-point formats include 8-,16-, and 24-bit representations. Accuracy may be close enough using fixed point, depending on the application. In addition to fixed-point data types, GPUs and some processors also provide various normalized data types (see manufacturer information).

Descriptor Memory

The total descriptor memory size is part of the efficiency of the descriptor, and compute performance is another component. A descriptor with a large memory footprint, few invariance attributes and heavy compute is inefficient. We are interested in memory size as a practical matter. Key memory-related attributes include:

Fixed length or variable length: some descriptors allows for alternative representations.
Byte count: the length of all data in the descriptor.

Feature Shapes

A range of shapes are used for the pixel sampling pattern; shapes are surveyed in Chapter 4 including the following methods:

Rectangle block patch: simple x, y, dx, dy range.
Symmetric polygon region: may be an octagon, as in the CenSurE method, or a circular region, like FREAK or DAISY.
Irregular segmented region: such as computed using morphological methods following segmented regions or thresholded perimeter.
Volumetric region: some features make use of stacks of images resembling a volume structure. As shown in Figure 6-12, the VLBP or Volume LBP and the LBP-TOP make use of volumetric data structures. The dynamic texture methods and activity recognition methods often use sets of three adjacent patches from the current frame plus 2 past frames, organized in a spatio-temporal image frame history, similar to a volume.
Deformable: most features use a rigid shape, such as a fixed-size rectangle or a circle; however, some descriptors are designed with deformation in mind, such as scale deformations [345,346], and affine or homographic deformation [220], to enable more robust matching.

Feature Pattern

Feature pattern is a major axis in this taxonomy, as shown in Figure 5-3, since it affects memory architecture and compute efficiency.

Feature shape and pattern are related. Shape refers to the boundary, and pattern refers to the sampling method. Patterns include:

Rectangular kernel: some methods use a kernel to define which elements in the region are included in the sample; see Figure 5-3 (left image) showing a kernel that does not use the corner pixels in the region; see also Figure 4-10.
Binary compare pattern: such as FREAK, ORB, and BRISK, where specific pixels in a region are paired to form a complex sampling pattern.
DNET line sample strip set: where points along a line segment are sampled densely; see Figure 4-8.
Radial line sampling pattern: where points on radial line segments originating at a center point are sampled densely; for example, used to compute Fourier descriptors for polygon region shape; see Figure 6-29.
Perimeter or contour edge: where points around the edge of a shape or region are sampled densely.
Sample weighting pattern: as shown in Figure 6-17, SIFT uses a circular weighting pattern in the histogram bins to decrease the contribution of points farther away from the center of the patch. The D-NETS method uses binary weighting of samples along the line strips, favoring points away from the endpoints and ignoring points close to the end points. Weighting patterns can provide invariance to noise and occlusion.

See Chapter 4 for more illustrations in the section on patches and shapes.

Feature Density

As shown in Figure 5-1, feature density is a major axis in this taxonomy. The amount of the image used for the descriptor is referred to in this taxonomy as feature density. For example, some descriptors are intended to use smaller regions of local pixels, anchored at interest points, and to ignore the larger image. Other methods use larger regions. Density categories include:

Global: covers the entire image, each pixel in the image.
Regional: covers fairly large regions of the image, typically on a grid, or around a segmented structure or region, not anchored at interest points.
Sparse: may be taken at interest points, or in small regions at selected points such as random points in the BRIEF descriptor, trained points such as FREAK and ORB, or a sparse sampling grid as in the RFM2.3 descriptor.

Feature Search Methods

The method used for searching for features in the image is a significant for feature descriptor design. The search method determines a lot about the design of the descriptor, and the compute time required in the vision pipeline. We list several search variations here, and more detailed descriptions and illustrations are provided in Chapter 4. Note that a feature descriptor can make use of multiple search criteria. Feature search related information is summarized as follows:

Coarse-to-fine image pyramid: or multi-scale search, using a pyramid of coarser resolution copies of the original.
Scale space pyramid: the scale space pyramid is a variation of the regular coarse-to-fine image pyramid, where a Gaussian blur function is computed over each pyramid scale image [547] to create a more uniform search space; see Figure 4-17.
Pyramid scale factor: captures pyramid scale intervals, such as octaves or other scales—for example, ORB uses a ∼1.41x scale.
Dense sliding window: where the search is made over each pixel in the image, often within a sliding rectangular region centered at each pixel.
Grid block search: where the image is divided into a fixed grid or tiles, so the search can be faster but does not discriminate as well as dense methods. For example, see Figure 6-17 describing the PHOG method, which computes descriptors at different grid resolutions across the entire image.
Window search: limited dense search to particular regions, such as in stereo matching between two L/R frames where the correspondence search range is limited to expected locations.
Sparse at interest points: where a corner detector or other detector is used to determine where valid features may be found.
Sparse at predicted points: such as in tracking and mapping algorithms like PTAM, where the location of interest points is predicted based on motion or trajectory, and then a feature search begins at the predicted points.
Sparse in segmented regions: for example, when morphological shape segmentation methods or thresholding segmentation methods define a region, and a second pass is made through the region looking for features.
Depth segmented regions (Z): when depth camera information is used to threshold the image into foreground and background, and only the foreground regions are searched for features.
Super-pixel search: similar to the image pyramid method, but a multi-scale representation of the image is created by combining pixel values together using super-pixel integration methods, as discussed in Chapter 2.
Sub-pixel search: where sub-pixel accuracy is needed—for example, with region correlation, so several searches are made around a single pixel, with sub-pixel offsets computed for each compare, and in some cases geometric transforms of the pattern are made prior to feature matching.
Double-scale first pyramid level: In the SIFT scale-space pyramid method, the lowest level of the pyramid is computed from a doubled 2x linear interpolated version of the full-scale image, which has the effect of preserving high-frequency information in the lowest level of the image pyramid, and increasing the number of stable keypoints by about four times, which is quite significant. Otherwise, computing the Gaussian blur across the original image would have the effect of throwing away most of the high-frequency details.

Pattern Pair Sampling

For local binary patterns, pattern pair sampling design is one of the key areas of innovation. Pairs of points are compared using a function such as (center pixel < kernel pixel) using a compare region threshold, and then the result of the comparison forms the binary descriptor vector. Note that many local binary descriptor method were discussed and illustrated in Chapter 4, to illustrate variations in point-pair sampling configuration and compare functions. The vision taxonomy for point-pair sampling includes:

Center – boundary pair: such as in the LBP family and Census transform.
Random pair points: such as in BRIEF, and semi-random in ORB.
Foveal centered trained pairs: such as in FREAK and Daisy.
Trained pairs: many methods train the point-pairs using ground truth data to meet objective criteria, such as FREAK and ORB.
Symmetric pairs: such as BRISK, which provides short and long line segments spaced symmetrically for point-pair comparisons.

Pattern Region Size

The size of the local pattern region is a critical performance factor, even though memory access is likely from fast-register files and cache. For example, if we are performing a convolution of a 3x3 pattern region, there are nine multiplies per kernel, and possibly one summary multiply to scale the results, for a total of 10 multiplies per pixel. For each multiply we have two memory reads, one for the pixel and one for the kernel value; and we have ten memory writes, one for each multiply. A 640x480 image has 307200 pixels, and assuming 8 bits per pixel gray scale only, per frame we end up with 3,072,000 multiplies, 60,720,000 memory reads, and 307200 writes for the result. Larger kernel sizes and larger image sizes of course add more compute.

There are many ways to optimize the performance, which we will cover in Chapter 8 on vision pipeline engineering. For this attribute, we are interested in the following:

Bounding box (x size, y size): for example, the bounding box around a rectangular region, circular region, or polygon shape region.

Distance Function

Computing the pattern matching or correspondence is one of the key performance criteria for a good descriptor. Feature matching is a tradeoff between accuracy and performance, with the key variables being the numeric type and size of the feature descriptor vectors, the distance function, and the number of patterns and search optimizations in the feature database. Choosing a feature descriptor amenable to fast matching is a good goal.

In general, the fastest distance functions are the binary family and Hamming distance, which is used in the local binary descriptor family. Distance functions are enumerated here; see Chapter 4 for details.

Euclidean or Cartesian Distance Family

Euclidean distance
Squared Euclidean distance
Cosine similarity
SAD L1 Norm
SSD L2 Norm
Correlation distance
Hellinger distance

Grid Distance Family

Manhattan distance
Chessboard or Chebychev distance

Statistical Distance Family

Earth movers distance
Mahalanobis distance
Bray Curtis difference
Canberra distance

Binary or Boolean Distance Family

L0 Norm
Hamming distance
Jaccard similarity

Feature Metric Evaluation

This section addresses the question of how to summarize feature descriptor information at a high level from the Vision Metrics Taxonomy into a practical Feature Metric Evaluation Framework (FME) from an engineering and design perspective.

Note

The FME is intended as a template to capture high-level information for basic analysis.

Efficiency Variables, Costs and Benefits

Efficiency can be measured for a feature descriptor in simple terms, such as the benefit of the compute cost and memory used vs. what is provided in the way of accuracy, discrimination, robustness, and invariance. How much value does the method provide for the time, space, and power cost? Efficiency metrics include:

Costs: compute, memory, time, power
Benefits: accuracy, robustness, and invariance attributes provided
Efficiency: benefits vs. costs

The effectiveness of the data contained in the descriptor varies—for example, a large memory footprint to contain a descriptor with little invariance is not efficient, and a high compute cost for small amounts of invariance and accuracy also reveals low efficiency. We could say that an efficient feature representation contains the least number of bytes and lowest compute cost providing the greatest amount of discrimination, robustness, and accuracy. Local binary descriptors have demonstrated the best efficiency for many robustness attributes.

Image Reconstruction Efficiency Metric

For a visual comparison of feature descriptor efficiency, we can also reconstruct an image from the feature descriptors, and then visually and statistically analyze the quality of the reconstruction vs. the compute and memory cost. Detailed feature descriptors can provide good visualization and reconstruction of the original image from the descriptor data only. For example, Figure 4-15 shows how the HOG descriptor captures oriented gradients using 32780 bytes per 64x128 region, Figure 4-16 shows image reconstruction illustrating how BRIEF and FREAK capture edge information similar to Laplacian or other edge filters using 64 bytes per descriptor, and Figure 4-17 shows SIFT image reconstruction using 128 bytes per descriptor.

Although we do not include image reconstruction efficiency in the FME, this topic was covered in Chapter 4, under the discussion of discrimination.

Example Feature Metric Evaluations

Here area few examples showing how the Vision Metrics Taxonomy and the FME can be used to collect summary descriptor information.

SIFT Example

We use SIFT as an example baseline, since SIFT is widely recognized and carefully designed.

VISION METRIC TAXONOMY FME

Name: SIFT
Feature Family: Spectra
Spectra dimensions: Multivariate
Spectra: Gradient magnitude and direction, DoG Scale Space Maxima
Storage format: Orientation and position, gradient orientation histograms
Data type: Float, integer
Descriptor Memory: 128 bytes for descriptor histogram
Feature shape: Rectangular region
Search method: Dense sliding window in 2D & 3D 3x3x3 image pyramid
Feature density: Local
Feature pattern: Rectangular and pyramid-cubic
Pattern pair sampling: -
Pattern region size: 16x16
Distance function: Euclidean distance

GENERAL ROBUSTNESS ATTRIBUTES

Total:5 (scale, illumination, rotation, affine transforms, noise)

LBP Example

The LBP is a very simple feature detector with many variations, used for texture analysis and feature description. We use the most basic form of 3x3 LBP here as an example.

VISION METRIC TAXONOMY FME

Name: LBP
Feature Family: Local Binary
Spectra dimensions: Single-variate
Spectra: Pixel pair compares with center pixel
Storage format: Binary Bit Vector
Data type: Integer
Descriptor Memory: 1 byte
Feature shape: Square centered at center pixel
Search method: Dense sliding window
Feature density: Local
Feature pattern: Rectangular kernel
Pattern pair sampling: Center - boundary pairs
Pattern region size: 3x3 or more
Distance function: Hamming distance

GENERAL ROBUSTNESS ATTRIBUTES

Total: 3 (brightness, contrast, rotation using RILBP)

Shape Factors Example

This example uses binary thresholded polygon regions. For this hypothetical example, the pre-processing steps begin with adaptive binary thresholding and morphological shape definition operations, and the measurement steps begin with pixel neighborhood based perimeter following to defined the perimeter edge, followed by centroid computation from perimeter points, followed by determination of 36 radial line segments originating at the centroid reaching to the perimeter. Then each line segment is analyzed to find the shape factors including major/minor axis the Fourier descriptor. The measurements assume a single binary object is being measured, and real-world images may contain at many objects.

We also assume the memory footprint as follows: angular samples taken around 360 degrees, starting at centroid, at 10 degree increments for 36 angular samples, 36 floats for FFT spectrum magnitude, 36 integers for line segment length array, 4 integers for major/minor axis orientation and length, 4 integers for bounding box (x, y, dx, dy), 1 integer for perimeter length, 2 integers for centroid coordinates, TOTAL 36*4 + 36*2 + 4*2 + 4*2 + 1*2 * 2*2 = 238, assuming 2 byte short integers and 4-byte floats are used.

VISION METRIC TAXONOMY FME

Name: Shape Factors
Feature Family: Polygon Shape
Spectra dimensions: Multivariate
Spectra: Perimeter following, area, perimeter, centroid, other image moments
Storage format: complex data structure
Data type: Float, integer
Descriptor Memory: Variable, several hundred bytes possible
Feature shape: Polygon shapes, rectangular bounding box region
Search method: Dense, recursive
Feature density: Regional
Feature pattern: Perimeter contour or edge
Pattern pair sampling: -
Pattern region size: Entire image
Distance function: Multiple methods, multiple comparisons

GENERAL ROBUSTNESS ATTRIBUTES

Total: 8 or more (scale, rotation, occlusion, shape, affine, reflection, noise, illumination)

Summary

In this chapter, a taxonomy is proposed as shown in Figure 5-1 to describe feature description dimensions as shape, pattern, and spectra. This taxonomy is used to divide the families of feature description methods into polygon shape descriptors, local binary descriptors, and basis space descriptors. The taxonomy is used throughout the book. Also, a general vision metrics taxonomy is proposed for the purpose of summarizing high-level feature descriptor design attributes, such as type of spectra, descriptor pixel region size, distance function, and search method. In addition, a general robustness taxonomy is developed to quantify feature descriptor goodness, one attribute at a time, based on invariance and robustness criteria attributes, including illumination, scale, rotation, and perspective. Since feature descriptor methods are designed to address only some of the invariance and robustness attributes, each attribute should be considered separately when evaluating a feature descriptor for a given application. In addition, the robustness attributes can be applied to the design of ground truth datasets, as discussed in Chapter 7. Finally, the vision metrics taxonomy and the robustness taxonomy are combined to form a feature metric evaluation (FME) table to record feature descriptor attributes in summary form. A simple subset of the FME is used to review the attributes of several feature descriptor methods surveyed in Chapter 6.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this chapter or parts of it.

The images or other third party material in this chapter are included in the chapter’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Springer Professional

Abstract

Feature Descriptor Families

Prior Work on Computer Vision Taxonomies

Robustness and Accuracy

General Robustness Taxonomy

Illumination

Color Criteria

Incompleteness

Resolution and Accuracy

Geometric Distortion

Efficiency Variables, Costs and Benefits

Discrimination and Uniqueness

General Vision Metrics Taxonomy

Feature Descriptor Family

Spectra Dimensions

Spectra Type

Interest Point

Storage Formats

Data Types

Descriptor Memory

Feature Shapes

Feature Pattern

Feature Density

Feature Search Methods

Pattern Pair Sampling

Pattern Region Size

Distance Function

Euclidean or Cartesian Distance Family

Grid Distance Family

Statistical Distance Family

Binary or Boolean Distance Family

Feature Metric Evaluation

Efficiency Variables, Costs and Benefits

Image Reconstruction Efficiency Metric

Example Feature Metric Evaluations

SIFT Example

VISION METRIC TAXONOMY FME

GENERAL ROBUSTNESS ATTRIBUTES

LBP Example

VISION METRIC TAXONOMY FME

GENERAL ROBUSTNESS ATTRIBUTES

Shape Factors Example

VISION METRIC TAXONOMY FME

GENERAL ROBUSTNESS ATTRIBUTES

Summary