Methods that train complex models in an end-to-end manner tend to pursue either AEs [
27] or knowledge distillation (KD) [
13]. Both approaches are based on the assumption that the trained DL model is well-behaved only on images that originate from the normal data distribution. For the AE, this means that the image reconstruction fails for anomalies, whereas for KD, this means that the regression of the teacher’s features by the student network fails. While the AE can be easily applied to multi/hyperspectral images [
35], KD-based approaches are limited by their need for a suitable teacher model. Since CNNs pre-trained on ILSVRC2012 (a subset of ImageNet [
36]) are commonly used as teacher models, this limits KD approaches to images that are castable to the RGB image format used in ImageNet, i.e. RGB or grayscale images. While a randomly initialized CNN might potentially be used as the teacher to circumvent this problem (similar approaches have been pursued successfully in reinforcement learning [
37]), its efficacy has not yet been demonstrated for AVI.
As an alternative to AE and KD, the
concentration assumption can be used to formulate learning objectives such as the patch support vector data description [
38], which directly learn feature representations where the normal data is concentrated/clustered around a fixed point. Anomalies are then expected to have a larger distance to the cluster center than normal data.
The main advantage of methods that train complex models in an end-to-end manner is their applicability to any data type, including multi/hyperspectral images. For their main drawbacks, it needs to be stated that training these methods is compute″=intensive, and they thus do not conform with the requirement of low/limited training effort imposed by the manufacturing industry. Furthermore, these methods tend to produce worse results than hybrid approaches on RGB-castable images. As a potential explanation, it has been hypothesized that discriminative features are inherently difficult to learn from scratch using normal data only [
21]. Moreover, it was shown that AEs tend to generalize to anomalies in AVI [
27]. To improve results, two approaches are currently pursued in literature: (I) Initializing the method with a model that was pre-trained on a large-scale natural image dataset [
39]. However, this restricts approaches to grayscale/RGB images due to a lack of large-scale multi/hyperspectral image datasets. Furthermore, its effectiveness is limited by catastrophic forgetting [
22,
40], which, in AD, refers to a loss of initially present, discriminative features. Therefore, this technique is often combined with the second approach, where (II) surrogates for the anomaly distribution are provided via anomaly synthesis [
31,
41]. This requires either access to representative anomalies as a basis for synthesis, or an exhaustive understanding of the underlying manufacturing process and the visual appearance of occurring defects. Thus, anomaly synthesis violates the assumption of an ill-defined anomaly distribution in the same manner as supervised approaches, and is expected to incur a similar bias.