Skip to main content

Open Access 09.06.2025 | Original Article

Multi-model anomaly detection for industrial inspection with dynamic loss weighting and soft-hard features loss

verfasst von: Willy Fitra Hendria, Hanbi Kim, Daeho Seo

Erschienen in: Neural Computing and Applications

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Die Erkennung von Anomalien bei der industriellen Inspektion ist von entscheidender Bedeutung für die Aufrechterhaltung hoher Qualitätsstandards und die Identifizierung von Defekten, die die Integrität des Produkts gefährden können. Herkömmliche Einzelmodellansätze kämpfen oft mit der Variabilität und Seltenheit von Defekten, was zu suboptimaler Erkennungsleistung führt. Dieser Artikel geht auf diese Herausforderungen ein, indem er ein Multi-Modell-Rahmenwerk einführt, das dynamische Verlustgewichtung und Soft-Hard-Feature-Verluste wirksam nutzt. Die dynamische Verlustgewichtung passt den Einfluss mehrerer Lehrermodelle auf Grundlage ihrer Zuverlässigkeit adaptiv an und optimiert den Trainingsprozess ohne umfangreiche Hyperparametereinstellungen. Der Verlust von Soft-Hard-Features andererseits sorgt für einen umfassenderen Lernprozess, indem relevante geringere Fehlermerkmale beibehalten werden, während Regionen mit hohem Fehleranteil hervorgehoben werden. Die vorgeschlagene Methode erzielt hochmoderne Ergebnisse bei Benchmark-Datensätzen und zeigt verbesserte Genauigkeit und Effizienz bei der Erkennung von Anomalien. Der Artikel enthält auch qualitative und quantitative Analysen, die die Fähigkeit der Methode hervorheben, Fehldetektionen zu reduzieren und die Anomalie zu lokalisieren. Durch die Integration dieser fortschrittlichen Techniken bietet das Rahmenwerk eine robuste Lösung für die industrielle Inspektion, die eine zuverlässige und effiziente Fehlererkennung in unterschiedlichen Fertigungsumgebungen gewährleistet.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Anomaly detection in image data [1] is a critical yet challenging task, particularly in industrial inspection [2], where defects are rare and exhibit significant variations in appearance. Traditional single-model approaches based on supervised learning [3] often struggle to generalize across diverse anomaly patterns due to their reliance on a single feature representation. Consequently, there is a growing need for multi-model frameworks that leverage complementary features to enhance anomaly detection performance. However, image-based anomaly detection presents several unique challenges, including high intra-class variability, where anomalies appear in diverse shapes, sizes, and textures, as well as the limited availability of labeled data, which restricts the effectiveness of supervised learning approaches. Additionally, subtle and localized anomalies require high-resolution feature extraction, making it difficult for traditional methods to achieve accurate detection. These challenges highlight the need for more robust approaches capable of capturing complex anomaly patterns while maintaining generalization across different datasets.
To address these challenges, multi-model approaches [48] have been explored, leveraging multiple models to improve robustness and generalization. Among these, student–teacher frameworks [4] have demonstrated strong performance in unsupervised anomaly detection by training a student network to mimic the teacher’s output on normal images, with discrepancies serving as an anomaly indicator. Extensions of this framework include multiple student networks [4], four-model student–teacher architectures for detecting both structural and logical anomalies [5], and three-model setups integrating a teacher, student, and autoencoder [6]. While these approaches improve performance, they do not effectively balance the contributions of different models, leading to suboptimal training dynamics.
Manual hyperparameter tuning can partially address this issue, but it becomes impractical as the number of models increases due to exponential growth in tuning complexity (e.g., optimizing five parameters across five models requires up to 3,125 experiments). To overcome this limitation, we propose dynamic loss weighting, which adaptively adjusts the influence of multiple teacher models based on their reliability, optimizing model contributions without extensive tuning.
Additionally, anomaly localization remains a challenge due to hard feature loss methods [6], which focus only on high-error regions while disregarding a significant portion of image information. This can lead to missed anomalies and incorrect detections. To address this, we introduce soft-hard feature loss, which preserves high-error features while retaining relevant lower error features, ensuring a more comprehensive learning process.
In our framework, we employed a three-model anomaly detection method, one as the student model and the two other models as the teacher models to guide the student model. Specifically, we dynamically adjust the importance, i.e., the weights of loss, of the two teacher models in the overall student training process. Following the state-of-the-art method [6], we utilize a pre-trained convolutional model as our static teacher, i.e., frozen model, and an autoencoder-based model as the trainable teacher. The student model follows the same architecture as the static teacher. While the static teacher guides the students based on its knowledge on the pretraining dataset, i.e., ImageNet [9], the trainable teacher adapts to the target dataset and guides the students based on the target dataset, e.g., MVTec AD [2]. To further improve the model performance, we introduce a soft-hard feature loss to help the student learn from the teacher about the normal images better while preventing the student from generalizing beyond the normal images.
In summary, our main contributions are listed as follows:
  • We propose a dynamic loss weighting technique to adjust the importance of different models during the training process dynamically. To the best of our knowledge, this is the first work that the dynamic loss weighting technique employed for multi-model unsupervised anomaly detection.
  • We introduce a novel soft-hard feature loss to reduce the impact of the lower error values while maintaining the ones with the largest error values. This novel loss function helps the student model to learn the representation of normal images better while inhibiting the model from imitating the teacher on anomalous images.
  • We achieve state-of-the-art results on the MVTec AD and VisA dataset without impacting computational requirements during inference time.
In the field of anomaly detection, methods are divided into two main categories: traditional and deep learning-based. Traditional methods include statistical-based [10, 11], proximity-based [12, 13], and shallow machine learning models [14, 15]. Such traditional methods often have limitations in capturing complex relationships and require manual work to craft algorithms or features.

2.1 Deep learning-based anomaly detection methods

Deep learning-based methods [4, 16, 17] address these issues by capturing complex relationships across multiple layers of abstraction and automatically learning the feature representations. Based on label availability during training, anomaly detection methods can be further categorized into supervised [18, 19], semi-supervised [16, 20], and unsupervised [4, 17] methods. In the context of anomaly detection, supervised and semi-supervised methods require labeled instances for both normal and abnormal cases during training, while unsupervised methods solely rely on normal data. Due to the challenges in collecting abnormal data in manufacturing, unsupervised learning has been extensively utilized across various benchmark datasets [2, 21] in the manufacturing domain. In this paper, we specifically focus on improving unsupervised deep learning methods for anomaly detection.

2.2 Unsupervised deep anomaly detection

In unsupervised learning, deep learning-based methods for anomaly detection can be broadly categorized into reconstruction-based and pre-trained-based algorithms. VAE-based [22, 23] and GAN-based [17, 24] algorithms aim to reconstruct images and detect anomalies from scratch by analyzing differences between the reconstructed and input images in pixel spaces. This approach may not be optimal due to straightforward per-pixel comparisons or suboptimal reconstructions [4]. VAEs often produce overly smoothed outputs, making it difficult to preserve fine-grained details necessary for detecting subtle anomalies. GANs are prone to mode collapse, leading to unreliable reconstructions and difficulty in capturing diverse anomaly patterns. On the other hand, pre-trained-based methods [46, 25, 26], extract features from models trained from large datasets like ImageNet and perform anomaly detection by analyzing the differences in extracted features between normal and abnormal instances. Utilizing features from pre-trained networks tends to yield better results than methods relying on autoencoders or GANs built from scratch [27]. However, pre-trained models are typically trained on large-scale natural image datasets like ImageNet, which may not fully align with the characteristics of industrial anomaly detection datasets. Recently, AAND [28] proposed a two-stage framework using knowledge distillation methods for industrial anomaly detection that enhances feature discrepancy through Anomaly Amplification and Normality Distillation (AADD). The Residual Anomaly Amplification (RAA) module increases sensitivity to anomalies while preserving the integrity of the pre-trained model. Additionally, the hard feature distillation loss facilitates the reconstruction of subtle or rare patterns in normal samples.
In this work, we adopt a recently popular student–teacher-based method [46] which falls under the broader umbrella of pre-trained-based methods, where the teacher model can be seen as a pre-trained model that guides the learning of the student model for detecting anomalies. Specifically, our approach utilizes a student–teacher-based model in which multiple models, i.e., one student and two teacher models, are incorporated to detect the anomalies.

2.3 Loss weighting techniques in anomaly detection

Loss weighting techniques have been widely utilized in deep learning-based approaches that incorporate multiple loss functions. These techniques aim to balance the contributions of different loss functions, each corresponding to distinct objectives or models, to optimize overall performance. In semi-supervised learning methods [29, 30], where supervised and unsupervised learning occur concurrently, adjusting the weight between the supervised and unsupervised losses plays a critical role in influencing overall performance. Basically, total loss, \(L_\text{total}\) is formulated as a weighted sum of the supervised loss \(L_\text{supervised}\) and unsupervised loss, \(L_\text{unsupervised}\), where \(\mathrm {\alpha }\) represents the weighting coefficient that controls the contribution of the unsupervised loss term. This can be expressed as follows:
$$\begin{aligned} L_\text{total} = L_\text{supervised} + \mathrm {\alpha } L_\text{unsupervised} \end{aligned}$$
(1)
The weighting factor \(\mathrm {\alpha }\) plays a crucial role in balancing the impact of supervised and unsupervised learning components, ensuring that the model effectively leverages both labeled and unlabeled data during training.
Recent semi-supervised learning approaches have employed a single fixed hyperparameter to regulate the influence of the unsupervised loss on the total loss function [29, 30]. A similar strategy has been adopted in anomaly detection, where a single fixed hyperparameter is used to adjust the importance of the distillation loss in a multitask learning framework [31]. Likewise, a fixed hyperparameter has been utilized to control the contribution of a cosine similarity-based loss within an overall loss function that also incorporates a Euclidean distance-based loss [32]. A fixed weighting strategy may not be optimal across different datasets or anomaly types, as some tasks require stronger global feature representations, while others necessitate higher sensitivity to localized anomalies.
While manually adjusting the hyperparameters through extensive experiments is possible, finding optimal values requires testing and comparing a comprehensive range of values. In many anomaly detection methods where multiple models or objectives are employed [46, 33], a balancing mechanism is often absent, resulting in equal weighting for all losses and neglecting the importance of balancing between different models or objectives. In contrast, compared to these previous methods, we dynamically adjust the importance between multiple losses, corresponding to different teacher models. Instead of fixing the importance of the two teacher networks with different functionalities as constants, dynamic weighting is applied to enable efficient fusion, effectively transferring specialized knowledge from each teacher to the student network. This dynamically controlling the weights of the losses during the training process helps our framework to achieve a better overall performance without requiring extensive tuning of hyperparameters.

2.4 Student–teacher methods in anomaly detection

Recently, using a student–teacher framework [4, 5, 32, 34] for spotting anomalies in computer vision has emerged as a promising approach. The student–teacher framework initially emerged from knowledge distillation [35], a model compression technique in which a complex teacher model transfers its learned representations to a more compact and computationally efficient student model while preserving performance. While traditionally designed to enhance model efficiency through size reduction, recent applications of this framework in anomaly detection have been adapted to address the challenges of unsupervised anomaly detection. In unsupervised anomaly detection, the student network is not necessarily smaller than the teacher network. Instead, it is trained using only normal images to mimic the output of the pre-trained teacher network. When presented with data not part of the training set, including anomalous images, the output of the student and the teacher could diverge. A higher discrepancy, especially on anomalous images, arises because the student model is only trained on anomaly-free data.
While some student–teacher-based methods focus on minimizing discrepancies across multiple intermediate layers between the student and teacher networks, many approaches [46, 34] primarily minimize the discrepancy between the output features of the student and teacher networks. In the EfficientAD [6] method, however, to avoid outliers on normal images, only a small percentage of the largest pixel-level errors, i.e., the discrepancy between the output features of the student and the teacher network, are considered. However, given that a significant amount of information in the images is disregarded, there is a chance that the student network may fail to capture essential details. In contrast, instead of entirely neglecting the majority of relatively lower pixel-level errors, our approach reduces these error values while retaining the largest ones. This approach is not only able to avoid the outliers on normal images, but also enables the capture of more relevant details, resulting in improved pixel-level detection performance.

3 Methods

As illustrated in Fig. 1, our proposed method consists of multiple models, i.e., static teacher, trainable teacher, and student model, with dynamic loss weighting controlling the importance between two teacher models. The student model is trained to mimic the output of the static teacher and trainable teacher on normal images, with the weighting that can be dynamically changed depending on the performance of the trainable teacher during the training process. The soft-hard feature loss is incorporated to adjust the loss values between the student and the static teacher so that the ones with the largest error values are more important than the ones with the lower values. Simultaneously, the trainable teacher is trained to mimic the output of the static teacher on the target dataset, for example, MVTec AD, with augmented images. During inference, two anomaly maps are produced and then averaged following the procedure in [6].
Fig. 1
Our proposed framework consists of multi-teachers with a single student, where the losses are dynamically weighed during the training process. The student network outputs a set of feature maps, where half of them mimic the output of the static teacher, i.e., patch description network (PDN) [6], and the other half mimic the output of the trainable teacher, i.e., autoencoder-based model [6], via the \(L_i^\mathrm {\Omega }\) and \(L_i^\mathrm {\Theta }\). Specifically for the \(L_i^\mathrm {\Omega }\), it includes our proposed soft-hard feature loss. For clarity, this illustration omits the augmentation process, i.e., 20% random alteration of brightness, contrast, or saturation, used to train the trainable teacher, following the approach used in EfficientAD [6] method. Also, to enhance clarity, it excludes the training process involving pretraining images, i.e., the flow of penalty loss computation

3.1 Teacher networks

3.1.1 Static teacher

A static teacher \(\mathcal {T}_\text{static}\) is a pre-trained model in which the model’s parameters are fixed during the student’s training process. This model is used to guide the student model in learning on normal images based on its knowledge of the pertaining dataset. Some previous works [32, 36] used a model pre-trained on ImageNet, for example, VGG [37], ResNet [38], and EfficientNet [39], as the static teacher. In the other works [46], an additional pretraining process is employed to further reduce the size of the teacher via the knowledge distillation framework [35]. In our work, to ensure a fair comparison with [6], we use the patch description network (PDN) [6] which is initially introduced in EfficientAD method. It is a relatively compact fully convolutional network that functions as a teacher model after undergoing knowledge distillation from a larger pre-trained model, such as WideResNet-101 [40]. Similarly, we adopt PDN as our static teacher and pretrain the static teacher on the ImageNet training dataset \(\mathcal {D}_\text{imagenet} = \{\hat{I}_1, \hat{I}_2,\ldots , \hat{I}_M\}\) by distilling the knowledge from a pre-trained WideResNet-101, which is denoted as \(\mathcal {P}\). Specifically, we used the medium variant of the PDN which consists of six convolutional layers as illustrated in Fig. 2.
Fig. 2
The architecture of our static teacher, i.e., the medium variant of PDN [6]. The student network uses the same architecture, except in the last two convolutional layers where 768 kernels are used instead of 384. Note that in line with the reference, we also opt not to employ padding in the convolutional layers to enhance the speed of the forward pass. In the figure, the symbols s, k, p, and o represent stride, kernel, padding, and output dimension, respectively. The kernel tuple k, such as (256, 4x4), denotes the number and size of kernels. In this example, 256 refers to the number of kernels, while 4x4 represents the kernel size. The output dimension o is presented in the format "Channel x Height x Width."
During the pretraining process, \(\mathcal {T}_\text{static}\) is trained to mimic the output of \(\mathcal {P}\) on the ImageNet dataset by minimizing the mean squared difference over all tuples (jkl) of the outputs:
$$\begin{aligned} L_i^\text{pre} = \frac{1}{CWH} \sum _{j=1}^{C} \sum _{k=1}^{W} \sum _{l=1}^{H} (\mathcal {T}_\text{static}(\hat{I}_i)_{j, k, l} - \mathcal {P}(\hat{I}_i)_{j,k,l})^2 \end{aligned}$$
(2)
where \(L_i^\text{pre}\) is the pretraining loss of the i-th training image in a minibatch, \(\hat{I}_i\) is the training image, \(\mathcal {T}_\text{static}(.) \in \mathbb {R}^{ C \times W \times H}\) is the operation yielding the output of the static teacher, and \(\mathcal {P}(.) \in \mathbb {R}^{ C \times W \times H}\) is the operation yielding the output of the pre-trained model. The symbols C, W, and H denote the channel, width, and height of the output dimensions. Averaging \(L_i^\text{pre}\) over all images in each minibatch gives us the final loss for the respective batch. By minimizing the mean squared error between the output of \(\mathcal {T}_\text{static}\) and the output of \(\mathcal {P}\), the static teacher \(\mathcal {T}_\text{static}\) is expected to extract similar features as the pre-trained model \(\mathcal {P}\), despite its smaller architecture.
As commonly known within the knowledge distillation framework [35], in conditions with strict latency requirements or low-resource environments, there may be a desire to further reduce the static teacher’s architecture to achieve faster inference speeds. However, this comes with the trade-off of potentially losing information, as a smaller network might capture less information than larger ones. Conversely, one may opt for a larger static teacher to improve detection performance, albeit at the expense of inference speed. For the details of the architecture and the pertaining process of the static teacher, we follow the procedure explained in the pertaining patch description network (PDN) in [6]. Since our \(\mathcal {T}_\text{static}\) model adopts the PDN architecture, which extracts the features of the local image region, it allows the model to capture the local structural anomalies [46]. While the static teacher, trained on a large-scale dataset such as ImageNet, provides strong feature extraction capabilities, it may not fully adapt to the specific characteristics of the target dataset. This limitation arises because the static teacher does not undergo fine-tuning on the industrial dataset, which can lead to suboptimal performance in detecting domain-specific anomalies. To address this, we introduce a trainable teacher, which learns directly from the target dataset, thereby improving adaptability and enhancing anomaly detection performance.

3.1.2 Trainable teacher

Because the parameters of \(\mathcal {T}_\text{static}\) are fixed during the student’s training process, the performance of the detection highly relies on how transferable the features from the model pre-trained on the pretraining dataset to the target dataset. To have more adaptable features w.r.t. the target dataset, we adopted an additional teacher, i.e., a trainable teacher \(\mathcal {T}_\text{trainable}\). While the \(\mathcal {T}_\text{static}\) helps to exploit the knowledge of the pretraining dataset, the \(\mathcal {T}_\text{trainable}\) helps to adapt more effectively on the target dataset, for example, MVTec AD. This is because the parameters of \(\mathcal {T}_\text{trainable}\) are trainable and adjusted during the training process based on the target dataset, potentially allowing it to better adapt and learn from the specific characteristics of the data. This aligns with the optimization principle, where the model is trained to improve the model’s performance by adjusting its parameters to minimize errors on the given dataset. In conditions where there is a high discrepancy between the pretraining dataset and the target dataset, we expect that this additional teacher would contribute to improving the overall model performance more effectively.
Given the training dataset of anomaly detection \(\mathcal {D}_\text{ad} = \{I_1, I_2,\ldots , I_N\}\), the randomly initialized \(\mathcal {T}_\text{trainable}\) network learns to mimic the output of \(\mathcal {T}_\text{static}\) on augmented images by minimizing the mean squared difference of the outputs:
$$\begin{aligned} L_i^\mathrm {\Theta } = \frac{1}{CWH} \sum _{j=1}^{C} \sum _{k=1}^{W} \sum _{l=1}^{H} (\mathcal {T}_\text{trainable}(I_i^\text{aug})_{j, k, l} - \mathcal {T}_\text{static}(I_i^\text{aug})_{j,k,l})^2 \end{aligned}$$
(3)
where \(L_i^\mathrm {\Theta }\) is the trainable teacher loss at the i-th training step, \(I_i^\text{aug}\) is a randomly augmented training image of an anomaly detection dataset at the i-th training step, \(\mathcal {T}_\text{trainable}(.) \in \mathbb {R}^{ C \times W \times H}\) is the operation yielding the output of the trainable teacher. We adopt an autoencoder-based model for our trainable teacher, which leverages an entire input image to allow capturing global context as in [5, 6]. Specifically, we use the same architecture that is used in [6], allowing our model to also detect the global logical anomalies [5, 6]. The augmented image is used to train \(\mathcal {T}_\text{trainable}\) to help the model learn about variation in images of the target. The architecture of the trainable teacher, which is a convolutional-based autoencoder, is illustrated in Fig. 3.
Fig. 3
The architecture of our trainable teacher, i.e., autoencoder-based model [8]. In the figure, the symbols s, k, p, r, and o represent stride, kernel, padding, resizing size, and output dimension, respectively. The kernel tuple k, such as (32, 4 \(\times \) 4), denotes the number and size of kernels. Here, 32 represents the number of kernels, while 4 \(\times \) 4 indicates the kernel size. The output dimension o is presented in the format "Channel \(\times \) Height \(\times \) Width." The resizing size r defines the target size for bilinear interpolation. The padding value p indicates the number of zero padding added to each border of the input features. The final output dimension is adjusted to align with those of the static teacher, i.e., \(384 \times 56\times 56\), for loss computation between the trainable and the static teachers

3.2 Student networks

Given the dataset \(\mathcal {D}_\text{ad}\), a student network \(\mathcal {S}\) learns from the static teacher and the trainable teacher on normal images by mimicking the output of \(\mathcal {T}_\text{static}\) and \(\mathcal {T}_\text{trainable}\), respectively. The architecture of \(\mathcal {S}\) has a similar architecture as \(\mathcal {T}_\text{static}\) to have a low inference time. This is because the PDN model [6] which is used for the \(\mathcal {T}_\text{static}\), has a low computational complexity consisting of only six convolutional layers. Note that the output layer of \(\mathcal {S}\) has twice the number of channel dimensions compared to \(\mathcal {T}_\text{static}\) and \(\mathcal {T}_\text{trainable}\), i.e., \(\mathcal {S}(.) \in \mathbb {R}^{ 2C \times W \times H}\). This expansion allows \(\mathcal {S}\) to simultaneously learn from both \(\mathcal {T}_\text{static}\) and \(\mathcal {T}_\text{trainable}\). Specifically, half of the total channels correspond to the static teacher, while the remaining half corresponds to the trainable teacher. The illustration of the student architecture, which is similar to the static teacher architecture, is shown in Fig. 4.
Fig. 4
The architecture of our student network follows the exact structure of the static teacher network, except for the last two convolutional layers, where the number of channels is doubled from 384 in the static teacher to 768 in the student network

3.2.1 Student-static teacher loss

The common approach in the student–teacher loss [4, 34, 36] considers the error between student and static teacher by simply averaging all the pixel-level errors for the entire output feature map. Another work [6] introduced a hard features loss where only a small percentage of the errors are averaged, while the remaining is ignored. The hard features loss for the i-th image is defined as follows:
$$\begin{aligned} L_i^\mathrm {\Omega \_{\tiny {hard}}} = \frac{1}{CWH} \sum _{j=1}^{C} \sum _{k=1}^{W} \sum _{l=1}^{H} p_{j,k,l}^{\text{hard}}(\mathcal {S}(I_i)_{j, k, l} - \mathcal {T}_\text{static}(I_i)_{j,k,l})^2 \end{aligned}$$
(4)
where the value of \(p_{j,k,l}^{\text{hard}}\) is 1, if \(\mathcal {E}_{j, k, l}^\mathrm {\Omega } = (\mathcal {S}(I_i)_{j, k, l} - \mathcal {T}_\text{static}(I_i)_{j,k,l})^2\) with \(\mathcal {E}_{j, k, l}^\mathrm {\Omega } \in \mathcal {E}^\mathrm {\Omega }\), i.e., the squared difference between the output of student and static teacher, is belong to the top 0.1% largest values on the entire feature map over \(j=1,\ldots ,C\), \(k=1,\ldots ,W\), and \(l=1,\ldots ,H\). Otherwise, the value of \(p_{j,k,l}^{\text{hard}}\) is set to 0. The authors introduced this loss with the motivation to only consider the relevant parts of images which are indicated by the highest error values, while avoiding the outliers during training. However, this approach may completely ignore some important information on the images due to the majority of the data points, i.e., the 99.9%, is not considered in the loss calculation for a single backpropagation.

3.2.2 Student-trainable teacher loss

As mentioned before, in addition to mimicking the output of \(\mathcal {T}_\text{static}\), the student is also trained to mimic the output of \(\mathcal {T}_\text{trainable}\):
$$\begin{aligned} L_i^\mathrm {\Psi } = \frac{1}{CWH} \sum _{j=C+1}^{2C} \sum _{k=1}^{W} \sum _{l=1}^{H} (\mathcal {S}(I_i^\text{aug})_{j, k, l} - \mathcal {T}_\text{trainable}(I_i^\text{aug})_{j-C,k,l})^2 \end{aligned}$$
(5)
The \(L_i^\mathrm {\Psi }\) loss encourages the student to mimic the trainable teacher on the augmented images. The augmented image is used here because the \(\mathcal {T}_\text{trainable}\) model is also trained on the augmented image. At the test time, however, only non-augmented images are needed for all of our models. Since the output sizes of the student network and the trainable teacher differ, there is a discrepancy in the number of channels. Specifically, the student network has twice the number of channels compared to the trainable teacher. To align their outputs, the student network’s channels range from \(\mathrm {C+1}\) to 2C, while the trainable teacher’s channels are adjusted by subtracting C to ensure they match correctly. This transformation ensures that both networks operate on comparable feature representations despite the difference in their channel dimensions. One may notice that the \(L_i^\mathrm {\Psi }\) loss is calculated without considering the soft-hard feature loss and penalty loss. This is due to the adaptability of the \(\mathcal {T}_\text{trainable}\) parameters, which are modified during the training process, and the model is not initially trained on a pretraining dataset. These two loss functions are employed to address the challenges associated with the student’s learning from a static teacher, where the parameters are fixed and pre-trained on a separate pretraining dataset.

3.3 Soft-hard feature loss

The hard feature loss, a traditionally used approach in anomaly detection, focuses on the highest values during backpropagation, thereby prioritizing the most critical regions for anomaly detection. While effective in cases where the difference between normal and abnormal features is significant, this approach becomes less reliable when anomalies exhibit subtle deviations, as it may fail to capture minor but important variations. In contrast, soft feature loss, which normalizes and propagates all feature values without filtering, may retain excessive information, potentially leading to an increased false-positive rate by misclassifying normal samples as anomalies. To overcome this issue as illustrated in Fig. 5, we introduce a soft-hard feature loss to avoid completely ignoring most of the data points while maintaining the importance of the most relevant parts of the image. Soft-hard feature loss is a hybrid loss function designed to balance the trade-off between hard feature loss and soft feature loss in anomaly detection tasks. This loss function aims to enhance feature-level anomaly representation by selectively focusing on high-error regions while still incorporating global feature information.
Fig. 5
Hard feature loss emphasizes only a small portion of the Gaussian function, making it effective for highly distinguishable anomalies where the difference between normal and abnormal features is significant. Soft feature loss, on the other hand, utilizes the entire feature space (grid) with varying intensities, making it more suitable for subtle or dispersed anomalies that are harder to isolate. Soft-hard feature loss takes a hybrid approach, where high-error regions are emphasized, but lower error regions are also considered. This dynamic balance achieves robust anomaly detection by effectively capturing both strong and weak anomaly signals, ensuring adaptability to various anomaly types
Our soft-hard feature loss is defined as follows:
$$\begin{aligned} & L_i^\mathrm {\Omega \_{\tiny {softhard}}} = \alpha L_i^\mathrm {\Omega \_{\tiny {soft}}} + (1-\alpha )L_i^\mathrm {\Omega \_{\tiny {hard}}} \end{aligned}$$
(6)
$$\begin{aligned} & L_i^\mathrm {\Omega \_{\tiny {soft}}} = \frac{1}{CWH} \sum _{j=1}^{C} \sum _{k=1}^{W} \sum _{l=1}^{H} p_{j,k,l}^{\text{soft}}\mathcal {E}_{j,k,l}^\mathrm {\Omega } \end{aligned}$$
(7)
$$\begin{aligned} & p_{j,k,l}^{\text{soft}} = \frac{\mathcal {E}_{j,k,l}^\mathrm {\Omega } - \min (\mathcal {E}^\mathrm {\Omega })}{\max (\mathcal {E}^\mathrm {\Omega }) - \min (\mathcal {E}^\mathrm {\Omega })} \end{aligned}$$
(8)
where the loss \(L_i^\mathrm {\Omega \_{\tiny {softhard}}}\) consists of hard features loss and soft features loss with \(\alpha \) parameter controlling the contribution of the two losses. The soft features loss \(L_i^\mathrm {\Omega \_{\tiny {soft}}}\) is the loss that is weighted by the min–max normalized errors. Intuitively, this gives the higher weights to the higher error values and gives the lower weights to the lower error values. By extending hard features loss techniques to the soft-hard features loss, we ensure that a broader range of loss values contributes to the optimization process, while concurrently mitigating the influence of less relevant parts or outliers. This approach allows the model to focus on minimizing errors where they are most significant while still considering the entirety of the loss distribution, thereby enhancing the model’s pixel-level performance.
The soft-hard feature loss technique is expected to be particularly effective in scenarios where there are many areas in the images considered hard or difficult to train, such as background clutter, object variability, and other complexities inherent in real-world image data. Only considering the top 0.1% error values as in the hard features loss may not adequately address the complexities present in the whole area. We provide an example highlighting the influence of the soft-hard feature loss in Fig. 6.
Fig. 6
Visualization of the loss masks at a specific training step for three different approaches: hard, soft, and soft-hard. The pixel intensity within each mask reflects the contribution of a pixel’s feature vector to the backpropagation process. The hard feature loss might miss important information during backpropagation, while the soft feature loss might include many irrelevant details, for example, shadow, background, etc. Our proposed soft-hard feature loss is designed to strike a balance, capturing and emphasizing important features while suppressing less relevant information. In this visualization, the hard mask method ignores 99.9% of the error values based on the approach in [6], and for the soft-hard mask, the parameter \(\alpha \) is set to 0.5
In addition to \(L_i^\mathrm {\Omega \_{\tiny {softhard}}}\) loss, we also incorporate penalty loss \(L_i^\mathrm {\Omega \_{\tiny {penalty}}}\) [6] which is defined as follows:
$$\begin{aligned} L_i^\mathrm {\Omega \_{\tiny {penalty}}} = \frac{1}{CWH} \sum _{j=1}^{C} \sum _{k=1}^{W} \sum _{l=1}^{H} (\mathcal {S}(\hat{I}_i)_{j, k, l})^2 \end{aligned}$$
(9)
where \(\hat{I}_i\) is a randomly selected image from the pretraining dataset, i.e., ImageNet, at the i-th training step. This penalty loss discourages the student from imitating the teacher on images beyond the normal images on the training set. Hence, the final loss between the student and the static teacher is formulated as:
$$\begin{aligned} L_i^\mathrm {\Omega } = L_i^\mathrm {\Omega \_{\tiny {softhard}}} + L_i^\mathrm {\Omega \_{\tiny {penalty}}} \end{aligned}$$
(10)

3.4 Dynamic loss weighting

Dynamic loss weighting is a mechanism that automatically adjusts the contribution of different teacher models during training. Instead of using fixed weighting, this approach dynamically balances the importance of the static and trainable teachers based on their reliability at each iteration. The model ensures that when the trainable teacher is unreliable in early training stages, the static teacher takes precedence. Conversely, as the trainable teacher improves, it gradually contributes more to the final loss calculation. This adaptive strategy eliminates the need for extensive hyperparameter tuning while improving learning stability and performance. During training on the \(\mathcal {D}_\text{ad}\) dataset, we train the \(\mathcal {S}\) and \(\mathcal {T}_\text{trainable}\) models simultaneously, and aggregate each of the loss terms with a weighted sum:
$$\begin{aligned} L_i = w_i^\mathrm {\Omega } L_i^\mathrm {\Omega } + w_i^\mathrm {\Psi } L_i^\mathrm {\Psi } + L_i^\mathrm {\Theta } \end{aligned}$$
(11)
where \(w_i^\mathrm {\Omega }\) and \(w_i^\mathrm {\Psi }\) control the weighting at the i-th training step, for the student model to dynamically balance the importance between the static teacher and the trainable teacher, respectively. The previous works [5, 6] did not consider balancing the importance between multiple models based on the reliability of the model. The individual loss terms w.r.t. the different models are considered equal. Although the weight parameters can be manually tuned using a single fixed value over the training iterations, the number of experiments on exhaustive combinations of the hyperparameters could easily increase largely with the number of models or the number of parameters increasing. In our approach, we introduce a novel dynamic loss weighting technique, in which the weight hyperparameters are dynamically adjusted during the training process based on the reliability of the related model. Since we are dealing with an unsupervised problem, where there is no ground truth to validate the actual detection performance of the models on the training or validation set, in our framework, we consider the \(L_i^\mathrm {\Theta }\) loss, i.e., the trainable teacher loss between trainable teacher and static teacher, as the reference to adjust the weight parameters with the formula is defined as follows:
$$\begin{aligned} & w_i^\mathrm {\Omega } = \frac{L_i^\mathrm {\Theta }}{1 + L_i^\mathrm {\Theta }} \end{aligned}$$
(12)
$$\begin{aligned} & w_i^\mathrm {\Psi } = \frac{1}{1 + L_i^\mathrm {\Theta }} \end{aligned}$$
(13)
During the early stages of training, the \(L_i^\mathrm {\Theta }\) loss tends to be relatively high, as the trainable teacher has not yet been sufficiently trained. Consequently, it is necessary to dynamically decrease \( w_i^\mathrm {\Psi }\) and increase \(w_i^\mathrm {\Omega }\) to mitigate the influence of the trainable teacher when its reliability is low, allowing the model to rely more on the static teacher for stable guidance. As training progresses and the trainable teacher becomes more reliable, the weight distribution gradually shifts in the opposite direction, increasing the contribution of the trainable teacher. Notably, the sum of \(w_i^\mathrm {\Omega }\) and \(w_i^\mathrm {\Psi }\) is always equal to 1, ensuring a balanced distribution of contributions throughout training. The proposed dynamic loss weighting mechanism effectively adjusts these weights in an adaptive manner, optimizing the balance between the static and trainable teachers without requiring extensive hyperparameter tuning.
The dynamic loss weighting approach is grounded in the principle of adaptive optimization [41], emphasizing the model’s capability to adjust its learning strategy dynamically in response to the available information at each training iteration. Through dynamic weight adjustment of the losses, the dynamic loss weighting method ensures that the model prioritizes guidance from the more reliable teacher, leading to better optimization. In the scenario where time and resources for hyperparameter optimization are limited, the dynamic loss weighting technique eliminates the need for extensive manual tuning of hyperparameters related to the balance between the teachers. In addition, we also conducted experiments on using a single fixed parameter for the weight parameters and compared the performance with our proposed dynamic loss weighting.
Algorithm 1 presents the training procedure of the student–teacher network under the proposed framework. Instead of employing a frozen pre-trained model as the static teacher, our approach utilizes the patch description network (PDN), which is derived through knowledge distillation from a larger pre-trained model. This process follows the methodology of EfficientAD [6], ensuring consistency with prior works while leveraging the benefits of distilled knowledge for enhanced anomaly detection performance.
Algorithm 1 Training procedure of the proposed method

4 Experiments

4.1 Dataset and evaluation metrics

The MVTec dataset [2] is widely recognized as a prominent benchmark in the field of anomaly detection in computer vision. It consists of 15 categories, comprising 10 distinct objects and 5 unique textures. The training data exclusively consists of normal (anomaly-free) data, with an average of 240 image samples per category. The test dataset encompasses a variety of abnormal cases and a single normal case, with each case organized into different folders. Additionally, each abnormal case is paired with a corresponding masking image, serving as the ground truth.
The VisA dataset [21] is more recent and larger than the MVTec dataset. This dataset comprises 12 categories with a total of 10,821 high-resolution images. Among these images, 9621 represent normal samples, while 1200 represent abnormal samples, with image-level annotations provided in CSV files. Similarly to the MVTec dataset, the VisA dataset also includes annotations of ground truth masks for abnormal images.
The availability of both image-level and pixel-level labeling in both the MVTec and VisA datasets facilitates the evaluation of models for two distinct tasks: the detection task and the localization or segmentation task. The detection task involves identifying anomalies as we would in an image classification problem, distinguishing between normal and abnormal instances at the image level. In contrast, the localization or segmentation task focuses on precisely locating anomalies at the pixel level, providing detailed information about the specific pixels that make up anomalous regions.
We evaluate our method using the two widely used metrics in anomaly detection, i.e., AU-ROC and AU-PRO. Specifically, we evaluate our method’s detection and localization performance using AU-ROC and AU-PRO, respectively. As suggested by [4, 6, 42], the AU-PRO performance is computed until false-positive rate of 30%. In addition, we evaluate the computational performance of our method using some computational efficiency metrics, including latency, throughput, CUDA memory usage, number of model parameters, and floating-point operations (FLOPs). Latency measures the time taken to infer detection results for a single input image, while throughput quantifies the number of input images processed per second through batch processing. To compute CUDA memory usage and FLOPs, we utilize PyTorch’s official profiling framework [43].

4.2 Implementation details

For \(\mathcal {T}_\text{static}\) and \(\mathcal {S}\) networks, we employ the medium variant of the PDN architecture [6]. To speed up the forward pass, we also disable the padding in the convolutional layers of the PDN architecture. Meanwhile, for \(\mathcal {T}_\text{trainable}\), we utilize an autoencoder-based model following the same reference. Given the structural similarity between our approach and the EfficientAD method [6], we establish the EfficientAD method as the baseline for our work.
Before training, we initialized the convolutional layer parameters using PyTorch’s default initialization method [43]. The image input is resized to a resolution of \(256 \times 256\) using bilinear interpolation, and then normalized using the mean and standard deviation values of the torchvision [43] pre-trained models for the RGB channels, that is [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225], respectively. For all models, we set the output dimensions to \(C=384\), \(W=56\), and \(H=56\). For the output of the \(\mathcal {T}_\text{static}\), we follow EfficientAD to normalize the output channel-wise using the training set of the anomaly detection dataset.
For each experiment, the training process consists of 70,000 iterations, where a random training image is selected for each of the iterations. The Adam [44] optimizer is utilized with an initial learning rate of 1e-4 and a weight decay factor of 1e-5. We then reduce the learning rate after reaching 95% of the total iterations to 1e-5. For the hyperparameters \(\beta _1\) and \(\beta _2\) of the Adam optimizer, we use the default values of 0.9 and 0.999, respectively. For the augmentation techniques as shown in Fig. 7, which correspond to the trainable teacher, we randomly change either brightness, contrast, or saturation, for up to 20% in either direction.
Fig. 7
Visualization of the augmentation techniques applied to the trainable teacher. We randomly adjust brightness, contrast, or saturation by up to ±20%. The capsule examples from the MVTec AD dataset illustrate the variations introduced through augmentation, demonstrating how the model learns more robust representations by encountering diverse transformations during training
During inference, we utilize quantile-based normalization as described in [6]. This process generates the final anomaly map, which is a combination of two anomaly maps: one based on the mean squared difference between the output of \(\mathcal {S}\) and \(\mathcal {T}_\text{static}\) and another based on the mean squared difference between the output of \(\mathcal {S}\) and \(\mathcal {T}_\text{trainable}\). The combined anomaly map is then resized to match the original image input size utilizing bilinear interpolation.
We used PyTorch to implement our method and used a computer with NVIDIA RTX 3090 GPU and AMD Ryzen 9 5900X CPU, and 64 GB RAM. With these settings, our training process takes about 2 h for the 70,000 iterations.

4.3 Quantitative results

In this section, we quantitatively compare our method with the baseline and other existing methods on the MVTec AD and VisA datasets. Besides, we conduct an ablation study of the proposed methods on the MVTec AD dataset. Finally, we provide computational performance results for our method compared to the baseline.

4.3.1 Comparison with existing methods

We compare our method with the baseline method and also the existing method which also incorporates the student–teacher-based method for fair comparison. The baseline method also uses a three-model architecture with the medium variant of PDN [6] for fair comparison. In Table 1, we can see that the proposed method, in which the dynamic loss weighting and the soft-hard feature loss are utilized, achieved state-of-the-art results on MVTec AD and VisA datasets for both detection and localization in the metric of AU-ROC and AU-PRO, respectively. Compared to the baseline method, we achieved a 0.6% gain in AU-PRO and 0.2% in AU-ROC on the MVTec AD dataset. Similarly, on the VisA dataset, we observed enhancements of 0.3% in AU-PRO and 0.4% in AU-ROC relative to the baseline method. It is noteworthy to mention that we consistently achieved stable performance across multiple trials as indicated by the standard deviation shown in Table 1. While these gains may not appear highly impressive, we achieved notable improvements in several categories within the MVTec AD dataset. However, in practical applications, model performance is influenced by various factors beyond accuracy, including training time, inference speed, and computational efficiency. Therefore, achieving even a small performance improvement without additional computational cost compared to the baseline is highly significant. This is particularly crucial in industrial manufacturing, where even a slight increase in accuracy can have a substantial impact due to the large-scale production environment.
Table 1
Comparison with the existing student–teacher-based methods on MVTec AD and VisA dataset
Method
MVTec AD
VisA
AU-ROC
AU-PRO
AU-ROC
AU-PRO
(Det.)
(Loc.)
(Det.)
(Loc.)
PaDiM [25]
89.7
91.2
86.3
87.2
PatchCore [26]
99.1
93.5
92.0
87.7
FastFlow [45]
95.0
91.4
92.5
90.0
MKDAD [32]
87.7
STFPM [36]
95.5
92.1
GCAD [5]
89.1
91.0
83.7
83.7
S-T [4]
93.2
92.4
94.6
93.0
AST [34]
98.9
81.2
94.9
81.5
R.D. [46]
90.8
89.5
91.4
90.0
EfficientAD [6]
99.1
93.5
98.1
94.0
Proposed
99.3
94.1
98.5
94.3
 
(\(\pm 0.0883\))
(\(\pm 0.1009\))
(\(\pm 0.1124\))
(\(\pm 0.0757\))
The performance of EfficientAD, GCAD, S-T, and AST, are retrieved from [6], while others are retrieved from the original papers. Det. and Loc. are the abbreviations of Detection and Localization, respectively. The mean and standard deviation reported in the table are based on five runs
Table 2
Comparison with the baseline model for each category of MVTec AD dataset
Category
AU-ROC (Detection)
AU-PRO (Localization)
Baseline
Proposed
Baseline
Proposed
Tile
99.9
99.9
88.5
88.7
Wood
99.4
99.6
90.7
91.2
Carpet
99.5
99.6
92.7
93.4
Grid
100
100
89.2
89.2
Leather
100
100
98.2
98.3
Average Texture
99.76
99.82
91.86
92.16
Capsule
97.9
98.4
96.9
97.7
Hazelnut
100
100
95.3
95.4
Screw
95.7
97.5
96.6
97
Cable
94.5
95.6
90.8
91.1
Pill
98.4
99.3
96.5
96.5
Metal Nut
99.7
99.7
94
94.5
Toothbrush
100
100
95
95.6
Transistor
99.9
99.9
91
92.5
Bottle
100
100
95.7
95.8
Zipper
99.6
99.8
92.1
94.3
Average Object
98.6
99.0
94.4
95.0
Bold text represents the best result
The MVTec AD dataset comprises 15 distinct categories, which can be broadly classified into five textile categories and ten object categories. In this study, we conduct experiments to compare the performance of our proposed method against the baseline across all categories. As observed in the results, certain categories, such as Grid, Leather, and Hazelnut, have already reached saturation at 100% detection performance. However, Table 2 highlights notable improvements in anomaly localization, particularly for the Zipper and Transistor categories, with AU-PRO gains of 2.2% and 1.5%, respectively. Additionally, for image-level anomaly detection, significant improvements are observed in the Screw and Cable categories, achieving AU-ROC gains of 1.8% and 1.1%, respectively. Furthermore, the results indicate consistent performance enhancements or, at the very least, comparable outcomes across all dataset categories compared to the baseline method, underscoring the robustness and effectiveness of our proposed approach.
All of these performance gains are achieved without affecting the inference speed since we adopted the same architecture, i.e., the medium variant of PDN and the autoencoder-based model, and the same inference process as the baseline method. These experiment results indicate that our proposed method can detect the anomaly better.

4.3.2 Ablation study of the proposed methods

In this section, we conduct an ablation study to explore the impact of each of the proposed techniques, specifically dynamic loss weighting and soft-hard features loss. From Table 3, we can see that the incorporation of the soft-hard feature loss contributes to an enhancement in the anomaly localization. This is expected because the soft-hard feature loss comprehensively utilizes pixel information to a greater extent than the hard feature loss. Additionally, the same table demonstrates that the dynamic loss enables our model to improve both detection and localization. This validates the idea that dynamically adjusting the importance of multiple models leads to better performance compared to simple equal weighting, by leveraging the reliability of the model. Finally, when we combine both of these proposed approaches, we achieve even better performance, leveraging the advantages of both methods.
Table 3
Ablation study of the proposed methods on MVTec AD dataset
Dynamic weighting
Soft-Hard Loss
AU-ROC (Detection)
AU-PRO (Localization)
99.1
93.5
99.1
93.7
99.2
93.8
99.3
94.1

4.3.3 The impact of \(\alpha \) on the soft-hard feature loss

In Table 4, we conducted experiments with various values of \(\alpha \) to assess its impact. As shown in the table, we can notice that when utilizing the soft-hard feature loss with varying \(\alpha \), the localization performance consistently exceeds 93.5%, outperforming the localization performance achieved by the baseline method, i.e., the hard feature loss used in EfficientAD. This verifies the introduced soft-hard loss ability to capture pixel-level information more thoroughly despite the value varies. While utilizing the soft-hard feature loss alone does not lead to any improvement in detection or image-level performance, integrating the soft-hard feature loss with dynamic loss weighting contributes to enhancing the overall performance of our final model, as demonstrated in Table 3. From the same table, we can also observe that there are no significant differences when adjusting the parameter \(\alpha \). We believe this because the soft and hard feature losses might be contributing equally to the optimization process on the MVTec AD dataset, resulting in a balance that is maintained across a range of \(\alpha \) values.
Table 4
Experiment on different \(\alpha \) of soft-hard feature loss on the MVTec AD dataset
Method
\(\alpha \)
AU-ROC (Detection)
AU-PRO (Localization)
Baseline (Hard)
1.0
99.1
93.5
Soft-hard
0.1
99.0
93.7
 
0.3
99.1
93.7
 
0.5
99.1
93.7
 
0.7
99.1
93.7
 
0.9
99.0
93.8
In this experiment, the dynamic loss weighting technique is not utilized

4.3.4 Comparison of fixed weighting versus dynamic weighting mechanism

We additionally conduct experiments by comparing our proposed dynamic loss weighting with a fixed weighting technique as shown in Table 5. In this experiment, for the fixed weighting approach, we experiment with unequal weight to investigate to impact of the different models, instead of using equal weights as in the baseline method. Although there can be many possible combinations of the fixed weight values, to simplify the experiment, we focused on four specific weight values: 0.1, 0.3, 0.7, and 0.9, with the constraint that the sum of \(w_i^\mathrm {\Omega }\) and \(w_i^\mathrm {\Psi }\) equals 1. From the table, we can see that generally giving more weight to the \(L_i^\mathrm {\Psi }\) loss, i.e., the squared difference between student and trainable teacher, is better than giving more weight to the \(L_i^\mathrm {\Omega }\) loss. We believe this because the trainable teacher might adapt more flexibly to the student’s learning compared to a static teacher with fixed model parameters, resulting in improved overall performance. While a meticulously tuned fixed weighting approach might achieve good performance, our dynamic weighting method achieves comparable results without the need for extensive tuning across various weight combinations. This advantage becomes especially valuable in situations where computational resources and time constraints are a concern.
Table 5
Experiment on different weighting mechanisms to balance the importance of the static and the trainable teacher on the MVTec AD dataset
Method
\(w_i^\mathrm {\Omega }\)
\(w_i^\mathrm {\Psi }\)
AU-ROC (Detection)
AU-PRO (Localization)
Fixed
0.1
0.9
99.3
93.7
 
0.3
0.7
99.2
93.7
 
0.7
0.3
98.9
93.6
 
0.9
0.1
99.1
93.6
Dynamic
99.2
93.8
In this experiment, the soft-hard feature loss is not incorporated

4.3.5 Computational performance metrics

Our method is expected to perform as efficiently as the baseline method, i.e., EfficientAD [6], in terms of computational performance metrics such as latency, throughput, FLOPs, number of model parameters, and GPU Memory. This expectation arises from the fact that our model adopts the same architecture as the baseline method, in which, for a fair comparison, we utilized the medium variant of the patch description network (PDN) and an autoencoder-based model. Among the existing methods listed in Table 1, EfficientAD is worth noting as the only one that specifically focuses on achieving better computational efficiency and is recognized as one of the state-of-the-art methods in terms of efficiency.
Table 6
Comparison of our method with the baseline method on computational performance metrics during inference using NVIDIA RTX 3090
Method
Latency
Throughput
CUDA Memory
Number of Parameters
FLOPs
(ms)
(image/s)
(MB)
(\(\times 10^6\))
(\(\times 10^9\))
Baseline
6.389
206
71.767
20.738
235.410
 
(\(\pm 0.004\))
(\(\pm 1\))
(\(\pm 0.161\))
(\(\pm 0\))
(\(\pm 0\))
Proposed
6.387
206
71.885
20.738
235.410
 
(\(\pm 0.005\))
(\(\pm 1\))
(\(\pm 0.118\))
(\(\pm 0\))
(\(\pm 0\))
The mean and standard deviation reported in the table are based on five runs of 1000 forward passes. Following [6], we utilize half precision to measure the computational performance. Note that switching to half precision during the inference of our method does not change the other experimental results evaluated in this paper
To provide evidence and strengthen our claim regarding the impact on computational performance relative to the baseline method, we measure the inference performance of our method on various computational performance metrics and compare it with the baseline method as shown in Table 6. From Table 6, we can see that the number of parameters, FLOPs, and throughput of our method compared to the baseline remains the same, while the other metrics vary slightly. For each metric, we conduct 1000 initial forward passes as a warm-up phase, followed by computing the mean performance values of the subsequent 1000 forward passes. We measure throughput using a batch size of 16, while for the other metrics, we use a batch size of 1 for each forward pass. The throughput metric is specifically assessed in the context of batched image processing.
Fig. 8
Qualitative results of the proposed method versus the baseline method on three samples of the object (left) and texture (right) images on the MVTec AD dataset. For the object images, from top to bottom: transistor, zipper, and capsule. For the texture images, from top to bottom: tile, wood, and carpet. A higher anomaly score is represented closer to red, while a lower score is represented closer to blue
Fig. 9
Qualitative results of the proposed method vs the baseline method on six samples in the VisA dataset. For the left side, from top to bottom: cashew, chewing gum, and macaroni1. For the right images, from top to bottom: pipe_fryum, fryum, and pcb1. A higher anomaly score is represented closer to red while a lower score is represented closer to blue

4.4 Qualitative results

We qualitatively analyze and compare our proposed method to the baseline model with six samples from both MVTec and VisA datasets as illustrated in Fig. 8 and Fig. 9, respectively. As shown in the transistor and zipper images in Fig. 8 and cashew, pipe_fryum, and pcb1 images in Fig. 9, compared with the baseline method, we can see that our proposed method avoids the false detection of an anomaly on the part of images where the anomaly has not occurred. At the same time, our proposed model is still able to detect the anomalous parts equally well. We believe this is because our proposed model does not completely neglect most of the parts in the image while still giving higher importance to some by utilizing the proposed soft-hard feature loss.
Fig. 10
Comparison of sample images from different datasets. The left panel shows 9 randomly selected images from the ImageNet dataset [47], while the right panel displays 9 representative images from the MVTec AD dataset. This visual comparison highlights the domain gap between natural images and industrial anomaly detection images
From the image of the capsule and chewing gum, we can see that the prediction of the proposed method is more localized than the baseline method, by having a smaller detected anomaly area. From the carpet image, the proposed method covered more anomalous areas than the baseline method. From the sample image of the tile, wood, macaroni1, and fryum, we can notice that our proposed method has better confidence in predicting the anomaly, indicated by more red color shown in the images. We expect these behaviors because the dynamic loss helps to balance the importance of the model, optimizing the final detection performance.

5 Limitations and future work

While the MVTec AD dataset consists of 15 categories and the VisA dataset includes 12 categories (total 27 different categories in our experiments), these datasets still fall short of representing the vast diversity of real-world industrial environments, where millions of defect types can occur. Furthermore, due to the inherent domain gap as Fig. 10 in between manufacturing data and commonly used datasets such as ImageNet, which is based on natural images, there remain challenges in fully generalizing the model to unseen scenarios. Although the proposed method demonstrates strong performance in controlled benchmark settings, its scalability to high-volume industrial production lines and adaptability to highly diverse and subtle anomalies require further investigation.
To address these limitations, future work will focus on constructing large-scale, diverse manufacturing-specific datasets and evaluating the proposed model under real-world constraints to enhance its robustness, efficiency, and generalizability in practical applications. In cases where building such manufacturing datasets is not feasible, an alternative strategy would be to fine-tune the early layers of pre-trained models (e.g., those trained on ImageNet) so that the low-level features can be better adapted to the specific characteristics of manufacturing data. This approach could serve as a practical compromise to reduce the domain gap and improve performance in industrial scenarios with limited data availability.
We limit our experiments solely to the static image dataset to maintain clarity and focus in our research. However, there is a possibility that our methods, with or without some extensions, could apply to other types of data, such as audio, video, or other modalities. Investigating these other modalities could also be an interesting path for future research.
To ensure a fair comparison, we employed a convolutional architecture in our experiments. While alternative architectures, such as dense networks, attention mechanisms, and diffusion models, exist, most anomaly detection methodologies remain CNN-based due to their computational efficiency. In contrast, architectures like Transformers and diffusion models typically require significantly higher computational resources. Investigating the applicability of our proposed method to these architectures presents a promising direction for future research.
Through our ablation study, we identified that certain alpha values yield the best results. However, the explored range of alpha values was relatively narrow, limiting our ability to comprehensively assess the impact of varying alpha values on performance. In this study, we employed a grid search strategy within a constrained range to identify suitable alpha values. As future work, we plan to explore broader and more adaptive optimization techniques—such as Bayesian optimization or gradient-based tuning—to determine the optimal alpha value. Furthermore, comparing these static optimal values with our dynamic loss weighting approach could provide deeper insights into the trade-offs between fixed and adaptive weighting strategies in balancing soft and hard loss components. This could enhance both the stability and the detection accuracy of the proposed model across various scenarios.
Training multiple models simultaneously poses challenges, especially with limited computational resources. In our experimental setup, we restricted our experiments to employing a single model for each architecture type, namely static teacher, trainable teacher, and student networks. However, with greater computational resources, exploring the use of multiple models for each architecture could be a potential exploration for future studies.

6 Conclusion

Anomaly detection and localization are essential for industrial inspection, ensuring quality, reducing defects, and optimizing efficiency. These techniques minimize reliance on manual inspection, prevent defective products, and enable real-time corrective actions, thereby enhancing reliability and production standards.
In this paper, we propose a novel approach to enhance anomaly detection by leveraging the collective strength of multiple models. Our method introduces a dynamic loss weighting technique that adaptively balances the contributions of multiple teachers throughout the training process, leading to improved performance in both anomaly detection and localization. Additionally, to facilitate more comprehensive feature learning, we also introduce a soft-hard feature loss technique, further enhancing overall performance. As the proposed method does not modify the model architecture, it can be seamlessly applied to various multi-model approaches.
Experimental results demonstrate state-of-the-art performance on the MVTec AD and VisA datasets, achieving up to a 0.6% increase in AU-PRO on MVTec AD and a 0.4% increase in AU-ROC on VisA compared to baseline methods, underscoring the effectiveness of our approach. Due to the large-scale production nature of the manufacturing industry, even a relatively small performance improvement can have a substantial impact, potentially resulting in significant economic benefits in real-world industrial applications. Furthermore, qualitative analysis highlights the method’s capability to reduce false detections and improve anomaly localization within images.

Declarations

Conflict of interest

The authors declare that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by-nc-nd/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
2.
Zurück zum Zitat Bergmann P, Fauser M, Sattlegger D, Steger C (2019) Mvtec ad—a comprehensive real-world dataset for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9592–9600 Bergmann P, Fauser M, Sattlegger D, Steger C (2019) Mvtec ad—a comprehensive real-world dataset for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9592–9600
3.
Zurück zum Zitat Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of the 2007 conference on emerging artificial intelligence applications in computer engineering: real word AI systems with applications in EHealth, HCI, information retrieval and pervasive technologies. IOS Press, NLD, pp 3–24 Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of the 2007 conference on emerging artificial intelligence applications in computer engineering: real word AI systems with applications in EHealth, HCI, information retrieval and pervasive technologies. IOS Press, NLD, pp 3–24
4.
Zurück zum Zitat Bergmann P, Fauser M, Sattlegger D, Steger C (2020) Uninformed students: student–teacher anomaly detection with discriminative latent embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4183–4192 Bergmann P, Fauser M, Sattlegger D, Steger C (2020) Uninformed students: student–teacher anomaly detection with discriminative latent embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4183–4192
5.
Zurück zum Zitat Bergmann P, Batzner K, Fauser M, Sattlegger D, Steger C (2022) Beyond dents and scratches: logical constraints in unsupervised anomaly detection and localization. Int J Comput Vis 130(4):947–969CrossRef Bergmann P, Batzner K, Fauser M, Sattlegger D, Steger C (2022) Beyond dents and scratches: logical constraints in unsupervised anomaly detection and localization. Int J Comput Vis 130(4):947–969CrossRef
6.
Zurück zum Zitat Batzner K, Heckler L, König R (2023) Efficientad: accurate visual anomaly detection at millisecond-level latencies. arXiv:2303.14535 [cs.CV] Batzner K, Heckler L, König R (2023) Efficientad: accurate visual anomaly detection at millisecond-level latencies. arXiv:​2303.​14535 [cs.CV]
7.
Zurück zum Zitat Wang Y, Peng J, Zhang J, Yi R, Wang Y, Wang C (2023) Multimodal industrial anomaly detection via hybrid fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8032–8041 Wang Y, Peng J, Zhang J, Yi R, Wang Y, Wang C (2023) Multimodal industrial anomaly detection via hybrid fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8032–8041
8.
Zurück zum Zitat Thomine S, Snoussi H, Soua M (2023) Mixedteacher: knowledge distillation for fast inference textural anomaly detection. arXiv preprint arXiv:2306.09859 Thomine S, Snoussi H, Soua M (2023) Mixedteacher: knowledge distillation for fast inference textural anomaly detection. arXiv preprint arXiv:​2306.​09859
10.
Zurück zum Zitat Han ML, Lee J, Kang AR, Kang S, Park JK, Kim HK (2015) A statistical-based anomaly detection method for connected cars in internet of things environment. In: Hsu C-H, Xia F, Liu X, Wang S (eds) Internet of vehicles—safe and intelligent mobility. Springer, Cham, pp 89–97 Han ML, Lee J, Kang AR, Kang S, Park JK, Kim HK (2015) A statistical-based anomaly detection method for connected cars in internet of things environment. In: Hsu C-H, Xia F, Liu X, Wang S (eds) Internet of vehicles—safe and intelligent mobility. Springer, Cham, pp 89–97
11.
Zurück zum Zitat Dewaele G, Fukuda K, Borgnat P, Abry P, Cho K (2007) Extracting hidden anomalies using sketch and non gaussian multiresolution statistical detection procedures. In: Proceedings of the 2007 workshop on large scale attack defense. LSAD’07. Association for Computing Machinery, New York, NY, USA, pp 145–152. https://doi.org/10.1145/1352664.1352675 Dewaele G, Fukuda K, Borgnat P, Abry P, Cho K (2007) Extracting hidden anomalies using sketch and non gaussian multiresolution statistical detection procedures. In: Proceedings of the 2007 workshop on large scale attack defense. LSAD’07. Association for Computing Machinery, New York, NY, USA, pp 145–152. https://​doi.​org/​10.​1145/​1352664.​1352675
14.
Zurück zum Zitat Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ODD’13. Association for computing machinery, New York, NY, USA, pp 8–15. https://doi.org/10.1145/2500853.2500857 Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ODD’13. Association for computing machinery, New York, NY, USA, pp 8–15. https://​doi.​org/​10.​1145/​2500853.​2500857
16.
Zurück zum Zitat Akcay S, Atapour-Abarghouei A, Breckon TP (2019) Ganomaly: semi-supervised anomaly detection via adversarial training. In: Jawahar CV, Li H, Mori G, Schindler K (eds) Computer Vision - ACCV 2018. Springer, Cham, pp 622–637CrossRef Akcay S, Atapour-Abarghouei A, Breckon TP (2019) Ganomaly: semi-supervised anomaly detection via adversarial training. In: Jawahar CV, Li H, Mori G, Schindler K (eds) Computer Vision - ACCV 2018. Springer, Cham, pp 622–637CrossRef
17.
Zurück zum Zitat Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs G (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International conference on information processing in medical imaging. Springer, pp 146–157 Schlegl T, Seeböck P, Waldstein SM, Schmidt-Erfurth U, Langs G (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International conference on information processing in medical imaging. Springer, pp 146–157
18.
Zurück zum Zitat Görnitz N, Kloft M, Rieck K, Brefeld U (2013) Toward supervised anomaly detection. J Artif Intell Res 46(1):235–262MathSciNetCrossRef Görnitz N, Kloft M, Rieck K, Brefeld U (2013) Toward supervised anomaly detection. J Artif Intell Res 46(1):235–262MathSciNetCrossRef
20.
Zurück zum Zitat Minhas MS, Zelek J (2020) Semi-supervised anomaly detection using autoencoders. J Comput Vis Imaging Syst 5(1):3 Minhas MS, Zelek J (2020) Semi-supervised anomaly detection using autoencoders. J Comput Vis Imaging Syst 5(1):3
21.
Zurück zum Zitat Zou Y, Jeong J, Pemula L, Zhang D, Dabeer O (2022) Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T (eds) Computer vision - ECCV 2022. Springer, Cham, pp 392–408CrossRef Zou Y, Jeong J, Pemula L, Zhang D, Dabeer O (2022) Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T (eds) Computer vision - ECCV 2022. Springer, Cham, pp 392–408CrossRef
22.
Zurück zum Zitat Baur C, Wiestler B, Albarqouni S, Navab N: Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. In: Brainlesion: Glioma, multiple sclerosis, stroke and traumatic brain injuries: 4th international workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4. Springer, Berlin, pp 161–169 (2019) Baur C, Wiestler B, Albarqouni S, Navab N: Deep autoencoding models for unsupervised anomaly segmentation in brain MR images. In: Brainlesion: Glioma, multiple sclerosis, stroke and traumatic brain injuries: 4th international workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4. Springer, Berlin, pp 161–169 (2019)
23.
Zurück zum Zitat Vasilev A, Golkov V, Meissner M, Lipp I, Sgarlata E, Tomassini V, Jones DK, Cremers D: q-space novelty detection with variational autoencoders. In: Computational diffusion MRI: MICCAI workshop, Shenzhen, China, October 2019. Springer, Berlin, pp 113–124 (2020) Vasilev A, Golkov V, Meissner M, Lipp I, Sgarlata E, Tomassini V, Jones DK, Cremers D: q-space novelty detection with variational autoencoders. In: Computational diffusion MRI: MICCAI workshop, Shenzhen, China, October 2019. Springer, Berlin, pp 113–124 (2020)
24.
Zurück zum Zitat Schlegl T, Seeböck P, Waldstein SM, Langs G, Schmidt-Erfurth U (2019) f-AnoGAN: fast unsupervised anomaly detection with generative adversarial networks. Med image Anal 54:30–44CrossRef Schlegl T, Seeböck P, Waldstein SM, Langs G, Schmidt-Erfurth U (2019) f-AnoGAN: fast unsupervised anomaly detection with generative adversarial networks. Med image Anal 54:30–44CrossRef
25.
Zurück zum Zitat Defard T, Setkov A, Loesch A, Audigier R (2021) Padim: a patch distribution modeling framework for anomaly detection and localization. In: International conference on pattern recognition. Springer, pp 475–489 Defard T, Setkov A, Loesch A, Audigier R (2021) Padim: a patch distribution modeling framework for anomaly detection and localization. In: International conference on pattern recognition. Springer, pp 475–489
26.
Zurück zum Zitat Roth K, Pemula L, Zepeda J, Schölkopf B, Brox T, Gehler P (2022) Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14318–14328 Roth K, Pemula L, Zepeda J, Schölkopf B, Brox T, Gehler P (2022) Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14318–14328
27.
28.
Zurück zum Zitat Tang C, Zhou S, Li Y, Dong Y, Wang L: Advancing pre-trained teacher: towards robust feature discrepancy for anomaly detection. arXiv preprint arXiv:2405.02068 (2024) Tang C, Zhou S, Li Y, Dong Y, Wang L: Advancing pre-trained teacher: towards robust feature discrepancy for anomaly detection. arXiv preprint arXiv:​2405.​02068 (2024)
29.
Zurück zum Zitat Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, Cubuk ED, Kurakin A, Li C-L (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. Adv Neural Inf Process Syst 33:596–608 Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, Cubuk ED, Kurakin A, Li C-L (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. Adv Neural Inf Process Syst 33:596–608
30.
Zurück zum Zitat Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, Shinozaki T (2021) Flexmatch: boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inf Process Syst 34:18408–18419 Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, Shinozaki T (2021) Flexmatch: boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inf Process Syst 34:18408–18419
31.
32.
Zurück zum Zitat Salehi M, Sadjadi N, Baselizadeh S, Rohban MH, Rabiee HR (2021) Multiresolution knowledge distillation for anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14902–14912 Salehi M, Sadjadi N, Baselizadeh S, Rohban MH, Rabiee HR (2021) Multiresolution knowledge distillation for anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14902–14912
34.
Zurück zum Zitat Rudolph M, Wehrbein T, Rosenhahn B, Wandt B (2023) Asymmetric student–teacher networks for industrial anomaly detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2592–2602 Rudolph M, Wehrbein T, Rosenhahn B, Wandt B (2023) Asymmetric student–teacher networks for industrial anomaly detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2592–2602
36.
Zurück zum Zitat Wang G, Han S, Ding E, Huang D (2021) Student-teacher feature pyramid matching for anomaly detection. In: The British Machine Vision Conference (BMVC) Wang G, Han S, Ding E, Huang D (2021) Student-teacher feature pyramid matching for anomaly detection. In: The British Machine Vision Conference (BMVC)
37.
Zurück zum Zitat Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings. http://arxiv.org/abs/1409.1556 Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings. http://​arxiv.​org/​abs/​1409.​1556
39.
Zurück zum Zitat Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114 Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114
40.
Zurück zum Zitat Zagoruyko S, Komodakis N: Wide residual networks. In: BMVC (2016) Zagoruyko S, Komodakis N: Wide residual networks. In: BMVC (2016)
42.
Zurück zum Zitat Bergmann P, Batzner K, Fauser M, Sattlegger D, Steger C (2021) The MVTec anomaly detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection. Int J Comput Vis 129(4):1038–1059CrossRef Bergmann P, Batzner K, Fauser M, Sattlegger D, Steger C (2021) The MVTec anomaly detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection. Int J Comput Vis 129(4):1038–1059CrossRef
43.
Zurück zum Zitat Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein NAntiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems 32, pp 8024–8035 Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein NAntiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems 32, pp 8024–8035
44.
Zurück zum Zitat Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings. http://arxiv.org/abs/1412.6980 Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings. http://​arxiv.​org/​abs/​1412.​6980
45.
Zurück zum Zitat Yu J, Zheng Y, Wang X, Li W, Wu Y, Zhao R, Wu L (2021) Fastflow: unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint arXiv:2111.07677 Yu J, Zheng Y, Wang X, Li W, Wu Y, Zhao R, Wu L (2021) Fastflow: unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint arXiv:​2111.​07677
46.
Zurück zum Zitat Deng H, Li X (2022) Anomaly detection via reverse distillation from one-class embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9737–9746 Deng H, Li X (2022) Anomaly detection via reverse distillation from one-class embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9737–9746
47.
Zurück zum Zitat Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255 Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Metadaten
Titel
Multi-model anomaly detection for industrial inspection with dynamic loss weighting and soft-hard features loss
verfasst von
Willy Fitra Hendria
Hanbi Kim
Daeho Seo
Publikationsdatum
09.06.2025
Verlag
Springer London
Erschienen in
Neural Computing and Applications
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-025-11367-3