admin管理员组

文章数量:1622542

论文地址:https://arxiv/pdf/2011.11108.pdf
官方代码:https://github/Niousha12/Knowledge_Distillation_AD
自实现代码:https://github/HibikiJie/Multiresolution-Knowledge-Distillation-for-Anomaly-Detection/tree/master

\

Abstract

Unsupervised representation learning has proved to be a critical component of anomaly detection/localization in images. The challenges to learn such a representation are two-fold. Firstly, the sample size is not often large enough to learn a rich generalizable representation through conventional techniques. Secondly, while only normal samples are available at training, the learned features should be discriminative of normal and anomalous samples. Here,we propose to use the “distillation” of features at various layers of an expert network, pre-trained on ImageNet, into a simpler cloner network to tackle both issues. We detect and localize anomalies using the discrepancy between the expert and cloner networks’ intermediate activation values given the input data. We show that considering multiple intermediate hints in distillation leads to better exploiting the expert’s knowledge and more distinctive discrepancy compared to solely utilizing the last layer activation values. Notably, previous methods either fail in precise anomaly localization or need expensive region-based training. In contrast, with no need for any special or intensive training procedure, we incorporate interpretability algorithms in our novel framework for localization of anomalous regions. Despite the striking contrast between some test datasets and ImageNet, we achieve competitive or significantly superior results compared to the SOTA methods on MNIST, F-MNIST, CIFAR-10, MVTecAD, Retinal-OCT,and two Medical datasets on both anomaly detection and localization.

无监督表示学习已被证明是图像异常检测/定位的关键组成部分。学习这种表示形式的挑战是双重的。首先,样本数量通常不足以通过常规技术来学习丰富的可概括表示。其次,虽然只有正常样本可用于训练,但是学习到的特征应区分正常样本和异常样本。在这里,我们建议使用在ImageNet上经过预训练的专家网络各个层上的功能“提炼”到一个更简单的克隆器网络中,以解决这两个问题。我们使用给定输入数据的专家和克隆网络的中间激活值之间的差异来检测和定位异常。我们表明,与仅使用最后一层激活值相比,考虑蒸馏中的多个中间提示可导致更好地利用专家的知识和更加独特的差异。值得注意的是,先前的方法要么无法精确定位异常,要么需要昂贵的基于区域的训练。相比之下,无需任何特殊或强化培训程序,我们就将可解释性算法纳入了用于异常区域定位的新颖框架中。尽管某些测试数据集与ImageNet形成了鲜明的对比,但与MNIST,F-MNIST,CIFAR-10,MVTecAD,Retinaal-OCT的SOTA方法以及两个在异常检测和定位方面的医学数据集相比,我们仍获得了竞争性或明显优越的结果。

1.Introduction

Anomaly detection (AD) aims for recognizing test-time inputs looking abnormal or novel to the model according to the previously seen normal samples during training. It has been a vital demanding task in computer vision with various applications, like in industrial image-based product quality control [27, 7] or in health monitoring processes [26].These tasks also require the pixel-precise localization of the anomalous regions, called defects. This is pivotal for comprehending the dynamics of monitored procedures and triggering the apt antidotes, and providing proper data for the downstream models in industrial settings. Traditionally, the AD problem has been approached in a one-class setting, where the anomalies represent a broadly different class from the normal samples. Recently, considering subtle anomalies has attracted attentions. This new setting further necessitates precise anomaly localization. However, performing excellently in both settings on various datasets is highly appreciated but is not yet fully achieved. Due to the unsupervised nature of the AD problem and the restricted data access, availability of just the normal data in training, the majority of methods [36, 31, 40, 18, 34] model the normal data abstraction by extracting semantically meaningful latent features. These methods perform well solely on either of the two mentioned cases. This problem, called the generality problem [39], highly declines trust in them on unseen future datasets. Moreover, anomaly localization is either impossible or poor in most of them [36, 31, 33] and leads to intensive computations that hurt their real-time performance. Additionally, many earlier works [33, 31] suffer from unstable training, requiring unprincipled early stopping to achieve acceptable results.

介绍

异常检测(AD)的目的是根据训练过程中以前见过的正常样本来识别模型中看起来异常或新颖的测试时间输入。在计算机视觉及其各种应用中,例如在基于工业图像的产品质量控制[27、7]或健康监测过程[26]中,这是一项至关重要的任务。这些任务还需要对异常区域进行精确的像素定位。 ,称为缺陷。这对于理解所监控程序的动态并触发合适的解毒剂,以及为工业环境中的下游模型提供适当的数据至关重要。传统上,AD问题是在一类环境中处理的,其中异常代表的类别与正常样本大不相同。近来,考虑细微的异常现象引起了人们的注意。此新设置还需要精确的异常定位。但是,高度赞赏在各种数据集的两种设置下均表现出色,但尚未完全实现。由于AD问题的不受监督的性质以及受限制的数据访问,仅常规数据在训练中的可用性,大多数方法[36、31、40、18、34]通过提取语义上有意义的潜在特征来对常规数据抽象进行建模。这些方法仅在上述两种情况中的任何一种上都表现良好。这个被称为普遍性问题的问题[39]极大地降低了他们在看不见的未来数据集上的信任度。此外,在大多数情况下,异常定位是不可能的或很差的[36、31、33],并且会导致密集的计算,从而损害其实时性能。另外,许多较早的著作[33,31]受训练不稳定的影响,需要无原则的尽早停止以取得可接受的结果。

Figure 1: Our precise heatmaps localizing anomalous features in MVTecAD (top two rows) and normal features in MNIST and CIFAR-10 (two bottom rows).

图1:我们精确的热图定位了MVTecAD中的异常特征(顶部两行)以及MNIST和CIFAR-10中的正常特征(底部两行)。

Using the pre-trained networks, though not fully explored in the AD context, could potentially be an alternative track. This is especially very helpful when the sample size is small and the normal class shows large variations. Some earlier studies [4, 12, 28, 29] try to train their model based on the pre-trained features of normal data. These methods either miss anomaly localization [4, 12], or tackle the problem in a region-based fashion [28, 53], i.e. splitting images into smaller patches to determine the sub-regional abnormality. This is computationally expensive and often leads to inaccurate localization. To evade this issue, Bergmann et al. [8] train an ensemble of student networks to mimic the last layer of a teacher network on the anomaly-free data. However, performing a region-based approach in this work, not only makes it heavily rely on the size of the cropped patches and hence susceptible to the changes in this size, but also intensifies the training cost severely. Furthermore, imitating only the last layer misses to fully exploit the knowledge of the teacher network [32]. This makes them complicate their model and employ other complementary techniques, such as self-supervised learning, in parallel.

使用预训练的网络,尽管在AD上下文中没有进行充分的探索,但可能是另一种选择。当样本量较小且正常类别显示较大差异时,这特别有用。一些较早的研究[4、12、28、29]尝试根据正常数据的预训练特征来训练他们的模型。这些方法要么错过异常定位[4、12],要么以基于区域的方式解决问题[28、53],即将图像分成较小的补丁以确定子区域异常。这在计算上是昂贵的,并且经常导致不准确的定位。为了回避这个问题,Bergmann等人。 [8]训练学生网络的整体,以模仿无异常数据上教师网络的最后一层。但是,在这项工作中执行基于区域的方法,不仅使其严重依赖于裁剪斑块的大小,因此容易受到该大小变化的影响,而且严重地增加了培训成本。此外,仅模仿最后一层错过了充分利用教师网络的知识[32]。这使他们使模型复杂化,并同时采用其他互补技术,例如自我监督学习。

Lately, Zhang et al. [52] have demonstrated that the activation values of the intermediate layers of neural networks are a firm perceptual representation of the input images. By this premise, we propose a novel knowledge distillation method that is designed to distill the comprehensive knowledge of an ImageNet pre-trained source network, solely on the normal training data, into a simpler cloner network.This happens by forcing the cloner’s intermediate embedding of normal training data at several critical layers to conform to those of the source. Consequently, the cloner learns the manifold of the normal data thoroughly, and yet earns no knowledge from the source about other possible input data. Hence, the cloner will behave differently from the source when fed with anomalous data. Furthermore, a simpler cloner architecture enables avoiding distraction by non-distinguishing features, and enhances the discrepancy in behavior of the two networks on anomalies.

最近,张等人。 [52]已经证明,神经网络中间层的激活值是输入图像的牢固的感知表示。 在此前提下,我们提出了一种新颖的知识提炼方法,该方法旨在将ImageNet预训练源网络的全面知识仅基于正常训练数据提炼为更简单的克隆器网络,这是通过强制克隆器中间嵌入来实现的。 在几个关键层的常规训练数据,以符合源数据。 因此,克隆程序会彻底学习正常数据的多种形式,而不会从源头获得任何其他可能的输入数据的知识。 因此,在接收到异常数据时,克隆器的行为将与源行为不同。 此外,更简单的克隆体系结构可以避免因无区别特征而分散注意力,并增强两个网络在异常情况下的行为差异。

In addition, we derive precise anomaly localization heat maps, without using region-based expensive training and testing, through exploiting the concept of gradient. We evaluate our method on a comprehensive set of datasets on various tasks of anomaly detection/localization where we exceed the SOTA in both localization and detection. Our training is highly stable and needs no dataset-dependent fine tuning. As we only train the cloner’s parameters, we require just one more forward pass of inputs through the source compared to a standard network training on the normal data. We also investigate our method through exhaustive ablation studies. Our main contributions are summarized as follows:

  1. Enabling a more comprehensive transfer of the knowledge of the pre-trained expert network to the cloner one. Distilling the knowledge into a more compact network also helps concentrating solely on the features that are distinguishing normal vs. anomalous.
  2. Our method has a computationally inexpensive and stable training process compared to the earlier work.
  3. Our method allows a real-time and precise anomaly localization based on computing gradients of the discrepancy loss with respect to the input.
  4. Conducting a huge number of diverse experiments, and outperforming previous SOTA models by a large margin on many datasets and yet staying competitive on the rest.

此外,我们通过利用梯度的概念,无需使用基于区域的昂贵培训和测试,即可得出精确的异常本地化热图。 我们在关于异常检测/定位的各种任务的综合数据集上评估了我们的方法,在定位和检测方面我们都超过了SOTA。 我们的培训非常稳定,不需要依赖于数据集的微调。 由于我们只训练克隆器的参数,因此与标准网络对普通数据的训练相比,我们仅需要通过源进行一次前向输入传递。 我们还通过详尽的消融研究来研究我们的方法。 我们的主要贡献概述如下:

  1. 能够将经过预训练的专家网络的知识更全面地转移给克隆者。 将知识提取到更紧凑的网络中还有助于仅专注于区分正常与异常的功能。
  2. 与早期的工作相比,我们的方法具有计算成本低廉且稳定的训练过程。
  3. 我们的方法基于计算差异损失相对于输入的梯度,从而可以进行实时,精确的异常定位。
  4. 进行大量多样的实验,并在许多数据集上大大超越以前的SOTA模型,而在其他数据集上保持竞争力。

2.Related Work

Previous Methods: Autoencoder(AE)-based methods use the idea that by learning normal latent features, abnormal inputs are not reconstructed as precise as the normal ones. This results in higher reconstruction error for anomalies. To better learn these normal latent features, LSA [1]trains an autoregressive model at its latent space and OCGAN [31] attempts to force abnormal inputs to be reconstructed as normal ones. These methods fail on industrial or complex datasets [38]. SSIM-AE [10] trains an AE with SSIM loss [54] instead of MSE causing to perform just better on defect segmentation. Gradient-based VAE [15] introduces an energy criterion, which is minimized at test-time by an iterative procedure. Both of the mentioned methods do not perform well on one-class settings, such as CIFAR-10 [23].

GAN-based approaches, like AnoGan [41], fAnoGan [40], and GANomaly [3], attempt to find a specific latent space where the generator’s reconstructions, obtained from samplings of this space, are analogous to the normal data. f-AnoGan and GANomaly add an extra encoder to the generator to reduce inference time of AnoGan. Despite their acceptable performance in localization and detection on subtle anomalies, they fail on one-class settings.

相关工作
以前的方法:基于自动编码器(AE)的方法使用的思想是,通过学习正常的潜在特征,异常输入不会像正常输入那样精确地重建。这导致异常的重构误差更高。为了更好地学习这些正常的潜在特征,LSA [1]在其潜在空间中训练了一个自回归模型,OCGAN [31]试图迫使异常输入被重建为正常的输入。这些方法不适用于工业或复杂的数据集[38]。 SSIM-AE [10]训练了一个具有SSIM损失[54]的AE,而不是MSE,从而导致缺陷分割上的表现更好。基于梯度的VAE [15]引入了一种能量标准,该能量标准在测试时通过迭代过程得以最小化。上面提到的两种方法在一类设置(例如CIFAR-10 [23])上都无法很好地执行。

基于GAN的方法,例如AnoGan [41],fAnoGan [40]和GANomaly [3],试图找到特定的潜在空间,从该空间的采样获得的生成器重构类似于常规数据。 f-AnoGan和GANomaly为生成器添加了一个额外的编码器,以减少AnoGan的推理时间。尽管它们在定位和细微异常检测方面具有令人满意的性能,但它们在一级设置中仍失败。

Methods like uninformed-students [9], GT[18], and DSVDD [33] keep only the useful information of normal data by building a compact latent feature space, in contrast to AE-based ones that try to miss the least amount of normal data information. To achieve this, they use self-supervised learning methods or one-class techniques. However, since we only have access to normal samples in an unsupervised setting, the optimization here is harder than in AE-based methods and usually converges to trivial solutions. To solve this issue, unprincipled early stopping is used that lowers the trust in these models on unseen future datasets. For example, GT fails on subtle anomaly datasets like MVTecAD while performs well on one-class settings.

像不知情的学生[9],GT [18]和DSVDD [33]这样的方法通过构建紧凑的潜在特征空间而仅保留正常数据的有用信息,而基于AE的方法则试图忽略最少的信息量。 正常数据信息。 为此,他们使用自我监督的学习方法或一类技术。 但是,由于我们只能在无监督的情况下访问正常样本,因此此处的优化比基于AE的方法更难,并且通常会收敛到平凡的解决方案。 为了解决此问题,使用了无原则的提前停止功能,从而降低了对这些模型的看不见的未来数据集的信任。 例如,GT在诸如MVTecAD之类的细微异常数据集上失败,而在一类设置上表现良好。

Figure 2: Visualized summary of our proposed framework. A smaller cloner network, C, is trained to imitate the whole behavior of a source network, S, (VGG-16) on normal data. The discrepancy of their intermediate behavior is formulated by a total loss function and used to detect anomalies test time. A hypothetical example of distance vectors between the activations of C and S on anomalous and normal data is also depicted. Interpretability algorithms are employed to yield pixel-precise anomaly localization maps.

图2:我们提出的框架的可视化摘要。 训练了一个较小的克隆网络C,以模仿正常数据上源网络S(VGG-16)的整个行为。 它们的中间行为的差异由总损失函数表示,并用于检测异常测试时间。 还描述了异常和正常数据上C和S激活之间的距离矢量的假设示例。 可解释性算法用于产生像素精确的异常定位图。

Using Pre-trained Features: Some previous methods use pre-trained VGG’s last layer to solve the representation problem [14, 35]. However, [14] sticks in bad local minima as it uses only the last layer. [35] attempts to solve this by extracting lots of different patches from normal images. Then, it fits a Gaussian distribution on the VGG extracted embeddings of the patches. Although this might alleviate the problem, they fail to provide good localization or detection on diverse datasets because of using unimodal Guassian distribution and hand engineered size of patches.

使用预先训练的功能:某些先前的方法使用预先训练的VGG的最后一层来解决表示问题。 但是,由于本地极小值仅使用最后一层,因此会产生不良的局部最小值,从而尝试通过从正常图像中提取许多不同的色块来解决此问题。 然后,将其高斯分布拟合到VGG提取的补丁嵌入中。 尽管这可以缓解问题,但是由于使用单峰Guassian分布和手工设计的补丁大小,它们无法在各种数据集上提供良好的定位或检测。

Interpretability Methods: Determining the contribution of input elements to a deep function is investigated in interpretability methods. Gradient-based methods computes pixel’s importance using gradients as a proxy.While Gradients [42] uses rough gradients, GuidedBackprop (GBP) [45] filters out negative backpropagated gradients to only consider elements with positive contribution.As Gradients’ maps can be noisy, SmoothGrad [44] adds small noises to the input and averages the maps obtained using Gradients for each noisy input. Several methods [2, 30] reveal some flaws in GBP by demonstrating that it reconstructs the image instead of explaining the outcome function.

可解释性方法:在可解释性方法中研究确定输入元素对深度函数的贡献。 基于梯度的方法使用梯度作为代理来计算像素的重要性。虽然梯度使用粗糙的梯度,但GuidedBackprop(GBP)过滤掉了反向传播的梯度,只考虑了具有正贡献的元素。 输入并平均每个有噪输入的使用渐变获得的地图。 有几种方法通过证明英镑重构图像而不是解释结果函数来揭示英镑的某些缺陷。

3.Method

3.1. Our Approach

Given a training dataset D train = {x 1 , …, x n } consisting only of normal images (i.e. no anomalies in them), we ultimately train a cloner network, C, that detects anomalous images in the test set, D test , and localizes anomalies in those images with the help of a pre-trained network. As C needs to predict the deviation of each sample from the manifold of normal data, it needs to know the manifold quite well. Therefore, it is trained to mimic the comprehensive behavior of an expert network, called the source network S. Earlier Work in knowledge distillation have conducted huge efforts to transfer one network’s knowledge to another smaller one for saving computational cost and memory usage. Many of them strive to teach just the output of S to C. We, however, aim to transfer the intermediate knowledge of S on the normal training data to C as well.

给定训练数据集D train = {x 1,…,xn}仅包含正常图像(即其中没有异常),我们最终训练一个克隆网络C,该克隆网络C检测测试集中的异常图像D test ,并借助预先训练的网络定位这些图像中的异常。 由于C需要从正常数据的流形中预测每个样本的偏差,因此需要非常了解流形。 因此,它经过训练以模仿称为源网络S的专家网络的综合行为。早期的知识提炼工作已进行了巨大的努力,以将一个网络的知识转移到另一个较小的知识,以节省计算成本和内存使用量。 他们中的许多人都努力将S的输出仅教给C。但是,我们的目标是将S的正常训练数据的中间知识也传递给C。

In [32], it is shown that by using a single intermediate level hint from the source, thinner but deeper cloner even outperforms the source on classification tasks. In this work,we provide C with multiple intermediate hints from S by encouraging C to learn S’s knowledge on normal samples through conforming its intermediate representations in a number of critical layers to S’s representations. It is known that layers of neural networks correspond to features at various abstraction levels. For instance, first layer filters act as simple edge detectors. They represent more semantic features when considering later layers. Therefore, mimicking different layers, educates C in various abstraction levels,which leads to a more thorough final understanding of normal data. In contrast, using only the final layer shares a little portion of S’s knowledge with C. In addition, this causes the optimization to stuck in irrelevant local minima. On the contrary, using several intermediate hints turns the ill-posed problem into a more well-posed one. The effect of considering different layers is more investigated in Sec. 3.3.1.

在[32]中,显示了通过使用来自源的单个中间级别提示,更细但更深的克隆器甚至在分类任务方面比源更胜一筹。在这项工作中,我们通过鼓励C通过将其在多个关键层中的中间表示与S的表示相一致来鼓励C学习S在正常样本上的知识,从而为C提供来自S的多个中间提示。众所周知,神经网络的各个层对应于各种抽象级别的特征。例如,第一层滤波器充当简单的边缘检测器。当考虑后面的层时,它们表示更多的语义特征。因此,通过模仿不同的层,可以在各种抽象级别上对C进行教育,从而可以更全面地了解常规数据。相反,仅使用最后一层会与C共享S知识的一小部分。此外,这还会导致优化陷入不相关的局部最小值。相反,使用多个中间提示会将病态的问题转化为病态更好的问题。在第二节中将进一步研究考虑不同层的影响。 3.3.1。

In what follows, we refer to the i-th critical layer in the networks as C P i CP_i CPi ( C P 0 CP_0 CP0 stands for the raw input) and the source activation values of that critical layer as a s C P i a_s^{CP_i} asCPi, and the cloner’s ones as a c C P i a_c^{CP_i} acCPi . As discussed in knowledge distillation literature [32, 50], the notion of knowledge can be seen as the value of activation functions. We define the notion of knowledge as both the value and direction of all a C P i a^{CP_i} aCPis to intensify the full knowledge transfer from S to C. Hence, we define two losses, L v a l L_{val} Lval and L d i r L_{dir} Ldir to represent each aspect. The first, L v a l L_{val} Lval , aims to minimize the Euclidean distance between C’s and S’s activation values at each C P i CP_i CPi .Thus, L v a l L_{val} Lval is formulated as
L v a l = ∑ i = 1 N C P 1 N i ∑ j = 1 N i ( a s C P i ( j ) − a c C P i ( j ) ) 2 L_{val} =\sum_{i=1}^{N_{CP}}\frac{1}{N_i}\sum_{j=1}^{N_i}(a_s^{CP_i}(j)-a_c^{CP_i}(j))^2 Lval=i=1NCPNi1j=1Ni(asCPi(j)acCPi(j))2
where N i N_i Ni indicates the number of neurons in layer C P i CP_i CPi and a C P i ( j ) a^{CP_i}(j) aCPi(j) is the value of j-th activation in layer C P i CP_i CPi. N C P N_{CP} NCP represents total number of critical layers.

在下文中,我们将网络中的第i个关键层称为$ CP_i ( ( CP_0 代 表 原 始 输 入 ) , 并 将 该 关 键 层 的 源 激 活 值 称 为 代表原始输入),并将该关键层的源激活值称为

本文标签: 分辨率异常知识