8 Knowledge distillation

知识蒸馏的原理可以看同济子豪兄的视频,很好理解:https://www.bilibili.com/video/BV1gS4y1k7vj/?spm_id_from=333.788&vd_source=32f93e4a4ba268a7391b3d329ced107d


知识蒸馏的核心思想就是从一个大的模型中提取出一个小的模型。

知识蒸馏中有个核心概念是soft。例如三个目标马、驴、车,一个图片中的内容是马,hard targets的相似度向量为[1,0,0],表示这就是一匹马,不是驴也不是车,而且不是驴和不是车的程度是一样(但实际上马应该是和驴有很大相似度的,至少比和车的相似度高);而soft targets的相似度向量可能为[0.7,0.25,0.05],表示和马有0.7的相似度,和驴有0.25的相似度,但是和车就只有0.05的相似度。而后者一般不可能会由人来提前标注,这部分的相似度其实就是用teacher model计算得出的,即一个模型提前进行判断,判断为马的概率为0.7,判断为驴的概率为0.25,判断为车的概率为0.05。所以教师模型得出的结果就是一个相似度分布。这里还给出一个蒸馏温度$T$用来评定蒸馏温度,温度越高越soft。

而这样的soft targets就作为student model的训练的标签。

image-20240520110624109
图1:模型图1

如上图,在温度相同情况下,用教师模型和学生模型进行训练,计算损失Loss 1(distillation loss),越小越好,这个过程就是学生跟着老师学。再用学生模型在温度为1的情况下训练,计算Loss 2(student loss),越小越好,这个过程就像学生跟着课本学。注意,上述的Bigger model就是教师模型,并且是已经训练好的。其中True Targets也被称为Ground Truth Label。最终就是将Loss 1和Loss 2根据一个权重相加,再反向传播训练。如下图:

image-20240520111409886
图2:模型图2
计算过程如下图:
image-20240520110426285
图3:计算举例

最终就是将学生模型部署到移动端。

知识蒸馏的应用场景:

  1. 模型压缩

    学生模型的大小更小,速度更快

  2. 优化模型,防止过拟和,起到正则化的作用

  3. 无限大、无监督数据集的数据挖掘

    可以从网络上爬取大量图片(无标注)给教师模型训练出soft targets,再给学生模型训练,只是这个时候就没有Ground Truth Label了

  4. 少样本、零样本学习

    因为教师模型给的soft targets会记录一个类别和其它类别的相似度,因此学生没有见过这个类别也能分类。

    举例来说:学生没见过驴,但是见过马,老师说驴和马有7分像,和车一点也不像,和牛3分像…这样学生第一次看见驴的时候也能知道是驴

8.1 Introduction

However, most (if not all) state-of-the-art DL architectures are largely overparameterized and the additional parameters/complexity in many cases just help to discover better solutions, instead of merely increasing its representational capacity. This means that in many cases the over-parametrized DL models could be potentially “compressed” into smaller ones, if we had the appropriate tools for training them.

上述段提出DL架构中有一些参数和复杂度是没用的(附加的),想要通过工具去掉它们

Among the most well-known methods for improving the effectiveness of the training process for DL models is knowledge distillation, also known as knowledge transfer or neural network distillation.

These methods are capable of improving the effectiveness of the training process by transferring the knowledge encoded in a large and complex neural network into a smaller one.

Typically, the larger model is called the teacher model, while the smaller one is called the student model, to highlight the similarity with the anthropocentric training approaches, where a teacher transfers its knowledge to a student using the most appropriate teaching techniques.

In the case of neural networks, the knowledge encoded by the teacher model, which is also known as dark knowledge, describes various aspects regarding the inner workings of the larger models, which the student model should mimic.

上述段提出了一些术语名词,以及知识蒸馏的核心概念。

知识蒸馏是将大型复杂神经网络中的编码知识转移到较小的网络中来提高训练过程的有效性。较大的模型称为teacher model,较小的模型为student model。teacher model的编码知识也称为暗知识,描述了student model应该模仿的方面。

Even though a solid theoretical underpinning of why and when knowledge distillation methods work has not been fully established yet, there are several explanations on why this process is so effective. First, knowledge distillation acts as a regularizer, encoding our prior knowledge and beliefs regarding the regions of the hypothesis space that should be considered. Therefore the teacher model steers the student toward a potentially better region of the optimization landscape, often leading to a better solution and improved generalization. Note that, as we mentioned earlier, the overparameterization of the teacher model sometimes only helps to discover a better solution, thus improving the effective capacity of the teacher, even through the correct hypotheses can be described by a model with lower representational capacity.

Furthermore, knowledge distillation also has information-theoretic consequences, since the student model is trained by using more information compared with regular training using only the ground truth information. This can be better understood by an example. Consider the case of a neural network trained to classify handwritten digits, as shown in Fig. 8.1. Regular supervised learning algorithms typically aim to maximize the response of the neuron that corresponds to the correct class, ignoring that the similarities of each digit with the rest of the classes might also provide useful information to the model. On the other hand, knowledge distillation methods, such as distillation between the last (decision) layers of the networks, extract such information from the teacher model and then trains the student models using both the ground truth labels, as well as the implicit (dark) knowledge extracted from the teacher. Even though this dark knowledge can be provided in various forms, most methods employ either similarities or relations between samples and/or classes, as shown in Fig. 8.1.

image-20240519110551797

上述段主要论述了知识蒸馏的有效性(优点、作用)。

  1. 充当正则化器的作用
  2. 具有信息论结果。在考虑输入与正确类的相似度的同时,也会考虑每个类与其它类的相似度。

    知识蒸馏结构如上图。大多数方法都采用样本或类之间的相似性关系提供dark knowledge。

8.2 Neural network distillation

介绍了开创性的神经网络蒸馏方法

G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint, arXiv:1503.02531, 2015.


Distillation was motivated by the observation that training a single model that mimics the behavior of a larger and more complicated ensemble of models typically leads to better accuracy than directly training the same single model for the task at hand.

上述段介绍了蒸馏的动机

Distillation targets DL models trained for classification tasks and works by using the teacher model to extract implicit similarities between the training samples and the categories, as shown in Fig. 8.2. These similarities, which are also called soft targetsto distinguish them from the hard ground truth targets, reveal more information both regarding the way the teacher models the knowledge, as well as regarding the structure of the data. However, extracting this kind of information from DL models is not always straightforward, since they are typically structured and trained to suppress the similarities with the labels apart from the correct class. For example, note that DL models trained for classification tasks typically use the softmax activations for producing a probability distribution over the classes, which is designed to suppress all the activation apart from the highest one. Neural network distillation overcomes this issue by introducing a temperature hyperparameter $T$ that allows for producing a softer distribution over the candidate categories, as also shown in Fig. 8.3 ($T = 1$ refers to the regular output of the network).

image-20240519112804947

上述段先提出了知识蒸馏的目标:通过使用教师模型来提取训练样本和类别之间的隐式相似性来培训分类任务的DL模型。相似部分称为软目标,揭示教师对知识的建模方式以及数据结构的更多信息。

又提出了提取这些信息的困难为:模型通常被结构化和训练成除了正确类别外,压制与标签的相似性。

再提出了解决困难的方法:

  1. 使用softmax激活函数,来产生类上的概率分布,抑制除了最高的所有激活
  2. 引入一个超参数temperature,它是想要再候选类上产生一个更soft的分布

The output of the teacher is calculated as

where the notation $[\boldsymbol{x}]_i$ is used to refer to the $i$th element of the vector $\boldsymbol{x}$ and $N_C$ is the number of categories. vector $\boldsymbol{x}$ denote an input sample that is fed into the teacher DL model $f(\cdot)$ and $f^{logits}(\cdot)$ denote the logits of the model, that is the activations of the final layer before feeding them to a softmax layer. Note that $f(\boldsymbol{x}) \in \mathbb{R}^{N_C}$ and $f^{logits}(\boldsymbol{x}) \in R^{N_C}$.

上述内容给出了一些符号定义和teacher model的输出计算公式

这里的重点是logits,它是一个向量,它的下一步通常就投给softmax/sigmoid。softmax的输出是分类任务的概率,其输入是logits层,logits层通常产生$- \infty$到$+ \infty$​的值,而softmax将其转换为0到1的值。logits可以取任意实数值,正值表示趋向于分类到该类负值表示趋向于不分类到该类。具体解释看:logits是什么?

$f^{logits}(\cdot)$在这里就表示教师模型的输出

Distillation introduces the temperature hyperparameter $T$ that allows for controlling the smoothness of the output distribution of the teacher $p_i$​ as

where $T = 1$ is set to $1$ during regular operation. During the distillation process we can fine tune the value of $T$ in order to obtain a distribution with the appropriate entropy. More specifically, higher values of $T$​ make the distribution softer, as shown in Fig. 8.3, while smaller values make the distribution more peaked.

image-20240519162452051

上述给出了引入超参数$T$后的teacher model的输出分布。并提出$T=1$就是常规输出(根据公式也能看出来),$T$越大输出的概率分布就越平缓

Similarly, the softened output of the student model $g_W(\cdot)$ is defined as

where $g_\boldsymbol{W}^{logits}$ is similarly defined and $\boldsymbol{W}$​ refers to the trainable parameters of the model.

上述段给出了student model的输出的计算公式

Then the student model can be trained to minimize the KL divergence between the output of the student and the ground truth labels, as well as the softened output of the student and the softened output of the teacher. Therefore, the loss function used in distillation is defined as

where $\boldsymbol{q}=[q1,\dots,q{NC}]^T,\boldsymbol{p}=[p_1,\dots,p{NC}]^T$,and $\boldsymbol{t}$ refers to the one-hot encoded ground truth vector for each sample, while KL divergence $D{KL}$ is dedined as

Note that the gradients for the KL divergence term for the soft-targets scale by $T^2$. Therefore this term in Eq.(4) is multiplied by $\frac{1}{T^2}$ to ensure that both KL terms have about the same weight during the optimization. If needed, the weight of the softtargets can be increased or decreased by altering the value of hyperparameter $\alpha$ in Eq.(4).

上述段提出了蒸馏中使用的损失函数。该损失函数是利用KL divergence定义的。详细解释:Kullback-Leibler Divergence

The student model is trained by minimizing the loss defined in Eq.(4):

where $\boldsymbol{x}_i$ and $\boldsymbol{t}_i$ refer to the samples and targets of the training set and $N$ is the number of samples in the training set.

上述段给出了学生模型根据上述定义的损失函数进行训练的迭代公式

Unfortunately, there is no rule for selecting the most appropriate value for $T$, except for the observation that typically values between 2 and 10 work the best for most practical problems.

For more difficult problems, smaller values of T that are closer to 1, sometimes work better.

More specifically, it has been shown that if logits have been zero-meaned and the temperature is high enough, then minimizing $D{KL}(\boldsymbol{p}^{(T)}, \boldsymbol{q}^{(T)})$ is equivalent to minimizing $\sum^{N_C}{i=1}([g_\boldsymbol{W}(\boldsymbol{x})]_i−[f(\boldsymbol{x})]_i)^2$​ (up to a scaling factor).

上述段给出了一些关于超参数$T$的争论和结论

8.3 Probabilistic knowledge transfer

提出了上一节方法(Neural network distillation)的推广,它为知识蒸馏提供了一个通用的概率视图,允许超越分类任务并克服早期方法中存在的一些重大限制


This section focuses on Probabilistic Knowledge Transfer(PKT), which first models the probability distribution of the samples in the teacher space and then trains a student which mimics this distribution. More specifically, PKT works by modeling the pairwise relations between samples, that is, the distribution that expresses how probable is to select the sample $i$ if we have already selected the sample $j$. Therefore, PKT models the probability $p{i|j}$ instead of the probability $P{c|i}$ , that is, the probability that the $i$-th sample belongs to class $c$, which is used in the regular neural network distillation process. In this way, PKT removes the dependence on class labels and can be applied to any kind of problems, ranging from metric learning to regression problems.

Then any divergence metric can be used to measure the distance between the teacher and student distributions. Note that PKT is a fully unsupervised method, even though it can be combined with any other supervised loss.

In this chapter, we will first analytically derive the PKT method and discuss how the student model can be trained in order to mimic the behavior of the student model. Then we provide further insight into the way the PKT method works by discussing connections of PKT with an information-theoretic metric, the mutual information.

上述段介绍了概率知识转移(PKT)方法,它通过建模样本之间的概率关系来训练学生模型,使其模拟教师模型的行为。PKT不依赖于类别标签,可应用于各种问题,是一种完全无监督的方法。它可以与其他监督损失结合使用,并可以使用不同的散度度量来衡量教师和学生分布之间的距离。

The way the PKT method works is summarized in Fig. 8.4.

image-20240519185405081

First, we employ a teacher network $f(\cdot) \in \mathbb{R}^{Nt}$ to extract a representation for each input sample. The student model is denoted by $g\boldsymbol{W}(·) \in \mathbb{R}^{N_s}$. The dimensionality of the embedding space of the student($N_s$) and teacher model($N_t$) can be different.

This prohibits the use of simpler methods for transferring the knowledge from the teacher model into the student, for example, by directly matching the representation of the teacher to the representation of the student, highlighting the need for more advanced methods that can transfer the knowledge between spaces of different dimensionality.

During the process of knowledge distillation the parameters $\boldsymbol{W}$ of the student model are optimized to “mimic” the behavior of the teacher model. It is also worth noting that there is actually no constraint on what the functions $f(\cdot)$ and $g\boldsymbol{W}(\cdot)$ are. We only need the output value of the teacher $f(\cdot)$ for every sample fed to the student $g\boldsymbol{W}(\cdot)$ and a way to optimize the parameters of $g_\boldsymbol{W}(\cdot)$ to minimize the given objective function.

上述段总结了PKT的工作原理

To simplify the notation used in this section, we let $\boldsymbol{x}i = f(\boldsymbol{t}_i)$ denote the representation extracted from the teacher model and $\boldsymbol{y}_i = g\boldsymbol{W}(\boldsymbol{t}_i)$ denote the representation extracted from student model, where $\boldsymbol{t}_i$ is a transfer sample contained in the transfer set $\mathcal{T} = {\boldsymbol{t}_1, \boldsymbol{t}_2, \dots, \boldsymbol{t}_N}$ used for knowledge distillation. Note that, as before, $N$ denotes the number of samples used for knowledge distillation. Also, we use two continuous random variables $X$ and $Y$ to model the distribution of the teacher and student networks, respectively.

上述段给出了一些更简洁的符号定义

These joint densities in the embedding space of the student and teacher can be estimated using Kernel Density Estimation (KDE) as

Note that a kernel function $K(\boldsymbol{a}, \boldsymbol{b}; \sigma^2_t)$, with width $\sigma_t$, that measures the similarity between two vectors $\boldsymbol{a}$ and $\boldsymbol{b}$ is needed for estimating these probabilities. Minimizing the divergence between the probability distributions defined by the representation of the samples in the student and teacher space allows for transferring the knowledge from the teacher to the student model, since it allows for learning a feature space that maintains the same relations that appear in the teacher model. In other words, minimizing the divergence between the distribution of the teacher $\mathcal{P}$ and the distribution of the student $\mathcal{Q}$ aims to learn a space where each sample will have the same neighbors in both the teacher and student spaces. Therefore this process implicitly models the manifold of the data in the teacher space and tries to recreate it in the (potentially lower dimensional) student space. Also, note that class labels are not required during this process, effectively allowing for performing this process with fully unlabeled transfer sets.

上述段给出了学生和教师的嵌入空间的联合密度分布的估计公式,使用的是具有宽度$\sigma^2_t$的核函数。最小化二者的的概率分布差异就可以将知识从教师转移给学生。并且这个过程不需要类标签,是完全无监督学习过程

PKT employs the conditional probability distribution instead of the joint one. To understand why this makes the optimization easier, despite that in both of these cases the divergence between the probability distributions is minimized when the kernel similarities are equal for both models, we need to consider the following. The conditional probability distribution models how probable it is for each sample to select each of its neighbors, which in turn allows for more precisely describing the local regions between the samples, instead of the global ones. This is also the reason why the conditional probabilities have been employed for dimensionality reduction methods, such as the t-SNE algorithm.

上述段表示PKT实际上实用的是条件概率分布而不是联合概率分布。因为条件概率模拟了每个样本选择其邻居的可能性,这允许更精确地描述样本之间的局部区域。

Therefore, the teacher’s conditional probability distribution can be estimated as

and the student’s distribution as

Note that these probabilities are limited to the range $[0,1]$ and they sum to $1$, that is, $\sum{i=0,i\not=j}^N p{i|j} = 1$ and $\sum{i=0,i\not=j}^N q{i|j} = 1$.

上述给出了教师和学生模型的条件概率分布计算公式

Selecting the most appropriate kernel for estimating these probabilities is critical for the effectiveness of the process of knowledge transfer. Typically, the Gaussian kernel is used to this end, which is defined as

where the notation $\Vert\cdot\Vert_2$ is used for the $l^2$ norm of a vector, while the width (scaling factor) of the kernel is denoted by $\sigma$.

前面提到的几个概率分布公式中均有$K(\boldsymbol{a},\boldsymbol{b};\sigma)$,前面也提到了,这是核函数,上述段就表示常用的核函数是高斯核函数,并给出高斯核函数的定义(高斯核函数是常用的核函数,但不是PKT方法中常用的核函数)

This formulation leads to the regular Kernel Density Estimation (KDE) method for estimating the conditional probabilities, which, unfortunately, is not very effective in high-dimensional spaces, requiring the tedious and careful tuning of the widths of the kernels. Several heuristics have been proposed to overcome this issue. PKT mitigates this issue by employing a more robust formulation that does not require extensive tuning. Therefore, instead of using a Gaussian kernel, a cosine-based affinity metric is used:

This formulation allows for avoiding finetuning the kernel’s bandwidth, while providing a more robust affinity estimation, since it is established that for many tasks Euclidean metrics are inferior to cosine-based ones.

上述段提出高斯核函数用来估计条件概率分布在高维空间中不是很有效,需要对核的宽度进行繁琐和仔细的调整。因此PKT不是用高斯核函数,而是使用基于余弦的亲和度进行度量,这种方式可以不必微调核的宽度,且比欧式度量更好。

Furthermore, there are several ways to calculate the divergence between the teacher distribution $\mathcal{P}$ and student distribution $\mathcal{Q}$, which is required to effectively transfer the knowledge between the models. PKT employs the Kullback–Leibler (KL) divergence:

上述又给出了学生和教师分布的散度的估计函数,即KL散度,并给出其计算公式

Note that in practice, we employ the transfer set to approximate the distributions $\mathcal{P}$ and $\mathcal{Q}$. Therefore the loss function that can be used for optimizing the student is defined as

上述段给出了在实际中使用的损失函数,仔细看其实就是KL散度函数

Note that minimizing the divergence for pairs of points that are closer in the teacher space is more important that the distant ones, as shown in Eq.(13), since KL is an asymmetric distance metric. Therefore, keeping the local geometry of the embedding space is more important in this formulation than accurately mimicking the relations between all the samples (including very distant ones). This provides more flexibility during the training process, which is especially important when more lightweight student models, which are not capable of fully recreating the global geometry of the teacher, are used. If the whole geometry should be maintained, then symmetric distance metrics could be used, for example, the quadratic divergence measure:

The metrics can also be further adapted to the needs of the each application.

上述段提出,最小化教师空间中较近的点对的发散比较远的点对更重要,因为KL是一个不对称的距离度量。因此,在这个公式中,保持嵌入空间的局部几何形状比准确模拟所有样本(包括非常遥远的样本)之间的关系更重要。这在训练过程中提供了更大的灵活性。

当使用更轻量级的学生模型时,这一点尤为重要,因为这些模型无法完全重建教师的全局几何结构。如果应该保持整个几何结构,那么可以使用对称距离度量,例如二次散度度量,并给出了该度量的公式

Again, similar to the neural network distillation methods presented in Section 8.2, the student model is trained by minimizing the loss defined in Eq.(13):

Again, similar to the neural network distillation methods presented in Section 8.2, the student model is trained by minimizing the loss defined in Eq.(13):

where $\frac{\boldsymbol{y}_l}{\partial \boldsymbol{W}}$ is the partial derivative of the student model with respect to its parameter.

上述段给出了PKT的学生模型的训练方法和方向传播过程

In practice, calculating the loss by employing the whole training set is infeasible, since it has quadratic complexity and requires quadratic space with respect to the number of training samples. Therefore PKT calculates the loss function described in Eq .(13) in small batches (typically of 64–256 samples). This process is similar to Nyström-based methods that approximate the full similarity matrix of the data, which can indeed speed up convergence, without negatively affecting the optimization. However, note that it is crucial to shuffle the training set during each epoch in order to ensure that different pairs of samples are used for each optimization epoch when calculating the loss function.

上述段提出因为具有二次复杂度,因此不会对全局样本进行方向传播,一般会分为多个批次,每个批次有64-256个样本。这个过程近似于数据的完整相似度矩阵,可以加快收敛且不会对优化产生负面影响。注意点事确保每个优化epoch使用不同的样本对

More specifically, PKT also aims to maintain the same amount of mutual information (MI), which is a measure of dependence between random variables, between the representation learned by the student model and a set of (possible unknown) labels as the teacher model. To understand this, let us use the notation $C$ to denote a discrete random variable that corresponds to a possibly unknown attribute of the samples. For example, in the case of classification, this can be the labels of the samples. Therefore, for each representation extracted from the teacher model $\boldsymbol{x}$ drawn from $X$ we also have an associated label $c$. MI allows us to quantify how much the uncertainty regarding the class labels is reduced after observing $\boldsymbol{x}$. To define MI, we need to also define the probability density function of the joint distribution between the extracted representations and the samples $p(\boldsymbol{x}, c)$. Then we can measure MI for the teacher model as

It is worth noting that MI can be also viewed as the KL divergence between the joint density $p(\boldsymbol{x}, c)$ and the product of $p(\boldsymbol{x})$ and $P (c)$.

上述段提出,PKT方法还旨在维护相同数量的互信息,这是随机变量之间依赖性的量度,学生模型所学的表示形式与一组(可能的未知)标签作为教师模型。MI允许量化在观察$\boldsymbol {x} $​之后,降低了类标签的不确定性。并 利用 抽取表示与样本之间的联合概率密度 定义了 教师模型的互信息计算公式。

并且MI也可以看作联合概率密度$p(\boldsymbol{x},c)$和$p(\boldsymbol{x})$、$P(c)$乘积的KL散度

Another variant of MI, the Quadratic Mutual Information (QMI) has been proposed to more efficiently estimate MI between a continuous variable and a discrete variable. QMI is calculated by using a quadratic divergence measure instead of KL divergence as

It is easy to see that

上述提出了MI的变体,二次互信息(QMI),并给出了其计算公式。

QMI使用二次散度度量而不是KL散度,QMI可以更有效地估计连续变量和离散变量之间的MI。

Then we can define certain quantities, called information potentials, based on the terms that appear in this formula. These quantities are defined as

These information potentials express different aspects regarding the interactions between the data samples, which will become more clear later in this section. As a result, QMI can be expressed in terms of these quantities as

上述将公式中的不同项定义为信息势(不同的变量)

If we assume that there is a total of $N_C$ different attributes (classes) for the data and that there is a total of $J_p$ samples for each class, then we can estimate the class prior as

Also, using kernel density estimation we can estimate the joint densities as

where we used the notation $\boldsymbol{x}_{pj}$ to denote the $j$th sample of the $p$th class. The probability density of the representation can be trivially estimated as

上述给出了关于样本更多的假设,以及定义了$p(\boldsymbol{x}, c_p)$和$p(\boldsymbol{x})$的计算公式(为之后的计算内容做铺垫)

By using these estimations, the information potentials for the teacher model can be calculated as:

The kernel function $K(\boldsymbol{x}i, \boldsymbol{x}_j ; \sigma^2)$ measures the interaction between two samples $i$ and $j$. Also, the information potentials actually calculate a weighted average over the interactions between these data pairs. That is, $V{IN}$ measures the in-class interactions,$V{ALL}$ the interactions among all the samples, while $V{BTW}$ quantifies the interactions of all the samples against each class. Following a similar approach, we can also define the information potentials for the student network,

Of course, in the case of regular KDE, different widths $\sigma_t$ and $\sigma_s$ should be used for the teacher and the student model, after appropriately tuning them.

上述提出了QMI的信息势根据上述提出的估计值(类别数,样本数)的进行计算的详细计算公式。

解释了$V{IN}$、$ V{ALL}$、$V_{BTW}$的意义。

并且分别给出了学生模型的教师模型的的计算公式。其中不同模型的核函数宽度是不同的。

The knowledge can be transferred between the teacher and student models by maintaining the same amount of MI between the two random variables $X$ and $Y$ with respect to their class labels, that is, $I (X, C) = I (Y, C)$. If we employ the aforementioned QMI-based formulation, then the respective information potentials must be equal between the two models. This in turn implies that the joint densities provided in Eq.(7) must also be equal to each other, which can be achieved by solving the same minimization problem as in Eq.(13). It worth noting that this is not the only configuration that maintains the same MI between the models. Other configurations might exist as well. However, it is enough to use one of them to transfer the knowledge.

上述段解释了使用基于QMI的公式,那么两个模型之间的各自的信心势必须相等,也就是联合密度$p{i|j}$和$q{i|j}$也必须彼此相等。保持相同的MI可以使用前面提到的损失函数,也有其他方法,使用其中一个就行

There is a number of other related representation-learning based distillation methods that have been proposed following PKT, which model different kinds of interactions between the transfer samples. For example, in [1] instead of measuring the conditional probabilities, the proposed relative teacher approach employs the distance between the training samples. Then the pairwise distances between the training samples are matched using a divergence measure. Therefore the following loss function is used:

A similar approach was also proposed in [2], where the cosine similarity was used instead of the Euclidean distance, and the Huber loss was used for the similarity matching, as well as in [3], where the Frobenious norm was used instead.

上述段提出了一些不同的方法,这些方法其实都是使用不同的损失函数

8.4 Multilayer knowledge distillation

前两节的方法都是在一个层中传递知识,这一节介绍多层教师模型,该模型进一步提高了有效性


The two methods presented in the previous two sections focus on transferring the knowledge using only the output layers of the teacher and student networks. However, DL models encode useful knowledge in their intermediate layers as well. This has led to the development of a number of methods that focus on employing these intermediate layers to further improve the effectiveness of knowledge distillation. In this section, we present different methods that can perform this kind of multilayer distillation.

提出本节的主要内容为介绍多层蒸馏的不同方法

8.4.1 Hint-based distillation

Hint-based distillation was among the first methods capable of performing knowledge distillation by employing several intermediate layers of a network [4]. Hint-based distillation is summarized in Fig. 8.5 and works by

  • selecting a number of intermediate layers from the teacher and student
  • imposing an appropriate loss in order to distill the knowledge encoded in this intermediate layer of the student to the corresponding layer of the student.
image-20240521174844454

This is achieved by introducing the following loss term:

where the notation $\boldsymbol{x}_i^{(l)}$ is used to refer to the $l$th layer of the teacher network, $\boldsymbol{y}^{(m)}$ is used to refer to the $m$th layer of the student network and $r(\cdot)$ is a regressor that allows for matching the dimensionality between $\boldsymbol{x}_i^{(l)}$ and $\boldsymbol{y}_i^{(m)}$​​.

上述段解释了基于提示的蒸馏的工作过程,并提出实现该过程的损失函数公式。

For example, when transferring the knowledge from a fully connected layer with $N_l$ dimensions to a layer with $N_m$ dimensions, the regressor $r(\cdot)$ can be defined as

where the projection matrix of the regressor is defined as $\boldsymbol{W}_r \in \mathbb{R}^{N_m \times N_l}$. The matrix $\boldsymbol{W}_r$ os optimized along with the rest of the parameters of the student model.

Note that the additional parameters introduced by $\boldsymbol{W}_r$ to the student model are not part of it, since they merely exist to guide (or provide hints to) the learning process. After the student network has been trained, these parameters can be removed. The loss provided by Eq.(29) is typically combined with the regular neural network distillation loss.

上述提出了回归函数$r(\cdot)$的公式,并解释公式中的$\boldsymbol{W}_r$的意义

Selecting the most appropriate sets of layers $l$ and $m$ is crucial for improving the performance of the student network when hint-based distillation is used.

If an early layer of the teacher model is matched with a too late layer of the student network, then the student network might suffer from overregularization.

On the other hand, if a layer of the teacher is matched with a too early layer of the student, then the student might collapse the representation in this layer [5], removing useful information too early, and thus reducing the accuracy of the model.

Multiple pairs of intermediate layers can also be used, to further improve the performance of the student.

However, it should be again stressed that using too many pairs of layers imposes a strong regularization effect of the network, which can eventually lead to decreasing the performance of the network [4].

Finally, using fully connected regressors can be prohibitively expensive in the case of convolutional neural networks. In this case, a convolutional regressor is used aiming to match the number of channels and size of the features maps extracted from both the teacher and student. Afterwards, Eq.(29) can be directly applied on the extracted feature maps.

上述提出了选择适合的$l$和$m$对于Hint-Based distillation的蒸馏时提高student model的性能至关重要,并给出了两个理由。

提出可以使用多对中间层提高学生的表现,但使用过多的层灰对网络是假很强的正则化效应。

在卷积中使用完全连接的回归器消耗很大,卷积回归器应该匹配从教师和学生中提取的特征图的通道数和大小

8.4.2 Flow of solution procedure distillation

Another approach to introduce knowledge regarding the way the teacher model works to the student is by employing the Flow of Solution Procedure (FSP) matrix [6]. The FSP matrix describes how a network transforms its representation across two different layers.

上述段给出FSP方法的核心概念

Typically, the FSP matrix is calculated using the output of two convolutional layers. The FSP matrix between the $l$th and the $m$th layers of the teacher network is defined as

where $x^{(l)} \in \mathbb{R}^{H \times W \times N_l}$ is the output feature map of the lth layer of the teacher model and $x^{(m)} \in R^{H \times W \times N_m}$ is the output feature map of the mth layer of the teacher model.

Note that in contrast with the hint-based approach, where the notation $l$ and $m$ was used to refer to the layers of the teacher and student, respectively, in this section we use this notation to refer to layers of the same (teacher) network.

The dimensionality of the FSP matrix depends on the number of channels in both feature maps, that is, $\boldsymbol{G}^{(T)} \in \mathbb{R}^{N_l \times N_m}$. Note that in this way FSP matrix provides information regarding the way the teacher transforms the channels across intermediate layers, encoding in this way knowledge regarding the way the teacher model works.

上述段给出关于教师网络的FSP矩阵的表达式

Similarly, we can calculate the FSP matrix $\boldsymbol{G}^{(S)}$ for the student model. Then the FSP loss is defined as

where the notation $\boldsymbol{G}^{(T)}(\boldsymbol{t}_i)$ and $\boldsymbol{G}^{(S)}(\boldsymbol{t}_i)$ is used to refer to the FSP matrix calculated when the $i$th transfer sample is fed to the network and $\Vert\cdot\Vert_F$ refers to the Frobenius norm of a matrix.

This process is summarized in Fig. 8.6. Similar to Eq.(29), this loss can be defined for multiple $l-m$ pairs of the teacher network and $l’-m’$ pairs of the student network.

image-20240521182119810

上述提出$\boldsymbol{G}^{(S)}$​是和教师网络类似定义的学生网络的FSP矩阵。并根据两个矩阵给出了FSP方法的损失函数

A limitation of this approach is that it requires the pairs of layers between which the knowledge is transferred to have the same number of channels between the teacher and student. That is, the $l$th layer of the teacher must have the same dimensionality (number of channels) as the $l’$th layer of the teacher (as well as the $m$th layer of the teacher and $m’$th layer of the student), since the loss provided in Eq.(32) can only be calculated between the FSP matrices with the same dimensions.

In practice, this often limits the application of the FSP approach to residual neural networks [7], where we employ the FSP loss between sets of residual modules. For example, five residual modules could be used for calculating the FSP matrix for the teacher and two residual modules could be used for the student, as further discussed in [6], in order to ensure that the FSP matrix of the student and the teacher will have the same dimensions.

Furthermore, it is usually suggested to apply the FSP procedure in two steps:

  1. first optimize the networks using the FSP loss
  2. apply the loss that corresponds to the final task [6].

上述给出了FSP方法的限制(要求不同层的维度是相同的),其限制了餐叉网络在FSP方法中的应用。但给出了解决限制的一般处理方法

8.4.3 Other multilayer distillation methods

Apart from hint-based distillation and FSP-based distillation, several other methods have been proposed to exploit the knowledge provided by the intermediate layers of the teacher model.

For example, attention transfer distills the knowledge of intermediate layers by training a student model that mimics the attention maps of the teacher model [8].

Another approach includes the so-called Factor Transfer (FT), which employs a paraphrazer module to extract the knowledge factors from any layer of the teacher that are then transferred through an additional translator module [9].

Also, in [10] has been proposed to employ the activation boundaries, that is, the separating hyperplanes formed by the neurons, to transfer the knowledge from the teacher to student.

Furthermore, in [11], the PKT method has been extended by allowing for effectively matching the distribution of intermediate layers.

For all the aforementioned methods, it is crucial to correctly match the layer of the teacher layer, which is used to mine the knowledge, to the layer of the student layer used to transfer the knowledge to. This can be trivial for compatible architectures, that is, ResNets with the same number of blocks [7], but challenging for more heterogeneous distillation setups. Indeed, as it was demonstrated in [5], an incorrect matching between these layers can either over-regularize the student or lead to an early representation collapse, negatively impacting the performance of the student model. To overcome this issue, an intermediate network, that has an architecture compatible with the student, but it is more powerful than the student, is employed in [5], allowing for reducing the gap between the layers.

8.5 Teacher training strategies

提出了一些更高级的方法来训练教师模型,允许推导出更有效的蒸馏方法,例如在线蒸馏方法,它允许同时训练教师和学生模型;不使用不同教师模型的自蒸馏方法,并且能够重用从学生模型中提取的知识。


The previous sections focused on deriving more effective distillation approaches for extracting the knowledge from the teacher model and transferring it to the student model.

However, the teacher model can also significantly affect the effectiveness of the distillation process. For example, recent evidence suggests that the teacher-student gap can indeed affect the distillation efficiency [40]. In other words, using a very powerful teacher is not the optimal choice when training a very lightweight student. In these cases, a smaller teacher, which is possibly less accurate, yet simpler, is usually a better choice.

上述提出提升蒸馏的效果也要关注教师模型本身。

对于非常轻量的学生来说应该选择更简单的老师(即使精度不是非常高)

There are three main categories of distillation methods based on how the teacher model is trained:

  • offline distillation
  • online distillation
  • self-distillation.

These approaches are depicted in Fig. 8.7.

Offline distillation refers to the standard setup that we have considered until now, that is, the teacher model is trained beforehand, its weights are fixed and then the knowledge is transferred to the student model, while in online distillation the teacher is trained along with the student model. A special case of offline distillation has been proposed in [12], where several teachers are trained one-by-one using distillation, aiming at gradually arriving to a smaller student model, reducing in this way the teacher-student gap. Furthermore, ensembles of multiple teacher models have also been considered for improving the effectiveness of distillation, by increasing the diversity of the knowledge encoded by the teacher [13].

Finally, in self-distillation the student itself is used to extract useful knowledge for improving its accuracy, that is, the student also acts as a teacher, as shown in Fig. 8.7. There are several recent self-distillation methods proposed in the literature. In its simplest form, self-distillation methods transfer the knowledge from a previous version of the student to a newer generation [14], while this process can be repeated multiple times, potentially further increasing the accuracy of the later generations of the network. Also, in [15], the knowledge is transferred from latter layers a DL model into its earlier layers, allowing for learning more discriminative early layers. A similar approach is also followed in [16], where it is combined with an early exit strategy [17], which enables dynamic adaptations of the computational graph of the model to the available computational resources [18].

image-20240521185141347

上述提出了三种教师模型的蒸馏方案

References

[1] L. Yu, V.O. Yazici, X. Liu, J.v.d. Weijer, Y. Cheng, A. Ramisa, Learning metrics from teachers: compact networks for image embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2907–2916.

[2] W. Park, D. Kim, Y. Lu, M. Cho, Relational knowledge distillation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.

[3] F. Tung, G. Mori, Similarity-preserving knowledge distillation, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1365–1374.

[4] N. Passalis, M. Tzelepi, A. Tefas, Heterogeneous knowledge distillation using information flow modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2339–2348.

[5] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 630–645.

[6] J. Yim, D. Joo, J. Bae, J. Kim, A gift from knowledge distillation: fast optimization, network minimization and transfer learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141.

[7] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 630–645.

[8] S. Zagoruyko, N. Komodakis, Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer, in: Proceedings of the International Conference on Learning Representations 2017, 2017.

[9] J. Kim, S. Park, N. Kwak, Paraphrasing complex network: network compression via factor transfer, in: Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 2760–2769.

[10] B. Heo, M. Lee, S. Yun, J.Y. Choi, Knowledge transfer via distillation of activation boundaries formed by hidden neurons, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3779–3787.

[11] N. Passalis, M. Tzelepi, A. Tefas, Multilayer probabilistic knowledge transfer for learning image representations, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 2020, pp. 1–5.

[12] S.I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, H. Ghasemzadeh, Improved knowledge distillation via teacher assistant, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 5191–5198.

[13] G. Panagiotatos, N. Passalis, A. Iosifidis, M. Gabbouj, A. Tefas, Curriculum-based teacher ensemble for robust neural network distillation, in: Proceedings of the European Signal Processing Conference, 2019, pp. 1–5.

[14] T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, A. Anandkumar, Born again neural networks, in: Proceedings of the International Conference on Machine Learning, 2018, pp. 1607–1616.

[15] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, K. Ma, Be your own teacher: improve the performance of convolutional neural networks via self distillation, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3713–3722.

[16] M. Phuong, C.H. Lampert, Distillation-based training for multi-exit architectures, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1355–1364.

[17] N. Passalis, J. Raitoharju, A. Tefas, M. Gabbouj, Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits, Pattern Recognition 105 (2020) 107346.

[18] N. Passalis, J. Raitoharju, A. Tefas, M. Gabbouj, Adaptive inference using hierarchical convolutional bag-of-features for low-power embedded platforms, in: Proceedings of the IEEE International Conference on Image Processing, 2019, pp. 3048–3052.