文中英文部分为书中的部分原文,中文部分是我自己的理解和总结

0 Perface

With this book the reader will be able to:

  • understand deep learning principles and methodologies
  • understand the principles of applying end-to-end learning in robotics applications
  • design and train deep learning models
  • apply deep learning in several robot vision tasks such as object recognition, image classification, video analysis, and others
  • understand how to use robotic simulation environments for training deep learning models
  • understand how to use deep learning for different tasks, ranging from planning and navigation to biosignal analysis

1 Introduction

1.1 Artificial intelligence and machine learning

what are the real world problems that AI can help us solve.

  • One approach to solve real world problems with machines is by trying to mimic the way humans are solving these problems.

    it is not in the focus of this book.

  • The second approach is to perceive the environment using available sensors (e.g., cameras, microphones, etc.) and use this raw data in order to represent the world and make decisions and take actions.

    This approach is mostly used in statistical machine learning and pattern recognition

The deep learning approach is based on building large computational models that use as building block the artificial neuron or perceptron and its variants and are able to adapt to the data by changing their parameters in order to solve real world problems


这一章的内容包含:什么是deep learning,什么是Artificial intelligence。同时想要解释AI能够帮助我们解决现实中的什么问题(如何解决)。=> 输入、处理(make a decision and take an action)、输出。

并提出符号主义(模拟人类解决问题的过程)和统计概率方法(根据原始数据统计分析)这两种解决实际问题的方法。本书主要用后者。

并最终提出目前深度学习已在计算机视觉,语音识别,机器翻译等很多领域有了成果,希望能够将这些研究统一部署到机器人。

1.2 Real world problems representation

the way the real world can be represented in order to be able to apply ML for solving specific problems

An actor is able to perform actions usually based on specific input that can be as simple as “someone turned on the machine” or more complex (e.g., video, lidar, etc.) that leads to actions such as “stop the car because a pedestrian was detected to cross the road.”

The environment from which data are acquired and in which all the actions take place provides the context of the task and in many cases should be also represented for solving complex tasks (e.g., a 3D map can be used for robot navigation).

The learning tasks of an actor are related with the real world problems that the actor will have to solve:

  • supervised learning
  • unsupervised learning
  • reinforcement learning

这一章,主要是关于“如何将现实生活的问题转换为相应的ML的问题”

首先需要一个特殊的Input,同时也需要有任务的上下文环境和原始数据。

接着提出,learning task与需要解决的现实生活的问题有关,并且提出了三种learning paradigm,分别为有监督学习,无监督学习和强化学习。并简要解释了它们的概念。

1.3 Machine learning tasks

We refer to this mapping as f (x) = y, where the sign “=” denotes assignment of the value that f (·) takes when receiving as input x to the variable y.

where $l(·, ·)$ is a loss function quantifying the error corresponding to a mismatch between the output of the parametric function $f (·; \Theta)$ when receiving as input $x_i$ and its (known) class $y_i$​.

This is achieved through a process called training.

Another way to approach the above image classification problem is to formulate it as a regression problem.

Both the above approaches belong to the supervised machine learning category, where the parameters of the model are estimated based on human supervision, that is, each sample in the training set is followed by an expert-defined target (label or vector).

One example of unsupervised machine learning problems is that of data clustering.

Another example of unsupervised learning models is that of the Autoencoder.

The above-described process is the quintessential example of representation learning.

The main idea in self-supervised learning is that an algorithm can use the input data to devise an auxiliary learning task in which supervision is provided by the data itself


这一章,详细讲解了Machine Learning Tasks,先详细讲解了输入,例如图片向量化,以及提出机器学习任务可以表述为$f(\cdot)$,即有输入,经过一个过程得到输出。

之后再详细介绍了训练和回归两种有监督学习方法,以及聚类和自编码两种无监督学习方法,之后有介绍了自监督方法和强化学习的思想。

1.4 Shallow and deep learning

这一章先讲了学习(训练):包含原始方程:

然后引出损失函数:

再引出梯度下降:

随后拓展维度介绍了MSE(mean squared error):

The advantages of using the loss function in Eq. (5) include the existence of a unique solution for both linear and nonlinear classification problems and its easy extension to multiclass classification problems by the use of class-indicator vectors $t_i, i=1,\dots,C$

再介绍nonlinear function: sigmoid function $\sigma(\alpha) = \frac{1}{1 - exp(-\alpha)}$。它可以讲值转换到[0, 1]区间,是一种概率模型,可以看成贝叶斯模型:

where $\alpha_i = \ln \frac{p(\hat{x}_i |C_1)p(C_1)}{p(\hat{x}_i |C_2)p(C2)} = \ln \frac{p(C_1|\hat{x}_i )}{1−p(C_1|\hat{x}_i )}$ represents the logarithm of the ratio of the posterior probabilities and is known as log odds.

针对该模型的优化,引出交叉熵损失函数(cross-entropy loss function):

之后再扩展了多分类逻辑回归(Softmax函数)和多分类时的交叉熵损失函数:

然后介绍了先求线性方程,再利用非线性方程进行转换,最终提出这种结构就是两层neurons组成的neural network。并提出再神经网络的背景下,每个神经元的非线性函数都被称为激活函数。

1.5 Robotics and deep learning

Some of the obstacles for integrating deep learning in robotics are explained below.

  1. The available deep learning open frameworks (e.g., Tensorflow, PyTorch, etc.) are not easily employed in robotics since they have a long learning curve and radically different methodology (end-to-end data-driven learning, from sensing to acting, etc.) than conventional robotics.
  2. The already available deep learning software modules are implemented in order to be deployed on large and expensive GPUs and they rarely perform in real-time even for low resolution input.
  3. Another obstacle for applying deep learning in robotics is the importance of simulation in deep robotic learning and the lack of available open-source robotics simulation environments that allow for deep learning training.

The major tasks of a robotic system can be categorized as follows.

  1. In the first category belong the tasks that are related to robot perception.

    That is, the robot should be able to interact with people and environment, and thus should be able to perceive people and environment and acquire the data the will help in representing them with numbers, vectors, graphs, etc.

    The most important tasks that are related to person and environment perception are person/object detection and tracking

    Another important task is the semantic scene segmentation

    The localization and tracking of objects in the 3D space

    Person activity recognition methods

  2. In the second category, we can find tasks that are related to ability of the robot for action, planning, navigation, manipulation, and cognition.


这一章先讲解将机器人技术整合到深度学习的障碍,主要有:

  1. 没有现成的,方便的开发框架
  2. 算力不支持实时
  3. 缺乏开源的机器人模型环境(缺少开源的真实数据集)

之后讲解了机器人系统的分类,主要包含两类:

  1. 机器人感知
  2. 机器人行为、计划、导航、操纵、识别

2 Neural networks and backpropagation

2.1 Introduction

这一章主要介绍一些名词:

  • GPU —— Graphics Processing Units
  • TPU —— Tensor Processing Units
  • neuron —— also know as node or units, at the input and output of the model, and connections between them
  • layer —— hidden layer
  • decision/prediction —— the output neurons compose the final result

  • classes —— The number of input neurons is the same as the number of data features and receive only a single value, while the number of output neurons is the same as the number of categories to be predicted, known as classes.

  • weight/bias —— Each connection between neurons carries a weight, while each neuron is equipped with an additional bias term.
  • training —— At first, the weights are randomly initialized and they update through an operation called training.
  • backpropagation —— Training ends up finding the appropriate weights and biases, utilizing the backpropagation algorithm, which calculates the derivative of each layer’s function after every pass of the data through the network, in order to determine the changes that need to be made to the network’s weights.
  • iteration/epoch/overfitting —— A single pass of a sample (or a batch of samples) is called an iteration, while a full pass of all the training data is called an epoch. The number of epochs can affect the quality of the predictions, since if it is small the network can underfit the data, while if it is too large, it can lead to overfitting.

The output of a neuron $i$ in layer $l$ is calculated as follows:

where $\alpha_i^l$ is an activation function, which calculates the final ouput of the neuron $i$ in layer $l$ at a given time. Also, $u_i^l$ is called propagation function and calculates the total input value at a given time, koown as neuron’s state, by adding all the m individual inputs $o^{l-1}$ it receives after first multiplying them with their corresponding weights $W_i^l$ and adding the corresponding bias $b_i^l$

2.2 Actibvation functions

这一章主要是介绍各种激活函数

Sigmoid:

TanH:

Softmax:

ReLU:

eLU:

2.3 Cost functions

they aim to minimize a cost function (also known as loss function)

这一章主要是介绍目标函数(损失函数)

Mean Squared Error(MSE):

where $|| \cdot||_2$ denotes the $l_2$ norm of a vector. Mean Absolute Error(MAE) is an alternative, which employs the $l_1$​ norm instead.

Mean Absolute Error(MAE):

Huber:

It smooths out the big changes caused by large errors that result from using MSE. It uses both MSE and MAE as follows:

where $\delta$ defines the proportion by which the error approaches the output of MSE and MAE.

Cross Entropy(CE):

For Multi-class problems is defined as:

with $\Psi$ indicating the number of classes.

For binary classification, it is defined as:

Kullback-Leibler(KL):

KL is a measure of how one probability distribution is different from another. It calculates the loss of information that results in the attempt of the network to approach the expected values. Evidently, it also needs probabilities as inputs in order to function properly:

2.4 Backpropagation

which calculates the gradient of the cost function with respect to the bias and the weights of a network. It iterates through the layers backwards – right after the data have been forward passed – calculating the gradients for each neuron of each layer.

Let $\epsilon^L$ be the gradients of the cost function $\mathcal{J}$ with respect to the output vectoroL in the last layer $L$. This is called error of $L$, because if it converges to 0 it means that the network’s predictions are the same as the expected values, and it is given by:

where $\alpha’(\cdot)$ denotes the derivative of the activation function which is applied element-wise on the input vector ($\odot$ refers to element-wise multiplication). In a similar way, the error in any of previous layers cam be calculated as:

where $(W^{\mathscr{l}+1})^T$ is the transpose of the weight matrix $W^{\mathscr{l}+1}$ for the $\mathscr{l}+1$ layer. Intuitively, it can be thought as moving the error backwards through the network. Weights indicate to what extent each neuron affects the outcome, so it can be seen as “distributing responsibility”, attributing most of it to the neurons that have the largest influence.

Using the above, the gradients with respect to the bias can be easily calculated in any layer, since it is the error itself:

Finally, the gradients with respect to the weights can be easily calculated as:


这一章就是将反向传播中的权重的梯度计算过程。

其实反向传播只需要理解核心本质,就是更新权重$W$。

反向传播详细解释和计算过程

2.5 Optimizers and training

In this section, for simplicity, the notation $\epsilon$ represents a vector containing all the gradients of a network, which have been calculated using the backpropagation algorithm. The most basic optimizers are described below:($\theta^M$ means parameters)

  • Gradient Descent(GD)

    $\eta$ is called learning rate

  • Stochastic Gradient Descent(SGD)

    $\beta$​ is the momentum of the optimizer

  • Adaptive Gradient(AdaGrad)

    using the same update rule with SGD

    It is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training -> update learning rate

  • Root Mean Square Propagation(RMSProp)

    Instead of cumulative sum, it uses Exponential Moving Average (EMA), replacing Eq. (30) with:

    where $\rho$ determines the importance of the current errors in relation to the previous ones.

  • Adaptive Moment Estimation(Adam)

    It essentially combines RMSProp with momentum(SGD).

    $m^{(t)}$ is the results of $(\theta^M)’ = \theta^M - \bar{\epsilon}^{(t)}$ (Eq. (28)),

    $u^{(t)}$ is the results of $\gamma_i^{(t)} = (1-\rho)\gamma_i^{(t-1)} + \rho\epsilon_i^2$​ (Eq. (31))

    Adam uses the update rule of Eq.(32):

    image-20240426171814508

DNN Training Algorithm:

image-20240426182929604


这一章才是真正将反向传播加上优化器一起更新权重W和偏置b的值。

这一章先介绍了GD,SGD,AdaGrad,RMSPorp,Adam五种优化器(Adam在实际中使用最多)。

GD就是通过上一章计算得到的误差进行更新。SGD在GD的基础上引入了动量(惯性),会一定程度上保留上一步的方向。AdaGrad整体与SGD相同,不同点在于AdaGrad会根据误差调节学习率。RMSProp与AdaGrad整体相同,不同点在于更新学习率的方法,RMSProp添加了一个惩罚项用来衡量误差对新的学习率的影响程度。Adam整合了SGD和RMSProp。

之后有介绍了DNN的训练流程。即先将权重矩阵初始化为随机值,偏置赋值为0;然后再根据权重进行前向传播(计算每一层的输出);然后再计算每一层的误差;最后根据误差更新权重参数矩阵W和偏置向量b。

2.6 Overfitting

A well-known problem with NNs is their tendency to overfit, that is, to memorize training data, while failing to generalize their knowledge to unseen data.

The true ability of a model lies in its potential to correctly predict the labels of unseen samples, which is called generalization in ML.

In addition to carefully selecting hyperparameters, there are many solutions to overfitting, some of which are presented in this section.

2.6.1 Early stopping

the data are randomly divided in two subsets, one for training and one for validation. Observing the loss of a network after each epoch for both sets, the training process is interrupted as soon as it stops improving or gets worse.

2.6.2 Regularization

Low bias and high variance could cause overfitting. In contrast, high bias and low variance could cause underfitting, meaning that the model is not able to obtain a sufficiently low error in the training data.

One way to achieve this is by using regularization techniques. The main idea is to add an extra regularization term, which includes the network’s parameters, in the cost function. At the same time, a parameter $\lambda$ is used, which defines the importance of the regularization term.

$l_1$ regularization:

$l_2$ regularization:

2.6.3 Dropout

each neuron of the network is “turned off” on every training step by setting its output to 0, with a probability p usually set to 0.5.

This prevents hidden units from co-adapting to others, making them more generally useful and it is achieved with a very small change in the NN’s algorithm.

2.6.4 Batch normalization

Each unit’s input is subtracted with the mean and divided by the standard deviation of all the unit’s inputs $o^{\mathscr{l}−1}$ for a mini-batch of size $\mathcal{B}$. Backpropagation takes it into account learning two extra parameters $\gamma$ and $\zeta$ , for each unit, which scale and shift the resulting normalized values:

where

with every normalized unit’s input given by:

where

There are more normalization techniques, such as Layer Normalization, Instance Normalization, Group Normalization and Deep Adaptive Input Normalization.


这一章先提出了过拟和的概念(验证集上误差变大)。

然后提出了四种避免过拟和的办法:

  1. Early stop。当在验证集上的误差开始变大的时候就停止继续训练
  2. 添加正则化项。就是在原本的代价函数(损失函数)中添加一个惩罚项。出了提到的$l_1$和$l_2$正则化项以外,还有Elastic Net正则化项,其是利用一个$\rho$将$l_1$正则化项和$l_2$正则化项结合起来。
  3. Dropout。就是在每次训练时,随机将一些神经元的输出变为0(如果经过softmax函数则变为$-\infty$),相当于本次训练去掉该神经元。这是为了避免层与层之间的神经元产生关联。
  4. Batch normalization。它的主要作用其实是加快神经网络的训练速度的,至于能否避免过拟和有待考究

3 Convolutional neural networks

3.1 Introduction

The tasks currently performed by CNNs are numerous. In the simplest regressiontasks, the CNN should predict a single number, for example, given a picture of a house, predict its price. In multiclass classification tasks, the goal is to predict the correct class for a given input, for example, given a picture of a bird, predict its species. In object detection, the CNN should find objects in the image along with their positions typically given as bounding boxes around the objects. In semantic segmentation, the task is to label each image pixel to its class (e.g., sky, building). The prediction may be also a new image, for example, given a facial image, predict how this person will look 30 years later. The task at hand affects many design choices related to CNNs. To focus on the most basic elements, this chapter assumes either regression or classification task, while some of the other tasks are discussed in later chapters of this book.

3.2 Structure of convolutional neural networks

3.2.1 Convolutional layers

这章主要了解一些概念:

  • receptive field

  • padding, zero padding

  • strided

  • Dilated convolution

  • Atrous convolution

  • horizontal Sobel kernel

  • average kernel

  • weight and bias sharing

  • grouped convolution

    Grouped convolutions were applied in the original AlexNet to parallelize the convolution operations across two separate GPUs. The main idea is to split filters so that they only operate on a subset of input channels. Later, it was proposed to shuffle the channels between layers of grouped convolutions.

  • transposed convolution

    Transposed convolutions are used when it is necessary to upsample the feature maps. This is common in tasks such as semantic segmentation or creation of superresolution images, but not in regression or classification.

  • separable convolution

    Separable convolutions split the standard convolutional filters into smaller filters so that the resulting operation approximates the original operation, but is computationally more efficient.

3.2.2 Activation functions

image-20240426202250574

image-20240426202333389

这章需要辨析常见的激活函数,了解它们的公式,作用。除了上述提到的激活函数,还有GLEU(Guass Linear Error Unit)

3.2.3 Pooling layers

Pooling layers reduce the spatial size of the feature maps extracted by convolutional layers. This saves computation costs and allows the following convolutional layer to extract features at a different scale.

The two main approaches to pooling are max pooling and average pooling:

image-20240426202710264

global average pooling

3.2.4 Fully connected and output layers

Fully connected layers consist of standard neurons and they connect each input element to each output element with a separate weight.

An output layer is a fully connected layer, whose output is the network’s prediction for the given input. The output format depends on the task at hand. In simple regression tasks, the output is just a single number. In such cases, the output layer has a single neuron that forms the weighted sum of all the features and a linear activation function is typically used.

3.3 Training convolutional neural networks

3.3.1 Backpropagation formulas on CNNs

这一节还是在推导反向传播训练卷积神经网络的迭代公式。可了解:反向传播详细解释和计算过程

3.3.1.1 Backpropagation on convolutional layers

Because each convolutional filter affects only a single output feature map, the error is backpropagated only via the corresponding local error map in $\Delta^l$. Here, $\Deltap^l [m, n] = \frac{\partial E}{\partial Y_p^l [m,n]} \forall m ∈ 1,\cdots, M \;and \;n ∈ 1,\cdots, N$, where $M \times N$ is the output feature map size and $p$ refers to the $p$th convolutional filter. **The main things needed to apply backpropagation on convolutional layers are solving the previous layer’s local error $\Delta^{l−1}$ and error gradients with respect to kernel weights,$\nabla{K_p}E$ when $\Delta_j^l$ is known.**

where $K[i,j,c]$ is a Kernal size with $i\times j$ and $c$ channel. $Y[m,n]$ is feature maps size with $m \times n$.

image-20240427165630642

Inserting this to Eq.(39), gives:

To solve $\Delta^{l-1}$, it can be seen from $Y[m,n] = \sum\limits{i=1}^K\sum\limits{j=1}^K\sum\limits_{c=1}^C K[i,j,c]X[m-1+i,n-1+j,c]$ that:

Changing the indices as $u=m-1+i(m=u+1-i) \; and \; v= n+i-j(n=v+1-j)$:

This, in turn, can be used to express $\frac{\partial E}{\partial x[u,v,c]}$:

Finally, the local errors at layer l − 1 can be obtained by inserting Eq. (44) into $\delta^{l-1} = \frac{\partial E}{\partial y^{l-1}} = \frac{\partial E}{\partial y^l}\frac{y^l}{\partial y^{l-1}}=\delta^l\frac{\partial y^l}{\partial x^l}f’(y^{l-1})=\frac{\partial E}{\partial x^l}f’(y^{l-1})$:


这一节提出,反向传播最重要的是求上一层的误差$\Delta^{l-1}$。

根据上面推导得到的方向传播公式,应用到卷积层中,详细讨论了如何对filter求偏导,以及最终的误差迭代公式。

3.3.1.2 Backpropagation on pooling layers

  • Max Pooling:

    Therefore, during training, the error is backpropagated only via the maximum element, but it is not changed.

  • Average Pooling:

    As average pooling can be presented as a 2D convolution with an averaging kernel having all the weights equal to 1/K2 , the error update can be obtained directly from Eq. (45):

3.3.2 Loss functions

这一节指出损失函数在CNN中是非常重要的(因为要用于反向传播时计算损失)

并且给出了不同任务适用的损失函数:

image-20240427190334253

3.3.3 Batch training and optimizers

3.3.3.1 Batch training

这一节指出前面提到的都是在整个数据集上训练,这样效果不好,因此要把大的数据集划分为小的数据集。

3.3.3.2 Optimizers

image-20240427191328116

这一节指出了优化器的统一形式,如上图中的表。然后辨析了这些优化器,最终指出Adam是最适合CNN的。

3.3.4 Typical challenges in CNN training

  • Overfitting

  • Long training time

  • Vanishing and exploding gradients

  • Internal covariate shift

    In a CNN, the updates in the lower layers’ weights during training will change the input distribution of the subsequent layers. This phenomenon is called internal covariate shift

3.3.5 Solutions to CNN training challenges

3.3.5.1 Learning rate scheduling

Too small learning rate will lead to very slow learning or even inability to learn at all, while too large learning rate can lead to exploding or oscillating performance over the training epochs and to a lower final performance. It is typical to start training with a higher learning rate and then decrease it either after a certain number of iterations or gradually at every step. This allows the network to first take faster steps toward the optimal solution, and when converging, focus on finer details without oscillations.

Learning rate for larger batch sizes should be generally larger

3.3.5.2 Data augmentation

One of the commonly used techniques to avoid overfitting is data augmentation, where the amount of training data and its variability is increased by applying different transformations, such as rotation, scaling, cropping, and adding noise, to the original images. Such augmentation is often performed manually, but also techniques for learning suitable augmentation policies automatically have been proposed [9].

3.3.5.3 Transfer learning

In transfer learning, some of the CNN layers are pretrained using a larger similar dataset, which can help the network learn better features that are suitable also for the new task and then the network is only finetuned by training shortly with the data set at hand. This can help the network avoid overfitting and improve the performance. It is common to use as a starting point a publicly available network model that has been pretrained with one of the large data sets, such as ImageNet.

This makes the training much faster and more economic as the most time-consuming part can be reused in multiple applications.

3.3.5.4 Weight regularization

L1/L2 regularization adds to the loss function a term that depends on L1/L2 norm of the weights.

3.3.5.5 Dropout

Dropout can be applied on any network layer except the output layer and the probability of dropping the units is typically set to 0.5 for hidden layers and to a larger value, such as 0.8, for the input layer.

Dropout can prevent network units from coadapting too much by making the presence of other units unreliable, which makes the network more robust.

3.3.5.6 Normalization

image-20240427194226884

3.3.5.7 Skip connections

image-20240427194430453

实际上就是利用ResNet。

4 Graph convolutional networks

Distill介绍文章:https://distill.pub/2021/gnn-intro/

中文介绍文章:https://luweikxy.gitbook.io/machine-learning-notes/graph-neural-networks/graph-convolutional-networks/gcn-comprehensive-understand

这一章主要需要了解三种任务:

  • 点分类
  • 边分类、边预测
  • 整个图分类

了解图的表示:拉普拉斯矩阵

图上的傅立叶变换:

  • 傅立叶正变换:从Spatial域到Spectral域
  • 傅立叶逆变换:从Spectral域到Spatial域

图上的卷积

图注意力网络

图卷积网络的两种采样方式

常用数据集和库

4.1 Introduction

The Spatiotemporal Graph Convolutional Networks (ST-GCN) are able to process spatiotemporal graphs. They are multilayer neural networks, which capture both spatial structure and temporal dynamics of the underlying spatiotemporal graph by applying spatial and temporal graph convolutions at each layer.

Using graph structured data sets, three main learning tasks can be formulated, which are briefly described as follows:

  • Node classification: The goal of node classification methods is to predict a label for each graph node in a supervised or semisupervised setting. In these methods, the GCN model extracts high-level feature representation for each node in a supervised setting or propagates the label information from labeled nodes through the whole graph in a semisupervised setting. Using the extracted high-level node features, a label will be predicted for each node by a classification layer.
  • Graph classification: The methods targeting graph classification tasks use a pooling layer before the classification layer in order to produce a feature vector aggregating information of the whole graph, which is then introduced to the classification layer predicting a label. The pooling layer can be a simple global average pooling, which produces a mean vector of all the node features.
  • Edge prediction and classification: The goal of edge prediction and classification tasks is either to generate a graph by predicting the edges between nodes and building a new graph adjacency matrix, or to learn feature representations for the graph edges.

4.1.1 Graph definition

用邻接矩阵方式定义图,存在边的i,j表示为1,否则为0(i=j时也为0),这样定义的图在文中用$A$表示。

同时给出度$D$的定义。

4.2 Spectral graph convolutional network

Given the graph adjacency and the graph degree matrices, the graph Laplacian matrix is defined as $L = D − A$ so that:

The degree of each node $deg(ν_i)$ indicates the number of its neighbors, which are directly connected to it with an edge. $\varepsilon$ is a graph edge set. $L$ encodes the graph connectivity and determines the smoothness of the graph function.

Laplacian matrix

In a smooth graph function, its value does not vary a lot across connected nodes. This can be expressed formally by the Dirichlet energy of a graph function:

where $f$ denotes the graph function, which receives as input a node of the graph and provides as output a real value.

The smoothness of the graph function can also be expressed by the eigenvectors and eigenvalues of the normalized graph Laplacian matrix $L’$, which is defined as

and its elements are

5 Recurrent neural networks

  1. RNN的基本概念:了解RNN是什么以及它们如何用于处理序列数据。
  2. Vanilla RNN:学习最基本的RNN结构,包括其内部结构和如何在序列数据上进行操作。
  3. 长短期记忆网络(LSTM):理解LSTM的工作原理,包括其门控机制如何帮助解决梯度消失和梯度爆炸问题。
  4. 门控循环单元(GRU):了解GRU的结构,它与LSTM的相似之处以及它们之间的差异。
  5. 其他RNN变体:认识到存在多种RNN的变体,并且每种都有其特定的应用场景。
  6. RNN的应用:了解RNN在不同领域的应用,如自然语言处理、语音识别、时间序列预测等。
  7. 训练RNN的挑战:认识到训练RNN时可能遇到的一些常见问题,例如梯度消失和梯度爆炸,并学习如何通过各种技术来解决这些问题。
  8. 反向传播通过时间(BPTT):理解如何使用BPTT来训练RNN,以及它如何通过序列数据传播误差梯度。
  9. 优化和正则化技术:学习用于改善RNN性能的优化技术,如梯度裁剪和正则化方法。

5.1 Introduction

这一节先介绍啦时序数据的概念,再介绍相关处理方法,之后引出循环神经网络。


Many types of data, such as natural language and sound tracks, contain additional information in the way they are ordered, forming a sequence of data points. This type of sequence is usually referred to as a time series.

Indeed, the most common way to process sequence data using deep learning is by employing Recurrent Neural Networks (RNNs). As the name suggests, RNNs have a recurrent aspect in their operation, which is the repeated processing and updating of their internal state. Using their internal representation, RNNs are capable of exhibiting dynamic temporal behavior and processing sequences of arbitrary length, while being able to model the long-term relationships between distant points in a sequence. Their internal state is a vector of values $h$ colloquially referred to as their hidden state. The hidden state is the RNN’s only way of detecting patterns across the data sequence, so all latent information extracted at every step is included in that hidden state. In Fig. 5.1 a visualization of the basic RNN functionality is shown.

image-20240506163244017

5.2 Vanilla RNN

这一节用Vanilla RNN(一种架构简单的RNN模型)来引出RNN的一些重要概念。

首先解释到循环神经网络的每一次迭代直接应用基于当前步骤的输入和上一步的隐藏状态更新状态的逻辑,并且最开始的隐藏状态为空向量(全0)。然后给出Vanilla RNN的两个可训练的权重参数。这两个参数分别作用与当前$xt$的输入,和上一步隐藏状态$h{t-1}$,计算后求和再经过激活函数(一般为tanh或sigmoid)得到当前步隐藏状态$h_t$​。

介绍完Vanilla RNN的更新过程后再介绍RNN的其他功能(对最终结果预测、对每个过程结果预测,对部分过程结果预测)。这里提出可以对每个过程结果预测,即可以得到每个$h_t$。

然后再介绍BPTT(Backpropagation Through Time)。

根据反向传播又提出了RNN可能出现梯度消失和梯度爆炸的问题,并提出了解决建议。

  • 梯度爆炸:当与较短的序列相比,RNN的梯度的长期分量对于较长的序列变得指数更大时,就会出现梯度爆炸问题
  • 梯度消失:当与较短的序列相比,RNN的梯度的长期分量在较长的序列上变得指数接近0时,就会出现梯度消失问题

The first iteration of recurrent neural networks directly applied the logic of updating a hidden state based on the input of the current step and the hidden state of the previous step. The hidden state is usually initialized as a null vector, meaning all elements are zero. The main trainable parameters of a “vanilla” RNN are the two weight matrices $W{hh} ∈ R^{m_h\times m_h}$ and $W{xh} \in R^{m_x \times m_h}$ , where $m_h$ is the number of hidden neurons of the RNN and $m_x$ is the number of features of the input, that is, the dimensionality of the time series measurements. These parameter matrices are used to transform the previous hidden activation and the input, respectively.

这一段给出RNN的迭代逻辑和Vanilla RNN的两个可训练参数

The input sequence matrix is denoted $X \in R^{n\times mx}$ , where $n$ is the number of steps in the sequence. At each processing step of the sequence a nonlinear transformation is performed on the current input step $x_t ∈ R^{m_x}$ using $W{xh}$ and another one on $h{t−1}$ using $W{hh}$. The result of both transformations has the same size $m_h$​, so the two results are combined using an elementwise summation as shown in Fig. 5.2. This can also be described with notation as

image-20240506165211631

这一段给出两个可训练参数的用途,其中$W{xh}$为当前输入的权重,$W{hh}$为隐藏状态的权重。迭代效果如图

The potential uses for RNNs vary from simply parsing a sequence and predicting something after processing it completely (e.g., classifying music genres, determining a comment’s sentiment), to making a prediction for every step of the sequence processed (e.g., detecting anomalies throughout a time series, autoregressive prediction of a sequence), and generating a new sequence at the end of the current sequence (e.g., generating text after an initial prompt).

image-20240506165933416

Similar to backpropagation, BPTT utilizes the chain rule to efficiently propagate the error gradient to the trainable weight to apply gradient descent. Using the unrolled RNN computation graph, the gradients are propagated through the computation steps of the hidden state as shown in Fig. 5.4. Formally, the gradient of the activation at step $n$ w.r.t the error (loss) function $\mathcal{L}$ is defined as

image-20240506174200894

To utilize the error propagation and train the network by executing gradient descent, the gradients of the weight parameters $W_{hh}$ are needed. The accumulated gradients that the weights of RNN incur across all time steps (considering a total of $T$ time steps) are calculated as

where we consider that at every time step the RNN produces a prediction, which is used to calculate some loss $\mathcal{L}$. To accommodate the calculation of gradients across multiple time steps from the loss of a single time step, we use the chain rule to calculate the gradient at each step as the error propagates backwards as shown in Fig. 5.5. This propagation of errors to every previous step of the RNN’s computation graph is the reason that this type of network can manage to learn dependencies across multiple time steps and take advantage of the sequential nature of the input data.

image-20240506175909569

上述给出了BPTT的概念和公式,以及向后传播时,在每个步骤中计算梯度的链式规则

Long term dependency learning signals need to be propagated through multiple steps of the hidden activations, which may be difficult depending on the properties of the $W_{hh}$​ matrix.

A solution proposed for the exploding gradient problem is called “gradient clipping.” To avoid the rapid increase of the scale of gradient values, the norm of the gradient is used to clip the values to be within a threshold:

where $\hat{g}$ is the gradient $\frac{\partial \mathcal{L}}{\partial h_t}$。

In the case of the vanishing gradient, suggests a regularizer that preserves the error norm as it travels backwards in the sequence.

这一段提出想要解决序列长度过大的问题,根据极限的直觉,推测出和$||W_{hh}||_2$的取值有关。当其在区间$[0,1]$时可能会梯度消失;在区间$[1, \infty]$可能会梯度爆炸。

并给出解决办法

5.3 Long-short term memory

这一节首先介绍了LSTM模型是为了解决梯度消失和梯度爆炸问题提出的,并且提到了的三个门单元:

  • forget $f_t$

    遗忘哪些单元,通过一个向量,记录单元是否存在

  • input $i_t$

    决定将新信息的那一部分添加到单元

  • output $o_t$

    决定哪些单元会参与前向传播并且作为网络的输出

之后介绍了整个LSTM模型的前向传播过程和方程。

最终根据LSTM的梯度解释了为什么LSTM模型可以解决梯度消失和梯度爆炸的问题。


Long-Short Term Memory (LSTM) was first proposed by [11] to solve the problem of vanishing and exploding gradients by proposing a different architectural approach to RNNs. This is achieved by protecting its hidden activation using gates between each of its transaction points with the rest of its layer. The hidden activation that is protected is called the cell state. The protection of the cell state is undertaken by the three LSTM gates, namely, the forget, input, and output gates.

上述段提出LSTM模型是为了解决梯度消失和梯度爆炸问题提出的,然后解释到被保护的隐藏激活叫做细胞状态,并且给出三个细胞状态叫做门,分别为:遗忘门、输入门、输出门。

During the forward pass, the first gate that acts upon the cell is the forget gate. It determines which of the cell’s activations are forgotten and by how much.

It achieves this by multiplying all the cell’s elements with a vector $f_t \in (0, 1)^{m_h}$. If the forget gate emits a value close to zero, the corresponding element in the cell state will be wiped and set to zero, whereas if it outputs a value close to one, then the cell will fully retain the value of that element.

上述段表示第一个门是遗忘门,它的功能是觉得哪些细胞背遗忘了。

第二段给出实现办法,就是通过设置一个向量,该向量发射的值接近1则表示没被遗忘(保留元素的值),接近0则表示被遗忘(设置元素的值为零)

The next gate to act upon the cell is the input gate, which determines what portion of new information will be added to the protected state. This happens in combination with the calculation of a new candidate cell state $c_t’$ .

Similar to the forget gate, the input gate $i_t \in (0, 1)^{m_h}$ is multiplied by the candidate state $c_t’$ and added to the cell state. This prevents unnecessary additions to the cell state.

上述段提出,第二个门为输入门,它决定将新信息的哪些部分添加到被保护状态,这一过程是与计算新的候选单元状态同时发生的。

输入门的实现也是设置一个向量表示输入的哪些部分会被添加。

Finally, there is the output gate, which is important for the backwards pass (backpropagation). It determines which parts of the cell state need to be propagated forward and be included in the output of the network.

上述段提出,第三个门为输出门,它在反向传播过程中很重要。它决定哪些单元会前向传播,和输出。

The following equations describe the behavior of the LSTM model:

where $ft$, $i_t$, and $o_t$ are the activations of the input, forget, and output gates at timestep $t$, which control how much of the input and the previous state will be considered, and how much of the cell state will be included in the hidden activation of the network. The protected cell activation at time step $t$ is denoted by $c_t$ , whereas ht is the activation that will be given to the next layer. The notations of the weight matrices $W{(··)}$ are explicitly denoted with subscripts of their transform. $b{(·)}$ denotes the bias vector of each gate. $W{xf}$ and $W{hf}$ are the matrices that transform the input xt and hidden state $h{t−1}$ , respectively, to the forget gate dimension, $W{xi}$ and $W{hi}$ transform the input and hidden state to the input gate dimension and so on. There are simplified notation combining the input and hidden state into a single matrix, thus having a single matrix per gate $W_f , Wi , W_c, W_o$.

The forward flow of information across the described gates is shown in Fig. 5.6.

image-20240506213654749

上述段落给出了LSTM模型的实际计算过程,包括当前时间的遗忘门、输入门、输出门的映射值,并给出了具体的图示。

根据图示可以看出:

  1. 上一层的隐藏状态$h_{t-1}$和当前输入$x_t$一起进入forget-gate(Eq. 57),计算得到$f_t$。
  2. 上一层的隐藏状态$h_{t-1}$和当前输入$x_t$一起进入input-gate(Eq. 58),计算得到$i_t$
  3. 上一层的隐藏状态$h_{t-1}$和当前输入$x_t$一起通过Eq. 59计算得到$c’_t$
  4. 输入门决定添加新的信息$it c_t’$,与未遗忘的上一层候选状态$f_tc{t-1}$相加(Eq. 60),计算得到$c_t$
  5. 上一层的隐藏状态$h_{t-1}$和当前输入$x_t$一起通过output-gate(Eq。 61),计算得到$o_t$
  6. 最后将需要输出的状态$o_t$与这一层的候选状态$c_t$相乘(Eq。 62),计算得到这一层的隐藏状态$h_t$

Although not obvious, the additive nature of the gates to modify the cell’s state has a great impact on the effects of vanishing and exploding gradients. To understand why this happens, we must expand the gradient calculations of the LSTM’s cell state.

The LSTM’s gradient has four terms that are added together and some of them are not multiplied directly with a weight matrix. This allows for each term to have a different gradient factors at each step, either increasing or decreasing the gradient step norm, thus avoiding exponentially increasing or decreasing the gradient values across steps.

上述段落最终回归到解释为什么LSTM模型可以解决梯度消失或梯度爆炸的问题。

因为LSTM的梯度是四个项相加在一起,其中一些项不直接与权重矩阵相乘。使得每个项在每一步都有不同的梯度因子,增加或减少梯度步长范数,从而避免跨步骤的梯度值呈指数级增加或减少。

5.4 Gated recurrent unit

The Gated Recurrent Unit (GRU) proposed in [13] is similar to the LSTM but has a much simpler gating mechanism. It utilizes two gates, namely, the reset and update gates. No protected cell state is used, hence the gates are applied directly to the hidden state. This difference makes the GRU more computationally efficient than the LSTM.

上述段落提出GRN因为只有两个门(重置门和更新门)并且没有保护单元状态,直接将门用在隐藏状态,因此具有更高的计算效率。

6 Deep reinforcement learning

可以参考知乎文章:https://zhuanlan.zhihu.com/p/466455380

6.1 Introduction

这一节提出了强化学习的概念;讲解了强化学习与监督学习、非监督学习的区别;引出强化学习的基本组件,并解释到基于值和基于政策两种方法的异同点以及各自的缺点;最后介绍了现代的一些发展(监督学习和非监督学习的显著成果、Tensorflow、PyTorch框架的提出、GPU和ASIC的发展)使得强化学习也可以进一步提高性能完成机器学习无法解决的任务。


Indeed, the most succinct and simple explanation of learning is the maximization of a reward signal through trial-and-error. This simple concept that exists throughout nature as a mechanism that helps humans and animals to survive is systematically explored in the field of Reinforcement Learning (RL). The interaction with the environment and the feedback received, shapes our behavior depending on the reward signal we receive. When a positive reward is received, such as finding food and enjoying eating, the behavior (or policy) leading up to that feedback is reinforced. Similarly, when a negative feedback is experienced, such as falling and hurting ourselves, the behavior that leads to it is diminished or avoided in the future.

这一段引出了强化学习的核心概念,即与环境和受到的反馈互动,调整行为策略,最大化试错法的奖励。正反馈则加强,负反馈则减少/避免。

Reinforcement learning is comprised of the following basic components: the policy, reward signal, and value function. Some methods also include a model of the observable environment and are called model-based approaches. These methods allow RL agents to improve its policy without interacting with a real environment, thus enabling a form of planning based on the understanding of the environment. When an environment can be simulated adequately without concern about the number of interaction from which an agent can learn from, then a model-free approach can be used.

image-20240507103536581

这一段引出了强化学习的基本组件——policy、reward signal、value function。还提及了RL agents即强化学习中的智能体。并且讨论了基于模型和不基于模型的两种方法,以及何时使用它们。

Both value and policy-based methods assign a value of the state of the environment, but value-based methods aim to select the actions that will lead to higher value states, while policy-based methods direct learn a policy utilizing state values in an indirect way. Value-based methods have been shown to be less sample-efficient and slower to train, while policy-based methods can be trapped in local minima because the policy itself is used to sample the environment.

这一段提出了Value-Based和Policy-Based这两种方法的异同点:

  • 同:都分配环境状态的价值
  • 异:

    • Value-Based:旨在选择将导致更高价值状态的行动
    • Policy-Based:以间接的方式直接利用状态值学习策略

    也讨论了两者的缺点:

  • Value-Based:样本效率较低,训练速度较慢

  • Policy-Based:可能会陷入局部极小值

6.2 Value-based methods

这一节主要是介绍基于值方法的强化学习。可以参看知乎文章:https://zhuanlan.zhihu.com/p/73083240

一开始先介绍了RL的状态,并给出了reward函数$G_t$,表示$t$时刻(或步骤)时的回报值。

接着给出最优化值函数(代价函数,贝尔曼期望方程)$V_\pi^*(s)$,其表示当以$s_t$状态作为开始,选择使用$\pi(a|s)$策略时的回报函数的未来期望。

然后在贝尔曼期望方程的基础上提出贝尔曼优化方程$V(s_t)$,之后再引入学习率。

最后给出时间差分学习的时间误差的概念。


When mentioning the term value in a RL setting, we usually refer to the state value. That is an arbitrary value that can be assigned to a specific state we may find ourselves in. This value is usually based or tries to estimate the long term discounted return $G_t$ of a specific state which is defined as

where $r_t$ is the reward received at time step $t$, $γ$ is the exponential decaying factor of future rewards.

这一段给出了RL中的状态值的定义。根据定义可以看出,距离当前时间越远,奖励回报价值更低;距离当前时间越近,奖励回报价值更高。

Having an accurate estimate of the expected returns would allow a policy to select the optimal action for any given state $s_t$. The optimal value function is defined as

where $V^∗_\pi (s)$ is the state value function for policy $\pi$, and $s$ denotes the state.

上述段给出了RL中的最优化值函数的定义(也可称为代价函数、贝尔曼代价方程(Bellman Expectation Equation))。其中$\mathbb{E}[\cdot]$表示期望值

期望值:

设 $X$ 是一个随机变量, 具有分别以概率 $p_1,p_2,\cdots,p_k$出现的有限数量的有限结果$x_1,x_2,\cdots,x_k$.

$X$的期望值$\mathbb{E}[x]$定义为:$\mathbb{E}[x] = x1p_1 + x_2p_2 + \cdots + x_kp_k = \sum\limits{i=1}^k x_ip_i$​

根据公式可以看出最优化值函数就是求以状态$s_t$作为开始,并且执行$\pi(a|s)$策略可以求得的回报值的未来期望。

  • 状态价值函数(State-value Function):用来度量给定策略的情况下,当前状态的好坏程度。
  • 动作价值函数(Action-value Function):用来度量给定状态和策略的情况下,采用动作的好坏程度。

This value estimation of each state is needed in RL because there is no explicit label on each state that informs us whether we are in a good position or not. To find the optimal value of each state the Bellman optimality equation can be used:

上述段又在贝尔曼期望方程的基础上引出了贝尔曼最优方程。

Since our estimates $V(s)$ at the beginning of training are not accurate, we introduce a learning rate or “step” denoted as $\alpha \in [0, 1]$ to gradually improve the estimates

上述段提出因为在训练开始时的$V(s)$是不确定的,因此要通过学习率逐步提高估计值

This is part of the Time Difference (TD) learning methods, where in this case $r{t+1} + \gamma V(s{t+1})$ is an unbiased estimate of $V(s)$. As a shorthand, we define

as the temporal difference error $\delta_t$. By utilizing the value of the estimate of the next step to train the current step, we are bootstraping the value estimator. This allows us to start training an estimator before knowing exactly how an episode concludes.

上述段提出时间差分学习中的时间误差的概念

6.2.1 Q-learning

首先提出了Q-learning的核心思想——最大化Q值。

并且给出了类似于State-value Function(Eq. 65)的Action-value Function:

$Q(st,a_t) = \mathbb{E}[r{t+1} + \gamma Q(s{t+1}, a{t+1} | s{t+1},a{t+1})]$​

也给出了与State-value Function的训练规则(Eq. 66)类似的训练规则:

然后提出了Q-table以及Q-table存在的问题,并且提出了解决办法。


In its most basic of Q-learning a “Q-table” is employed, where each row of the table represents a state and each column represents an action. As state-action tuples are sampled from an environment the corresponding cells of the matrix are update using Eq. (68). The structure of the Q-Table is shown in Table 6.1.

image-20240507161440777

This is obviously very costly if the distinct action-state space is large and quickly becomes infeasible to manage. Constructing a table is also only possible when actions are discrete. To remedy this problem, we can replace the Q-table with a parametrizable approximator that can be trained using an update rule similar to Eq. (68).

上述段先提出了Q-table的概念,即一个表,行表示状态,列表示动作,对应的值表示Q值。

又提出存在的问题:当状态和动作空间很大就会很难管理。提出的解决办法是引入一个可参数化的近似器(该近似器再下一节)。

6.2.2 Deep Q-learning

这一节先提出DNN可以做一个简单的改进就称为了Q-value的近似器,用来代替Q-table。并且DNN可以观察当前状态和可能的动作。

然后提出了DQN(Deep Q-Network),并给出训练DQN的损失函数。

然后提出DQN是有问题的,并且提出了缓解这个问题的解决方案。

最后提到训练期间的内存被建模为一个优先队列,根据他们的重要性的度量(例如时间差异错误)优先考虑转换。

6.3 Policy-based methods

这一章先介绍了策略的表示,以及基于策略方法与基于值方法的优缺点(基于策略方法收敛更快)。


A policy can be described as a mapping from states to actions or to action probability distributions, which depends on whether the policy is deterministic or stochastic. In general, to ensure exploration, the policy function is always stochastic, emitting a probability distribution denoted as $\pi(a|s)$. A policy usually outputs a vector of preferences over actions and is passed through a softmax activation function to bound it to a probability distribution:

where $h$ is the preference function, which in the case of a neural network, can be the raw activations of the output neurons before the application of the softmax function. Since policy $\pi$ is an trainable approximator with tunable parameters $\theta$ it can also be written as $\pi_\theta$.

上述段提出了政策的表示方法:表示为状态-行动的映射关系(策略确定)或行动的概率分布(策略随机)。

一般来说为了可以学习,政策函数都是随机的,$\pi(a|s)$表示为在状态s下选择a行动的概率。由于需要可训练,于是使用$h(s,a)$作为偏好函数,最后要回归为概率,因此再对$h(s,a)$使用softmax。

By avoiding to build a policy indirectly from an action-state value approximator and directly building it from the observation of the state can yield better results in many situations.

6.3.1 Policy gradient


To learn a policy when using a neural network, an objective function must be defined and gradient ascent to be employed to improve said policy. We denote this objective function as $J(\theta)$ and define it as

where $p(s’, r|s, a)$ where $p(s’, r|s, a)$ is the probability to transition from state $s$ with action $a$ to state s′and receiving reward $r$.

上述段提出需要使用神经网络,必须要定义目标函数和梯度上升。

并且给出了目标函数$J(\theta)$

The term $\mu$ is the stationary distribution given policy $\pi$, defined as

This is true under the assumption that the stationary distribution does not change as long as the actions are selected under said policy $\pi_\theta$ with parameters $\theta$:

上述段指出目标函数中的$\mu(s)$​的含义与公式。

以及$\mu(s’)$的计算

One simple example of the gradient ascent path we need to follow come from the REINFORCE Monte Carlo policy gradient on-policy algorithm based on episodic samples:

where $G_t$ denotes the return defined in Eq. 63. The return of the whole episode is needed so sampled trajectories must be completed before being able to perform the gradient computation, which is one reason it is considered a Monte Carlo method.

上述段给出了蒙特卡罗政策的梯度在线政策算法的梯度公式作为策略梯度。并且指出必须在执行梯度计算之前完成轨迹采样。

To account for reward schedules that might be consistently higher or lower than 0 (among other configurations) a simple kind of “normalization” can be performed on the factor $G_t$ of our gradient step. By subtracting an appropriate baseline $b(s)$, we can ensure that better rewards contribute more in the gradient step compared to smaller positive rewards. The only requirement for this variable is that it must be independent of action $a$. The resulting update function of the parameters with this baseline is

where $\alpha$ is the step size parameter(learning rate). A simple baseline that can be used is a state value approximator $\hat{v}(s_t, w)$ trained to predict $G_t$.

上述段给出了参数的更新规则

6.3.2 Actor-critic methods

这一节先提出REINFORCE等Monte-Carlo方法的缺点(存在高方差,收敛速度比直接利用状态值训练策略的引导方法慢)

随后提出Actor-critic,其中actor是policy,critic是state value approximator。

然后提出了policy的更新规则和确定训练期间每个步骤强弱的方法。


By utilizing a learned state value $V_\pi$ based on their policy the update rule for policy $\pi$ becomes

where $\delta_t$ is the temporal difference residual or error defined in Eq. (67).

The most recently used instantiation though is the advantage function:

where $A_\pi$ denotes the advantage under policy $\pi$ of the transition from state $s_t$ with action $a_t$.

上述段提出了两种代表性的方法用来确定在训练期间必须考虑的每个步骤的强弱

6.3.3 Deep policy gradient-based methods

这一节首先提出Deep policy gradient-based methods的巨大前景。并介绍了actor-critic的优点为将state value approximator和policy都实现于深度神经网络。

6.3.3.1 Actor-critic

这一段在前面6.3.2 Actor-critic methods已经讲过了。此处只是再总结了一下:

The actor network processes a given state and produces a distribution of probabilities over actions, usually with the same number of actions as neurons with the softmax activations applied on the network’s output. The critic network is similarly a neural network that predicts the value of the state usually having a linear activation at the last layer to allow flexibility in the ranges it can to the state value.

6.3.3.2 Trust region policy optimization

这一节主要讲置信域策略优化算法-TRPO,可以参考知乎文章:https://zhuanlan.zhihu.com/p/384334291

原论文地址:https://arxiv.org/abs/1502.05477

7 Lightweight deep learning

本章需要深入了解各种在提高深度轻量级网络在多种任务中的性能的方法,主要包括分类、回归以及对象检测和跟踪。

7.1 Introduction

首先肯定了深度卷积神经网络的成果,以及具有parameter-heavy架构的深度卷积神经网络模型已经可以解决很多实际任务,这得益于大型标记数据集的收集和GPU的发展。

但是这些模型具有规模庞大,功耗高的缺点,这阻碍了它们在移动机器人和嵌入式系统上的使用。而且与视觉分析相关的应用程序变得越来越流行,增大了在嵌入式设备上的部署需求。

目前的解决方法分为两类:

  • 在训练阶段过程中优化——训练小模型
  • 在训练阶段之后优化——训练收敛后压缩大模型

提出算法方法例如深度可分离卷积也可以用来减轻计算负担,并介绍了MobileNets(同时使用上述两类的解决方法,专门为特征提取提出)。SqueezeNet(通过将标准卷积层修改为fire modules并在传统层中使用更小更少的过滤器,在ImageNet分类数据集上实现了AlexNet级精度但只需要1/50的参数)。也提出其他小型网络包括扁平网络和分解网络,这两种网络都具有特征分解卷积。

然后提出两种方法的具体举例,一种是提前训练后优化(知识蒸馏),另一种是quantization methods,值的是将网络的参数转化为较低进度的位宽的过程(通常转化为单精度浮点格式(FP32)、半精度格式(FP16)和低精度八位整数格式(INT8)),quantization的典型缺点是推理精度的损失、对软件库的依赖、量化张量的吞吐量严重依赖于相应的硬件。

第7.4节介绍了一个可学习的特征包模块,它可以提高深度网络的性能,促进在回归和视觉对象跟踪等任务中使用更轻的架构。最后,第7.5节介绍了深度网络的早期退出架构,该架构提供了自适应推理能力,并且可以针对特定场景进行实时调整。

7.2 Lightweight convolutional neural network architectures

本章主要介绍了轻量级体系结构,以及如何在机器人应用程序的分类和检测任务中有效使用它们,甚至在嵌入式设备上实时运行。

7.2.1 Lightweight CNNs for classification

首先提出实时部署的要求:

高分辨率输入的实时部署对机器人应用至关重要。例如,在无人机捕获的图像中进行人员检测。在这种情况下,物体的尺寸都非常小,而最先进的物体检测器为了以足够的速度满足机载部署的限制,会调整输入图像的大小,导致感兴趣的物体进一步缩小,甚至无法进行检测。

为了实现实时部署,必须将输入帧的分辨率降低到 224 × 224,同时牺牲精度。也可以用1×1卷积来代替3×3卷积。

满足上述要求并使用轻量级CNN是为生成一个热力图,该热力图提供控制信号指示机器人要看哪里,以获得所考虑对象的更好视图。

然后根据二分类问题举例,通过丢弃VGG-16模型的最深层和修剪滤波器,提出了两种仅有五个卷积层组成的架构,第一种可以使用NVIDIA Jetson TX2系统实时运行720p($1280\times720$分辨率图像),缩写为VGG-720p;第二种可以实时运行1080p($1920\times1080$分辨率图像),缩写为VGG-1080p。具体架构见原文。

7.2.2 Lightweight object detection

7.2.2.1 Real-time generic object detection on embedded devices

使用大于1的批处理在计算上更高效。在实时应用程序中,图像必须在捕获时按顺序逐个处理。

较大的输入大小会导致较大的热图和更密集的对象检测,但会带来沉重的内存和计算约束;较小的输入大小处理得更快,但会导致更粗糙、更不准确的预测。

对于物体检测的目的,单阶段detector如SSD,YOLO都明显快于基于区域的双阶段detector(但是基于区域的detector如Faster R-CNN的准确率更高)。

接下来介绍了SSD、YOLO、Tiny YOLO(使用YOLO一半的卷积层,用精度换速度)模型。YOLO > SSD

7.2.2.2 Real-time face detection

先提出了两种处理方法:

  • a lightweight method for face detection is proposed, trained on input images of size 32 × 32, using a small neural network with only a little over 76 thousand parameters.
  • a fully convolutional neural network with seven layers is trained for the binary face/nonface classification task on 32 × 32 image patches, and deployed on full size images during the testing phase, producing heatmaps of face existence probabilities. The architecture is summarized in Table 7.3.(见原文)

然后提出轻权重的架构需要仔细调整训练阶段,因为相比于重权重更容易出现假阳性检测。并提出解决办法:One way to address this issue, is via progressively adapting the training set, which consists of positive (face) and negative (nonface) samples, to facilitate the training process, similar to curriculum learning methods.

这种方法更具体的做法是:

  1. 在训练的早期阶段首选更简单的正例,从而引导网络学习到对应类的一般表示;
  2. 也要考虑正例和负例之间的平衡;
  3. 当网络收敛并学习正确识别容易的样本时,使用渐进的正示例挖掘方案用稍微困难的样本来扩充训练集。

上述做法提到的样本难度,于是提出样本难度评估的方法,就是看网络输出的概率,高概率的样本被认为是简单样本,低概率的样本被认为是困难的。

然后提出正例负例平衡的方法。对于负例样本,为了保持平衡,这些样本会与正例样本一起收集,但要按难度排序,逐步将更难的负例样本添加到训练集以指导训练过程。

7.3 Regularization of lightweight convolutional neural networks

介绍了各种正则化技术及其在训练轻量级体系结构中的重要性,由于参数数量减少,轻量级体系结构可能比较重的体系结构更容易收敛到局部极小值。

与更复杂的模型相比,轻量级模型通常不那么准确。然而,这可能构成应用中的一个主要缺陷,在这些应用中,通过以精度为代价的模型的轻量级特性所获得的改进的速度与精度一样重要,例如朝向人群回避的人群检测问题。因此,为了控制过拟合并提高轻量级深度学习模型的性能,又回顾了之前提出了几种正则化方法:L1/L2正则化,Dropout,early stopping,multitask learning。

本节就基于multitask learning的概念,提出三种提高轻量级深度学习模型性能的正则化方法。

7.3.1 Graph embedded-based regularizer

它的核心思想是利用多重损失函数进行深度学习。

一般可以给每一层增加一个Euclidean loss(sum of squares)。

假设原本使用softmax损失函数,表达式为:

其中$k$表示第$k$个类别。$p_{i,k}$表示样本$i$属于第$k$个类的softmax预测概率。

在GE算法下,Euclidean loss表达式为:

则最终的损失函数为:

其中$N_L$表示层数。$\eta_i \in [0,1]$,表示第$i$层的Euclidean loss regularization的权重。

接下来几节介绍Discriminant Analysis regularization approach,Minimum Enclosed Ball regularization,Locally Linear Embedding inspired regularization。它们都可以使用Euclidean loss中适当的目标表示定义它们的目标。

7.3.1.1 Discriminant analysis regularization

受线性判别分析(LDA)方法的启发,LDA旨在通过将属于不同类别的训练样本投影到新的低维空间中,从而最大化类别间的可分性,同时最小化类别内的可变性。DA正则化训练方法除了保持类别间可分性的softmax损失外,还包括欧几里得损失,该损失旨在使同类的训练样本更接近类别质心。

这种方法的additional objective是最小化$y_i^L$与其类的平均表示之间的平方距离。

Then the additional goal is defined by the following optimization problem:

其中$\mu_c^i = \frac{1}{K^i}\sum_kc_k$,其表示其类的平均表示。其中$c_k$表示属于同一类别的第$i$张图像的$K^i$个表征集合。

7.3.1.2 Minimum enclosing ball regularization

该目标是找到使训练样本之间的方差最小化的投影。它的根本思想基于radius-margin based Support Vector Machines(SVM)。

并且有文章指出,最大边距SVM的泛化误差不仅取决于分离边距的平方$\gamma^2$,还取决于半径边距率$R^2/\gamma^2$。对于固定的特征空间,由于半径$R$​是常数,因此在优化过程中可以忽略误差界对半径的依赖性。

然而,当$R$由训练数据的最小包围球定义时,该模型具有通过在特征空间中简单地扩展训练数据的最低包围球来增加边距的风险。为了解决这个问题,提出了一种在多核学习的背景下,同时考虑裕度和半径来优化误差界的算法。基于此方向,由于 softmax 损失旨在区分训练样本属于不同类别的特征表示,因为在神经层生成的特征空间中,尤其是负类的特征表示可能非常扩展,因此MEB正则化器旨在缩小训练样本的MEB的半径。

7.3.1.3 LLE-inspired regularization

7.3.2 Class-specific discriminant regularizer

where $\mu{ck} = \frac{1}{\mathcal{S}^{ck}}\sum{yj^L \in \mathcal{S}^{ck}} y_j^L$, $|\mathcal{S}^{ck}|$ is the cardinality of set $\mathcal{S}^{ck}$, and $\mu{ck}^i = \mu_{ck}$, if $y_i^L \in \mathcal{S}^{ck}$.$\mathcal{S}^{ck}$ is the set of all image representations $y_i^L$ that belong to the $k$th subclass of the $c$th class.

7.3.2 Class-specific discriminant regularizer

这种正则化方法旨在利用二元分类问题的性质,其中一个类描述特定的概念,而另一个类则描述除此概念以外的任何概念(即真实类与冒名顶替类)。

具体来说,虽然分类器旨在区分属于不同类别的样本,但提出通过要求属于真实类的特定深度神经层生成的特征空间的表示更接近类质心来增强真正的类区分。

the following objective is defined:

where $\muc = \frac{1}{\mathcal{S} }\sum{y_j^L \in \mathcal{S}} y_j^L$.

The Euclidean loss (sum of Ssquares) is used to implement the regularizer. The regularizer can be attached to one or multiple neural layers. Thus for a deep neural model of $N_L$ layers, the total loss in the regularized training scheme is computed by summing the above losses:

where the parameter $\eta_i \in [0,1]$ controls the relative importance of the Euclidean loss, $L_e$, of each deep layer.

于是可以修改方程84:

and thus the proposed regularizer can also be implemented in terms of mini-batch training.

7.3.3 Mutual information regularizer

这种正则化器是由二次互信息($QMI$)准则驱动的。具体来说,除了分类损失之外,还引入了从$QMI$的information potentials导出的正则化损失。额外的目标是在数据表示和相应的类标签之间最大化利用信息势表示的$QMI$。

That is, the MI measures how much the uncertainty for the class label $c$ is reduced by observing the feature vector $y$. Let $p(c)$ be the probability of observing the class label $c$, and $p(y, c)$ the probability density function of the corresponding joint distribution.

The MI between the two random variables is defined as:

where $P(c) = \int_y p(y, c) dy$. MI can also be interpreted as a Kullback–Leibler divergence between the joint probability density $p(y, c)$ and the product of marginal probabilities $p(y)$ and $P(c)$.

上述内容解释互信息的概念,实际上互信息就是两个时间的信息熵的交。$H(X,Y)=H(Y,X)$,$H(X,Y) = H(X) - H(X|Y)$

QMI is derived by replacing the Kullback–Leibler divergence by the quadratic divergence measure. That is,

And thus, by expanding Eq. (88) one can arrive at the following equation:

上述内容解释信息势(information potentials),可以将上述公式替换为:$QMI = V{IN} + V{ALL} - 2V{BTW}$

The pairwise interactions described above between the samples can be interpreted as follows:

  • $V_{IN}$ expresses the interactions between pairs of samples inside each class
  • $V_{ALL}$ expresses the interactions between all pairs of samples, regardless of the class membership
  • $V_{BTW}$ expresses the interactions between samples of each class against all other samples

The total loss for the network training is defined as:

where the parameter $\eta \in [0,1]$ controls the relative importance of $L_{MI}$.

其中:

7.4 Bag-of-features for improved representation learning

they rely on the previous layers of the network to provide scale invariance, they lead to loss of valuable information regarding the distribution of the feature vectors, etc.

本节前段介绍pooling的好处,这里引出pooling的缺点,并提出Bag-of-features就是用来解决缺点的

  1. A feature extractor, such as SIFT extractor, is used to process an input image and extract feature vectors that describe various of its regions. Note that the number of feature vectors that are extracted depend on various parameters (e.g., the size of the image, the type of interest point detector that is used, etc.) and in many cases this leads to a different number of feature vectors per image.
  2. Then the dictionary learning process follows, where a set of prototype feature vectors is learned, typically called codewords. The set of all codewords is also called codebook or dictionary.
  3. Finally, in order to extract a constant length representation for each image each of the extracted feature vectors is quantized using the codewords. Then the final histogram representation of each input image can be extracted simply by counting the number of feature vectors that were assigned to each codeword.

上述段介绍BoF的工作流程,就是:提取特征,对特征聚类得到codeword,根据codeword将图片表示为直方图(向量)

BoF, when used in DL models, is formulated using two neural layers.

The first one measures the similarity of the input feature vectors to the codewords and quantizes the input feature vectors. The second one performs the accumulation of these similarity vectors and creates the final histogram representation.

These two layers compose a unified architecture that is differentiable and can be used with any DL model.

the output of $k$th codeword neuron $[\phi(x)]_k$, that is the neuron that measures the similarity with the $k$th codeword, is defined as

where $x$ is a feature vector and $v_k$ is the corresponding codeword(center of the RBF neuron). typically a scaling factor $\sigma_k$ is used to adjust the RBF kernel to the actual distribution of the feature vectors.

note that usually the output of this layer is normalized in order to provide a distribution over the similarity to the $N_K$ codewords as

the first layer extracts one membership vector for each input feature vector. These vectors are then accumulated in the second layer of the BoF model simply by aggregating them as

where $\phi(x) = ([\phi(x)]1, \cdots,[\phi(x)]{N_k})$ denotes the output vector of the RBF layer. The histogram $s$ provides a distribution over the codewords, each of which corresponds to a different semantic prototype, and as a result, describes the semantic content of each image. This vector can be then fed to the fully connected layers of a DL architecture or directly used for other representation learning tasks.

上述段表示BoF被分为两层,如上讲解了第一层的codeword neuron的公式,以及在histogram中的使用公式

直方图的直观理解:

7.4.1 Convolutional feature histograms for real-time tracking

using a similarity metric, such as the normalized Euclidean similarity:

where $g_{ijm}$ is the membership of the ${i,j}$-th feature vector to the $m$th codeword.

上述给出了相似度矩阵

其中$f_{ij}$表示特征向量,$b_m$表示codeword

After all memberships have been calculated, the histogram representation can be extracted by aggregating the membership vectors spatially. Formally, a single histogram $q \in \mathbb{R}^M$ may be extracted as

image-20240517120335889

上述给出了直方图表示的特征抽取公式,并给出示意图

The LBoF module, formulated as a normalized convolutional layer, is a better match. In this case, the memberships are computed as the absolute value of the inner product between each feature vector and each codeword, and subsequently normalized by the sum of all memberships by spatial location:

Formally, using a sliding window of size $p\times p$ with stride $1$, multiple histogram representations $Q^{H_q \times W_q \times M}$ can be obtained as

where $//$ dentes integer division, $k=1, \dots, H_q$ and $l=1, \dots,W_q$. This can be translated into a standard average pooling layer. Stride can be introduced into Eq.(98) to produce less or even nonoverlapping histograms.

上述给出卷积公式

7.5 Early exits for adaptive inference

Perhaps the easiest way to achieve this is by including additional early exits layers on top of specific layer of the network, as shown in following figure:

image-20240518104724418

上述提出了一种自适应计算图的模型,就是在网络的特定层添加额外的早起推出层

Unfortunately, using early exits, especially on earlier layers of a network, is not always straightforward, due to the enormous dimensionality of the respective feature maps. Some methods employ aggressive subsampling methods, for example, global average pooling [69], which, however, ignore both the spatial information and the distribution of the extracted feature vectors, reducing the accuracy of early exits. This problem is partly addressed in [67], where a series of densely connected structures is employed. However, this requires a significant number of structural changes in the architecture of a network and cannot be easily used with existing neural networks. On the other hand, in [64], this problem is addressed using additional dimensionality reduction layers, which can alleviate this issue. However, these extra layers can also increase the computational complexity of using early exits.

上述段提出early exits方法对于高维度数据不好用。并提出有方法采用子采样方法(例如全局平均池化),但伴随的又有了新的问题:忽略了空间信息和提取的特征向量的分布,降低了准确性。随后又提出了部分解决办法,例如采用一系列密集连接的结构;使用额外的降维层(总之就是降低维度,但是这种方法也会增加计算复杂度)。

image-20240518110713003

image-20240518111042811

上述提出了本节想要介绍的结构(如图7.8)。这也是一种early exits的方法,利用了上一节的BoF,这种方法可以在估计网络在推理过程的各个点的最终输出,不用在整个网络中潜亏,并且还保留了更多关于每层提取的特征分布信息,以及编码的更多的空间信息。这种方法还可以用于高效的分层聚合方案,它可以从前一层的信息逐步细化网络的估计。

7.5.1 Early exits using bag-of-features

本节提出了基于特征袋的早期出口方法。


Therefore an input image is denoted as $x \in \mathbb{R}^{W \times H \times C}$, where $W$ is its width, $H$ is its height, and $C$ is the number of channels. Also, let $y^{(i)} = f_W(x, i) \in \mathbb{R}^{W_i \times H_i \times C_i}$ be the output feature map of the $i$th convolutional layer. Again, $W_i$ , $H_i$ , and $C_i$ refer to the width, height, and number of channels of the corresponding output. Also, the final output of the DL model, which is used to estimate the probability of each sample to belong to a different class (out of a total of $N_C$ classes), is denoted by $y = f_W(x, m) \in \mathbb{R}^{N_C}$.

上述段提出了notation

DL models are typically trained using back-propagation to minimize a loss function $\mathcal{L}$:

where $\mathcal{X} = {x_1, x_2, \dots,x_N}$ is a training set of $N$ images and each image is accompanied by a target (ground truth annotation) vector $t_i \in \mathbb{R}^{N_C}$

上述再次提出了方向传播的核心思想,上述的$\mathcal{L}$在文中使用的cross-entropy loss,并给出了该损失函数的公式

Early exits propose using an additional estimator $g_{W_i}^{(i)} (\cdot)$ on the feature maps extracted at the $i$th layer of a model as

where $W_i$ refers to the parameters of the corresponding estimator. This formulation allows for estimating the output of the network at different levels, without feedforwarding the whole network. Therefore the total inference time can be adjusted by selecting a different estimator and stopping the inference process there.

上述段提出了Early exits多附加estimator,该estimator可以在每层得到输出。

for each feature vector $y^{(i)}_{_kl}$ a membership vector can be extracted, that is calculated as

where $(k,l)$ is the location from which the feature vector was extracted and $K(\cdot)$ is a kernel function that is used to measure the similarity between a feature vector and a codeword. In other words the membership vectors measure how similar is each feature vector to each prototypical concept(note tha $||u_{ikl}||_1 = 1$, where $||\cdot||_1$ denotes the $l_1$ norm of a vector).

上述段给出了membership vector的计算公式。

每一个codeword都由$v_{ij}$表示,其表示为BoF的第$i$层,$j$表示codeword。

To compile the final constant length summary histogram for each image the membership vectors can simply be aggregated as

This histogram vector provides a semantic summary of the features extracted from each layer. As a result, this compact vector can be used to fit the early exit estimator instead of performing naive pooling operations (e.g., average or sum pooling) that lead to loosing valuable discriminative information or using more complex convolutional dimensionality reduction layers. Note that spatial segmentation schemes can be also used to keep more spatial information into the extracted representation, if needed.

上述提出将membership vector聚合的公式,目的是编译每个图像的最终恒定长度摘要直方图,该直方图提供了从每一层提取的特征的语义摘要(因此可以代替原本的pooling用于early exits)。

To further improve the performance of models that use early exits a hierarchical inference structure can be built, as shown in Fig. 7.7(在上面), by incrementally concatenating the current BoF representation with the ones extracted from previous layers. As a result, the final representation $s^{(i,h)}$ can be extracted from the $i$th layer as

where the operator $a \frown b$ is used to refer to the concatenation of vectors $a$ and $b$. By caching and reusing the representations extracted from the previous layers the classification accuracy can increase at virtually no additional cost, apart from the use estimators with more parameters. The number of parameters can be further reduced by using an additive formulation for building the histogram representations (Fig. 7.8) as

where $\alpha$ is a decay factor, used to perform exponential averaging over the histograms. This also enables us to reuse the same estimator, further reducing the number of parameters required for early exits by building an implicit common histogram representation space, which is incrementally refined as more early exits are used.

上述提出了进一步提高模型性能的聚合公式。

7.5.2 Adaptive inference with early exits

本节讨论了使用早期退出的自适应推理来加速DL模型的不同方法。


Apart from using early exits for adapting the computational graph to the available resources, early exits can be also used to stop the inference process early for easier input samples. However, in order to do so, an appropriate stopping criterion should be defined. Perhaps the most straightforward solution is to estimate the difficulty of each sample based on the uncertainty of the network $r(x, i)$ for a given sample $x$ at the ith exit based on the confidence of the network for the class with the highest activation. This uncertainty can be estimated as

where $k = \arg{k’} \max[g{Wi}^{(i)}(y^{(i)})]{k’}$.

上述段想要定义一个停止准则,用来提早停止推理过程。

最直接的方式是根据网络$r(x,i)$的不确定度评估,并提出这种不确定度的公式

Then the threshold for stopping the inference process at the $i$th early exit can be calculated based on the mean activation of the winning neurons:

where $g{W_i}^{(i)}$ denotes the output of the $i$th exit and $k=\arg\max g{W_i}^{(i)}(y_j)$.

上述段提出了停止的阈值的计算方式

Therefore the computation graph is adapted to the difficulty of the input sample by stopping the inference process on an early exit when the confidence on a class is higher than the corresponding threshold, that is,

where $\alpha$ is an inference hyperparameter. That is, using larger values for this parameter leads to more accurate results, but more layers of the network will be used, increasing the inference time. On the other hand, using smaller values for the $\alpha$ parameter can increase the speed, but can also lead to less accurate classification results, since the inference process will use only a few early layers of the network.

Even though the aforementioned process is easy to implement, the value of the inference hyperparameter α is not bounded, requiring a significant amount of experimentation in order to select the most appropriate value. More recent early exit formulations can overcome this issue by calculating this threshold while taking into account both the mean activation of a layer $\mu_i$ (to provide a lower bound for the confidence), as well as the output of the model $\mu_C$ (to provide an upper bound). As a result, the threshold can be estimated as

for $\alpha \in [0,1]$. This allows for having more interpretable values for $\alpha$, since this hyper parameter is now bounded between 0 and 1.

上述段提出了一个超参数$\alpha$用来使用计算得到的阈值引用到网络的输出,作为最终的停止标准