Lectures

Lecture 9: Scanning for patterns (aka Convolutional Networks)

首先来看一个关于平移不变性的问题，需要在一段audio中判断是否出现过Welcome这个单词，如果采用简单的MLP的话，则需要考虑每种情况，想象最简单的情况，假设输入的audio被切分为段，并被编码为维向量，含有Welcome则对应位置元素置为1，否则置为0，不难发现，这是一个超过指数接近组合数解的问题，训练这样庞大的数据集是不现实的。

这样平移不变性的问题在实际问题中是很经常出现的，比如图像中是否含有相同的图案，而且往往是对pattern的"位置"不敏感的，而简单的MLP是不具备这样的性质。

从整体的角度出发没有好的策略，可以先从局部进行思考，也就是Scan的策略，我们可以专门训练一个识别audio某段中是否有welcome的神经网络，然后再逐段进行扫描，遍历整个audio后再把结果整合，虽然重复了很多次，但是参数都是共享的，因此这样的策略无论是网络构建的代价还是实现，可行性都远高于之前的想法。训练这样的神经网络和之前的方法都是相同的，需要注意的是处理shared parameters的过程，多了一个累加的过程。

接下来的思路就是很重要的，是教你如何能构造出一样的神经网络。核心的trick就是change the order of loop: The operations in scanning the input with a full network can be equivalently reordered as scanning the input with individual neurons in the first layer to produce scanned “maps” of the input - Jointly scanning the “map” of outputs by all neurons in the previous layers by neurons in subsequent layers.

'''
K = width of "patch" evaluated by MLP
W is the width of image
H is the height of image
L is the size of layers
'''
# the original order
 for x = 1:W-K+1
    for y = 1:H-K+1
        for l = 1:L # layers operate on vector at (x, y)
            if(l == 1) # first layer operates on input
            	Y(0, x, y) = img(1:C, x:x+K-1, y:y+K-1)
            z(l, x, y) = W(l)Y(l-1, x, y) + b(l)
            Y(l, x, y) = activation(z(l, x, y))
Y = softmax(Y(L, 1, 1) .. Y(L, W-K+1, H-K+1))            

# now change the order
for l = 1:L # layers operate on vector at (x, y)
    for x = 1:W-K+1
        for y = 1:H-K+1
            if(l == 1) # first layer operates on input
            	Y(0, x, y) = img(1:C, x:x+K-1, y:y+K-1)
            z(l, x, y) = W(l)Y(l-1, x, y) + b(l)
            Y(l, x, y) = activation(z(l, x, y))
Y = softmax(Y(L, 1, 1) .. Y(L, W-K+1, H-K+1))

有了这样具有递归特性的扫描结果后，我们可以进一步去思考，能不能把第一层每个神经元得到的输出当成一张提取特征后的picture，然后第二层神经元再去基于第一层提取的细粒度特征识别更大的pattern。这就是distributing the scan的思想，把识别整张图案的工作拆分到神经网络的各层之中，利用神经网络表征特征的能力（try all possible combinations）去简化模型。

另外教授有说如果不是人为设置这样的结构的话，普通MLP是不可能按照这样的思路去学到细粒度特征的，嗯，果然MLP可以学到一切东西还是理论上的。

PPT上的图有些稍微优点不太规范，按理来说distributing以后第二层和第三层的结果数量是要变少的，可能ppt只是示意作用吧。

代码上的逻辑如下：

其中代表的是每层不同的神经元，上面两版代码不同的地方在于，基于分布的扫描不只是第一层进行图像的切分，而是每一层都以上一层的结果为切分作为输入，如果将，得到的结果和之前最原始的scan策略是等价的。

我们也因此得到了一个极为重要的概念convolution:

也就是scan with filter的思想，这样的神经网络就叫做CNN。下面是vector notation:

The advantage of distributing (详细的解释可以看ppt):

Distribution forces hierarchical representations with localized patterns in lower layers, which means more generalizable.
Fewer computations, which means reusable computations from lower layers.
Far fewer number of parameters

Key intuition: Regardless of the distribution, we can view the network as “scanning” the picture with an MLP. The only difference is the manner in which parameters shared in the MLP(take filter into consideration).

总结一下就是Distribution的处理摆脱了MLP输入输出一对一的限制，可以对输入进行进一步拆分，这样会让shared parameter变少，然后每层可以学习到更加细粒度的特征。

顺便补充一下卷积在数学上的定义，抽象地总结就是围绕某个点做带权平均，

Lecture 10: Models of vision, Convolutional Neural Networks

这节先是从神经生物学的角度来找到一些启发式的思路，首先介绍了一个有关猫的实验，引出了S cell和C cell的概念，接下来是Fukushima的neocognitron，在数学上和scan的策略是等价的，采用的是无监督的方法，还是很有意思的，训练是用了Hebbian规则，最后的效果真的可以说很不错了。

The mammalian visual cortex contains of S cells, which capture oriented visual patterns and C cells which perform a “majority” vote over groups of S cells for robustness to noise and positional jitter
The neocognitron emulates this behavior with planar banks of S and C cells with identical response, to enable shift invariance
- Only S cells are learned
- C cells perform the equivalent of a max over groups of S cells for robustness–
- Unsupervised learning results in learning useful patterns
LeCun’s LeNet added external supervision to the neocognitron
- S planes of cells with identical response are modeled by a scan (convolution) over image planes by a single neuron
- C planes are emulated by cells that perform a max over groups of S cells: Reducing the size of the S planes
- Giving us a “Convolutional Neural Network”

整合这一讲和上一节的知识就可以得出完整的CNN架构了，

接下来professor比较详细的解释了pooling,stride,padding,convolution,downsampling,upsampling，具体我就不细说了，ppt的例子很详细，下面是我想补充的几点：

pooling + stride相当于pooling layer + downsampling layer，这样操作后实际上得到的图片大小是缩小了stride*stride的倍率，所以为了保持信息，对应我们要设置更多的channel，所以这也是为什么越往后convolution的channel越来越多的原因。

举个例子来说明，假设现在输入有两个channel，input图片是4x4的，pooling采用stride为2，所以最后得到的输出是2x2，两个channel，为了保持信息不丢失，对应我们需要把两个channel增加到8个。

upsampling + convolution相当于把convolution的stride缩减倍率，如果upsampling为2则对应convolution的stride缩减为一半。

Q: What is the relationship between the number of channels in the output of a convolutional layer and the number of neurons in the corresponding layer of a scanning MLP?

A: They are the same.

所以CNN的shit invariance体现在channel的计算中，MLP扫描整个图片一遍的计算过程等价于计算channel。

The number of “channels” in any filter equals the number of input maps(output maps from the previous layer). The number of filters equals the number of output maps.

Lecture 11: Learning in Convolutional Neural Networks

CNN的training过程和MLP是类似的，只是结构有些不一样，CNN需要学习的参数有fliter + final MLP。

考虑channel和filter下的chain rule，其实可以把input/output channel的图像理解为MLP里的不同神经元的输入输出数据，MLP中的input都是一维的（不考虑batch），而对于CNN则可以是(input_channel, data_dim1,data_dim2,...)，filter可以理解为MLP里的weight，只不过在CNN中就是二维的。

这种理解也引申出了实现CNN的一种方法，在channel last的数据中，常常利用img2col这样的操作计算最后的答案，下面我会画一张示意图来表示：

ps: 理论上来说无论是channel first还是channel last都能直接用tensordot很容易实现，具体的流程就是ppt写的那样，可能是出于学习和练习的关系，有些资料里用img2col相当于自己手动实现了一遍tensordot。

一般考虑CNN和MLP的转换时，可以从第一层和后续层两个角度来考虑，对于图像数据来说，一般第一层的input_channel都是3，对应RGB三色，对于一维的数据来说，第一层input_channel可以定义为数据的维度。

求导数如何转成卷积，back扩充k-1大小，

For convolutional layers:

Given the derivatives for the output activation maps , how to compute the derivatives w.r.t. the affine maps .
Given the derivatives for the affine maps . How to compute the derivative w.r.t. and .

For pooling layers:

How to compute the derivative w.r.t. input layer given derivatives w.r.t. pooled output .

Lab HW2P1

代码都传到github上了，但是介于这个课程的hw是非公开的，所以我把仓库设成private了，至于怎么搞到这个课程的代码，我只能说都在github上，剩下的就不能多说了，毕竟也没得到professor的许可，不能私自公开分享资料，见谅。

这部分就是手搓一个CNN，沿用上个作业的MyTorch。

选择题

选择题确实挺难的，如果没有Appendix的讲解我也确实没看懂，下面我来讲一下这几题：

这题的答案是B，他这图给的很有迷惑性，正确的看法是像下面这样竖起来看：

这个例子里面

Layer 1: 2 filters of kernel width 2, with stride 2
Layer 2: 3 filters of kernel width 3, with stride 3
Layer 3: 2 filters of kernel width 3, with stride 3

filter的个数就看这层有几个神经元，kernel的width看的是和下面一层的几组神经元连起来，一组神经元里面对应不同的input channel，stride就把下面一层神经元当input来看stride就行了。

这题的答案是B，计算的过程就是用下面这个公式：

下面一题是这样的：

1
2
3

A = np.arange(30.).reshape(2,3,5)
B = np.arange(24.).reshape(3,4,2)
C = np.tensordot(A,B, axes = ([0,1],[2,0]))

tensordot解释可以看: here，C的形状是[5,4]，其中第一个元素C[0,0] = 820。

代码

代码的部分大体分成三个板块：Resampling，Convolution，ScanMLP，我会讲一讲实现的过程中遇到的一些问题。

代码的部分大体分成四个板块：Resampling，Convolution，Pooling, ScanMLP，我会讲一讲实现的过程中遇到的一些问题。

Resampling: upsampling算是最简单的，利用下面的语句就可以实现forward和backward的过程：

1 2	Z = np.insert(A, list(range(1, A.shape[-1])) * (self.upsampling_factor - 1), 0, axis = -1) # TODO

而对于downsampling，还需要记录输入的长度，

Convolution: forward的卷积操作没有太多可说的，直接tensordot就可以了，下面主要介绍backprop中跨越各个channel间的卷积操作。首先有，那么根据指导书的例子（单个channel之间）：

扩展起来就是

1
2
3

for every outchannel j:
	for every inchannel i:
        dLdF_ji = conv(dLdO_j, X_i)

Pooling: 需要记录信息

有时间写，具体可以看代码

ScanMLP: CNN等价于一种ScanMLP，以本次作业的例子加以说明。

对于128x24的输入，假如MLP每次处理8个向量，步长为4，网络的结构如下：

最终的输出不难得到是4x31的，因为(128 - 8 )/ 4 + 1 = 31，对于这样的MLP，需要设计对应等价的CNN，要求是利用三层卷积，最后Flatten。本次作业要实现non-distributed和distributed两种不同类型。

non-distributed：具体的结构如下所示，因为开始没有flatten，所以输入的数据都是二维的(A的shape是1x24x128，也就是batchsize, input_channel, input_width)，第一个卷积层需要把8x24的input转成(input_channel, kernel_size)，然后对于MLP后续的层就直接照搬就行了，kernel和stride都设置为1。这里解释一下为什么input_channel是24而kernel_size是8，因为原本的MLP每次处理八个向量，这对应着kernel_size。

其实也对应了选择题的这道题：

You must discover the orientation of the initial weight matrix(of the MLP) and convert it for the weights of the Conv1d stride1 layers. This will involve

reshaping a transposed weight matrix into out channels, kernel size, in channels
transposing the weights back into the correct shape for the weights of a Conv1d stride1 instance, out channels, in channels, kernel size.

来解释一下为什么是w.T.reshape(out channels, in channels, kernel size).transpose(0, 2, 1)，以第一层为例，MLP参数的大小是192, 8，对应的是将input的A展开成8个长度为24的向量拼起来，而我们希望的结果是MLP这样的输出和卷积得到的结果是等价的，也就是对于24个input_channel的第一个output_channel的值，我们需要将这个output_channel的kernel权重设置成对应MLP中第一个output和input所连的所有weight的值。这样其实也就是需要把192拆成8x24，所以我们需要先转置再reshape，reshape的理解其实就是按照维度从后往前逐个用原数据进行填充，所以先填满24这个维度，最后将得到的权重转置成CNN标准形式。transpose和reshape的区别可以自行google。