Lecture

Lecture 13: Time Series and Recurrent Networks

对于时序或者序列输入的问题,在很多情况下都需要考虑整体性,也就是序列前会对序列后产生影响,前后元素的对照等因素。

所以为了解决这些问题,设计的神经网络不仅要考虑到当前输入,也要兼顾之前序列的输入,比较简单的想法就是用CNN那几讲说过的Time Delay NN,也被称为finite response system,Something that happens today only affects the output of the system for days into the future. 这样设计的缺点也是很明显,那就是单个输入不会长久影响输出,所以接下来就会介绍infinite response system。

为了做到A single input influences the output for the rest of time, 最简单的思路就是把输出重新喂回网络中,这也被称为NARX network,nonlinear autoregressive network with exogenous inputs. 这也引出了一个概念:memory,也就是当前步是怎么把past take into consideration的,在NARX中采用的是最简单的把memory设为past output,我们可以强化memory这个概念,把它单独拿出来定义,设置为一个变量,

针对不同的memory设计也就得到了不同的神经网络,下面介绍两种最有名的。首先是Jordan Network,对memory的定义是running average of past outputs,其次是Elman Network,把memory cell和ouput分离,对memory的定义是hidden state的clone,也称为"context" unit。值得注意的是上面两种network中,memory都是一个fixed structure,没有可学习的参数,并在backprop时会block梯度,For the purpose of training, this was approximated as a set of T independent 1-step history nets。这两种神经网络也被称为简单的循环网络,“Simple” (or partially recurrent) because during learning current error does not actually propagate to the past :

  • “Blocked” at the memory units in Jordan networks
  • “Blocked” at the “context” unit in Elman networks

所以为了得到一个fully recurrent neural network,我们只需要把hidden state和memory cell结合起来,得到 state summarizes information about the entire past. Model directly embeds the memory in the state. 这样learning过程就不会被blocked,在任意 ​ 时刻对后续某个时刻产生的影响都能被捕捉到并进行参数的调整。这样就得到了简单也是最通用的state-space model,其实就是最常见的RNN结构,

State-space models retain information about the past through recurrent hidden states. These are “fully recurrent” networks. The initial values of the hidden states are generally learnable parameters as well. State-space models enable current error to update parameters in the past

下面展示的是Forward的伪代码:

image-20230310103643771

对于Backward,

image-20230310105732668

The divergence is the divergence between the actual sequence of outputs and the desired sequence of outputs and cannot always be decomposed into the sum of divergences at individual time steps.

SGD trains neural networks one time series at a time, rather than over batches of inputs. The corresponding equivalent for RNNs would be to update the network after each series.

Lecture 14: Stability Analysis and LSTMs

RNN出现的问题大体可以分为forward part和backward part,下面分别针对这两方面进行分析:

Forward Part

Stability Analysis

对于Time delay网络,是可以做到Bounded Input Bounded Output的,毕竟网络只接受有限步为输入,如果输入有界,输出也对应是有界。而对于RNN情况则会复杂很多,所以先从最简单的情况分析,也就是linear system,考虑hidden cell接受输入后直接等值输出,激活函数是identity function,由定义可以得到:

进一步推广可以展开推出:

为了研究的范围,实际上只要研究这项即可,这其实反映了一种linear recursion的形式,如果是scaler的话,那么就是简单的指数变化,如果是矩阵的话可以用谱分解(假设是满足某些性质的),拆开后得到的表达式还是有着这项,For any input, for large the length of the hidden vector will expand or contract according to the -th power of the largest eigen value of the recurrent weight matrix. If it will blow up, otherwise it will contractand shrink to 0 rapidy.

image-20230310234424992

总结:

In linear systems, long-term behavior depends entirely on the eigenvalues of the recurrent weights matrix

  • If the largest Eigen value is greater than 1, the system will “blow up”
  • If it is lesser than 1, the response will “vanish” very quickly
  • Complex Eigen values cause oscillatory response but with the same overall trends • Magnitudes greater than 1 will cause the system to blow up
  • The rate of blow up or vanishing depends only on the Eigen values and not on the input

Forgetfulness

对于非线性的激活函数,因为采取的激活函数大多都在0-1之间(Relu可能会把斜率设置成大于1),所以并不会出现线性系统的blow up现象,但却会出现遗忘的现象。关于网络遗忘性的定义也是蛮有意思的,值得去想一想,我的理解是只要最后的输出是和目标输入有关的就算没有遗忘。最简单的情况就是一开始有输入,然后把后续的输入都设为0,看看最后输出的变化。再完善点就是考虑一系列的输入,针对某个特定的输入,改变值看输出的变化,

下面分析刚开始只输入一次scalar,可以发现随着时间变化,不同的激活函数输出都表现为只与有关而不和有关。

image-20230310234611954 image-20230310234642320

对于向量输入,结论也是类似的:

image-20230313093634706

The “memory” of the network also depends on the parameters (and activation) of the hidden units

  • Sigmoid activations saturate and the network becomes unable to retain new information
  • RELU activations blow up or vanish rapidly
  • Tanh activations are the slightly more effective at storing memory, but still, for not very long

Backward Part

深度网络会出现梯度消失or梯度爆炸的问题,对于RNN更是如此,因为RNN的层数其实就是时间的步长数。下面展示的是RNN反向传播的公式:

image-20230313220400240

因为激活函数都是point wise,所以是对角阵,RNN中的都是相同的,因此可以得到下面的结论:

image-20230313220123170

当然RNN中的vanishing并不是说对某个weight的梯度为0,RNN 所谓梯度消失的真正含义是,梯度被近距离梯度主导,导致模型难以学到远距离的依赖关系。

LSTMs

这部分要说的东西真的特别多,我就单开一个blog去讲,大概参考的资料有:

CMU lectures

Different papers

ZhiHu

Link: https://xurui314.github.io/2023/03/13/Hey-this-is-how-LSTM-works/

https://arxiv.org/pdf/1903.00906.pdf

Lecture 15: Sequence PredictionAlignments and Decoding

Lecture 16:

暂时鸽,这部分讲的是audio的技术。

Lecture 17: Sequence to Sequence Models: Attention Models

decouple the weight metric and the information passed to the hidden state

Lab

记录一个很隐秘的bug,debug了1h,这个源于python的性质,需要拷贝a而不是直接把a加到b里面。

image-20230315163408463