admin管理员组

文章数量:1529460

文章目录

  • 梯度下降
  • 后向传播
  • Sigmoid激活函数的梯度消失
  • Dying ReLUs
  • Parameter initialization
  • Fine-tuning


梯度下降

对于使用MSE损失的感知机模型:
y ^ = σ ( μ ) = σ ( w ⋅ x ) , mse = 1 2 ( y ^ − y ) 2 \hat y=\sigma(\mu)=\sigma(\pmb w\cdot\pmb x),\quad \text{mse} = \dfrac{1}{2}(\hat y -y)^2 y^=σ(μ)=σ(wwwxxx),mse=21(y^y)2
使用梯度下降更新参数:
w ∗ = w − η Δ w Δ w = ∂ L ∂ y ^ ∂ y ^ ∂ μ ∂ μ ∂ w = ( y ^ − y ) ∂ σ ∂ μ x \pmb w^* = \pmb w - \eta\Delta\pmb w\\[1ex] \Delta\pmb w = \frac{\partial L}{\partial\hat y}\frac{\partial\hat y}{\partial\mu}\frac{\partial\mu}{\partial\pmb w}=(\hat y - y)\frac{\partial\sigma}{\partial\mu}\pmb x www=wwwηΔwwwΔwww=y^Lμy^wwwμ=(y^y)μσxxx
如果 σ \sigma σ是sigmiod,则
∂ σ ( μ ) ∂ μ = σ ( μ ) ( 1 − σ ( μ ) )    ⟹    w ∗ = w − η ( y ^ − y ) y ( 1 − y ) x \dfrac{\partial\sigma(\mu)}{\partial\mu }=\sigma(\mu)(1-\sigma(\mu)) \implies \pmb w^* = \pmb w - \eta(\hat y - y)y(1-y)\pmb x μσ(μ)=σ(μ)(1σ(μ))www=wwwη(y^y)y(1y)xxx
感知机模型仅含一层神经元,只能处理线性可分问题,不能处理非线性问题(如异或问题).


后向传播

k − 1 、 k k-1、k k1k层的神经元个数分别为 m 、 n m、n mn,则第 k k k层隐藏层的输入侧参数、输入和输出分别为
w k = ( v 1 k , ⋯   , v m k ) ∈ R n × m , z k = w k f ( z k − 1 ) ∈ R n , f ( z k ) ∈ R n \pmb w^{k}=(\pmb v_1^k, \cdots, \pmb v_m^k)\in\R^{n\times m},\quad\pmb z^{k} = \pmb w^{k}f(\pmb z^{k-1}) \in \R^{n},\quad f(\pmb z^{k}) \in \R^{n} wwwk=(vvv1k,,vvvmk)Rn×m,zzzk=wwwkf(zzzk1)Rn,f(zzzk)Rn
根据链式求导法则,损失(标量)对第 k k k层第 i i i个神经元的第 j j j个输入参数 w i j k w_{ij}^{k} wijk的偏导:
∂ L ∂ w i j k = ∂ L ∂ z i k ∂ z i k ∂ w i j k = f ( z j k − 1 ) δ i k \frac{\partial L}{\partial w_{ij}^k}=\frac{\partial L}{\partial z_i^k}\frac{\partial z_i^k}{\partial w_{ij}^k}=f(z^{k-1}_j)\delta_i^k wijkL=zikLwijkzik=f(zjk1)δik
上式 δ i k \delta_i^k δik为第 k k k层的第 i i i个误差项,即
δ i k = ∑ t ∂ L ∂ z t k + 1 ∂ z t k + 1 ∂ f ( z i k ) ∂ f ( z i k ) ∂ z i k = f ′ ( z i k ) ∑ t δ t k + 1 w t i k + 1 = f ′ ( z i k ) v i k + 1 ⋅ δ k + 1 \delta_i^k=\sum_t\frac{\partial L}{\partial z_t^{k+1}}\frac{\partial z_t^{k+1}}{\partial f(z_i^k)}\frac{\partial f(z_i^k)}{\partial z_i^k} = f'(z_i^k)\sum_t\delta_t^{k+1}w^{k+1}_{ti} = f'(z_i^k)\pmb v_i^{k+1}\pmb\cdot\pmb\delta^{k+1} δik=tztk+1Lf(zik)ztk+1zikf(zik)=f(zik)tδtk+1wtik+1=f(zik)vvvik+1δδδk+1
k k k层的误差项由第 k + 1 k+1 k+1层的误差项累加得到,此过程称为误差反向传播


反向传播的矩阵形式
k k k层的误差项的矩阵形式为
δ k = f ′ ( z k ) ⊙ ( w k + 1 T δ k + 1 ) \pmb\delta^k=f'(\pmb z^k) \odot ({\pmb w^{k+1}}^T\pmb\delta^{k+1}) δδδk=f(zzzk)(wwwk+1Tδδδk+1)
⊙ \odot 等价于将 f ′ ( z k ) f'(\pmb z^k) f(zzzk)扩展为对角阵,因此
∂ E ∂ w k = δ k f ( z k − 1 ) T = f ′ ( z k ) ⊙ ( w k + 1 T δ k + 1 ) f ( z k − 1 ) T \frac{\partial E}{\partial \pmb w^k} =\pmb\delta^kf(\pmb z^{k-1})^T =f'(\pmb z^k) \odot ({\pmb w^{k+1}}^T\pmb\delta^{k+1})f(\pmb z^{k-1})^T wwwkE=δδδkf(zzzk1)T=f(zzzk)(wwwk+1Tδδδk+1)f(zzzk1)T

采用均方差作为损失函数,输出层为第 L L L层,采用随机梯度下降更新参数(仅适用一个样本)则
δ L = f ′ ( z L ) ⊙ ( f ( z L ) − y ) \pmb\delta^L= f'(\pmb z^L)\odot(f(\pmb z^L)-\pmb y) δδδL=f(zzzL)(f(zzzL)yyy)
向量对向量偏导为Jacobian矩阵,标量向量偏导为向量,误差反向传播过程为一系列矩阵运算,可并行化!

单个神经元的前后向传播可视化:


Sigmoid激活函数的梯度消失

The tricky part people might not realize until they think about the backward pass is that if you are sloppy with the weight initialization or data preprocessing theses non-linarites can ‘saturate’ and entirely stop learning – your training loss will be flat and refuse to go down.

Fully connected layer with sigmoid computes using raw numpy:

z = 1 / (1 + np.exp(-np.dot(w, x))) # forward pass
dx = np.dot(w.T, z * (1 - z)) # backward pass: local gradient for x
dw = np.outer(z * (1 - z), x) # backward pass: local gradient for w

If your weight matrix is initialized too large, the output of the matrix multiply could have a very large range, which make all outputs in the vector z almost binary: either 1 or 0. In this case, z*(1 - z) will in both cases become zero(“vanish”), making the gradient for both x and w be zero. What’s worse is that the rest of backward pass will come out all zero from this point due to multiplication in the chain rule.

On the other hand, sigmoid local gradient achieves a maximum at 0.25, where z = 0.5. That means that every time the gradient signal flows through a sigmoid rate, its magnitude always diminishes by one quarter(or more generally). If you’re using basic SGD, this would make the lower layers of a network train much slower than the higher ones.


Dying ReLUs

The forward and backward pass for a fully connected layer that uses ReLU would as the core include:

z = np.maximum(0, np.dot(w, x)) # forward pass
dw = np.outer(z > 0, w) # backward pass: local gradient for w

If a neuron gets clamped to zero in the forward pass (i.e. z = 0), then its weights will get zero gradient, which is called “dead ReLU” problem. If a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with remain permanently dead. Sometimes a large fraction(e.g. 40%) of your neurons were zero the entire time.


Parameter initialization

It is vital, vital and vital that initialize weights to small random values, and also avoid symmetries that prevent learning/specialization.

  • only initialize weights to 0 is too symmetry to learning different things;
  • large input would make sigmoid unit saturate that hard to update parameters because of gradients are small;
  • bias could be initialize to 0 and you can see how the system learn the bias;

Xavier Initialization

For the lots of models, we would like values in the network to sort of stay small in middle range.

If a matrix with big values in it, and multiply a vector by this matrix, things might be get bigger. And then put in through another layer, it’ll get bigger again. That’s terrible. So, Xavier initialization is often recommended to avoid this circumstance.

Xavier initialization has variance inversely proportional to fan-in n i n n_{in} nin and fan-out n o u t n_{out} nout:
V a r ( W ) = 2 n i n + n o u t Var(W)=\frac{2}{n_{in}+n_{out}} Var(W)=nin+nout2


Fine-tuning

小容量数据集had better not update the word embedding that of pre-trained maybe good enough.


Reference:

1. Yes you should understand backprop

本文标签: 初始化后向参数bpoptimization