CS231n（4）：最优化

2015-02-13 | 研究学术 | CNN 计算机视觉机器学习基础梯度下降法 BP算法 Python

理解损失函数

图像分类的两个关键步骤包括：（1）通过评分函数将图像映射到类别评分；（2）通过损失函数度量评分。多分类支持向量机采用的是线性评分函数$f\left(\mathbf x_i,\mathbf W\right)=\mathbf W\mathbf x_i$，需要最小化的损失函数为 \[ L={1\over N}\sum_i\sum_{j\neq y_i} \max\left( 0, f\left(\mathbf x_i,\mathbf W\right)_j-f\left(\mathbf x_i,\mathbf W\right)_{y_i}+1 \right) +\alpha R(\mathbf W)。 \]

图像分类的第三个关键步骤：（3）通过最优化求解最小化损失函数的参数$\mathbf W$。

图 1: 损失函数图 [JPG]

将$\mathbf W$视为空间中的一个点，损失函数（不含正则化项）可以用上图表示。$\mathbf W$、$\mathbf W_1$和$\mathbf W_2$是随机产生的矩阵。上图左表示损失函数$L(\mathbf W+a\mathbf W_1)$，横轴为$a$，纵轴为$L$；上图中和右表示损失函数$L(\mathbf W+a\mathbf W_1+b\mathbf W_2)$，横轴和纵轴分别为$a$和$b$，越红损失$L$越大，越蓝损失$L$越小。上图左和中，只计算一张图片的损失；上图右的碗状图是100张图片损失的平均值，相当于100张上图中的平均图。

每张图$i$的损失为（不含正则化项） \[ L_i=\sum_{j\neq y_i}\max\left( 0, \mathbf w_j^\top\mathbf x_i-\mathbf w_{y_i}^\top\mathbf x_i+1 \right)， \] 它是分段线性（piecewise-linear）函数。

图 2: 1维损失函数 [PNG]

三张图片$\mathbf x_0$、$\mathbf x_1$和$\mathbf x_2$分别属于类0、1和2，它们的损失计算如下 \[ \begin{aligned} L_0 &= \max\left(0, \mathbf w_1^\top\mathbf x_0-\mathbf w_{0}^\top\mathbf x_0+1\right)+ \max\left(0, \mathbf w_2^\top\mathbf x_0-\mathbf w_{0}^\top\mathbf x_0+1\right)\\ L_1 &= \max\left(0, \mathbf w_0^\top\mathbf x_1-\mathbf w_{1}^\top\mathbf x_1+1\right)+ \max\left(0, \mathbf w_2^\top\mathbf x_1-\mathbf w_{1}^\top\mathbf x_1+1\right)\\ L_2 &= \max\left(0, \mathbf w_0^\top\mathbf x_2-\mathbf w_{2}^\top\mathbf x_2+1\right)+ \max\left(0, \mathbf w_1^\top\mathbf x_2-\mathbf w_{2}^\top\mathbf x_2+1\right)\\ L &= {1\over 3}\left(L_0+L_1+L_2\right)， \end{aligned} \] 如上图所示，横轴表示权值，纵轴表示损失。

多分类SVM的代价函数是凸函数的一个范例，当采用神经网络的评分函数$f$时，代价函数就是非凸的。损失函数不可微分（non-differentiable），因此梯度未定义，通常使用次梯度（subgradient）代替梯度。本文不区分梯度和次梯度。

损失函数可以评估任意权值$\mathbf W$，最优化的目标是找出最小化损失函数的权值$\mathbf W$。神经网络的优化不能方便地使用凸函数的优化工具。

随机方法

随机搜索

比较糟糕的优化策略是随机搜索（random search）。由于验证参数$\mathbf W$比较简单，随机搜索通过尝试不同的随机权值$\mathbf W$，从中选择最优的。

bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
  W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
  loss = L(X_train, Y_train, W) # get the loss over the entire training set
  if loss < bestloss: # keep track of the best solution
    bestloss = loss
    bestW = W
  print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)

在CIFAR-10数据集上，经过1000次尝试，随机搜索可以做到约15.5%的精度，优于随机猜想的10%。找到最优的$\mathbf W$很困难甚至不可行，但是找到更好一些的$\mathbf W$就不难么困难了。基于这个思路，优化算法可以从随机的$\mathbf W$开始，不断更新权值，使得每次更新都提升一点性能。随机搜索如同徒步者带上眼罩向山脚走。针对CIFAR-10数据集，这山有30730维，山上的每一点都相当于特定的损失值。

随机局部搜索

从随机初始化的$\mathbf W$开始，若增加扰动$\delta\mathbf W$，损失在$\mathbf W+\delta\mathbf W$比在$\mathbf W$时更低，那么更新$\mathbf W\leftarrow\mathbf W+\delta\mathbf W$。

W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in xrange(1000):
  step_size = 0.0001
  Wtry = W + np.random.randn(10, 3073) * step_size
  loss = L(Xtr_cols, Ytr, Wtry)
  if loss < bestloss:
    W = Wtry
    bestloss = loss
  print 'iter %d loss is %f' % (i, bestloss)

在CIFAR-10数据集上，经过1000次尝试，精度提高到了21.4%。但这仍然是低效耗时的算法。

梯度下降法

随机搜索对寻找最佳的权值改变方向没有帮助。沿着最佳的方向改变权值，在数学上可以保证损失最速下降（steepest descend）。这个方向和梯度（gradient）相关。

一维函数的斜率（slope）是函数值在某点的瞬时（instantaneous）变化率。梯度是斜率在多维空间函数的推广，它是每维斜率构成的向量，通常也称为导数（derivative）。一维函数的导数为 \[ {df(x)\over dx}=\lim_{h\rightarrow 0}{f(x+h)-f(x)\over h}， \] 将其推广到多变量函数时称为偏导数（partial derivative）。梯度就是每一维偏导数组成的向量，有两种计算方法：

数值梯度（numerical gradient）：近似计算，较慢，但容易；
解析梯度（analytic gradient）：计算快，但是容易出错（error-prone）。

数值梯度

梯度的数值计算方法如下：

def eval_numerical_gradient(f, x):
  """ 
  a naive implementation of numerical gradient of f at x 
  - f should be a function that takes a single argument
  - x is the point (numpy array) to evaluate the gradient at
  """ 

  fx = f(x) # evaluate function value at original point
  grad = np.zeros(x.shape)
  h = 0.00001

  # iterate over all indexes in x
  it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
  while not it.finished:

    # evaluate function at x+h
    ix = it.multi_index
    old_value = x[ix]
    x[ix] = old_value + h # increment by h
    fxh = f(x) # evalute f(x + h)
    x[ix] = old_value # restore to previous value (very important!)

    # compute the partial derivative
    grad[ix] = (fxh - fx) / h # the slope
    it.iternext() # step to next dimension

  return grad

上述代码计算损失函数在$\mathbf x$每个维度的偏导数。在实际应用中通常使用中心差分公式（centered difference formula） \[ {df(x)\over dx}=\lim_{h\rightarrow 0}{f(x+h)-f(x-h)\over 2h}。 \]

梯度只表明了最快增长的方向，还需要在这个方向前进的步长（也就是学习率）。步长是神经网络需要设定的重要超参数。

# to use the generic code above we want a function that takes a single argument
# (the weights in our case) so we close over X_train and Y_train
def CIFAR10_loss_fun(W):
  return L(X_train, Y_train, W)

W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient

loss_original = CIFAR10_loss_fun(W) # the original loss
print 'original loss: %f' % (loss_original, )

# lets see the effect of multiple step sizes
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
  step_size = 10 ** step_size_log
  W_new = W - step_size * df # new position in the weight space
  loss_new = CIFAR10_loss_fun(W_new)
  print 'for step size %f new loss: %f' % (step_size, loss_new)

# prints:
# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036

上述代码中，step_size表示步长（学习率），步长小损失函数减小慢，但步长太大损失函数不降反增。当损失函数有30730个参数时，每次更新参数需要计算30731次损失函数，计算复杂度非常高。

解析梯度

数值梯度虽简单但耗时，解析梯度计算高效但易错。在实际应用中，采用解析梯度时，通过比较它与数值梯度确定梯度计算是否正确，这称为梯度校验（gradient check）。

对每个数据，多分类SVM的损失函数为

\[ L_i=\sum_{j\neq y_i}\max\left( 0, \mathbf w_j^\top\mathbf x_i-\mathbf w_{y_i}^\top\mathbf x_i+\Delta \right)， \]

对$\mathbf w_{y_i}$求偏导（梯度）

\[ \nabla_{\mathbf w_{y_i}}L_i= -\left(\sum_{j\neq y_i}\left[\left[\mathbf w_j^\top\mathbf x_i-\mathbf w_{y_i}^\top\mathbf x_i+\Delta>0 \right]\right]\right)\mathbf x_i， \]

对$\mathbf w_{j}$求偏导（梯度）

\[ \nabla_{\mathbf w_{j}}L_i= \left[\left[\mathbf w_j^\top\mathbf x_i-\mathbf w_{y_i}^\top\mathbf x_i+\Delta>0 \right]\right]\mathbf x_i。 \]

利用梯度更新参数的方法称为梯度下降法（gradient descent），通常的形式：

# Vanilla Gradient Descent

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update

梯度下降法是到目前为止最常用的优化神经网络损失函数的方法。

对大数据集上的应用，在整个数据集上计算损失函数的梯度非常耗时，通常从中抽取从中抽取一部分数据计算梯度，这称为mini-batch梯度下降法，例如采用256个样本计算梯度：

# Vanilla Minibatch Gradient Descent

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

在实际中，采用mini-batch的方法能提高参数更新频率，更快收敛。若mini-batch梯度下降法只采用一个样本，称为随机梯度下降法（SGD，stochastic gradient descent）或在线（on-line）梯度下降法。但这不太常用，在实际中由于向量化代码的优化，计算100个样本的梯度比1个样本的梯度计算100次高效。虽然人们使用SGD这个称谓，实际通常指的是mini-batch的梯度下降法（MGD／BGD，minbatch/batch gradient descent）。min-batch的样本数量虽是超参数，但很少采用验证法确定，通常根据内存大小来决定，或者直接设定为100左右的值。

BP梯度

图 3: BP梯度计算 [PNG]

对函数$f(x,y,z)=(x+y)z$，令$q=x+y$，那么根据链式法则（chain rule）有${\partial f\over \partial x}={\partial f\over \partial q}{\partial q\over \partial x}$。梯度${\partial f\over \partial x}$从右到左反向计算如上图“电路”所示，通过相邻节点的局部梯度链式相乘得到，每个节点用门（gate）表示。当${\partial f\over \partial q}=-4$和${\partial q\over \partial x}=1$时，${\partial f\over \partial x}=-4\times 1 = -4$。

图 4: BP梯度计算 [PNG]

当 $ f(\mathbf w,\mathbf x)={1\over 1 + e^{-\left(w_0x_0+w_1x_1+w_2\right)}} $ 时，链式计算如上如所示。链式法则的依据是复合函数的求导法则。门可以是任何可导函数（differentiable function），通常选择导数容易计算的函数作为门，如sigmoid函数$\sigma(x)={1\over 1 + e^{-x}}$。将${df\over dx}$简记为dx，${d\sigma(x)\over dx}=(1-\sigma(x))\sigma(x)$，上图的计算过程简化如下：

w = [2,-3,-3] # assume some random weights and data
x = [-1, -2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit

对于复杂的函数，一次求导很复杂，如果采用链式法则，可以降低计算复杂度。对函数 \[ f(x,y)={x+\sigma(y)\over\sigma(x)+(x+y)^2}， \] 前向计算如下：

x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

反向梯度计算如下：

# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
# done! phew