机器学习：泛化理论

2014-12-19 | 研究学术 | 机器学习理论

本节的主要内容来自Hsuan-Tien Lin的机器学习基石课程[1]。

泛化是指在训练集上得到的模型可以推广到整个数据集，也就是$E_{out}\approx E_{in}$，就是要使坏事$\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon$发生的概率足够小。

Generelization: $E_{out}\approx E_{in}$ possible, if $m_{\mathcal H}(N)$ breaks somewhere and $N$ large enough.

最大可能的$m_{\mathcal H}(N)$

如何通过断点$k$限定$m_{\mathcal H}(N)$的值？

由前文可得，正射线的断点是$2$，$m_{\mathcal H}(2)＝3$；正区间的断点是$3$，$m_{\mathcal H}(3)＝7$；2维感知器的断点是$4$，$m_{\mathcal H}(4)＝14$。

如果最小断点$k=2$，在$N=1,2,3,\ldots$的情况下，对任意的$\mathcal H$而言，可能得到的最大$m_{\mathcal H}(N)$是多少呢？

若$N=1$，只有1个点，永远不存在2个点能被打碎的情况，因此$m_{\mathcal H}(1)=2^1=2$；
若$N=2$，由于最小断点$k=2$，不能打碎$2$个点，因此$m_{\mathcal H}(2)<2^2=4$，最多为$3$；
若$N=3$，$3$个点中任取$2$个点都不能被打碎时，最多可能的二分法有多少种？
……

当$N=3,k=2$时，如果$m_{\mathcal H}(3)＝3$，可以肯定从3个点中任意抽取2个都不会被打碎，因为打碎2个点至少要4种二分类方法，因此$m_{\mathcal H}(3)$最少为3，可以从$4$开始考察。

图 1: 4个二分类的情况 [PNG]

对于$4$个二分类的情况，如上图所示。上图左的二分类使得$\mathbf x_2$和$\mathbf x_3$被打碎了；上图右修改了最后一种分类方法，任意两个点都没有被打碎，因此$m_{\mathcal H}(3)$最少为4。

还需考察$m_{\mathcal H}(3)=5$的情况。结果表明，$m_{\mathcal H}(3)=5$时总会让其中的$2$个点被打碎。因此，若$N=3,k=2$，最多可能的二分法只有$4$种。

上限函数

期望对$m_{\mathcal H}(N)$进一步进行限定， \[ m_{\mathcal H}(N)\leq\mbox{maximum possible }m_{\mathcal H}(N)\mbox{ given }k\leq\mbox{poly}(N)。 \]

当断点为$k$时，将最大可能的$m_{\mathcal H}(N)$定义为上限函数（bounding function）$B(N,k)$。该上限函数和假设集$\mathcal H$无关，不受感知器等特定分类器的约束，它的取值是所有$\mathcal H$中的最大值。

$m_{\mathcal H}(N)$受假设集（分类器）$\mathcal H$和样本点数目$N$的约束；$B(N,k)$受样本点数目$N$和断点$k$约束，而与假设集$\mathcal H$无关。但是，如果知道假设集$\mathcal H$的断点$k$，就可以用$B(N,k)$对$m_{\mathcal H}(N)$进一步约束，$m_{\mathcal H}(N)\leq B(N,k)$。

图 2: 上限函数计算表 [PNG]

当$N\leq k$时$B(N,k)$的计算公式容易推导，当$N>k$时$B(N,k)$的计算比较复杂， \begin{equation} B(N,k)= \left\{ \begin{aligned} & 2^N & N<k \\ & 2^N-1 & N=k \\ & \sum_{i=1}^{k-1}\binom{N}{i} & N>k \end{aligned} \right. 。 \end{equation}

VC界

通过 \[ P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon\right]\leq 2m_{\mathcal H}(N)\exp\left(-2\epsilon^2N\right) \] 证明VC界（Vapnik-Chervonenkis bound） \begin{equation} P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon\right]\leq 4m_{\mathcal H}(2N)\exp\left(-{1\over 8}\epsilon^2N\right) \label{eq:vc-bound} \end{equation} 分三步，对应着3个常数的变化。

第一步：用$E’_{in}$替换$E_{out}$

图 3: 替换掉out-sample误差 [PNG]

另抽取$N$个点的数据集$\mathcal D’ $计算$E’_{in}$来估计$E_{out}$的值，$E’_{in}$和$E_{in}$的取值如上图，以$E_{out}$为中心分布。${1\over 2}P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon\right]$对应着上图粉色区域的面积（取$1\over 2$是忽略对称的左半区域），在满足该条件下，上图淡绿区域对应着$P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>\epsilon\right]$，于是显然有

\[ \begin{aligned} {1\over 2}P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon\right] \leq & P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>\epsilon\right] \\ \leq & P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>{\epsilon\over 2}\right] 。 \end{aligned} \]

第二步：利用成长函数约束二分类情况的数量

上一步将问题转化为只考虑$N$个点数据集$\mathcal D $和$N$个点数据集$\mathcal D’ $的问题，最多有$m_{\mathcal H}(2N)$种二分类情况，可进一步得到坏事儿发生概率的上界

\[ \begin{aligned} P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon\right] \leq & 2P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>{\epsilon\over 2}\right] \\ \leq & 2m_{\mathcal H}(2N)P\left[\mbox{fixed } h\mbox{ s.t. }\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>{\epsilon\over 2}\right] 。 \end{aligned} \]

第三步：利用Hoeffding without Replacement约束

易知$\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>{\epsilon\over 2}$和$\left\lvert E_{in}(h)-\frac{E_{in}(h)+E’_{in}(h)}{2}\right\rvert>{\epsilon\over 4}$等价，$\frac{E_{in}(h)+E’_{in}(h)}{2}$可以认为是out-sample数据的误差估计，所以直接利用Hoeffding可得

\[ \begin{aligned} P\left[\exists h\in\mathcal H\mbox{ s.t. }\left\lvert E_{in}(h)-E_{out}(h)\right\rvert>\epsilon\right] \leq & 2m_{\mathcal H}(2N)P\left[\mbox{fixed } h\mbox{ s.t. }\left\lvert E_{in}(h)-E’_{in}(h)\right\rvert>{\epsilon\over 2}\right] \\ = & 2m_{\mathcal H}(2N)P\left[\mbox{fixed } h\mbox{ s.t. }\left\lvert E_{in}(h)-\frac{E_{in}(h)+E’_{in}(h)}{2}\right\rvert>{\epsilon\over 4}\right] \\ \leq & 2m_{\mathcal H}(2N)\cdot 2\exp\left(-2\left(\epsilon\over 4\right)^2N\right)\\ = & 4m_{\mathcal H}(2N)\exp\left(-{1\over 8}\epsilon^2N\right)。 \end{aligned} \]

参考资料

[1]H.-T. Lin, “Lecture 6: Theory of Generalization.” Coursera, 2014. [Online]

脚注

打赏作者

2016-10-24 » NNML（03）：BP 学习
2016-10-16 » NNML（02）：感知器学习
2016-10-09 » 鲁棒及自适应控制（2）：模型
2016-10-09 » NNML（01）：引言
2016-09-24 » 家用监控设备用于电网的可行性分析
2016-09-19 » 无人机电缆隧道巡检可行性调研报告
2016-09-18 » 鲁棒及自适应控制（1）：概论
2016-09-13 » Gradient-Based Learning Applied to Document Recognition

上一篇：机器学习：训练与测试下一篇：机器学习：VC维