Machine Learning Techniques
The course notes of machine learning techniques enrolled in Coursera.
Week-1
Linear Support Vector Machine
Recall PLA \(h(x) = \text{sign}(w^T x)\) and \(E_\text{out}(w) \leq E_\text{in}(w) + \Omega(\mathcal{H})\).
我们希望在所有的 seperating hyperplane 中选择最 fat 的,即距离所有点最远的,因为这样的 hyperplane 有更好的 robustness,tolerate more noise. 将这个 idea 进行 formulate 可以得到如下的表示
\[\begin{align} \max_{w}\quad &\text{margin}(w) \\ \text{subject to}\quad &\text{every } y_n (w^T x_n + b) > 0 \\ &\text{margin}(w) = \min_{n} \text{distance}(x_n, w) \end{align}\]其中 \(\text{distance}(x_n, w) = \frac{1}{\|w\|_2}|w^T x_n + b|\) 为点 \(x_n\) 到 seperating hyperplane \(y = w^T x + b\) 的距离。设 \(x'\) 为 \(y = w^T x + b\) 上任意一点,那么 \(x_n - x'\) 在法向量 \(w\) 上的投影即为 \(x_n\) 到 seperating hyperplane 的距离,即 \(\text{distance}(x_n, w) = \frac{w^T}{\|w\|_2} (x_n - x')\) as \(x'\) locates in the seperating hyperplane, thus \(w^T x' + b = 0\) and substitute \(b = -w^T x'\) in the above equation.
注意到我们可以 scale up \(w, b\) 使得 \(\min \ w^T x_n + b\) 取任意值,同时不影响最终的最优解,因此令 \(\min_n w^T x_n + b = 1\) 我们可以得到
\[\begin{align} \max_{w, b}\quad &\frac{1}{\|w\|_2} \\ \text{subject to}\quad & y_n(w^T x_n + b) > 0,\quad \text{for all } n \\ & \min_n |w^T x_n + b| = 1. \end{align}\]进一步简化得到,
\[\begin{align} \min_{w, b}\quad &\frac{1}{2}\|w\|_2^2 \\ \text{subject to}\quad & y_n(w^T x_n + b) \geq 1,\quad \text{for all } n. \end{align}\]Dual Support Vector Machine
Lagrange Function
\[\mathcal{L}(b, w, \alpha) = \frac{1}{2} w^T w + \sum_{n=1}^N \alpha_n(1 - y_n(w^T z_n + b))\]Primal Problem and Dual Problem
\[\min_{b, w}(\max_{\alpha_n} \mathcal{L}(b, w, \alpha)) \geq \min_{b,w} \mathcal{L}(b, w, \alpha)\]because \(\max \geq\) any
\[\min_{b, w}(\max_{\alpha_n} \mathcal{L}(b, w, \alpha)) \geq \max_{\alpha_n} \min_{b,w} \mathcal{L}(b, w, \alpha)\]The Lagrange Dual Form of SVM
\[\max_{\alpha_n \geq 0}\left( \min_{b, w}\frac{1}{2}w^Tw + \sum_{n=1}^N \alpha_n (1 - y_n(w^T z_n + b))\right)\]根据 KKT condition 倒数为 0,有
\[\begin{align} \nabla_b \mathcal{L}(b, w, \alpha) &= 0 = - \sum_{n=1}^N \alpha_n y_n \\ \nabla_w \mathcal{L}(b, w, \alpha) &= 0 = w - \sum_{n=1}^N \alpha_n y_n z_n \end{align}\]带回 Dual Form 有
\[\max_{\alpha_n \geq 0, \sum y_n \alpha_n = 0, w = \sum \alpha_n y_n z_n} -\frac{1}{2} \|\sum_{n=1}^N \alpha_n y_n z_n\|^2_2 + \sum_{n=1}^N \alpha_n\]根据 KKT condition complementary slack 有,
\[\alpha_n (1 - y_n (w^T z_n + b)) = 0\]当 \(\alpha_n > 0\), 则 \(1 - y_n (w^T z_n + b) = 0\),也就是说 \(\alpha_n\) 对应的点 \(z_n\) 一定位于 support boundary 上,我们将这样的点称为 support vector,同时有 \(w = \sum_{n=1}^N \alpha_n y_n z_n\),由于 \(\alpha_n \geq 0\), 因此 only support vectors (SVs) contribute to \(w\),换句话说 \(w\) can be represented as the linear combination of support vectors。与此同时使用任意 support vector 可以求解 \(b = y_n - w^T z_n\), \(w \sum_{n=1}^N \alpha_n y_n z_n\)。
Dual Form
\[\min_{\alpha_n \geq 0,\, \sum y_n \alpha_n = 0} \frac{1}{2} \|\sum_{n=1}^N \alpha_n y_n z_n\|^2_2 - \sum_{n=1}^N \alpha_n\]or
\[\begin{eqnarray*} \min_{\boldsymbol\alpha} && \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m \mathbf{z}_n^T \mathbf{z}_m - \sum_{n=1}^N \alpha_n \\ \mathrm{s.t.} && \sum_{n=1}^N y_n \alpha_n = 0 \\ && \alpha_n \geq 0 \ \ \ n=1,\cdots,N \end{eqnarray*}\]在求解得到 \(\alpha\) 后,使用任意 support vector \((z_s, y_s)\)
\[\begin{aligned} w &= \sum_{n=1}^N \alpha_n y_n z_n \\ b &= y_s - w^T z_s = y_s - \sum_{n=1}^N\alpha_n y_n (z_n^T z_s) \end{aligned}\]Week-2
Kernel Support Vector Machine
在 Dual Support Vector Machine 中讲到了求解过程中需要计算 \(z_n^T z_m = \Phi(x_n)^T \Phi(x_m)\)。现在考虑二次多项式变换 \(\Phi_2(x) = (1, x_1, x_2, \ldots, x_d,x_1^2, x_1 x_2, \ldots, x_1 x_d, x_2 x_1, x_2^2, \ldots, x_2 x_d, \ldots, x_d^2)\) 那么
\[\begin{aligned} \Phi_2(x)^T \Phi_2(x') &= 1 + \sum_{i=1}^d x_i x_i' + \sum_{i=1}^d \sum_{j=1}^d x_i x_j x_i' x_j' \\ &= 1 + \sum_{i=1}^d x_i x_i' + \sum_{i=1}^d x_i x_i' \sum_{j=1}^d x_j x_j' \\ &= 1 + x^T x' + (x^T x') (x^T x') \end{aligned}\]考虑更一般的二次 Kernel
\[\Phi_2(x) = (1, \sqrt{2\gamma}x_1, \sqrt{2\gamma}x_2, \ldots,\sqrt{2\gamma} x_d, \gamma x_1^2,\gamma x_1 x_2, \ldots,\gamma x_1 x_d,\gamma x_2 x_1,\gamma x_2^2, \ldots, \gamma x_2 x_d, \ldots,\gamma x_d^2)\] \[\begin{aligned} K_{\Phi_2}(x, x') =& 1 + 2 \gamma x^T x' + \gamma^2 (x^T x')^2 \\ =& (1 + \gamma x^T x')^2 \text{ with } \gamma > 0 \end{aligned}\]Recall
\[b = y_s - w^T z_s = y_s - \left(\sum_{n=1}^N \alpha_n y_n z_n\right)^T z_s = y_s - \sum_{n=1}^N \alpha_n y_n K(x_n, x_s)\]其中 \(x_s\) 为任意 support vector。
那么对任意 test input \(x\)
\[g_\text{SVM}(x) = \text{sign}(w^T \Phi(x) + b) = \text{sign}\left(\sum_{n=1}^N \alpha_n y_n K(x_n, x) + b\right)\]Gaussian Kernel and Infinite Dimensional Transform 考虑一维的 \(x\), \(\Phi(x) = \exp(-x^2)\left(1, \sqrt{\frac{2}{1!}} x, \sqrt{\frac{2^2}{2!}} x^2, \ldots \right)\), \(K(x, x') = \Phi(x)^T \Phi(x')\)
More generally, Gaussian kernel
\[K(x, x') = \exp(-\gamma \| x - x' \|_2^2) \text{ with }\gamma > 0\]also called Radial Basis Function (RBF) kernel.
\(\gamma\) 越大越容易 overfitting, the extreme case \(K_{\gamma \rightarrow \infty}(x, x') = I\{x = x'\}\)
其对偶形式为 (Dual Form)
\[\begin{eqnarray*} \min_{\boldsymbol\alpha} && \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m K(\mathbf{x}_n, \mathbf{x}_m) - \sum_{n=1}^N \alpha_n \\ \mathrm{s.t.} && \sum_{n=1}^N y_n \alpha_n = 0 \\ && \alpha_n \geq 0 \ \ \ n=1,\cdots,N \end{eqnarray*}\]在求解得到 \(\alpha\) 后,使用任意 support vector \((z_s, y_s)\)
\[\begin{aligned} w &= \sum_{n=1}^N \alpha_n y_n z_n \\ b &= y_s - w^T z_s = y_s - \sum_{n=1}^N\alpha_n y_n K(x_n, x_s) \end{aligned}\]Soft Margin Support Vector Machine
Recall pocket \(\min_{b,w} \sum_{n=1}^N I\{y_n \neq \text{sign}(w^T z_n + b)\}\)
combine pocket with hard margin svm
\[\begin{aligned} \max_{w, b}\quad &\frac{1}{2}\|w\|_2^2 + C \cdot \sum_{n=1}^N I\{y_n \neq \text{sign}(w^T z_n + b)\}\\ \text{subject to}\quad & y_n(w^T x_n + b) \geq 1 \quad \text{for correct } n \\ & y_n(w^T z_n + b) \geq -\infty \quad \text{for incorret } n \end{aligned}\]但是每个点的犯错误程度不同,为了体现出这种错误程度,可以引入变量 \(\xi_n\)
\[\begin{align} \max_{w, b}\quad &\frac{1}{2}\|w\|_2^2 + C \cdot \sum_{n=1}^N \xi_n \\ \text{subject to}\quad & y_n(w^T x_n + b) \geq 1 - \xi_n \quad \text{for all } n\\ & \xi_n \geq 0 \end{align}\]其 Lagrange function 为
\[\begin{aligned}\mathcal{L}(b, w, \alpha) = &\frac{1}{2} w^T w + C\cdot \sum_{n=1}^N \xi_n \\ &+ \sum_{n=1}^N \alpha_n(1 - \xi_n - y_n(w^T z_n + b)) + \sum_{n=1}^N \beta_n \cdot (-\xi_n)\end{aligned}\]根据 KKT condition
\(\nabla_{\xi_n} \mathcal{L} = 0 = C - \alpha_n - \beta_n\) 因为 \(\beta_n \geq 0\) 故 \(0 \leq \alpha_n \leq C\)
根据 KKT complementary slackness condition
\[\begin{aligned} \alpha_n(1 - \xi_n - y_n(w^T z_n + b)) &= 0 \\ (C - \alpha_n)\xi_n &= 0 \end{aligned}\]对于 \(0 < \alpha_n < C\) 我们可以根据此 \(y_s, x_s\) 计算得到 \(b = y_s - \sum_{n} \alpha_n y_n K(x_n, x_s)\)
其对偶形式为 (Dual Form)
\[\begin{eqnarray*} \min_{\boldsymbol\alpha} && \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m K(\mathbf{x}_n, \mathbf{x}_m) - \sum_{n=1}^N \alpha_n \\ \mathrm{s.t.} && \sum_{n=1}^N y_n \alpha_n = 0 \\ && 0 \leq \alpha_n \leq C \ \ \ n=1,\cdots,N \end{eqnarray*}\]根据 \(\alpha_n\) 的取值,可以对 \((x_n, y_n)\) 做如下分类
- \(\alpha_n = 0\) 这类点为正确分类点,且在支撑边界之外
- \(0 < \alpha_n < C\) 这类点为 support vectors
- \(\alpha_n = C\) 这类点为越过了支撑边界的点,其中可能存在被错误分类的点,也可能存在被正确分类的点,\(\xi_n =\) violation amount
\(C\) 越大两个支撑边界间距 (\(1/\|w\|\)) 越 thin ,越容易 overfitting 在使用 Gaussian Kernel 时,\(\gamma\) 越大越容易 overfitting
在使用 kernel trick 时 \(w\) 可能无法算出来,但是 \(\|w\|_2\) 是可以计算的 \(\|w\|_2^2 = \sum_{n=1}^N \sum_{m=1}^N \alpha_n \alpha_m y_n y_m K(\mathbf{x}_n, \mathbf{x}_m)\), \(C\) 增大 \(\|w\|_2\) 也随之增大,求解过程也更久了,从对偶问题的形式来看,需要搜索的空间更大了
Week-3
Kernel Logistic Regression
回忆 SVM 的对偶形式
\[\begin{array}{ll} \text{minimize } & \frac{1}{2}w^T w + C\sum_{n=1}^N\xi_n \\ \text{subject to} & y_n(w^T z_n + b) \geq 1 - \xi_n \\ & \xi_n \geq 0,\quad n=1,2,\ldots,N \end{array}\]最后所有的点所对应的 \(\xi_n\) 可以分成两类
\[\xi_n = \begin{cases} 1-y_n(w^T z_n + b), & \text{if }\, y_n(w^T z_n + b) < 1 \\ 0, & \text{if }\, y_n(w^T z_n + b) \geq 1 \end{cases}\]which means \(\xi_n = \max(1-y_n(w^T z_n + b), 0)\) so we can cast the constrained optimization problem into an unconstrained one as
\[\text{minimize } \frac{1}{2}w^T w + C\sum_{n=1}^N \max(1-y_n(w^T z_n + b), 0).\]而 \(\max(1 - y \cdot s, 0)\)-hinge error message 为 \(\text{err}_{0/1}\) 的 upper bound,如果我们将 error 换成另一个 \(\text{err}_{0/1}\) 的 upper bound \(\log_2(1 + \exp(-y s))\),我们将得到 L2-regularized logistic regression
在 SVM 的最终解中 \(w = \sum (\alpha_n y_n) z_n\),现在我们 claim:
\[\min_{w} \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N \text{err} (y_n, w^T z_n)\]最优解 \(w_* = \sum_{n=1}^{N} \beta_n z_n\),我们可以直接将此式带入
\[\min_{w} \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N \log(1 + \exp(-y_n w^T z_n))\]可以得到 kernel logistic regression
\[\min_{\beta} \frac{\lambda}{N} \sum_{n=1}^N \sum_{m=1}^N \beta_n \beta_m K(x_n, x_m) + \frac{1}{N} \sum_{n=1}^N \log\left(1 + \exp\left(- y_n \sum_{m=1}^N \beta_m K(x_m, x_n)\right)\right)\]可以直接通过 SGD 求解 \(\beta\)
Support Vector Regression
对于任意 L2-regularized linear model
\[\min_{w} \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N \text{err} (y_n, w^T z_n)\]最优解 \(w_* = \sum_{n=1}^{N} \beta_n z_n\)
\[\text{err}(y, w^T z) = (y - w^T z)^2\]带入上式得
\[\min_{w} \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N (y_n - w^T z_n)^2\]我们可以将对 \(w\) 的求解转变为对 \(\beta\) 的求解
\[\min_{\beta} \frac{\lambda}{N} \sum_{n=1}^N \sum_{m=1}^N \beta_n \beta_m K(x_n, x_m) + \frac{1}{N}\sum_{n=1}^N\left(y_n - \sum_{m=1}^N\beta_m K(x_n, x_m) \right)^2 \\ = \frac{\lambda}{N} \beta^T K \beta + \frac{1}{N}(\beta^T K^T K \beta - 2 \beta^T K^T y + y^T y)\]\(\nabla_{\beta} E = 0\) 得到 \(\beta = (\lambda I + K)^{-1} y\)
Tube Regression
\[\text{err}(y, s) = \max(0, |s-y| - \epsilon)\] \[\min_{\beta} \frac{\lambda}{N} \sum_{n=1}^N \sum_{m=1}^N \beta_n \beta_m K(x_n, x_m) + \frac{1}{N}\sum_{n=1}^N\max(0, |w^T z_n - y| - \epsilon)\]mimic standard SVM
\[\min_{b, w} \frac{1}{2} w^T w + C \sum_{n=1}^N \max(0, |w^T z_n + b - y_n| - \epsilon)\]变换成带有 linear constraints version
\[\begin{align} \min_{w, b, \xi} & \frac{1}{2} w^T w + C \sum_{n=1}^N (\xi_n^- + \xi_n^+) \\ & -\epsilon - \xi_n^- \leq y_n - w^T z_n - b \leq \epsilon + \xi^+_n \\ & \xi_n^- \geq 0, \xi_n^+ \geq 0 \end{align}\]使用 KKT conditions
\[\begin{align} \min &\frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N(\alpha_n^+ - \alpha_n^-)(\alpha_m^+ - \alpha_m^-) k_{n,m}\\ &+ \sum_{n=1}^N ((\epsilon - y_n)\alpha_n^+ + (\epsilon + y_n) \alpha_n^-) \\ s.t. & \sum_{n=1}^N 1 \cdot (\alpha_n^+ - \alpha_n^-) = 0 \\ & 0 \leq \alpha_n^+ \leq C, 0 \leq \alpha_n^- \leq C \end{align}\]求解对偶问题后可得 \(w = \sum_{n=1}^N (\alpha_n^+ - \alpha_n^-)z_n\)
\[\begin{align} \alpha_n^+(\epsilon + \xi_n^+ - y_n + w^T z_n + b) = 0 \\ \alpha_n^-(\epsilon + \xi_n^- + y_n - w^T z_n - b) = 0 \end{align}\]对于那些严格在 tube 内的点有 \(\alpha_n^+ = 0, \alpha_n^- = 0\) 以及 \(\beta_n = 0\)
Week-4
Blending and Bagging
Why might aggregation work:
- feature transform
- regularization
Uniform blending
\[G(x) = \text{sign}\left(\sum_{t=1}^T 1 \cdot g_t(x)\right)\]Linear Blending
\(G(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t \cdot g_t(x) \right)\) with \(\alpha_t \geq 0\)
Adaptive Boosting
error rate \(\epsilon_t = \frac{\sum_{n=1}^N u_n^{(t)} I\{y_n \neq g_t(x_n)\}}{\sum_{n=1}^N u_n^{(t)}}\)
multiply incorrect \(\propto (1-\epsilon_t)\); multiply correct \(\propto \epsilon_t\)
define scaling factor \(\blacklozenge_t = \sqrt{\frac{1-\epsilon_t}{\epsilon_t}}\)
\[\begin{align} \text{incorrect}(u_n) &\leftarrow \text{incorrect} \cdot \blacklozenge_t \\ \text{correct}(u_n) &\leftarrow \text{correct} \cdot 1/\blacklozenge_t \end{align}\]if \(\epsilon_t \leq 1/2\), \(\blacklozenge_t \geq 1\), scale up incorrect
\[\alpha_t = \ln(\blacklozenge_t)\] \[G(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t g_t(x) \right)\]在使用 decision stump 实现 adaboost 时,每轮 \(g_t\) 都试图最小化
\[E_{\text{in}}^u(h) = \frac{1}{N} \sum_{n=1}^N u_n \cdot I\{y_n \neq h(x_n)\}\]其中
\[h_{s,i,\theta} = s \cdot \mbox{sign}(x_i - \theta)\]每次都要扫描所有的维度,从中选出最优的 \(h_{s,i,\theta}\)。每次迭代之后会发现 \(\sum_{i=1}^N u_n\) 都在减小。\(T\) 过大可能会导致 overfitting.
Week 5
Decision Tree
Recursively build the tree
\[G(x) = \sum_{c=1}^C I\{b(x) = c\} G_c(x)\]其中在选择划分标准时使用划分后的 impurity 作为标准
Gini index
\[1 - \sum_{k=1}^K \left(\frac{\sum_{n=1}^N I\{y_n = k\}}{N} \right)^2 = 1 - \sum_{k=1}^K p_k^2\]\(p_k\) is the frequency of the category \(k\).
选择 Gini index 最小的,因为 Gini index 越小,划分后的集合纯度越大或者说混乱程度越低。Gini index 与 entropy 有相似的效果。
Entropy
\[-\sum_{k=1}^K p_k \log p_k\]Random Forest
function RandomForest(D)
for t = 1:T
request size-M data D_t by bootstrapping with D
obtain tree g_t by DTree(D_t)
end
G = Uniform({g_t})
end
一对训练数据 \((x_n, y_n)\) 经过了 \(N\) 次采样后没有被选到的概率 \((1 - {\frac 1 N})^N = \frac{1}{e}\),对每个 \(g_t\) out of bag 的样本数量近似为 \({\frac N e}\).
OOB 为训练样本,如果训练 \(g_t\) 时没有使用 \(x_n\) 那么 \(x_n\) 就是 \(g_t\) 的 OOB.
nil | \(g_1\) | \(g_2\) | \(g_3\) | \(\ldots\) | \(g_T\) |
---|---|---|---|---|---|
\((x_1,y_1)\) | \(\tilde{D}_1\) | * | \(\tilde{D}_3\) | \(\ldots\) | \(\tilde{D}_T\) |
\((x_2,y_2)\) | * | * | \(\tilde{D}_3\) | \(\ldots\) | \(\tilde{D}_T\) |
\(\ldots\) | blank | blank | blank | blank | blank |
\((x_N,y_N)\) | \(\tilde{D}_1\) | * | * | \(\ldots\) | * |
\(G_n^-\) contains only trees that \(x_n\) is OOB of the trees, e.g. in the table \(G_N^-(x_N) = \mbox{average}(g_2, g_3, g_T)\), \(E_\text{oob}(G) = \frac{1}{N} \sum_{n=1}^N \mbox{err}(y_n, G_n^-(x_n))\).
Feature selection: \(\mbox{importance}(i) = E_\text{oob}(G) - E_\text{oob}^{(p)}(G)\), \(E_\text{oob}^{(p)}\) comes from replacing each request of \(x_{n,i}\) by a permuted OOB value.
Assume both \(x_n\) and \(x_{n'}\) is the OOB of \(g_t\), in \(E_\text{OOB}\) we use \(g_t([x_{n,1}, \ldots, x_{n,i}, \ldots, x_{n,d}])\) but in \(E_\text{OOB}^{(p)}\) we replace \(x_{n, i}\) with \(x_{n', i}\), i.e. we do \(g_t([x_{n,1}, \ldots, x_{n',i}, \ldots, x_{n,d}])\).
Week 6
Gradient Boost
\[u_n^{(t+1)} = \begin{cases} u_n^{(t)} \cdot \blacklozenge_t \quad \text{incorrect} \\ u_n^{(t)} / \blacklozenge_t \quad \text{correct} \end{cases}\] \[u_n^{(t+1)} = u_n^{(t)} \cdot \blacklozenge_t ^ {-y_n g_t(x_n)} = u_n^{(t)} \cdot \exp(-y_n \alpha_t g_t(x_n))\] \[u_n^{(T+1)} = u_n^{(1)} \cdot \prod_{t=1}^T \exp(-y_n \alpha_t g_t(x_n)) = \frac{1}{N} \cdot \exp\left(-y_n \sum_{t=1}^T \alpha_t g_t(x_n)\right)\]其中 \(\sum_{t=1}^T \alpha_t g_t(x_n)\) 可以看成投票的 score 我们希望 \(y_n \cdot s_n\) positive and large. 因此我们可以将 AdaBoost 看成如下的优化问题
\[\sum_{n=1}^N u_n ^ {(T+1)} = {\frac 1 N} \sum_{n=1}^N \exp\left(-y_n \sum_{t=1}^T \alpha_t g_t(x_n)\right)\]\(\exp(-y\cdot s)\) 称为 exp-loss
Gradient Boosting for Arbitrary Error Function
AdaBoost
\[\min_\eta \min_h {\frac 1 N} \sum_{n=1}^N \exp\left(-y_n \left(\sum_{\tau = 1}^{t-1} \alpha_\tau g_\tau (x_n) + \eta h(x_n)\right)\right)\]GradientBoost
\[\min_\eta \min_h {\frac 1 N} \sum_{n=1}^N \text{err}\left(\sum_{\tau = 1}^{t-1} \alpha_\tau g_\tau (x_n) + \eta h(x_n), y_n\right)\]\(h(x_n)\) 可以看成 Hilbert Space 中的一个点 (向量),\(\eta\) 为步长。
Neural Network
设输入层为第 0 层, \(w^{(i)}\) 为连接 \(i-1\) 层和 \(i\) 层的 weight.
\[e_n = (y_n - s_1^{(L)})^2 = \left(y_n - \sum_{i=0}^{d^{(L-1)}} w_{i1}^{(L)} x_i^{(L-1)}\right)^2\]设 \(\frac{\partial e_n}{\partial s^{(\ell)}} = \delta^{(\ell)}\),
-
对于最后一层
\[\delta_1^{(L)} = -2(y_n - s_1^{(L)})\] -
对于其他层有
\[\begin{align}\delta^{(\ell)}_j = \frac{\partial e_n}{\partial s_j^{(\ell)}} &= \sum_k \frac{\partial e_n}{\partial s_k^{(\ell+1)}} \frac{\partial s_k^{(\ell+1)}}{\partial x_j^{(\ell)}} \frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}} \\ &= \sum_k\left(\delta_k^{(\ell+1)}\right)\left(w_{jk}^{(\ell+1)}\right)\left(\tanh'(s_j^{(\ell)})\right)\end{align}\]
SGD:
- randomly pick \(n \in \{1,2,\ldots,N\}\)
- forward compute all \(x_i^{(\ell)}\) with \(x^{(0)}=x_n\)
- backward compute all \(\delta_j^{(\ell)}\)
- gradient descent \(w_{ij}^{(\ell)} \leftarrow w_{ij}^{(\ell)} - \eta x_i^{(\ell-1)} \delta_j^{(\ell)}\)
sometimes 1 - 3 is parallelly done many times and average \(x_i^{(\ell-1)} \delta_j^{(\ell)}\) taken for update in 4 (mini-batch gradient descent)
设 \(W\) 为连接 layer \(\ell\) 和 layer \(\ell + 1\) 的矩阵,\(x\) 为 layer \(\ell\) 的输出, \(\delta^{(\ell+1)}\) 为 layer \(\ell+1\) 的误差,那么 layer \(\ell\) 的误差可以使用矩阵和向量乘法描述为
\[\delta^{(\ell)} = W \cdot \delta^{(\ell+1)} \odot (1 - x^2)\]其中 \(\odot\) 为 element-wise multiply operation
而 \(W\) 的梯度可以通过 vector outer product 计算得到,
\[x \cdot (\delta^{(\ell + 1)})^T\]如果使用 numpy 的话可以用 np.outer()
完成 outer product operation
Week-7
Deep Learning
Linear autoencoder or PCA
\[{\frac 1 N} \sum_{n=1}^N \|x_n - W W^T x_n \|^2\]- let \(\bar{x} = {\frac 1 N} \sum_{n=1}^N x_n\), and let \(x_n \leftarrow x_n - {\bar x}\)
- calculate \({\tilde d}\) top eigenvectors \(w_1, w_2, \ldots, w_{\tilde d}\) of \(X^T X\)
- return feature transform \(\Phi(x) = W (x - {\bar x})\)
Radial Basis Network
Radial Basis Function
\[\phi(x, \mu) = \exp(-\gamma \| x - \mu\|^2)\]Full RBF Network
\[h(x) = \mbox{Output}\left(\sum_{m=1}^M \beta_m \mbox{RBF}(x, \mu_m)\right)\]k-means
\[E_\text{in}(S_1, \ldots, S_M; \mu_1, \ldots, \mu_M) = \sum_{n=1}^N \sum_{m=1}^M I(x_n \in S_m)\|x_n - \mu_m \|^2\]Matrix Factorization
\[R \approx W^T V\]\(\sum(r_{nm} - w_m^T v_n)^2\) movie m and user n
\(v_n\) 维度 \(\tilde{d} \times 1\),第 n 个用户的 topic 分布 \(w_m\) 维度 \(\tilde{d} \times 1\),第 m 部电影的 topic 分布
SGD for Matrix Factorization
- initialize \(\tilde{d}\) dimension vectors \(\{w_m\}, \{v_n\}\) randomly
- for t = 0, 1, …, T
- randomly pick \((n,m)\) within all known \(r_{nm}\)
- calculate residual \(\tilde{r}_{nm} = (r_{nm} - w_m^T v_n)\)
- \[v_n \leftarrow v_n + \eta \cdot \tilde{r}_{nm} w_m\]
- \(w_m \leftarrow w_m + \eta \cdot \tilde{r}_{nm} v_n\).