Automatic Differentiation and Higher Order Derivative
Automatic differentiation and higher order derivative in machine learning.
The chain rule for vector-valued functions
Suppose we have a composed function \(f(\mathbf{x}) = f^2 \circ f^1 \circ f^0 (\mathbf x)\) that maps vector to scalar,
\[\mathbf{x} \overset{f^0}{\longrightarrow} \mathbf{h}^1 \overset{f^1}{\longrightarrow} \mathbf{h}^2 \overset{f^2}{\longrightarrow} y\]in which \(\mathbf x, \mathbf h^i \in \mathbb{R}^d\) and \(y \in \mathbb{R}\).
According to the chain rule we have
\[\frac{\partial y}{\partial x_i} = \sum_{j, k} \frac{\partial y}{\partial h^2_j} \frac{\partial h^2_j}{\partial h^1_k} \frac{\partial h^1_k}{\partial x_i},\]which can be written in matrix as
\[\frac{\partial y}{\partial \mathbf x} = \frac{\partial y}{\partial \mathbf h^2} \frac{\partial \mathbf h^2}{\partial \mathbf h^1} \frac{\partial \mathbf h^1}{\partial \mathbf x},\]in which \({\partial y}/{\partial \mathbf x}, {\partial y}/{\partial \mathbf h^2} \in \mathbb{R}^{1 \times d}\) and \({\partial \mathbf h^{l+1}}/{\partial \mathbf h^l} \in \mathbb{R}^{d \times d}\) are the Jacobian matrices of the corresponding functions. In most of cases, the gradient of \(y\) w.r.t \(\mathbf{x}\), denoted by \(\nabla_\mathbf{x} y\), is treated as a column vector. It can be obtained by transposing \({\partial y}/{\partial \mathbf x}\),
\[\begin{align} \nabla_\mathbf{x} y = \left(\frac{\partial y}{\partial \mathbf x}\right) ^\top &= \left(\frac{\partial \mathbf h^1}{\partial \mathbf x}\right)^\top \left(\frac{\partial \mathbf h^2}{\partial \mathbf h^1}\right)^\top \left(\frac{\partial y}{\partial \mathbf h^2} \right)^\top, \\ & = \left(\frac{\partial \mathbf h^1}{\partial \mathbf x}\right)^\top \left(\frac{\partial \mathbf h^2}{\partial \mathbf h^1}\right)^\top \nabla_{\mathbf{h}^2} y. \end{align}\]The above results can be easily generalized to \(f(\mathbf{x}) = f^L \circ f^{L-1} \circ \ldots \circ f^0 (\mathbf x)\) with
\[\frac{\partial y}{\partial \mathbf x} = \mathbf{1}^\top \frac{\partial y}{\partial \mathbf h^L} \frac{\partial \mathbf h^{L}}{\partial \mathbf h^{L-1}} \frac{\partial \mathbf h^{L-1}}{\partial \mathbf h^{L-2}}\ldots \frac{\partial \mathbf h^1}{\partial \mathbf x},\]where \(\mathbf 1^\top\) is a row vector with appropriate dimension. The Jacobian matrix also has different appearance in different literatures,
\[\begin{align} \partial f(\mathbf x) &= \mathbf{1}^\top \partial f^L(\mathbf h^{L})\, \partial f^{L-1}(\mathbf h^{L-1})\, \partial f^{L-2}(\mathbf h^{L-2})\, \ldots \partial f^0(\mathbf h^{0}) \quad\text{(used in machine learning)},\\ \\ D[F(\mathbf x)] &= \mathbf{1}^\top D [f^L(\mathbf h^{L})]\, D [f^{L-1}(\mathbf h^{L-1})]\, D [f^{L-2}(\mathbf h^{L-2})]\, \ldots D [f^{0}(\mathbf h^{0})] \quad\text{(used in math)}. \end{align}\]Vector-Jacobian product
The form \(\mathbf{v}^\top \partial f(\mathbf x)\) is known as Vector-Jacobian Product or VJP and is adopted as workhorse in revese-mode automatic differentiation. Starting with \(\mathbf 1^\top\) and applying VJP from left to right successively, we can obtain \({\partial y} / {\partial \mathbf h^l}\) for \(l=L, L-1, \ldots, 0\). It is a row vector known as the adjoint. The adjoint serves as key bridge for computing the gradient of \(y\) w.r.t parameters of \(f^{l-1}(\mathbf h^{l-1} ; \boldsymbol \theta_{l-1})\) as
\[\frac{\partial y}{\partial \boldsymbol \theta_{l-1}} = \frac{\partial y}{\partial \mathbf h^{l}} \frac{\partial f^{l-1}(\mathbf h^{l-1})}{\partial \boldsymbol \theta^{l-1}}.\]VJP is very efficient to process wide-short Jacobian matrices, which frequently arise in machine learning. Actually, the gradients of machine learning models in the mainstream AD tools such as PyTorch, JAX are all computed by the VJP under the hood.