Matrix Calculus
The matrix derivative of scalar function and its applications in machine learning.
The pdf-version is also available.
Inner product notation
We will use \(\langle \cdot, \cdot \rangle\) to denote inner product for vector or Frobenius inner product for matrix, and use \(\odot\) to denote Hadamard product. Recall that the matrix Frobenuis inner product is defined as
\[\langle X, Y\rangle=\operatorname{tr}\left(X^{\top} Y\right)=\sum_{i=1}^m \sum_{j=1}^n X_{i j} Y_{i j}.\]For any \(X \in \mathbb{R}^{m \times n}, Y \in \mathbb{R}^{m \times n}, Z \in \mathbb{R}^{m \times n}, a \in \mathbb{R}\),
- \(\langle X, Y \rangle = \langle Y, X \rangle\).
- \(\langle a X, Y \rangle = \langle X, a Y \rangle = a \langle X, Y \rangle\).
- \(\langle X + Z, Y \rangle = \langle X, Y \rangle + \langle Z, Y \rangle\).
- \(\langle X, Y \odot Z \rangle = \langle X \odot Y, Z \rangle\).
Suppose \(A \in \mathbb{R}^{m \times \ell_1}, C \in \mathbb{R}^{\ell_1 \times n}, B \in \mathbb{R}^{m \times \ell_2}, D \in \mathbb{R}^{\ell_2 \times n}\), then we have
\[\begin{align} \langle AC, BD \rangle &= \langle B^\top AC, D \rangle = \langle C, A^\top BD \rangle \\ &= \langle A C D^\top, B \rangle = \langle A, B D C^\top \rangle \end{align}\]Proof The first two equations are obvious and the last two equations use the fact that \(\operatorname{tr}(XY) = \operatorname{tr}(YX)\) holds for any two matrix \(X\), \(Y\) such that \(X^\top\) has the same size with \(Y\).
In particular when the matrices \(C, D\) degrade to vector \(\mathbf{c} \in \mathbb{R}^{\ell_1}, \mathbf{d} \in \mathbb{R}^{\ell_2}\), we have
\[\langle A \mathbf{c}, B \mathbf{d} \rangle = \langle B^\top A \mathbf{c}, \mathbf{d} \rangle = \langle \mathbf{c}, A^\top B \mathbf{d} \rangle,\] \[\langle A \mathbf{c}, B \mathbf{d} \rangle = \langle A \mathbf{c} \mathbf{d}^\top, B \rangle.\]Note that the second one states that the inner product and Frobenius inner product are compatible.
The scalar function
Let us denote \(f = f(\mathbf{x}) \in \mathbb{R}\) or \(f = f({X}) \in \mathbb{R}\).
First, let us consider the the case when the input is vector \(\mathbf{x} \in \mathbb{R}^n\),
\[df = \sum_{i=1}^n \frac{\partial f}{\partial x_i} dx_i = \left \langle \nabla_{\mathbf{x}}f, d \mathbf{x} \right \rangle.\]When the input is the matrix \({X} \in \mathbb{R}^{m, n}\),
\[df = \sum_{i=1}^m \sum_{j=1}^n \frac{\partial f}{\partial X_{ij}} d X_{ij} = \left \langle \nabla_X f, d {X} \right\rangle.\]The matrix differentiation rules
- \(d ({X} \pm {Y}) = d {X} \pm d{Y}\), \(d({X} {Y}) = (d {X}) {Y} + {X} d{Y}\).
- \(d ({X}^\top) = (d {X})^\top\), transpose.
- \(d \operatorname{tr}({X}) = \operatorname{tr}(d {X})\), trace.
- \(d {X} ^ {-1} = - {X}^{-1} (d {X}) {X}^{-1}\), which can be obtained by differentiating \({X} {X}^{-1} = I\).
- \(d \operatorname{det}(X) = \operatorname{tr} (\operatorname{adj}(X) d {X}) = \langle \operatorname{adj}(X)^\top, d {X} \rangle\), where \(\operatorname{adj}(X)\) is the adjugate matrix of \({X}\).
- \(d \operatorname{det}(X) = \operatorname{det}(X) \operatorname{tr} ({X}^{-1} d {X})\) if \({X}\) is invertible.
- \(d({X} \odot {Y})=(d {X}) \odot {Y}+{X} \odot d {Y}\), Hadamard product.
- \(d \sigma({X}) = \sigma ' ({X}) \odot d {X}\), where \(\sigma(\cdot)\) is an element-wise function such as \(\operatorname{sigmoid}\), \(\sin\), etc.
Recall that the adjugate matrix \(\operatorname{adj}(X)\) of a matrix \(X \in \mathbb{R}^{n \times n}\) is defined as the transpose of its cofactor matrix \(C\), \(\operatorname{adj}(X) = C^\top\). The cofactor matrix denoted by \(C = [c_{ij}]_{n \times n}\), where \(c_{ij} = (-1)^{i+j} m_{ij}\) is defined as the cofactor of \(x_{ij}\) and \(m_{ij}\) is the determinant of the matrix produced by striking out row \(i\) and column \(j\) of \(X\). We have the equalities
\[X \operatorname{adj}(X) = \operatorname{det}(X) I\]for the square matrix \(X\) and
\[X^{-1} = \operatorname{adj}(X) / \operatorname{det}(X)\]if \(X\) is invertible.
Trace Trick
- \(a = \operatorname{tr} (a)\) for \(a \in \mathbb{R}\).
- \(\operatorname{tr}(A^\top) = \operatorname{tr}(A)\), null.
- \(\operatorname{tr} (A \pm B) = \operatorname{tr}(A) \pm \operatorname{tr}(B)\), null.
- \(\operatorname{tr} (AB) = \operatorname{tr}(BA)\) where \(A\) and \(B^\top\) have the same size.
- \(\operatorname{tr}(A^\top (B \odot C)) = \operatorname{tr}((A \odot B)^\top C)\) or \(\langle A, B \odot C \rangle = \langle A \odot B, C \rangle\), \(A,B,C\) are compatible.
In the following examples, we try to use the matrix differentiation rules and trace trick to write it as
\[df = \left \langle \nabla_X f, d {X} \right\rangle.\]Examples
Example 1
\(f(X) = \mathbf{a}^\top X \mathbf{b}\).
Using differentiation rules
\[\begin{align} d f &= \mathbf{a}^\top d (X \mathbf{b}) = \mathbf{a}^\top (d X) \mathbf{b}. \end{align}\]Using the trace trick
\[df = \operatorname{tr} (\mathbf{a}^\top (d X) \mathbf{b}) = \operatorname{tr} (\mathbf{b} \mathbf{a}^\top d X) = \langle \mathbf{a} \mathbf{b}^\top, d X\rangle.\]Hence,
\[\nabla_X f = \mathbf{a} \mathbf{b}^\top.\]Example 2
\(f(X) = \mathbf{a}^\top \exp(X \mathbf{b})\).
Using the differentiation rules
\[d f = \mathbf{a}^\top d \exp (X \mathbf{b}) = \mathbf{a}^\top \exp (X \mathbf{b}) \odot d( X \mathbf{b}) = \mathbf{a}^\top \exp (X \mathbf{b}) \odot d(X) \mathbf{b}.\]Using the trace trick
\[\begin{align} d f &= \operatorname{tr}(\mathbf{a}^\top \exp (X \mathbf{b}) \odot d(X) \mathbf{b}) = \operatorname{tr}((\mathbf{a} \odot \exp (X \mathbf{b}))^\top d(X) \mathbf{b}) = \operatorname{tr}(\mathbf{b} (\mathbf{a} \odot \exp (X \mathbf{b}))^\top dX)\\ &= \left\langle (\mathbf{a} \odot \exp (X \mathbf{b})) \mathbf{b}^\top, dX \right\rangle. \end{align}\]Hence, \(\nabla_X f = (\mathbf{a} \odot \exp (X \mathbf{b})) \mathbf{b}^\top.\)
Example 3
\(f(X) = \langle Y, MY \rangle\), \(Y = \sigma(WX)\).
Using the differentiation rules and trace trick,
\[\begin{align} d f &= \langle d Y, M Y \rangle + \langle Y, M d Y \rangle \\ &= \langle M Y, d Y \rangle + \langle M^\top Y, d Y \rangle \\ &= \langle (M + M^\top) Y, dY \rangle \\ &= \langle (M + M^\top) Y, \sigma'(WX) \odot (W d X) \rangle \\ &= \langle (M + M^\top) Y \odot \sigma'(WX), W d X \rangle \\ &= \left\langle W^\top \left((M + M^\top) Y \odot \sigma'(WX)\right), d X \right\rangle \\ &= \left\langle W^\top \left((M + M^\top) \sigma(WX) \odot \sigma'(WX)\right), d X \right\rangle \end{align}\]Hence,
\[\nabla_X f = W^\top \left( (M^\top + M) \sigma(WX) \odot \sigma'(WX)\right).\]Example 4
\(f(\mathbf{w}) = \langle X \mathbf{w} - \mathbf{y}, X \mathbf{w} - \mathbf{y} \rangle\).
Using the differentiation rules,
\[\begin{align} d f &= \langle d (X \mathbf{w} - \mathbf{y}), X \mathbf{w} - \mathbf{y} \rangle + \langle X \mathbf{w} - \mathbf{y}, d(X \mathbf{w} - \mathbf{y}) \rangle \\ &= 2 \langle X \mathbf{w} - \mathbf{y}, d(X \mathbf{w} - \mathbf{y}) \rangle \\ &= 2 \langle X \mathbf{w} - \mathbf{y}, X d \mathbf{w} \rangle = 2 \langle X^\top (X \mathbf{w} - \mathbf{y}), d \mathbf{w} \rangle. \end{align}\]Hence,
\[\nabla_{\mathbf{w}} f = 2 X^\top (X \mathbf{w} - \mathbf{y}).\]Example 5
\(f(\Sigma) = \log \vert\Sigma\vert + \frac{1}{N} \sum_{i=1}^N\left\langle\mathbf{x}_i - \boldsymbol{\mu}, \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle\).
Consider the first term
\[d \log \vert\Sigma\vert = \vert\Sigma\vert^{-1} d \vert\Sigma\vert = \langle \Sigma^{-1}, d \Sigma\rangle\]Consider the second term
\[\begin{align} d\frac{1}{N} \sum_{i=1}^N\left\langle\mathbf{x}_i - \boldsymbol{\mu}, \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle &= \frac{1}{N}\sum_{i=1}^N \left\langle\mathbf{x}_i - \boldsymbol{\mu}, d\Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle \\ &= \frac{1}{N} \sum_{i=1}^N\left\langle\mathbf{x}_i - \boldsymbol{\mu}, \Sigma^{-1} (d\Sigma) \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle \\ &= \frac{1}{N} \sum_{i=1}^N\left\langle (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^\top\Sigma^{-1}, \Sigma^{-1} d\Sigma \right\rangle \\ &= \frac{1}{N} \sum_{i=1}^N \left\langle \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^\top\Sigma^{-1}, d\Sigma \right\rangle. \end{align}\]Let \(S = \frac{1}{N}\sum_{i=1}^N (\mathbf{x}_i - \bar{\mathbf{x}}) (\mathbf{x}_i - \bar{\mathbf{x}})^\top\) then
\[d f = \left\langle\Sigma^{-1} - \Sigma^{-1} S \Sigma^{-1}, d \Sigma \right\rangle.\]Thus,
\[\nabla_{\Sigma} f = (\Sigma^{-1} - \Sigma^{-1} S \Sigma^{-1})^\top.\]Example 6
\(f(\mathbf{w}) = - \langle \mathbf{y}, \log \operatorname{softmax}(X \mathbf{w}) \rangle\), the cross entropy loss with prediction distribution \(\operatorname{softmax}(X \mathbf{w})\) and ground truth distribution \(\mathbf{y}\).
\[\begin{align*} f(\mathbf{w}) &= - \langle \mathbf{y}, \log \operatorname{softmax}(X \mathbf{w})\\ &= -\left\langle \mathbf{y}, \log \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} \right\rangle \\ &= -\left\langle \mathbf{y}, X \mathbf{w} - \mathbf{1}\log \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle \right\rangle \\ &= -\left\langle \mathbf{y}, X \mathbf{w} \right\rangle + \log \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle \left\langle \mathbf{y}, \mathbf{1}\right\rangle \\ &= -\left\langle \mathbf{y}, X \mathbf{w} \right\rangle + \log \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle, \end{align*}\]note that \(\left\langle \mathbf{y}, \mathbf{1}\right\rangle = 1\).
\[\begin{align*} d f &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \frac{d \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} \\ &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \left\langle \frac{\mathbf{1}}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, \exp (X \mathbf{w}) \odot d (X\mathbf{w}) \right\rangle \\ &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \left\langle \frac{\mathbf{1} \odot \exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, d (X \mathbf{w}) \right\rangle \\ &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \left\langle \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, Xd \mathbf{w} \right\rangle \\ %&= -\left\langle \mathbf{y} - \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, X d \mathbf{w} \right\rangle \\ &= -\left\langle X^\top \left(\mathbf{y} - \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} \right), d \mathbf{w} \right\rangle \end{align*}\]Hence,
\[\nabla_{\mathbf{w}} f = X^\top \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} - X^\top \mathbf{y}.\]Example 7
Let \(L = f(Y), Y = WX\) where \(f\) maps a matrix to a scalar,
\[\begin{align} d L = d f(Y) &= \left\langle \nabla_Y f, d Y \right\rangle \\ &= \left\langle \nabla_Y f, d (W X) \right\rangle \\ &= \left\langle \nabla_Y f, W dX + (dW)X\right\rangle \qquad (\text{using product rule of differentiation})\\ &= \left\langle \nabla_Y f, W dX \right\rangle + \left\langle \nabla_Y f, (dW)X \right\rangle \end{align}\]- \(W\) is constant, then \(d L = \left\langle \nabla_Y f, W dX \right\rangle = \left\langle W^\top \nabla_Y f, dX \right\rangle\) thus \(\frac{\partial L}{\partial X} = W^\top \nabla_Y f\).
- \(X\) is constant, then \(d L = \left\langle \nabla_Y f, (dW)X \right\rangle = \left\langle \nabla_Y f X^\top, dW \right\rangle\) thus \(\frac{\partial L}{\partial W} = \nabla_Y f X^\top\).
The non-scalar function
We wish to compute the derivative of \(\frac{\partial F}{\partial X}\), where \(F \in \mathbb{R}^{p \times q}\) and \(X \in \mathbb{R}^{m \times n}\). Consider the vector-output function \(\boldsymbol{f}(\mathbf{x}) \in \mathbb{R}^p, \mathbf{x} \in \mathbb{R}^{m}\), the derivative of \(\boldsymbol{f}\) w.r.t \(\mathbf{x}\) can be defined as
\[\frac{\partial \boldsymbol{f}}{\partial \mathbf{x}}=\left[\begin{array}{cccc}{\frac{\partial f_{1}}{\partial x_{1}}} & {\frac{\partial f_{2}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{p}}{\partial x_{1}}} \\ {\frac{\partial f_{1}}{\partial x_{2}}} & {\frac{\partial f_{2}}{\partial x_{2}}} & {\cdots} & {\frac{\partial f_{p}}{\partial x_{2}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial f_{1}}{\partial x_{m}}} & {\frac{\partial f_{2}}{\partial x_{m}}} & {\cdots} & {\frac{\partial f_{p}}{\partial x_{m}}}\end{array}\right] \in \mathbb{R}^{m \times p}\]which is the transpose of the Jacobian matrix of \(\boldsymbol{f}\) w.r.t \(\mathbf{x}\). Then we have
\[d \boldsymbol{f} = \left \langle \frac{\partial \boldsymbol{f}}{\partial \mathbf{x}}, d \mathbf{x} \right \rangle \in \mathbb{R}^p.\]Let us define the vectorization operator as \(\lambda(X) = [X_{11}, X_{12}, \ldots, X_{mn}]^\top \in \mathbb{R}^{mn}\). Then
\[\begin{align} \frac{\partial F}{\partial X} &= \frac{\partial \lambda(F)}{\partial \lambda(X)} \in \mathbb{R}^{mn \times pq}, \\ \end{align}\]and
\[\lambda (d F) = \left\langle \frac{\partial F}{\partial X}, \lambda(dX) \right\rangle.\]The content of this post is largely borrowed from this blog.