Matrix Calculus.

The pdf-version is also available.

## Inner product notation

We will use $$\langle \cdot, \cdot \rangle$$ to denote inner product for vector or Frobenius inner product for matrix, and use $$\odot$$ to denote Hadamard product.

For any $$X \in \mathbb{R}^{m \times n}, Y \in \mathbb{R}^{m \times n}, Z \in \mathbb{R}^{m \times n}, a \in \mathbb{R}$$,

• $$\langle X, Y \rangle = \langle Y, X \rangle$$.
• $$\langle a X, Y \rangle = \langle X, a Y \rangle = a \langle X, Y \rangle$$.
• $$\langle X + Z, Y \rangle = \langle X, Y \rangle + \langle Z, Y \rangle$$.
• $$\langle X, Y \odot Z \rangle = \langle X \odot Y, Z \rangle$$.

Suppose $$A \in \mathbb{R}^{m \times \ell_1}, C \in \mathbb{R}^{\ell_1 \times n}, B \in \mathbb{R}^{m \times \ell_2}, D \in \mathbb{R}^{\ell_2 \times n}$$, then we have

\begin{align} \langle AC, BD \rangle &= \langle B^\top AC, D \rangle = \langle C, A^\top BD \rangle \\ &= \langle A C D^\top, B \rangle = \langle A, B D C^\top \rangle \end{align}

Proof The first two equations are obvious and the last two equations use the fact that $$\operatorname{tr}(XY) = \operatorname{tr}(YX)$$ holds for any two matrix $$X$$, $$Y$$ such that $$X^\top$$ has the same size with $$Y$$.

In particular when the matrices $$C, D$$ degrade to vector $$\mathbf{c} \in \mathbb{R}^{\ell_1}, \mathbf{d} \in \mathbb{R}^{\ell_2}$$, we have

$\langle A \mathbf{c}, B \mathbf{d} \rangle = \langle B^\top A \mathbf{c}, \mathbf{d} \rangle = \langle \mathbf{c}, A^\top B \mathbf{d} \rangle,$ $\langle A \mathbf{c}, B \mathbf{d} \rangle = \langle A \mathbf{c} \mathbf{d}^\top, B \rangle.$

Note that the second one states that the inner product and Frobenius inner product are compatible.

## The scalar function

Let us denote $$f = f(\mathbf{x}) \in \mathbb{R}$$ or $$f = f({X}) \in \mathbb{R}$$.

First, let us consider the the case when the input is vector $$\mathbf{x} \in \mathbb{R}^n$$,

$df = \sum_{i=1}^n \frac{\partial f}{\partial x_i} dx_i = \left \langle \nabla_{\mathbf{x}}f, d \mathbf{x} \right \rangle.$

When the input is the matrix $${X} \in \mathbb{R}^{m, n}$$,

$df = \sum_{i=1}^m \sum_{j=1}^n \frac{\partial f}{\partial X_{ij}} d X_{ij} = \left \langle \nabla_X f, d {X} \right\rangle.$

### The matrix differentiation rules

• $$d ({X} \pm {Y}) = d {X} \pm d{Y}$$, $$d({X} {Y}) = (d {X}) {Y} + {X} d{Y}$$.
• $$d ({X}^\top) = (d {X})^\top$$, transpose.
• $$d \operatorname{tr}({X}) = \operatorname{tr}(d {X})$$, trace.
• $$d {X} ^ {-1} = - {X}^{-1} (d {X}) {X}^{-1}$$, which can be obtained by differentiating $${X} {X}^{-1} = I$$.
• $$d \vert{X}\vert = \operatorname{tr} (\operatorname{adj}(X) d {X})$$, when $${X}$$ is invertible $$d \vert{X}\vert = \vert{X}\vert\operatorname{tr} ({X}^{-1} d {X})$$.
• $$d({X} \odot {Y})=(d {X}) \odot {Y}+{X} \odot d {Y}$$, Hadamard product.
• $$d \sigma({X}) = \sigma ' ({X}) \odot d {X}$$, where $$\sigma(\cdot)$$ is an element-wise function such as $$\operatorname{sigmoid}$$, $$\sin$$, etc.

where $$\operatorname{adj}(X)$$ is the adjoint matrix of $${X}$$

### Trace Trick

• $$a = \operatorname{tr} (a)$$ for $$a \in \mathbb{R}$$.
• $$\operatorname{tr}(A^\top) = \operatorname{tr}(A)$$, null.
• $$\operatorname{tr} (A \pm B) = \operatorname{tr}(A) \pm \operatorname{tr}(B)$$, null.
• $$\operatorname{tr} (AB) = \operatorname{tr}(BA)$$ where $$A$$ and $$B^\top$$ have the same size.
• $$\operatorname{tr}(A^\top (B \odot C)) = \operatorname{tr}((A \odot B)^\top C)$$ or $$\langle A, B \odot C \rangle = \langle A \odot B, C \rangle$$, $$A,B,C$$ are compatible.

In the following examples, we try to use the matrix differentiation rules and trace trick to write it as

$df = \left \langle \nabla_X f, d {X} \right\rangle.$

### Examples

#### Example 1

$$f(X) = \mathbf{a}^\top X \mathbf{b}$$.

Using differentiation rules

\begin{align} d f &= \mathbf{a}^\top d (X \mathbf{b}) = \mathbf{a}^\top (d X) \mathbf{b}. \end{align}

Using the trace trick

$df = \operatorname{tr} (\mathbf{a}^\top (d X) \mathbf{b}) = \operatorname{tr} (\mathbf{b} \mathbf{a}^\top d X) = \langle \mathbf{a} \mathbf{b}^\top, d X\rangle.$

Hence,

$\nabla_X f = \mathbf{a} \mathbf{b}^\top.$

#### Example 2

$$f(X) = \mathbf{a}^\top \exp(X \mathbf{b})$$.

Using the differentiation rules

$d f = \mathbf{a}^\top d \exp (X \mathbf{b}) = \mathbf{a}^\top \exp (X \mathbf{b}) \odot d( X \mathbf{b}) = \mathbf{a}^\top \exp (X \mathbf{b}) \odot d(X) \mathbf{b}.$

Using the trace trick

\begin{align} d f &= \operatorname{tr}(\mathbf{a}^\top \exp (X \mathbf{b}) \odot d(X) \mathbf{b}) = \operatorname{tr}((\mathbf{a} \odot \exp (X \mathbf{b}))^\top d(X) \mathbf{b}) = \operatorname{tr}(\mathbf{b} (\mathbf{a} \odot \exp (X \mathbf{b}))^\top dX)\\ &= \left\langle (\mathbf{a} \odot \exp (X \mathbf{b})) \mathbf{b}^\top, dX \right\rangle. \end{align}

Hence, $$\nabla_X f = (\mathbf{a} \odot \exp (X \mathbf{b})) \mathbf{b}^\top.$$

#### Example 3

$$f(X) = \langle Y, MY \rangle$$, $$Y = \sigma(WX)$$.

Using the differentiation rules and trace trick,

\begin{align} d f &= \langle d Y, M Y \rangle + \langle Y, M d Y \rangle \\ &= \langle M Y, d Y \rangle + \langle M^\top Y, d Y \rangle \\ &= \langle (M + M^\top) Y, dY \rangle \\ &= \langle (M + M^\top) Y, \sigma'(WX) \odot (W d X) \rangle \\ &= \langle (M + M^\top) Y \odot \sigma'(WX), W d X \rangle \\ &= \left\langle W^\top \left((M + M^\top) Y \odot \sigma'(WX)\right), d X \right\rangle \\ &= \left\langle W^\top \left((M + M^\top) \sigma(WX) \odot \sigma'(WX)\right), d X \right\rangle \end{align}

Hence,

$\nabla_X f = W^\top \left( (M^\top + M) \sigma(WX) \odot \sigma'(WX)\right).$

#### Example 4

$$f(\mathbf{w}) = \langle X \mathbf{w} - \mathbf{y}, X \mathbf{w} - \mathbf{y} \rangle$$.

Using the differentiation rules,

\begin{align} d f &= \langle d (X \mathbf{w} - \mathbf{y}), X \mathbf{w} - \mathbf{y} \rangle + \langle X \mathbf{w} - \mathbf{y}, d(X \mathbf{w} - \mathbf{y}) \rangle \\ &= 2 \langle X \mathbf{w} - \mathbf{y}, d(X \mathbf{w} - \mathbf{y}) \rangle \\ &= 2 \langle X \mathbf{w} - \mathbf{y}, X d \mathbf{w} \rangle = 2 \langle X^\top (X \mathbf{w} - \mathbf{y}), d \mathbf{w} \rangle. \end{align}

Hence,

$\nabla_{\mathbf{w}} f = 2 X^\top (X \mathbf{w} - \mathbf{y}).$

#### Example 5

$$f(\Sigma) = \log \vert\Sigma\vert + \frac{1}{N} \sum_{i=1}^N\left\langle\mathbf{x}_i - \boldsymbol{\mu}, \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle$$.

Consider the first term

$d \log \vert\Sigma\vert = \vert\Sigma\vert^{-1} d \vert\Sigma\vert = \langle \Sigma^{-1}, d \Sigma\rangle$

Consider the second term

\begin{align} d\frac{1}{N} \sum_{i=1}^N\left\langle\mathbf{x}_i - \boldsymbol{\mu}, \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle &= \frac{1}{N}\sum_{i=1}^N \left\langle\mathbf{x}_i - \boldsymbol{\mu}, d\Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle \\ &= \frac{1}{N} \sum_{i=1}^N\left\langle\mathbf{x}_i - \boldsymbol{\mu}, \Sigma^{-1} (d\Sigma) \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right\rangle \\ &= \frac{1}{N} \sum_{i=1}^N\left\langle (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^\top\Sigma^{-1}, \Sigma^{-1} d\Sigma \right\rangle \\ &= \frac{1}{N} \sum_{i=1}^N \left\langle \Sigma^{-1} (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^\top\Sigma^{-1}, d\Sigma \right\rangle. \end{align}

Let $$S = \frac{1}{N}\sum_{i=1}^N (\mathbf{x}_i - \bar{\mathbf{x}}) (\mathbf{x}_i - \bar{\mathbf{x}})^\top$$ then

$d f = \left\langle\Sigma^{-1} - \Sigma^{-1} S \Sigma^{-1}, d \Sigma \right\rangle.$

Thus,

$\nabla_{\Sigma} f = (\Sigma^{-1} - \Sigma^{-1} S \Sigma^{-1})^\top.$

#### Example 6

$$f(\mathbf{w}) = - \langle \mathbf{y}, \log \operatorname{softmax}(X \mathbf{w}) \rangle$$, the cross entropy loss with prediction distribution $$\operatorname{softmax}(X \mathbf{w})$$ and ground truth distribution $$\mathbf{y}$$.

\begin{align*} f(\mathbf{w}) &= - \langle \mathbf{y}, \log \operatorname{softmax}(X \mathbf{w})\\ &= -\left\langle \mathbf{y}, \log \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} \right\rangle \\ &= -\left\langle \mathbf{y}, X \mathbf{w} - \mathbf{1}\log \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle \right\rangle \\ &= -\left\langle \mathbf{y}, X \mathbf{w} \right\rangle + \log \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle \left\langle \mathbf{y}, \mathbf{1}\right\rangle \\ &= -\left\langle \mathbf{y}, X \mathbf{w} \right\rangle + \log \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle, \end{align*}

note that $$\left\langle \mathbf{y}, \mathbf{1}\right\rangle = 1$$.

\begin{align*} d f &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \frac{d \langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} \\ &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \left\langle \frac{\mathbf{1}}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, \exp (X \mathbf{w}) \odot d (X\mathbf{w}) \right\rangle \\ &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \left\langle \frac{\mathbf{1} \odot \exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, d (X \mathbf{w}) \right\rangle \\ &= -\left\langle \mathbf{y}, X d \mathbf{w} \right\rangle + \left\langle \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, Xd \mathbf{w} \right\rangle \\ %&= -\left\langle \mathbf{y} - \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle}, X d \mathbf{w} \right\rangle \\ &= -\left\langle X^\top \left(\mathbf{y} - \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} \right), d \mathbf{w} \right\rangle \end{align*}

Hence,

$\nabla_{\mathbf{w}} f = X^\top \frac{\exp (X \mathbf{w})}{\langle\mathbf{1}, \exp (X \mathbf{w}) \rangle} - X^\top \mathbf{y}.$

#### Example 7

Let $$L = f(Y), Y = WX$$ where $$f$$ maps a matrix to a scalar,

\begin{align} d L = d f(Y) &= \left\langle \nabla_Y f, d Y \right\rangle \\ &= \left\langle \nabla_Y f, d (W X) \right\rangle \\ &= \left\langle \nabla_Y f, W dX + (dW)X\right\rangle \qquad (\text{using product rule of differentiation})\\ &= \left\langle \nabla_Y f, W dX \right\rangle + \left\langle \nabla_Y f, (dW)X \right\rangle \end{align}
• $$W$$ is constant, then $$d L = \left\langle \nabla_Y f, W dX \right\rangle = \left\langle W^\top \nabla_Y f, dX \right\rangle$$ thus $$\frac{\partial L}{\partial X} = W^\top \nabla_Y f$$.
• $$X$$ is constant, then $$d L = \left\langle \nabla_Y f, (dW)X \right\rangle = \left\langle \nabla_Y f X^\top, dW \right\rangle$$ thus $$\frac{\partial L}{\partial W} = \nabla_Y f X^\top$$.

## The non-scalar function

We wish to compute the derivative of $$\frac{\partial F}{\partial X}$$, where $$F \in \mathbb{R}^{p \times q}$$ and $$X \in \mathbb{R}^{m \times n}$$. Consider the vector-output function $$\boldsymbol{f}(\mathbf{x}) \in \mathbb{R}^p, \mathbf{x} \in \mathbb{R}^{m}$$, the derivative of $$\boldsymbol{f}$$ w.r.t $$\mathbf{x}$$ can be defined as

$\frac{\partial \boldsymbol{f}}{\partial \mathbf{x}}=\left[\begin{array}{cccc}{\frac{\partial f_{1}}{\partial x_{1}}} & {\frac{\partial f_{2}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{p}}{\partial x_{1}}} \\ {\frac{\partial f_{1}}{\partial x_{2}}} & {\frac{\partial f_{2}}{\partial x_{2}}} & {\cdots} & {\frac{\partial f_{p}}{\partial x_{2}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial f_{1}}{\partial x_{m}}} & {\frac{\partial f_{2}}{\partial x_{m}}} & {\cdots} & {\frac{\partial f_{p}}{\partial x_{m}}}\end{array}\right] \in \mathbb{R}^{m \times p}$

which is the transpose of the Jacobian matrix of $$\boldsymbol{f}$$ w.r.t $$\mathbf{x}$$. Then we have

$d \boldsymbol{f} = \left \langle \frac{\partial \boldsymbol{f}}{\partial \mathbf{x}}, d \mathbf{x} \right \rangle \in \mathbb{R}^p.$

Let us define the vectorization operator as $$\lambda(X) = [X_{11}, X_{12}, \ldots, X_{mn}]^\top \in \mathbb{R}^{mn}$$. Then

\begin{align} \frac{\partial F}{\partial X} &= \frac{\partial \lambda(F)}{\partial \lambda(X)} \in \mathbb{R}^{mn \times pq}, \\ \end{align}

and

$\lambda (d F) = \left\langle \frac{\partial F}{\partial X}, \lambda(dX) \right\rangle.$

The content of this post is largely borrowed from this blog.