Soft-DTW: a Differentiable Loss Function for Time Series@ICML17.

## Background

DTW (Dynamic Time Warping) computes the best possible alignment between two time series of respective length $$n$$ and $$m$$ by computing first the $$n \times m$$ pairwise distance matrix between these points, then finding the best alignment using a dynamic program. DTW is robust to shifts and dilatations across the time dimension.

Notations. Time series $$\mathbf { x } = \left( x _ { 1 } , \dots , x _ { n } \right) \in \mathbb { R } ^ { p \times n }$$, $$\mathbf { y } = \left( y _ { 1 } , \dots , y _ { m } \right) \in \mathbb { R } ^ { p \times m }$$. Differentiable discrepency function $$\delta : \mathbb { R } ^ { p } \times \mathbb { R } ^ { p } \rightarrow \mathbb { R } _ { + }$$, in most cases, will be the quadratic Euclidean distance between two vectors. The cost matrix $$\Delta ( \mathbf { x } , \mathbf { y } ) : = \left[ \delta \left( x _ { i } , y _ { j } \right) \right] _ { i j } \in \mathbb { R } ^ { n \times m }$$. $$\mathcal { A } _ { n , m } \subset \{ 0,1 \} ^ { n \times m }$$ binary alignment matrices, for any alignment matrix $$A \in \mathcal{A}_{n,m}$$ there is a path that connects the upper-left $$(1,1)$$ matrix entry to the lower-right $$(n, m)$$.

Example of alignment matrix $$A$$ and cost matrix $$\Delta ( \mathbf { x } , \mathbf { y } )$$.

Figure DTW Alignment

## DTW and soft-DTW loss functions

Given the cost matrix $$\Delta ( \mathbf { x } , \mathbf { y } ) : = \left[ \delta \left( x _ { i } , y _ { j } \right) \right] _ { i j } \in \mathbb { R } ^ { n \times m }$$, the Frobenius inner product $$\langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle$$ for $$A \in \mathcal { A } _ { n , m }$$ gives the score of $$A$$. Both DTW and GAK consider all possible alignment matrices,

\begin{aligned} \operatorname { DTW } ( \mathbf { x } , \mathbf { y } ) & : = \min _ { A \in \mathcal { A } _ { n , m } } \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle \\ k _ { \mathrm { GA } } ^ { \gamma } ( \mathbf { x } , \mathbf { y } ) & := \sum _ { A \in \mathcal { A } _ { n , m } } \exp\{ { - \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle / \gamma } \}. \end{aligned}

Consider the following generalized $$\min$$ operator, with a smoothing parameter $$\gamma \geq 0$$:

$\textstyle \min ^ { \gamma } \left\{ a _ { 1 } , \ldots , a _ { n } \right\} : = \left\{ \begin{array} { l l } { \min _ { i \leq n } a _ { i } , } & { \gamma = 0 }, \\ { - \gamma \log \sum _ { i = 1 } ^ { n } e ^ { - a _ { i } / \gamma } , } & { \gamma > 0 }. \end{array} \right.$

With the operator, we can define $$\gamma$$-soft-DTW:

$\textstyle \operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) : = \min ^ { \gamma } \left\{ \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle , A \in \mathcal { A } _ { n , m } \right\}.$
• The original DTW is recovered by setting $$\gamma = 0$$.
• When $$\gamma > 0$$, we can recover $$\operatorname { dtw } _ { \gamma } = -\gamma \log k _ { \mathrm { GA } } ^ { \gamma }$$.

### Differentiation of soft-DTW

The case $$\gamma=0$$.

$\nabla _ { \mathbf { x } } \mathbf { d } \mathbf { t } \mathbf { w } _ { 0 } ( \mathbf { x } , \mathbf { y } ) = \left(\frac { \partial \Delta ( \mathbf { x }, \mathbf { y } ) } { \partial \mathbf { x } }\right) ^\top A ^ { \star }$

$${ \partial \Delta ( \mathbf { x }, \mathbf { y } ) }/ { \partial \mathbf { x }}$$ is the Jacobian of $$\Delta(\mathbf{x}, \mathbf{y})$$ w.r.t $$\mathbf{x}$$, a linear map from $$\mathbb{R}^{p \times n}$$ to $$\mathbb{R}^{n \times m}$$. When $$\delta(\cdot, \cdot)$$ is the squared Euclidean distance, the transpose of the Jacobian applied to a matrix $$B$$ is

$( \partial \Delta ( \mathbf { x } , \mathbf { y } ) / \partial \mathbf { x } ) ^ { \top } B = 2 \left( \left( \mathbf { 1 } _ { p } \mathbf { 1 } _ { m } ^ { T } B ^ { T } \right) \circ \mathbf { x } - \mathbf { y } B ^ { T } \right) \in \mathbb{R}^{p \times n}.$

Remark: $$( \partial \Delta ( \mathbf { x } , \mathbf { y } ) / \partial \mathbf { x } ) ^ { \top }$$ is then a linear map from $$\mathbb{R}^{n \times m}$$ to $$\mathbb{R}^{p \times n}$$. In the usual case, $$f: \mathbb{R}^n \rightarrow \mathbb{R}^m$$ is a (suppose) differentiable function that takes $$\mathbf{x} \in \mathbb{R}^n$$ and produces $$f(\mathbf{x}) \in \mathbb{R}^m$$. Then the Jacobian matrix $$\partial f /\partial \mathbf{x} \in \mathbb{R}^{m \times n}$$ defines a linear map $$\mathbb{R}^n \rightarrow \mathbb{R}^m$$, the best linear approximation of the function $$f$$ near the point $$\mathbf{x}$$. Hence, the Jacobian $$\partial f /\partial \mathbf{x}$$ can be considered as the generalization of derivative.

The case $$\gamma > 0$$.

$\nabla _ { \mathbf { x } } \mathbf { d } \mathbf { t } \mathbf { w } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) = \left( \frac { \partial \Delta ( \mathbf { x } , \mathbf { y } ) } { \partial \mathbf { x } }\right)^\top \mathbb{E}_{p _ { \gamma }(A)} [A],$

where

$\mathbb { E } _ { p _ { \gamma }(A) } [ A ] : = \frac { \sum _ { A \in \mathcal { A } _ { n , m } } \exp\{ { - \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle / \gamma } \} A} { k _ { \mathrm { GA } } ^ { \gamma } ( \mathbf { x } , \mathbf { y } ) }$

is the expectation of alignment matrix $$A$$ under the Gibbs distribution $$p _ { \gamma }(A) \propto \exp \{ { - \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle / \gamma } \}$$.

### Algorithmic differentiation

A Bellman recursion to compute the expectation $$\mathbb { E } _ { p _ { \gamma }(A) } [ A ]$$ has a complexity $$O(n^2 m^2)$$. The paper proposes to apply the chain rule in reverse order of Bellman’s recursion in Algorithm 1 to obtain an $$O(nm)$$ algorithm.

As $$\operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } )$$ equals to $$r_{n,m}$$ in the Bellman recusion, thus

$\nabla _ { \mathbf { x } } \operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) = \frac{\partial r_{n,m}}{\partial \mathbf{x}}.$

The the derivative of loss $$r_{n,m}$$ w.r.t $$\mathbf{x}$$ can be computed by chain rule

$\frac{\partial r_{n,m}}{\partial \mathbf{x}} =\left(\frac{\partial \Delta (\mathbf { x } , \mathbf { y } )}{\partial \mathbf{x}} \right)^\top \frac{\partial r_{n,m}}{\partial \Delta(\mathbf { x } , \mathbf { y } )},$

by defining $$e_{ij} := \frac { \partial r _ { n , m } } { \partial r _ { i , j } }$$ we have

$\frac{\partial r_{n,m}}{\partial \delta_{ij}} = \frac { \partial r _ { n , m } } { \partial r _ { i , j } } \frac { \partial r _ { i , j } } { \partial \delta _ { i , j } } = e _ { i , j } \cdot 1 = e_{ij}$

hence

$\nabla _ { \mathbf { x } } \operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) = \left( \frac { \partial \Delta ( \mathbf { x } , \mathbf { y } ) } { \partial \mathbf { x } } \right) ^ { \top } E,$

from which we can see $$E$$ is exactly the expectation $$\mathbb { E } _ { p _ { \gamma }(A) } [ A ]$$ by comparing with the gradient in the case $$\gamma > 0$$.

Now, we turn to compute $$e_{ij}$$ using the chain rule

$\underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i , j } } } _ { e _ { i , j } } = \underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i + 1 , j } } } _ { e _ { i + 1 , j } } \color{green}{\frac { \partial r _ { i + 1 , j } } { \partial r _ { i , j } }} + \underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i , j + 1 } } } _ { e _ { i , j + 1 } } \color{blue}{\frac { \partial r _ { i , j + 1 } } { \partial r _ { i , j }}} + \underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i + 1 , j + 1 } } } _ { e _ { i + 1 , j + 1 } } \color{red}{\frac { \partial r _ { i + 1 , j + 1 } } { \partial r _ { i , j } }}.$

The Bellman recursion tells us that

$r _ { i + 1 , j } = \delta _ { i + 1 , j } + \textstyle \min ^ { \gamma } \left\{ r _ { i , j - 1 } , r _ { i , j } , r _ { i + 1 , j - 1 } \right\},$

thus

$\color{green}{\frac { \partial r _ { i + 1 , j } } { \partial r _ { i , j } }} = e ^ { - r _ { i , j } / \gamma } / \left( e ^ { - r _ { i , j - 1 } / \gamma } + e ^ { - r _ { i , j } / \gamma } + e ^ { - r _ { i + 1 , j - 1 } / \gamma } \right).$

Taking logarithm on both sides yields

\begin{aligned} \gamma \log \color{green}{\frac { \partial r _ { i + 1 , j } } { \partial r _ { i , j } }} & = \textstyle \min ^ { \gamma } \left\{ r _ { i , j - 1 } , r _ { i , j } , r _ { i + 1 , j - 1 } \right\} - r _ { i , j } \\ & = r _ { i + 1 , j } - \delta _ { i + 1 , j } - r _ { i , j }. \end{aligned}

Similarly, we can obtain

\begin{aligned} \gamma \log \color{blue} {\frac { \partial r _ { i , j + 1 } } { \partial r _ { i , j } }} & = r _ { i , j + 1 } - r _ { i , j } - \delta _ { i , j + 1 }, \\ \gamma \log \color{red}{\frac { \partial r _ { i + 1 , j + 1 } } { \partial r _ { i , j } }} & = r _ { i + 1 , j + 1 } - r _ { i , j } - \delta _ { i + 1 , j + 1 }. \end{aligned}

Therefore, we obtain a backward recursion to compute the entire matrix $$E = [e_{i,j}]$$ from $$r_{n,m}$$.

## Complete algorithms

Figure Forward Soft-DTW

Figure Backward Soft-DTW