Soft-DTW: a Differentiable Loss Function for Time Series@ICML17.

Background

DTW (Dynamic Time Warping) computes the best possible alignment between two time series of respective length \(n\) and \(m\) by computing first the \(n \times m\) pairwise distance matrix between these points, then finding the best alignment using a dynamic program. DTW is robust to shifts and dilatations across the time dimension.

Notations. Time series \(\mathbf { x } = \left( x _ { 1 } , \dots , x _ { n } \right) \in \mathbb { R } ^ { p \times n }\), \(\mathbf { y } = \left( y _ { 1 } , \dots , y _ { m } \right) \in \mathbb { R } ^ { p \times m }\). Differentiable discrepency function \(\delta : \mathbb { R } ^ { p } \times \mathbb { R } ^ { p } \rightarrow \mathbb { R } _ { + }\), in most cases, will be the quadratic Euclidean distance between two vectors. The cost matrix \(\Delta ( \mathbf { x } , \mathbf { y } ) : = \left[ \delta \left( x _ { i } , y _ { j } \right) \right] _ { i j } \in \mathbb { R } ^ { n \times m }\). \(\mathcal { A } _ { n , m } \subset \{ 0,1 \} ^ { n \times m }\) binary alignment matrices, for any alignment matrix \(A \in \mathcal{A}_{n,m}\) there is a path that connects the upper-left \((1,1)\) matrix entry to the lower-right \((n, m)\).

Example of alignment matrix \(A\) and cost matrix \(\Delta ( \mathbf { x } , \mathbf { y } )\).

Beam search probability

Figure DTW Alignment

DTW and soft-DTW loss functions

Given the cost matrix \(\Delta ( \mathbf { x } , \mathbf { y } ) : = \left[ \delta \left( x _ { i } , y _ { j } \right) \right] _ { i j } \in \mathbb { R } ^ { n \times m }\), the Frobenius inner product \(\langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle\) for \(A \in \mathcal { A } _ { n , m }\) gives the score of \(A\). Both DTW and GAK consider all possible alignment matrices,

\[\begin{aligned} \operatorname { DTW } ( \mathbf { x } , \mathbf { y } ) & : = \min _ { A \in \mathcal { A } _ { n , m } } \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle \\ k _ { \mathrm { GA } } ^ { \gamma } ( \mathbf { x } , \mathbf { y } ) & := \sum _ { A \in \mathcal { A } _ { n , m } } \exp\{ { - \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle / \gamma } \}. \end{aligned}\]

Consider the following generalized \(\min\) operator, with a smoothing parameter \(\gamma \geq 0\):

\[\textstyle \min ^ { \gamma } \left\{ a _ { 1 } , \ldots , a _ { n } \right\} : = \left\{ \begin{array} { l l } { \min _ { i \leq n } a _ { i } , } & { \gamma = 0 }, \\ { - \gamma \log \sum _ { i = 1 } ^ { n } e ^ { - a _ { i } / \gamma } , } & { \gamma > 0 }. \end{array} \right.\]

With the operator, we can define \(\gamma\)-soft-DTW:

\[\textstyle \operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) : = \min ^ { \gamma } \left\{ \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle , A \in \mathcal { A } _ { n , m } \right\}.\]
  • The original DTW is recovered by setting \(\gamma = 0\).
  • When \(\gamma > 0\), we can recover \(\operatorname { dtw } _ { \gamma } = -\gamma \log k _ { \mathrm { GA } } ^ { \gamma }\).

Differentiation of soft-DTW

The case \(\gamma=0\).

\[\nabla _ { \mathbf { x } } \mathbf { d } \mathbf { t } \mathbf { w } _ { 0 } ( \mathbf { x } , \mathbf { y } ) = \left(\frac { \partial \Delta ( \mathbf { x }, \mathbf { y } ) } { \partial \mathbf { x } }\right) ^\top A ^ { \star }\]

\({ \partial \Delta ( \mathbf { x }, \mathbf { y } ) }/ { \partial \mathbf { x }}\) is the Jacobian of \(\Delta(\mathbf{x}, \mathbf{y})\) w.r.t \(\mathbf{x}\), a linear map from \(\mathbb{R}^{p \times n}\) to \(\mathbb{R}^{n \times m}\). When \(\delta(\cdot, \cdot)\) is the squared Euclidean distance, the transpose of the Jacobian applied to a matrix \(B\) is

\[( \partial \Delta ( \mathbf { x } , \mathbf { y } ) / \partial \mathbf { x } ) ^ { \top } B = 2 \left( \left( \mathbf { 1 } _ { p } \mathbf { 1 } _ { m } ^ { T } B ^ { T } \right) \circ \mathbf { x } - \mathbf { y } B ^ { T } \right) \in \mathbb{R}^{p \times n}.\]

Remark: \(( \partial \Delta ( \mathbf { x } , \mathbf { y } ) / \partial \mathbf { x } ) ^ { \top }\) is then a linear map from \(\mathbb{R}^{n \times m}\) to \(\mathbb{R}^{p \times n}\). In the usual case, \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) is a (suppose) differentiable function that takes \(\mathbf{x} \in \mathbb{R}^n\) and produces \(f(\mathbf{x}) \in \mathbb{R}^m\). Then the Jacobian matrix \(\partial f /\partial \mathbf{x} \in \mathbb{R}^{m \times n}\) defines a linear map \(\mathbb{R}^n \rightarrow \mathbb{R}^m\), the best linear approximation of the function \(f\) near the point \(\mathbf{x}\). Hence, the Jacobian \(\partial f /\partial \mathbf{x}\) can be considered as the generalization of derivative.

The case \(\gamma > 0\).

\[\nabla _ { \mathbf { x } } \mathbf { d } \mathbf { t } \mathbf { w } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) = \left( \frac { \partial \Delta ( \mathbf { x } , \mathbf { y } ) } { \partial \mathbf { x } }\right)^\top \mathbb{E}_{p _ { \gamma }(A)} [A],\]

where

\[\mathbb { E } _ { p _ { \gamma }(A) } [ A ] : = \frac { \sum _ { A \in \mathcal { A } _ { n , m } } \exp\{ { - \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle / \gamma } \} A} { k _ { \mathrm { GA } } ^ { \gamma } ( \mathbf { x } , \mathbf { y } ) }\]

is the expectation of alignment matrix \(A\) under the Gibbs distribution \(p _ { \gamma }(A) \propto \exp \{ { - \langle A , \Delta ( \mathbf { x } , \mathbf { y } ) \rangle / \gamma } \}\).

Algorithmic differentiation

A Bellman recursion to compute the expectation \(\mathbb { E } _ { p _ { \gamma }(A) } [ A ]\) has a complexity \(O(n^2 m^2)\). The paper proposes to apply the chain rule in reverse order of Bellman’s recursion in Algorithm 1 to obtain an \(O(nm)\) algorithm.

As \(\operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } )\) equals to \(r_{n,m}\) in the Bellman recusion, thus

\[\nabla _ { \mathbf { x } } \operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) = \frac{\partial r_{n,m}}{\partial \mathbf{x}}.\]

The the derivative of loss \(r_{n,m}\) w.r.t \(\mathbf{x}\) can be computed by chain rule

\[\frac{\partial r_{n,m}}{\partial \mathbf{x}} =\left(\frac{\partial \Delta (\mathbf { x } , \mathbf { y } )}{\partial \mathbf{x}} \right)^\top \frac{\partial r_{n,m}}{\partial \Delta(\mathbf { x } , \mathbf { y } )},\]

by defining \(e_{ij} := \frac { \partial r _ { n , m } } { \partial r _ { i , j } }\) we have

\[\frac{\partial r_{n,m}}{\partial \delta_{ij}} = \frac { \partial r _ { n , m } } { \partial r _ { i , j } } \frac { \partial r _ { i , j } } { \partial \delta _ { i , j } } = e _ { i , j } \cdot 1 = e_{ij}\]

hence

\[\nabla _ { \mathbf { x } } \operatorname { dtw } _ { \gamma } ( \mathbf { x } , \mathbf { y } ) = \left( \frac { \partial \Delta ( \mathbf { x } , \mathbf { y } ) } { \partial \mathbf { x } } \right) ^ { \top } E,\]

from which we can see \(E\) is exactly the expectation \(\mathbb { E } _ { p _ { \gamma }(A) } [ A ]\) by comparing with the gradient in the case \(\gamma > 0\).

Now, we turn to compute \(e_{ij}\) using the chain rule

\[\underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i , j } } } _ { e _ { i , j } } = \underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i + 1 , j } } } _ { e _ { i + 1 , j } } \color{green}{\frac { \partial r _ { i + 1 , j } } { \partial r _ { i , j } }} + \underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i , j + 1 } } } _ { e _ { i , j + 1 } } \color{blue}{\frac { \partial r _ { i , j + 1 } } { \partial r _ { i , j }}} + \underbrace { \frac { \partial r _ { n , m } } { \partial r _ { i + 1 , j + 1 } } } _ { e _ { i + 1 , j + 1 } } \color{red}{\frac { \partial r _ { i + 1 , j + 1 } } { \partial r _ { i , j } }}.\]

The Bellman recursion tells us that

\[r _ { i + 1 , j } = \delta _ { i + 1 , j } + \textstyle \min ^ { \gamma } \left\{ r _ { i , j - 1 } , r _ { i , j } , r _ { i + 1 , j - 1 } \right\},\]

thus

\[\color{green}{\frac { \partial r _ { i + 1 , j } } { \partial r _ { i , j } }} = e ^ { - r _ { i , j } / \gamma } / \left( e ^ { - r _ { i , j - 1 } / \gamma } + e ^ { - r _ { i , j } / \gamma } + e ^ { - r _ { i + 1 , j - 1 } / \gamma } \right).\]

Taking logarithm on both sides yields

\[\begin{aligned} \gamma \log \color{green}{\frac { \partial r _ { i + 1 , j } } { \partial r _ { i , j } }} & = \textstyle \min ^ { \gamma } \left\{ r _ { i , j - 1 } , r _ { i , j } , r _ { i + 1 , j - 1 } \right\} - r _ { i , j } \\ & = r _ { i + 1 , j } - \delta _ { i + 1 , j } - r _ { i , j }. \end{aligned}\]

Similarly, we can obtain

\[\begin{aligned} \gamma \log \color{blue} {\frac { \partial r _ { i , j + 1 } } { \partial r _ { i , j } }} & = r _ { i , j + 1 } - r _ { i , j } - \delta _ { i , j + 1 }, \\ \gamma \log \color{red}{\frac { \partial r _ { i + 1 , j + 1 } } { \partial r _ { i , j } }} & = r _ { i + 1 , j + 1 } - r _ { i , j } - \delta _ { i + 1 , j + 1 }. \end{aligned}\]

Therefore, we obtain a backward recursion to compute the entire matrix \(E = [e_{i,j}]\) from \(r_{n,m}\).

Complete algorithms

Beam search probability

Figure Forward Soft-DTW

Beam search probability

Figure Backward Soft-DTW