Aug 26, 2024

Automatic differentiation using dual numbers

A common task in computer vision is to detect and extract features from images, match them between views and then estimate some parameters.The estimation is formulated as an optimization problem, which in certain special cases can be solved using linear programming, but in the more general case a non-linear estimation problem arise which are typically solved with gradient based techniques. Calculating the expressions of the gradients can be very cumbersome to do by hand. Luckily there are automatic methods to compute them through the field of automatic differentiation (autodiff / AD). This article is a tutorial on this topic, starting with a theoretical background and finishing off with a practical example in C++. Target audience is primarily developers working with optimization problems.

AD can be done in various ways, the primary distinction being "forward mode" and "reverse mode". Deep learning is made possible due to a special case of reverse mode AD known as backpropagation (or backprop). In geometric computer vision, it is quite common to formulate estimation as a robust optimization problem (discussed in depth in one of our other articles here) as a sum of "squashed" residuals $r_i$ using a squashing function $\rho$

$$
L(\theta) = \sum_{i=1}^M \rho(r_i) = \sum_{i=1}^M \rho(f(x_i,\theta) - y_i)
$$

The number of terms $M$ can be quite large (proportional to the number of features), while the number of parameters $N$ are typically much fewer (on the order of the number of views). Forward mode AD is very efficient for these types of problems, and it also turns out to be very easy to implement yourself without external libraries. The approach we introduce here has been used for more than 50 years. The core of AD is the chain rule, a theorem that states how differentiation works for composite functions. The rule reads as follows. Let $x$, $y$ and $z$ be scalars, then

$$
\newcommand\grad[2]{\frac{\partial#1}{\partial#2}}
\newcommand\gradinline[2]{\partial#1 / \partial#2}
\newcommand\derivative[2]{\frac{d#1}{d#2}}
\begin{eqnarray}
y &=& g(x) \newline
z &=& f(y) = f(g(x)) \newline
\derivative{z}{x} &=& \derivative{z}{y}\derivative{y}{x} = f^\prime(g(x)) g^\prime(x)
\end{eqnarray}
$$

In other words, the derivative of a function with respect to (w.r.t.) a parameter can be computed as a multiplication between the derivative of the output $z$ evaluated at local input $y$ and the derivative of the local input $y$ w.r.t. the main input $x$. This can be generalized for input and outputs vectors, for which the chain rule reads

$$
\begin{eqnarray}
y &=& g(x) \newline
z &=& f(y) = f(g(x)) \newline
\grad{z_i}{x_j} &=& \sum_k \grad{z_i}{y_k} \grad{y_k}{x_j} \newline
\grad{z}{x} &=& \grad{z}{y} \grad{y}{x}
\end{eqnarray}
$$

where $\gradinline{z}{x}$, $\gradinline{z}{y}$ and $\gradinline{y}{x}$ are matrices known as jacobians, consisting of partial derivatives of the elements of an output vector w.r.t. to the elements of an input vector. For the special case of a scalar output $z$ with vector based inputs $x$ and intermediate variables $y$, the vector based chain rule still holds but the jacobian $\gradinline{z}{y}$ is a row vector, which when multiplied from the right with the matrix $\grad{y}{x}$ becomes a row vector, that is, a gradient.

Let us focus on scalar outputs with vector inputs. To simplify notation, denote the current value as $z = z(y)$ and the accumulated gradient as $\nabla z = \gradinline{z}{x}$.
Here comes the trick: Invent a new multidimensional number $\epsilon_i$ with the property $\epsilon_i \epsilon_j = 0$. The value and gradient can then be simultaneously formulated as

$$
z^* = z + \epsilon^T \nabla z
$$

Looks strange? If you have learned about imaginary and complex numbers you have been exposed to a similar trick before. When humanity were faced with solving the impossible equation $x^2 = -1$ they just invented a new type of number solving it: the imaginary constant $i$ with the property $i^2 = -1$. The equation that was impossible to solve with real numbers became the very definition of the imaginary number itself. Real and imaginary numbers could be mixed, and the result became known as the complex numbers. The mathematical name for this type of construct is called a field. Similarly, we invent a new number using an equation for which no complex number is a solution to, and we get the dual number(s) $\epsilon^2 = 0$. Mixing with real numbers are known as first order jets. In general, a jet is a type of abstract polynomial that can have higher degrees, which actually can be used to evaluate higher order differentials to do things like automatic curvature computation, but this is out of scope for this article. To see how jets can be used as an implementation to forward AD, we set up a correspondence using previous definitions. Addition and multiplication on jets are defined as

$$\begin{eqnarray}
x + \epsilon^T \nabla x + y + \epsilon^T \nabla y &=& x + y + \epsilon^T (\nabla x + \nabla y) \newline
(x + \epsilon^T \nabla x) (y + \epsilon^T \nabla y) &=& xy + x \epsilon^T \nabla y + y \epsilon^T \nabla x + (\epsilon^T \nabla x) (\epsilon^T \nabla y) \newline
&=& xy + \epsilon^T (x \nabla y + y \nabla x) + \nabla x^T \underbrace{\epsilon \epsilon^T}_{0} \nabla y \newline
&=& xy + \epsilon^T (x \nabla y + y \nabla x)
\end{eqnarray}$$

The real part is added and multiplied as usual, and the dual part is combined linearly for addition while exhibiting a sort a cross mixing when multiplied. Compare that to how gradients are composed using the chain rule

$$\begin{eqnarray}
\nabla (x + y) &=& \nabla x + \nabla y \newline
\nabla (x y) &=& x \nabla y + y \nabla x
\end{eqnarray}$$

which is exactly the same expression derived in the dual part of jets! Hence, there is a one to one correspondence between jet arithmetic to simultaneously computing real values and their accumulated gradients. This means that by extending elementary functions operating on real or complex numbers to jets, we get AD as a side effect. In C++ this is easy to implement with a simple structure, some operator/function overloading and templates. For the sake of producing a minimal example, here is a simplified, rewritten excerpt from our code base

‍
#include <cmath>

#include <ostream>

‍

namespace rooniq

{

‍

template <typename T, int N>

struct Vec {

T data[N];

};

‍

template <typename T, int N>

Vec<T,N>

operator + (const Vec<T,N>& lhs, const Vec<T,N>& rhs)

{

Vec<T,N> result;

for (int i = 0; i < N; ++i) {

result.data[i] = lhs.data[i] + rhs.data[i];

}

return result;

}

‍

template <typename T, int N>

Vec<T,N>

operator * (T a, Vec<T,N> v)

{

for (int i = 0; i < N; ++i) {

v.data[i] *= a;

}

return v;

}

‍

template <typename T, int N>

std::ostream&

operator << (std::ostream& out, const Vec<T,N>& v)

{

out << "Vec {";

for (int i = 0; i < N - 1; ++i) {

out << v.data[i] << ", ";

}

return out << v.data[N-1] << "}";

}

‍

template <typename T, int N>

struct Jet

{

T real;

Vec<T,N> dual;

};

‍

template <typename T, int N>

std::ostream&

operator << (std::ostream& out, const Jet<T,N>& v) {

return out << "Jet { .real = "

<< v.real << ", .dual = "

<< v.dual << "}";

}

‍

template <typename T>

using Jet2 = Jet<T,2>;

using Jet2f = Jet2<float>;

using Jet2d = Jet2<double>;

‍

template <typename T, int N>

Jet<T,N>

operator + (const Jet<T,N>& lhs, const Jet<T,N>& rhs) {

return {

lhs.real + rhs.real,

lhs.dual + rhs.dual

};

}

‍

template <typename T, int N>

Jet<T,N>

operator * (const Jet<T,N>& lhs, const Jet<T,N>& rhs) {

return {

lhs.real * rhs.real,

lhs.real * rhs.dual + rhs.real * lhs.dual

};

}

‍

using ::std::cos;

using ::std::sin;

‍

template <typename T, int N>

Jet<T,N>

cos(const Jet<T,N>& v) {

return {

cos(v.real),

- sin(v.real) * v.dual

};

}

‍

template <typename T, int N>

Jet<T,N>

sin(const Jet<T,N>& v) {

return {

sin(v.real),

cos(v.real) * v.dual

};

}

‍

} // namespace rooniq‍

‍‍

#include <iostream>

‍

int main()

{

rooniq::Jet2d x = {1, {1,0}};

rooniq::Jet2d y = {2, {0,1}};

rooniq::Jet2d z = cos(x + y + x * y);

std::cout << "x = " << x << std::endl;

std::cout << "y = " << y << std::endl;

std::cout << "z = " << z << std::endl;

} ‍

Running this code produces the following output in a terminal

‍x = Jet { .real = 1, .dual = Vec {1, 0}}

y = Jet { .real = 2, .dual = Vec {0, 1}}

z = Jet { .real = 0.283662, .dual = Vec {2.87677, 1.91785}}
‍

Is that correct? We can quickly do a pen and paper calculation to verify

$$\begin{eqnarray}
z &=& \cos(x + y + xy) = \cos(1 + 2 + 2) = \cos(5) = 0.283662 \newline
\grad{z}{x} &=& \cos^\prime(5) \grad{x + y + xy}{x} = - \sin(5) (1 + y) = 0.95892 * 3 = 2.87677 \newline
\grad{z}{y} &=& \cos^\prime(5) \grad{x + y + xy}{y} = - \sin(5) (1 + x) = 0.95892 * 2 = 1.91784
\end{eqnarray}$$

so things looks correct! By using standard library functions we effectively make them participate in argument dependent lookup (ADL) in our namespace. Hence, any code that performs unqualified function calls will automatically be eligible for AD. This is a big reason to not quality function calls in your source code. If you consistently prefix all standard library calls with std:: you will need to remove those to get AD. Note that you automatically get vector-to-vector derivatives (i.e. jacobians) as well since vectors are just an array of numbers, there is no need to make a special AD for that. Since we use templates, it's turtles all the way down, and you can define a "simultaneous $M$-dimensional vector value + jacobian" with a simple typedef

‍

template <typename T, int M, int N>

using VectorValueAndJacobian = Vec<Jet<T,N>,M>
‍

By adopting this pattern you can quickly incorporate autodiff in your codebase.
The computed gradients can then be passed down to some optimization library for solving the optimization task at hand.

‍