Skip to content

Commit ae7c793

Browse files
adrhilltansongchen
authored andcommitted
Fix Theory section of docs
1 parent 72171fe commit ae7c793

1 file changed

Lines changed: 52 additions & 48 deletions

File tree

docs/src/theory.md

Lines changed: 52 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5,82 +5,86 @@ CurrentModule = TaylorDiff
55
# Theory
66

77
TaylorDiff.jl is an operator-overloading based forward-mode automatic differentiation (AD) package. "Forward-mode" implies that the basic capability of this package is that, for function $f:\mathbb R^n\to\mathbb R^m$, place to evaluate derivative $x\in\mathbb R^n$ and direction $l\in\mathbb R^n$, we compute
8-
$$
9-
f(x),\partial f(x)\times v,\partial^2f(x)\times v\times v,\cdots,\partial^pf(x)\times v\times\cdots\times v
10-
$$
11-
i.e., the function value and the directional derivative up to order $p$. This notation might be unfamiliar to Julia users that had experience with other AD packages, but $\partial f(x)$ is simply the jacobian $J$, and $\partial f(x)\times v$ is simply the Jacobian-vector product (jvp). In other words, this is a simple generalization of Jacobian-vector product to Hessian-vector-vector product, and to even higher orders.
128

13-
The main advantage of doing this instead of doing $p$ first-order Jacobian-vector products is that nesting first-order AD results in expential scaling w.r.t $p$, while this method, also known as Taylor mode, should be (almost) linear scaling w.r.t $p$. We will see the reason of this claim later.
9+
``f(x),\partial f(x)\times v,\partial^2f(x)\times v\times v,\cdots,\partial^pf(x)\times v\times\cdots\times v``
10+
11+
i.e., the function value and the directional derivative up to order $p$. This notation might be unfamiliar to Julia users that had experience with other AD packages, but $\partial f(x)$ is simply the jacobian $J$, and $\partial f(x)\times v$ is simply the Jacobian-vector product (JVP).
12+
In other words, this is a simple generalization of Jacobian-vector product to Hessian-vector-vector product, and to even higher orders.
13+
14+
The main advantage of doing this instead of doing $p$ first-order Jacobian-vector products is that nesting first-order AD results in exponential scaling w.r.t $p$, while this method, also known as Taylor mode, should be (almost) linear scaling w.r.t $p$. We will see the reason of this claim later.
15+
16+
In order to achieve this, assuming that $f$ is a nested function $f_k\circ\cdots\circ f_2\circ f_1$, where each $f_i$ is a basic and simple function, or called "primitives".
17+
We need to figure out how to propagate the derivatives through each step. In first order AD, this is achieved by the "dual" pair $x_0+x_1\varepsilon$, where $\varepsilon^2=0$, and for each primitive we make a method overload
18+
19+
``f(x_0+x_1\varepsilon)=f(x_0)+\partial f(x_0) x_1\varepsilon``
1420

15-
In order to achieve this, assuming that $f$ is a nested function $f_k\circ\cdots\circ f_2\circ f_1$, where each $f_i$ is a basic and simple function, or called "primitives". We need to figure out how to propagate the derivatives through each step. In first order AD, this is achieved by the "dual" pair $x_0+x_1\varepsilon$, where $\varepsilon^2=0$, and for each primitive we make a method overload
16-
$$
17-
f(x_0+x_1\varepsilon)=f(x_0)+\partial f(x_0) x_1\varepsilon
18-
$$
1921
Similarly in higher-order AD, we need for each primitive a method overload for a truncated Taylor polynomial up to order $p$, and in this polynomial we will use $t$ instead of $\varepsilon$ to denote the sensitivity. "Truncated" means $t^{p+1}=0$, similar as what we defined for dual numbers. So
20-
$$
21-
f(x_0+x_1t+x_2t^2+\cdots+x_pt^p)=?
22-
$$
23-
What is the math expression that we should put into the question mark? That specific expression is called the "pushforward rule", and we will talk about how to derive the pushforward rule below.
22+
23+
``f(x_0+x_1t+x_2t^2+\cdots+x_pt^p)=?``
24+
25+
What is the math expression that we should put into the question mark?
26+
That specific expression is called the "pushforward rule", and we will talk about how to derive the pushforward rule below.
2427

2528
## Arithmetic of polynomials
2629

2730
Before deriving pushforward rules, let's first introduce several basic properties of polynomials.
2831

2932
If $x(t)$ and $y(t)$ are both truncated Taylor polynomials, i.e.
30-
$$
33+
34+
```math
3135
\begin{aligned}
3236
x&=x_0+x_1t+\cdots+x_pt^p\\
3337
y&=y_0+y_1t+\cdots+y_pt^p
3438
\end{aligned}
35-
$$
39+
```
40+
3641
Then it's obvious that the polynomial addition and subtraction should be
37-
$$
38-
(x\pm y)_k=x_k\pm y_k
39-
$$
42+
43+
``(x\pm y)_k=x_k\pm y_k``
44+
4045
And with some derivation we can also get the polynomial multiplication rule
41-
$$
42-
(x\times y)_k=\sum_{i=0}^kx_iy_{k-i}
43-
$$
46+
47+
``(x\times y)_k=\sum_{i=0}^kx_iy_{k-i}``
48+
4449
The polynomial division rule is less obvious, but if $x/y=z$, then equivalently $x=yz$, i.e.
45-
$$
46-
\left(\sum_{i=0}^py_it^i\right)\left(\sum_{i=0}^pz_it^i\right)=\sum_{i=0}^px_it^i
47-
$$
50+
51+
``\left(\sum_{i=0}^py_it^i\right)\left(\sum_{i=0}^pz_it^i\right)=\sum_{i=0}^px_it^i``
52+
4853
if we relate the coefficient of $t^k$ on both sides we get
49-
$$
50-
\sum_{i=0}^k z_iy_{k-i}=x_k
51-
$$
54+
55+
``\sum_{i=0}^k z_iy_{k-i}=x_k``
56+
5257
so, equivalently,
53-
$$
54-
z_k=\frac1{y_0}\left(x_k-\sum_{i=0}^{k-1}z_iy_{k-1}\right)
55-
$$
58+
59+
``z_k=\frac1{y_0}\left(x_k-\sum_{i=0}^{k-1}z_iy_{k-1}\right)``
60+
5661
This is a recurrence relation, which means that we can first get $z_0=x_0/y_0$, and then get $z_1$ using $z_0$, and then get $z_2$ using $z_0,z_1$ etc.
5762

5863
## Pushforward rule for elementary functions
5964

6065
Let's now consider how to derive the pushforward rule for elementary functions. We will use $\exp$ and $\log$ as two examples.
6166

6267
If $x(t)$ is a polynomial and we want to get $e(t)=\exp(x(t))$, we can actually get that by formulating an ordinary differential equation:
63-
$$
64-
e'(t)=\exp(x(t))x'(t);\quad e_0=\exp(x_0)
65-
$$
68+
69+
``e'(t)=\exp(x(t))x'(t);\quad e_0=\exp(x_0)``
70+
6671
If we expand both $e$ and $x$ in the equation, we will get
67-
$$
68-
\sum_{i=1}^pie_it^{i-1}=\left(\sum_{i=0}^{p-1} e_it^i\right)\left(\sum_{i=1}^pix_it^{i-1}\right)
69-
$$
72+
73+
``\sum_{i=1}^pie_it^{i-1}=\left(\sum_{i=0}^{p-1} e_it^i\right)\left(\sum_{i=1}^pix_it^{i-1}\right)``
74+
7075
relating the coefficient of $t^{k-1}$ on both sides, we get
71-
$$
72-
ke_k=\sum_{i=0}^{k-1}e_i\times (k-i)x_{k-i}
73-
$$
76+
77+
``ke_k=\sum_{i=0}^{k-1}e_i\times (k-i)x_{k-i}``
78+
7479
This is, again, a recurrence relation, so we can get $e_1,\cdots,e_p$ step-by-step.
7580

7681
If $x(t)$ is a polynomial and we want to get $l(t)=\log(x(t))$, we can actually get that by formulating an ordinary differential equation:
77-
$$
78-
l'(t)=\frac1xx'(t);\quad l_0=\log(x_0)
79-
$$
82+
83+
``l'(t)=\frac1xx'(t);\quad l_0=\log(x_0)``
84+
8085
If we expand both $l$ and $x$ in the equation, the RHS is simply polynomial divisions, and we get
81-
$$
82-
l_k=\frac1{x_0}\left(x_k-\frac1k\sum_{i=1}^{k-1}il_ix_{k-j}\right)
83-
$$
86+
87+
``l_k=\frac1{x_0}\left(x_k-\frac1k\sum_{i=1}^{k-1}il_ix_{k-j}\right)``
8488

8589
---
8690

@@ -94,9 +98,9 @@ So all of the elementary functions have an easy pushforward rule that can be com
9498
## Generic pushforward rule
9599

96100
For a generic $f(x)$, if we don't bother deriving the specific recurrence rule for it, we can still automatically generate pushforward rule in the following manner. Let's denote the derivative of $f$ w.r.t $x$ to be $d(x)$, then for $f(t)=f(x(t))$ we have
97-
$$
98-
f'(t)=d(x(t))x'(t);\quad f(0)=f(x_0)
99-
$$
101+
102+
``f'(t)=d(x(t))x'(t);\quad f(0)=f(x_0)``
103+
100104
when we expand $f$ and $x$ up to order $p$ into this equation, we notice that only order $p-1$ is needed for $d(x(t))$. In other words, we turn a problem of finding $p$-th order pushforward for $f$, to a problem of finding $p-1$-th order pushforward for $d$, and we can recurse down to the first order. The first-order derivative expressions are captured from ChainRules.jl, which made this process fully automatic.
101105

102106
This strategy is in principle equivalent to nesting first-order differentiation, which could potentially leads to exponential scaling; however, in practice there is a huge difference. This generation of pushforward rule happens at **compile time**, which gives the compiler a chance to check redundant expressions and optimize it down to quadratic time. Compiler has stack limits but this should work for at least up to order 100.

0 commit comments

Comments
 (0)