Skip to content

Commit f628889

Browse files
adrhilltansongchen
authored andcommitted
Use math blocks everywhere
1 parent ae7c793 commit f628889

1 file changed

Lines changed: 66 additions & 39 deletions

File tree

docs/src/theory.md

Lines changed: 66 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,31 @@ CurrentModule = TaylorDiff
44

55
# Theory
66

7-
TaylorDiff.jl is an operator-overloading based forward-mode automatic differentiation (AD) package. "Forward-mode" implies that the basic capability of this package is that, for function $f:\mathbb R^n\to\mathbb R^m$, place to evaluate derivative $x\in\mathbb R^n$ and direction $l\in\mathbb R^n$, we compute
8-
9-
``f(x),\partial f(x)\times v,\partial^2f(x)\times v\times v,\cdots,\partial^pf(x)\times v\times\cdots\times v``
7+
TaylorDiff.jl is an operator-overloading based forward-mode automatic differentiation (AD) package.
8+
"Forward-mode" implies that the basic capability of this package is that, for function $f:\mathbb R^n\to\mathbb R^m$, place to evaluate derivative $x\in\mathbb R^n$ and direction $l\in\mathbb R^n$, we compute
9+
```math
10+
f(x),\partial f(x)\times v,\partial^2f(x)\times v\times v,\cdots,\partial^pf(x)\times v\times\cdots\times v
11+
```
1012

11-
i.e., the function value and the directional derivative up to order $p$. This notation might be unfamiliar to Julia users that had experience with other AD packages, but $\partial f(x)$ is simply the jacobian $J$, and $\partial f(x)\times v$ is simply the Jacobian-vector product (JVP).
13+
i.e., the function value and the directional derivative up to order $p$.
14+
This notation might be unfamiliar to Julia users that had experience with other AD packages, but $\partial f(x)$ is simply the jacobian $J$, and $\partial f(x)\times v$ is simply the Jacobian-vector product (JVP).
1215
In other words, this is a simple generalization of Jacobian-vector product to Hessian-vector-vector product, and to even higher orders.
1316

14-
The main advantage of doing this instead of doing $p$ first-order Jacobian-vector products is that nesting first-order AD results in exponential scaling w.r.t $p$, while this method, also known as Taylor mode, should be (almost) linear scaling w.r.t $p$. We will see the reason of this claim later.
17+
The main advantage of doing this instead of doing $p$ first-order Jacobian-vector products is that nesting first-order AD results in exponential scaling w.r.t $p$, while this method, also known as Taylor mode, should be (almost) linear scaling w.r.t $p$.
18+
We will see the reason of this claim later.
1519

1620
In order to achieve this, assuming that $f$ is a nested function $f_k\circ\cdots\circ f_2\circ f_1$, where each $f_i$ is a basic and simple function, or called "primitives".
17-
We need to figure out how to propagate the derivatives through each step. In first order AD, this is achieved by the "dual" pair $x_0+x_1\varepsilon$, where $\varepsilon^2=0$, and for each primitive we make a method overload
18-
19-
``f(x_0+x_1\varepsilon)=f(x_0)+\partial f(x_0) x_1\varepsilon``
20-
21-
Similarly in higher-order AD, we need for each primitive a method overload for a truncated Taylor polynomial up to order $p$, and in this polynomial we will use $t$ instead of $\varepsilon$ to denote the sensitivity. "Truncated" means $t^{p+1}=0$, similar as what we defined for dual numbers. So
21+
We need to figure out how to propagate the derivatives through each step.
22+
In first order AD, this is achieved by the "dual" pair $x_0+x_1\varepsilon$, where $\varepsilon^2=0$, and for each primitive we make a method overload
23+
```math
24+
f(x_0+x_1\varepsilon)=f(x_0)+\partial f(x_0) x_1\varepsilon
25+
```
2226

23-
``f(x_0+x_1t+x_2t^2+\cdots+x_pt^p)=?``
27+
Similarly in higher-order AD, we need for each primitive a method overload for a truncated Taylor polynomial up to order $p$, and in this polynomial we will use $t$ instead of $\varepsilon$ to denote the sensitivity.
28+
"Truncated" means $t^{p+1}=0$, similar as what we defined for dual numbers. So
29+
```math
30+
f(x_0+x_1t+x_2t^2+\cdots+x_pt^p)=?
31+
```
2432

2533
What is the math expression that we should put into the question mark?
2634
That specific expression is called the "pushforward rule", and we will talk about how to derive the pushforward rule below.
@@ -30,7 +38,6 @@ That specific expression is called the "pushforward rule", and we will talk abou
3038
Before deriving pushforward rules, let's first introduce several basic properties of polynomials.
3139

3240
If $x(t)$ and $y(t)$ are both truncated Taylor polynomials, i.e.
33-
3441
```math
3542
\begin{aligned}
3643
x&=x_0+x_1t+\cdots+x_pt^p\\
@@ -39,71 +46,91 @@ y&=y_0+y_1t+\cdots+y_pt^p
3946
```
4047

4148
Then it's obvious that the polynomial addition and subtraction should be
42-
43-
``(x\pm y)_k=x_k\pm y_k``
49+
```math
50+
(x\pm y)_k=x_k\pm y_k
51+
```
4452

4553
And with some derivation we can also get the polynomial multiplication rule
46-
47-
``(x\times y)_k=\sum_{i=0}^kx_iy_{k-i}``
54+
```math
55+
(x\times y)_k=\sum_{i=0}^kx_iy_{k-i}
56+
```
4857

4958
The polynomial division rule is less obvious, but if $x/y=z$, then equivalently $x=yz$, i.e.
50-
51-
``\left(\sum_{i=0}^py_it^i\right)\left(\sum_{i=0}^pz_it^i\right)=\sum_{i=0}^px_it^i``
59+
```math
60+
\left(\sum_{i=0}^py_it^i\right)\left(\sum_{i=0}^pz_it^i\right)=\sum_{i=0}^px_it^i
61+
```
5262

5363
if we relate the coefficient of $t^k$ on both sides we get
54-
55-
``\sum_{i=0}^k z_iy_{k-i}=x_k``
64+
```math
65+
\sum_{i=0}^k z_iy_{k-i}=x_k
66+
```
5667

5768
so, equivalently,
5869

59-
``z_k=\frac1{y_0}\left(x_k-\sum_{i=0}^{k-1}z_iy_{k-1}\right)``
70+
```math
71+
z_k=\frac1{y_0}\left(x_k-\sum_{i=0}^{k-1}z_iy_{k-1}\right)
72+
```
6073

6174
This is a recurrence relation, which means that we can first get $z_0=x_0/y_0$, and then get $z_1$ using $z_0$, and then get $z_2$ using $z_0,z_1$ etc.
6275

6376
## Pushforward rule for elementary functions
6477

65-
Let's now consider how to derive the pushforward rule for elementary functions. We will use $\exp$ and $\log$ as two examples.
78+
Let's now consider how to derive the pushforward rule for elementary functions.
79+
We will use $\exp$ and $\log$ as two examples.
6680

6781
If $x(t)$ is a polynomial and we want to get $e(t)=\exp(x(t))$, we can actually get that by formulating an ordinary differential equation:
68-
69-
``e'(t)=\exp(x(t))x'(t);\quad e_0=\exp(x_0)``
82+
```math
83+
e'(t)=\exp(x(t))x'(t);\quad e_0=\exp(x_0)
84+
```
7085

7186
If we expand both $e$ and $x$ in the equation, we will get
72-
73-
``\sum_{i=1}^pie_it^{i-1}=\left(\sum_{i=0}^{p-1} e_it^i\right)\left(\sum_{i=1}^pix_it^{i-1}\right)``
87+
```math
88+
\sum_{i=1}^pie_it^{i-1}=\left(\sum_{i=0}^{p-1} e_it^i\right)\left(\sum_{i=1}^pix_it^{i-1}\right)
89+
```
7490

7591
relating the coefficient of $t^{k-1}$ on both sides, we get
76-
77-
``ke_k=\sum_{i=0}^{k-1}e_i\times (k-i)x_{k-i}``
92+
```math
93+
ke_k=\sum_{i=0}^{k-1}e_i\times (k-i)x_{k-i}
94+
```
7895

7996
This is, again, a recurrence relation, so we can get $e_1,\cdots,e_p$ step-by-step.
8097

8198
If $x(t)$ is a polynomial and we want to get $l(t)=\log(x(t))$, we can actually get that by formulating an ordinary differential equation:
82-
83-
``l'(t)=\frac1xx'(t);\quad l_0=\log(x_0)``
99+
```math
100+
l'(t)=\frac1xx'(t);\quad l_0=\log(x_0)
101+
```
84102

85103
If we expand both $l$ and $x$ in the equation, the RHS is simply polynomial divisions, and we get
86-
87-
``l_k=\frac1{x_0}\left(x_k-\frac1k\sum_{i=1}^{k-1}il_ix_{k-j}\right)``
104+
```math
105+
l_k=\frac1{x_0}\left(x_k-\frac1k\sum_{i=1}^{k-1}il_ix_{k-j}\right)
106+
```
88107

89108
---
90109

91-
Now notice the difference between the rule for $\exp$ and $\log$: the derivative of exponentiation is itself, so we can obtain from recurrence relation; the derivative of logarithm is $1/x$, an algebraic expression in $x$, so it can be directly computed. Similarly, we have $(\tan x)'=1+\tan^2x$ but $(\arctan x)'=(1+x^2)^{-1}$. We summarize (omitting proof) that
110+
Now notice the difference between the rule for $\exp$ and $\log$: the derivative of exponentiation is itself, so we can obtain from recurrence relation; the derivative of logarithm is $1/x$, an algebraic expression in $x$, so it can be directly computed.
111+
Similarly, we have $(\tan x)'=1+\tan^2x$ but $(\arctan x)'=(1+x^2)^{-1}$. We summarize (omitting proof) that
92112

93113
- Every $\exp$-like function (like $\sin$, $\cos$, $\tan$, $\sinh$, ...)'s derivative is somehow recursive
94114
- Every $\log$-like function (like $\arcsin$, $\arccos$, $\arctan$, $\operatorname{arcsinh}$, ...)'s derivative is algebraic
95115

96-
So all of the elementary functions have an easy pushforward rule that can be computed within $O(p^2)$ time. Note that this is an elegant and straightforward corollary from the definition of "elementary function" in differential algebra.
116+
So all of the elementary functions have an easy pushforward rule that can be computed within $O(p^2)$ time.
117+
Note that this is an elegant and straightforward corollary from the definition of "elementary function" in differential algebra.
97118

98119
## Generic pushforward rule
99120

100-
For a generic $f(x)$, if we don't bother deriving the specific recurrence rule for it, we can still automatically generate pushforward rule in the following manner. Let's denote the derivative of $f$ w.r.t $x$ to be $d(x)$, then for $f(t)=f(x(t))$ we have
101-
102-
``f'(t)=d(x(t))x'(t);\quad f(0)=f(x_0)``
121+
For a generic $f(x)$, if we don't bother deriving the specific recurrence rule for it, we can still automatically generate pushforward rule in the following manner.
122+
Let's denote the derivative of $f$ w.r.t $x$ to be $d(x)$, then for $f(t)=f(x(t))$ we have
123+
```math
124+
f'(t)=d(x(t))x'(t);\quad f(0)=f(x_0)
125+
```
103126

104-
when we expand $f$ and $x$ up to order $p$ into this equation, we notice that only order $p-1$ is needed for $d(x(t))$. In other words, we turn a problem of finding $p$-th order pushforward for $f$, to a problem of finding $p-1$-th order pushforward for $d$, and we can recurse down to the first order. The first-order derivative expressions are captured from ChainRules.jl, which made this process fully automatic.
127+
when we expand $f$ and $x$ up to order $p$ into this equation, we notice that only order $p-1$ is needed for $d(x(t))$.
128+
In other words, we turn a problem of finding $p$-th order pushforward for $f$, to a problem of finding $p-1$-th order pushforward for $d$, and we can recurse down to the first order.
129+
The first-order derivative expressions are captured from ChainRules.jl, which made this process fully automatic.
105130

106-
This strategy is in principle equivalent to nesting first-order differentiation, which could potentially leads to exponential scaling; however, in practice there is a huge difference. This generation of pushforward rule happens at **compile time**, which gives the compiler a chance to check redundant expressions and optimize it down to quadratic time. Compiler has stack limits but this should work for at least up to order 100.
131+
This strategy is in principle equivalent to nesting first-order differentiation, which could potentially leads to exponential scaling; however, in practice there is a huge difference.
132+
This generation of pushforward rule happens at **compile time**, which gives the compiler a chance to check redundant expressions and optimize it down to quadratic time.
133+
Compiler has stack limits but this should work for at least up to order 100.
107134

108135
In the current implementation of TaylorDiff.jl, all $\log$-like functions' pushforward rules are generated by this strategy, since their derivatives are simple algebraic expressions; some $\exp$-like functions, like sinh, is also generated; the most-often-used several $\exp$-like functions are hand-written with hand-derived recurrence relations.
109136

0 commit comments

Comments
 (0)