Gradient Computation With PyTorch¶
This is a minimal example of working with gradients in PyTorch:
\begin{alignat*}{3} f(x) &= \theta x &&&&\textsf{We want to predict } y=f(x) \\ E &= (f(x) - y)^2 &&&&\textsf{under this squared-error function.} \\ \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E &= 2x (f(x) - y) \\ &= 2\cdot2 (2-3) &&= -4 \quad && \textsf{for } \theta=1, x=2, y=3 \\ \theta &\leftarrow \theta - \alpha \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E &&&& \textsf{Then we'll take this gradient-descent step} \\ &= 1 - 0.1 \cdot -4 &&= 1.4 \quad && \textsf{for } \alpha=0.1 \textsf{ and our above values.} \end{alignat*}
We will now compute the gradient and update $\theta$ in four different ways, using increasingly higher-level PyTorch facilities. The result will always be exactly the same.
1. Without PyTorch¶
First we just literally implement the above math in bare-bones Python:
θ_0 = 1.0
θ = θ_0
x = 2.0
y = 3.0
α = 0.1
fx_y = θ * x - y
E = fx_y * fx_y
θgrad = 2 * x * fx_y
θ -= α * θgrad
print(f'{E=}')
print(f'{θgrad=}')
print(f'{θ=}')
E=1.0 θgrad=-4.0 θ=1.4
2. Using torch.autograd.backward()
¶
Now instead of computing $\frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E$ by hand we recruit PyTorch's autograd facility by calling backward()
on the error function to compute (all) its (partial) derivative(s). To this end, we must turn intotorch.Tensor
s all variables in which we want to compute the (partial) derivatives:
import torch
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
fx_y = θ * x - y
E = fx_y * fx_y
E.backward()
with torch.no_grad(): # don't backpropagate through ...
θ -= α * θ.grad # ... the parameter update
print(f'{E=}')
print(f'{θ.grad=}')
print(f'{θ=}')
E=tensor([1.], grad_fn=<MulBackward0>) θ.grad=tensor([-4.]) θ=tensor([1.4000], requires_grad=True)
It is important to understand what exactly E.backward()
does: For all parameters $p$ with requires_grad=True
that were used when $E$ was computed, it computes $\frac{\mathrm{d}}{\mathrm{d}p} E$ and stores the numerical result in p.grad
.
In our example, the only such parameter is $\theta$.
3. Additionally Using a torch.optim.Optimizer
¶
Then instead of computing the parameter update by hand we recruit a PyTorch optimizer:
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
optimizer = torch.optim.SGD([θ], α)
fx_y = θ * x - y
E = fx_y * fx_y
optimizer.zero_grad()
E.backward()
optimizer.step()
print(f'{E=}')
print(f'{θ.grad=}')
print(f'{θ=}')
E=tensor([1.], grad_fn=<MulBackward0>) θ.grad=tensor([-4.]) θ=tensor([1.4000], requires_grad=True)
4. Additionally Using torch.nn.MSELoss
¶
Finally, instead of computing our Mean-Squares Error function $E$ by hand we use its PyTorch implementation. For this to work, the data $x$ and $y$ we carry around now must also be torch.Tensor
s:
x = torch.tensor([x])
y = torch.tensor([y])
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
optimizer = torch.optim.SGD([θ], α)
loss = torch.nn.MSELoss()(θ * x, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'{loss=}')
print(f'{θ.grad=}')
print(f'{θ=}')
loss=tensor(1., grad_fn=<MseLossBackward0>) θ.grad=tensor([-4.]) θ=tensor([1.4000], requires_grad=True)