Gradient Computation With PyTorch¶

This is a minimal example of working with gradients in PyTorch:

\begin{alignat*}{3} f(x) &= \theta x &&&&\textsf{We want to predict } y=f(x) \\ E &= (f(x) - y)^2 &&&&\textsf{under this squared-error function.} \\ \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E &= 2x (f(x) - y) \\ &= 2\cdot2 (2-3) &&= -4 \quad && \textsf{for } \theta=1, x=2, y=3 \\ \theta &\leftarrow \theta - \alpha \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E &&&& \textsf{Then we'll take this gradient-descent step} \\ &= 1 - 0.1 \cdot -4 &&= 1.4 \quad && \textsf{for } \alpha=0.1 \textsf{ and our above values.} \end{alignat*}

We will now compute the gradient and update $\theta$ in four different ways, using increasingly higher-level PyTorch facilities. The result will always be exactly the same.

1. Without PyTorch¶

First we just literally implement the above math in bare-bones Python:

In [1]:
θ_0 = 1.0

θ = θ_0
x = 2.0
y = 3.0
α = 0.1

fx_y = θ * x - y
E = fx_y * fx_y
θgrad = 2 * x * fx_y
θ -= α * θgrad

print(f'{E=}')
print(f'{θgrad=}')
print(f'{θ=}')
E=1.0
θgrad=-4.0
θ=1.4

2. Using torch.autograd.backward()¶

Now instead of computing $\frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E$ by hand we recruit PyTorch's autograd facility by calling backward() on the error function to compute (all) its (partial) derivative(s). To this end, we must turn intotorch.Tensors all variables in which we want to compute the (partial) derivatives:

In [2]:
import torch

θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None

fx_y = θ * x - y
E = fx_y * fx_y
E.backward()
with torch.no_grad():           # don't backpropagate through ...
    θ -= α * θ.grad             # ... the parameter update

print(f'{E=}')
print(f'{θ.grad=}')
print(f'{θ=}')
E=tensor([1.], grad_fn=<MulBackward0>)
θ.grad=tensor([-4.])
θ=tensor([1.4000], requires_grad=True)

It is important to understand what exactly E.backward() does: For all parameters $p$ with requires_grad=True that were used when $E$ was computed, it computes $\frac{\mathrm{d}}{\mathrm{d}p} E$ and stores the numerical result in p.grad.

In our example, the only such parameter is $\theta$.

3. Additionally Using a torch.optim.Optimizer¶

Then instead of computing the parameter update by hand we recruit a PyTorch optimizer:

In [3]:
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
optimizer = torch.optim.SGD([θ], α)

fx_y = θ * x - y
E = fx_y * fx_y
optimizer.zero_grad()
E.backward()
optimizer.step()

print(f'{E=}')
print(f'{θ.grad=}')
print(f'{θ=}')
E=tensor([1.], grad_fn=<MulBackward0>)
θ.grad=tensor([-4.])
θ=tensor([1.4000], requires_grad=True)

4. Additionally Using torch.nn.MSELoss¶

Finally, instead of computing our Mean-Squares Error function $E$ by hand we use its PyTorch implementation. For this to work, the data $x$ and $y$ we carry around now must also be torch.Tensors:

In [4]:
x = torch.tensor([x])
y = torch.tensor([y])
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
optimizer = torch.optim.SGD([θ], α)

loss = torch.nn.MSELoss()(θ * x, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f'{loss=}')
print(f'{θ.grad=}')
print(f'{θ=}')
loss=tensor(1., grad_fn=<MseLossBackward0>)
θ.grad=tensor([-4.])
θ=tensor([1.4000], requires_grad=True)