This is a minimal example of working with gradients in PyTorch:
\begin{alignat}{2} f(x) &= \theta x &&\textsf{We want to predict } y=f(x) \\ E &= (f(x) - y)^2 &&\textsf{under this squared-error function.} \\ \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E &= 2x (f(x) - y) \\ &= 2\cdot2 (2-3) &= -4 \quad & \textsf{for } \theta=1, x=2, y=3 \\ \theta &\leftarrow \theta - \alpha \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E && \textsf{Then we'll take this gradient-descent step} \\ &= 1 - 0.1 \cdot -4 &= 1.4 \quad & \textsf{for } \alpha=0.1 \textsf{ and our above values.} \end{alignat}import torch
θ_0 = 1.0
x = torch.tensor([2.0]) # requires_grad=False
y = torch.tensor([3.0]) # requires_grad=False
α = torch.tensor([0.1]) # requires_grad=False
We will now compute the gradient and update $\theta$ in three different ways, using increasingly higher-level PyTorch facilities. The result will always be exactly the same.
torch.Tensor
s¶We compute everything by hand and use backward()
on the error function to compute (all) its (partial) derivative(s).
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
fx_y = θ * x - y
E = fx_y * fx_y
E.backward()
print(f'{E=}')
print(f'{θ.grad=}')
with torch.no_grad(): # don't backpropagate through ...
θ -= α * θ.grad # ... the parameter update
print(f'{θ=}')
E=tensor([1.], grad_fn=<MulBackward0>) θ.grad=tensor([-4.]) θ=tensor([1.4000], requires_grad=True)
It is important to understand what exactly E.backward()
does: For all parameters $p$ with requires_grad=True
that were used when $E$ was computed, it computes $\frac{\mathrm{d}}{\mathrm{d}p} E$ and stores the numerical result in p.grad
.
In our example, the only such parameter is $\theta$.
torch.optim.Optimizer
¶The (only) role of the optimizer is to update the parameters $p$ (as defined above), here $\theta$.
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
optimizer = torch.optim.SGD([θ], α.item())
fx_y = θ * x - y
E = fx_y * fx_y
optimizer.zero_grad()
E.backward()
optimizer.step()
print(f'{E=}')
print(f'{θ.grad=}')
print(f'{θ=}')
E=tensor([1.], grad_fn=<MulBackward0>) θ.grad=tensor([-4.]) θ=tensor([1.4000], requires_grad=True)
torch.nn.MSELoss
¶Here, this loss function computes the same thing as our error function $E$.
θ = torch.tensor([θ_0], requires_grad=True) # initial gradient = None
optimizer = torch.optim.SGD([θ], α.item())
loss = torch.nn.MSELoss()(θ * x, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'{loss=}')
print(f'{θ.grad=}')
print(f'{θ=}')
loss=tensor(1., grad_fn=<MseLossBackward0>) θ.grad=tensor([-4.]) θ=tensor([1.4000], requires_grad=True)