Gradient Computation with PyTorch

This is a minimal example of working with gradients in PyTorch:

\begin{alignat}{2} f(x) &= \theta x &&\textsf{We want to predict } y=f(x) \\ E &= (f(x) - y)^2 &&\textsf{under this squared-error function.} \\ \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E &= 2x (f(x) - y) \\ &= 2\cdot2 (2-3) &= -4 \quad & \textsf{for } \theta=1, x=2, y=3 \\ \theta &\leftarrow \theta - \alpha \frac{\mathrm{d}}{\mathrm{d}\mathrm{\theta}} E && \textsf{Then we'll take this gradient-descent step} \\ &= 1 - 0.1 \cdot -4 &= 1.4 \quad & \textsf{for } \alpha=0.1 \textsf{ and our above values.} \end{alignat}

We will now compute the gradient and update $\theta$ in three different ways, using increasingly higher-level PyTorch facilities. The result will always be exactly the same.

Using just torch.Tensors

We compute everything by hand and use backward() on the error function to compute (all) its (partial) derivative(s).

It is important to understand what exactly E.backward() does: For all parameters $p$ with requires_grad=True that were used when $E$ was computed, it computes $\frac{\mathrm{d}}{\mathrm{d}p} E$ and stores the numerical result in p.grad.

In our example, the only such parameter is $\theta$.

Using a torch.optim.Optimizer

The (only) role of the optimizer is to update the parameters $p$ (as defined above), here $\theta$.

Additionally using torch.nn.MSELoss

Here, this loss function computes the same thing as our error function $E$.