PyTorch 04: Autograd

Introduction

Based on PyTorch Autograd Tutorial.

Gradient Descent and Back Propagation

When training neural networks, the most frequently used algorithm to minimize the training loss is stochastic gradient descent (SGD).

The derivative of the loss function with respect to the parameters gives us the direction of steepest increase evaluated at a set of parameter values.

So to decrease the loss, we need to move in the opposite direction of the gradient, which is the direction of steepest decrease.

During training, we update the parameters, \(\theta\), by the negative of the gradient, scaled by some learning rate \(\eta\):

\[ \theta \leftarrow \theta - \eta \nabla_\theta J(\theta) \]

The gradient can be officially calcluated using the chain rule of calculus, where the gradient of a function composed of multiple functions is the product of the gradients of each function.

Say we have a function \(f(x)\) that is a composition of two functions \(f(x) = f(g(x))\), then the gradient of \(f\) with respect to \(x\) is given by the chain rule:

\[ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x} \]

A neural network is a composition of many functions, e.g. linear layer, activation function, linear layer, etc.

We can compute every step of the gradient with back propagation.

Automatic Differentiation with torch.autograd

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd.

It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:

import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

Here’s the calculation of \(z\) in matrix form:

\(\displaystyle \mathrm{z} = \mathrm{w}\mathrm{x}+\mathrm{b} = \begin{bmatrix} 0.232 & -0.259 & 1.843 & -0.676 & 0.802 \\ 1.119 & -0.327 & -0.567 & -0.519 & 2.312 \\ -0.172 & -0.373 & -1.251 & 0.049 & 0.433 \\ \end{bmatrix} \begin{bmatrix} 1.000 \\ 1.000 \\ 1.000 \\ 1.000 \\ 1.000 \\ \end{bmatrix} + \begin{bmatrix} -0.306 \\ -0.188 \\ -2.382 \\ \end{bmatrix}\)

Tensors, Functions and Computational graph

The above code defines the following computational graph:

G x x matmul1 * x->matmul1 add1 + matmul1->add1 z z add1->z y y CE CE y->CE w w w->matmul1 b b b->add1 z->CE loss loss CE->loss
Figure 1: Computational graph for the simple neural network

In this network, w and b are parameters, which we need to optimize.

Thus, we need to be able to compute the gradients of loss function with respect to those variables.

Tensors with requires_grad

In order to do that, we set the requires_grad property of those tensors.

Note

You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method.

By default, new tensors are created with requires_grad=False.

print(f"x: {x.requires_grad}, y: {y.requires_grad}, w: {w.requires_grad}, b: {b.requires_grad}")
x: False, y: False, w: True, b: True

Functions Support Back Propagation

Any function that operates on tensors inherits from class Function.

The Function object knows

  • how to compute the function in the forward direction, and in fact implements a forward method, and
  • how to compute its derivative during the backward propagation step, and in fact implements a backward method.

A reference to the backward propagation function is stored in grad_fn property of a tensor.

You can find more information of Function in the documentation.

print(f"Gradient function for loss = {loss.grad_fn}")
print(f"Gradient function for z = {z.grad_fn}")
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7f6767d033a0>
Gradient function for z = <AddBackward0 object at 0x7f6765c81180>

Computing Gradients

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters.

We need \(\frac{\partial loss}{\partial w}\) and \(\frac{\partial loss}{\partial b}\) under some fixed values of x and y.

To compute those derivatives, we call loss.backward(), which steps back through the computational graph, computing the gradients of the leaf nodes, which in this case are the parameters w and b.

We can retrieve the values from w.grad and b.grad:

loss.backward()
print(w.grad)
print(b.grad)
tensor([[0.2790, 0.2872, 0.0081],
        [0.2790, 0.2872, 0.0081],
        [0.2790, 0.2872, 0.0081],
        [0.2790, 0.2872, 0.0081],
        [0.2790, 0.2872, 0.0081]])
tensor([0.2790, 0.2872, 0.0081])

There’s a handy package called torchviz that also can render the computational graph.

from torchviz import make_dot
from graphviz import Source
from IPython.display import display

# Visualize the computational graph
dot = make_dot(loss, params={"w": w, "b": b})

# Render and display the graph inline
graph = Source(dot.source)
display(graph)

Of course the graph will quickly become too complex for any meaningful network.

Warning

We can only obtain the grad properties for the leafnodes of the computational graph, which have requires_grad propertyset to True. For all other nodes in our graph, gradients will not be available. We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we needto do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.

Disabling Gradient Tracking

By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation.

However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network.

We can stop tracking computations by surrounding our computation code with a torch.no_grad() block:

z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)
True
False

Another way to achieve the same result is to use the detach() method on the tensor:

z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)
False

There are reasons you might want to disable gradient tracking:

  • To mark some parameters in your neural network as frozen parameters.
  • To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

Recap: Back Propagation and Computational Graphs

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects.

In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

  • run the requested operation to compute a resulting tensor
  • maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root.

autograd then:

  • computes the gradients from each .grad_fn,
  • accumulates them in the respective tensor’s .grad attribute
  • using the chain rule, propagates all the way to the leaf tensors.
Important

An important thing to note is that the graph is recreated from scratch by walking backward through the graph.

  • after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model
  • you can change the shape, size and operations at every iteration if needed.

Further Reading

Back to top