# Automatic differentiation and Deep Learning basics, with pytorch

Author: Mathurin Massias


Pytorch (torch) has become the standard library for DL, and surpassed Tensorflow (see e.g. https://paperswithcode.com/trends). A recent alternative, JAX, has emerged and is showing quick adoption. This lab focuses on pytorch.

### Working on GPU
Training neural networks is much faster on GPU. If you want to experiment with GPUs, you can upload this notebook to the google colab platform and run it there.
Colab provides free GPU resources. 
- Go to https://colab.research.google.com
- Upload this notebook 
- Open it
- Navigate to Editâ†’Notebook Settings
- select T4 GPU from the Hardware Accelerator list

You can check GPU availability with:
```
import torch
print(torch.cuda.is_available())
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
```
The last line is a standard torch practice, that allows writing code which will work on GPU if available, else fall back to CPU.


By default, models and tensors are stored on CPU. To move them to GPU, use 
```
my_tensor = my_tensor.to(device)
model.to(device)
```

## Pytorch basics

In [None]:
import torch 
import numpy as np
import matplotlib.pyplot as plt

Torch works with tensors, which are n-dimensional arrays. They work a lot like the famous numpy.ndarray.

In [None]:
x = torch.zeros(3, 5)  # like np.zeros
print(x.shape)
print(x)

In [None]:
torch.manual_seed(2406)  # equivalent of np.random.seed
y = torch.randn(3, 5, 2)  # like np.random.randn
print(y)

In [None]:
x[0, ::2]  # slicing like a np.array

In [None]:
z = torch.ones_like(x)  # like np.ones_like 
print(x - 2 * z)  # pointwise operations are supported

The key functionality of pytorch is its use of backpropagation (reverse mode automatic differentiation) to compute gradients of any function. 

In [None]:
# with `requires_grad`, we tell torch that together with these tensors, we'll need to store gradients
A = torch.randn(6, 5, requires_grad=True)
b = torch.arange(6, requires_grad=True, dtype=torch.float32) 

x = torch.randn(5, requires_grad=True)

fun = 0.5 * ((A @ x - b) ** 2).sum()

In [None]:
fun.backward()  # this computes the gradient of `fun`, with AD, with respect to all the variables in the computational graph!

In [None]:
x.grad

In [None]:
A.T @ (A @ x - b)  # matches x.grad

but we also have the gradient of `fun` with respect to `A` and `b`:

In [None]:
b.grad  # equals b - A @ x

In [None]:
b - A @ x

Q1) On paper, compute the gradient of `fun` with respect to $A$ (identified by an $n \times d$ matrix). Compute it with pytorch and check numerically that it is equal to the value you found.

The `.backward()` function is of primal importance, as it allows computing in one go the gradient of the loss of a neural network with respect to all the weights in the neural network.

Q2) Generate a random $100 \times 200$ matrix $A$ and vectors $x$ and $b$ of adequate size and content, in order to compute the gradient of the logistic loss at $x$ with automatic differentiation.

Compare the time it takes to compute the gradient with autodiff and with the mathematical formula.

Q3) Code gradient descent on the logistic regression problem, using automatic differentiation to compute the gradient at each iteration.

### Basic Neural Network 

In the sequel we'll define a very simple, fully connected neural network. Usually neural networks are defined as classes, inheriting from `torch.nn.Module`.

In [None]:
from torch import nn 

class MyNet(nn.Module):
    def __init__(self):
        super(MyNet, self).__init__()
        self.fc1 = nn.Linear(1, 100)
        self.fc2 = nn.Linear(100, 10)
        self.fc3 = nn.Linear(10, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        """Apply network to input x."""
        out = self.relu(self.fc1(x))
        out = self.relu(self.fc2(out))
        out = self.fc3(out)
        return out

Q4) Is a bias (constant term) included in pytorch Linear layer? If the input of a layer is `x`, what is the output? 
How are the layers weights initialized? Why does it matter?

Q5) Plot the output of your network on the segment [-2, 2]. Try the straightforward approach, and read carefully the error message that may pop up.

In [None]:
mynet = MyNet()
x = torch.linspace(-2, 2, 100)[:, None]  # beware: shape must be (batch_size, dimension)
y = mynet(x)

### Fitting a sine 
First, let's generate some 1 dimensional data.

In [None]:
X = torch.randn(10_000, 1)
y = torch.sin(np.pi * X[:, 0]) + 0.1 * torch.randn(X.shape[0])

plt.scatter(X[:100, 0], y[:100]);

We wrap our dataset in a util called `DataLoader`, that allows enumerating over the data.

In [None]:
from torch.utils.data import DataLoader, TensorDataset

train = DataLoader(TensorDataset(X, y), batch_size=64)

Let's train our network with SGD. 

Fix the code below to make it work.

In [None]:
from torch.optim import SGD 
mse = nn.MSELoss()
mynet = MyNet()

optimizer = SGD(mynet.parameters(), lr=10, momentum=0.9)


for epoch in range(10):
    av_loss = []
    for input, target in train: 
        loss = mse(mynet(input), target[:, None])
        av_loss.append(loss.detach().numpy())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch}, av loss {np.mean(av_loss):.5f}")

Visualize the output of your network on the segment  [-2, 2]. Comment.

Q6) Split the data into a training a testing part. 
Retrain your model, logging the training and testing losses across epochs. Plot them.

Q7) Run ADAM instead of SGD.