Welcome back! In the last tutorial we covered regularization and the bias variance trade-off. We discovered that a complicated model can basically memorize the training data and not generelize very well and that a model that is too simple does not learn the underlying complexity of the data.

Important notes

It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features

Lesson 3: Feedforward Neural Networks from Scratch

This tutorial teaches you to build a neural network and implement backpropagation from scratch using NumPy—no deep learning framework required.

First, we will explain conceptually what is happening in a feedforward neural network (also known as a universal function approximator), then we will derive the update equations (backpropagation), and finally we will implement our first feedforward neural network in NumPy.

Neural networks are often referred to as universal function approximators. This means that a sufficiently large and complex neural network can approximate any continuous function to an arbitrary degree of accuracy. When machine learning papers mention "estimating a function" or "learning a function," they are typically referring to this capability of neural networks.

In our classification examples, the neural network learns a complex function that maps input features to a probability distribution over different classes. For example, it could learn to predict the outcome of a soccer match, determining which team will win, lose, or draw, based on various attributes like team form, player statistics, historical performance, and home-field advantage.

3.1 Conceptual representation of an FFN

In our first class we implemented softmax regression. We did this for two reasons;

We wanted to show you how multi-class classification worked.
We wanted you to get comfortable stacking multiple layers of weight vectors on top of each other, to form matrices.

We are now going to take point (2) further, and stack several matrix operations after each other. In simplest terms, a feedforward neural network looks like this:

Here we have to define several attributes.

$K$ are the number of attributes that your data has. In image analysis, this might be a feature for each pixel, if you are classifying flowers this would be the four different dimensions of your sepal and petal.
$N$ is your batch dimension. This would be the amount of datapoints you "feed" into your neural network for a training "iteration".
$Weights$ are your first and second weight matrix. You take the dot product between your weight matrix and your input data or your features.
$Features$ are the features that you get after multiplying the first weight matrix with the input data. We refer to the numbers we calculate as features when they are not part of the output layer.

The Forward pass in a deep neural network is when a batch of data is passed from left to right in a neural network. In our case we:

Dot Product the input data $(N \times K )$ with the Weight matrix $(K \times F)$ , to obtain the feature matrix $(N \times F)$. In this case the dimension $F$ is also referred to as the number of neurons in the input layer.
The feature matrix is multiplied by the second weight matrix $(F \times C)$, to obtain the output (logits) matrix $(N \times C)$.
The output matrix is normalized by the softmax to obtain probability predictions for each class.

The input data is typically referred to as the input layer, the dotproduct between the first weight matrix and the input data is referred to as the first hidden layer and the dotproduct on the right is usually referred to as the output layer. If you add more layers in between, these will also be referred to as hidden layers.

What is not shown in this figure is what we do between layers, typically there is an activation function, which uses a normalization technique to keep the values of the feature matrices between certain values. Omitted from this representation is a bias term. In this case it would be a vectors of $(F \times 1)$ and $(C \times 1)$, they would be added to each feature. Because we will be calculating quite a few derivatives, the bias term has been omitted to keep them as uncluttered as possible.

The cool thing about this drawing is that you have already implemented the operation on the right. The only thing we need to do is: add the operation on the left and figure out how to adapt stochastic gradient descent (SGD) to work on this simple neural network.

Now that we conceptually know what a neural network does. Let's dive into the math and derive the update equations. You will see that these make it easy to create a modular implementation of neural networks.

3.2 Deriving the Update Equations (Backpropagation)

Now that we have a conceptual understanding of a feedforward neural network, let's dive into the mathematics behind training it. We will use stochastic gradient descent (SGD) to update the weights of our network. To do this, we need to calculate the gradients of the loss function with respect to each weight matrix. At each operation we will also calculate the gradient with respect to the input of the layer and pass this down the network such that this can be used to update the network at that layer as well. This process is known as backpropagation.

We will consider a two-layer neural network with an input layer, one hidden layer, and an output layer as was shown in the figure above.

NOTE: As mentioned before, the bias term in these calculations has been omitted for simplicity.

Our network can be represented by the following equations:

Input to Hidden Layer: $$Z_{hidden} = X \cdot W_{hidden}$$

where $X$ is the input data ($N \times K$), $W_{hidden}$ is the weight matrix for the input layer ($K \times F$), and $Z_{hidden}$ is the pre-activation output of the hidden layer ($N \times F$).

Activation Function: $$A_{hidden} = \sigma(Z_{hidden})$$

where $\sigma$ is the sigmoid activation function (the same one we used in logistic regression!) and $A_{hidden}$ is the activated output of the hidden layer ($N \times F$).

Hidden to Output Layer: $$Z_{output} = A_{hidden} \cdot W_{output}$$

where $W_{output}$ is the weight matrix for the output layer ($F \times C$), and $Z_{output}$ is the pre-activation output of the output layer ($N \times C$).

Output Function: $$\hat{Y} = \text{softmax}(Z_{output})$$

where $\hat{Y}$ is the predicted probabilities for each class ($N \times C$).

We will use the cross-entropy loss function (over a batch of data, so we average over each data point), $\mathcal{L}_{CE}$. Recall from lesson 1 that it is defined as:

$$\mathcal{L}_{CE} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C Y_{i,c} \log(\hat{Y}_{i,c})$$

where $Y$ are the true labels ($N \times C$), in one-hot encoding.

Our goal is to find the gradients with respect to the output weights $\frac{\partial L}{\partial W_{output}}$ and with respect to the hidden weights $\frac{\partial L}{\partial W_{hidden}}$. We will use the chain rule to calculate these gradients, working backward from the loss function.

Gradient with Respect to $W_{output}$

To find $\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}}$, we apply the chain rule up until that point:

$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} \cdot \frac{\partial Z_{output}}{\partial W_{output}}$$

We know that $$Z_{output} = A_{hidden} \cdot W_{output}$$

Therefore, $$\frac{\partial Z_{output}}{\partial W_{output}} = A_{hidden}^T$$

The derivative of the cross-entropy loss with respect to the pre-activation output of the softmax layer ($\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}}$) is a result that we recall from previous derivations in lesson 1:

$$\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} = \frac{1}{N} (\hat{Y} - Y)$$

Therefore, the gradient with respect to $W_{output}$ is:

$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}} = \frac{1}{N} A_{hidden}^T \cdot (\hat{Y} - Y)$$

Because we are now working with batches, the cross entropy loss is averaged.

Gradient with Respect to $W_{hidden}$

To find $\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}}$, we apply the chain rule up to one level deeper in the network:

$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} \cdot \frac{\partial Z_{output}}{\partial A_{hidden}} \cdot \frac{\partial A_{hidden}}{\partial Z_{hidden}} \cdot \frac{\partial Z_{hidden}}{\partial W_{hidden}}$$

Notice for later, that the first part of the chain rule is the same as when we calculated the update for $\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}}$. So we don't need to repeat ourselves and calculate the update for terms: $\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}}$ or the softmax.

Instead of taking the derivative with the weight matrix in the first layer (where $W_{output}$ lives), we are now of course taking the derivative with respect to the weights in the hidden layer. So we need to calculate:

$$\frac{\partial Z_{output}}{\partial A_{hidden}} = W_{output}^T$$

Next, we need to add the derivative of the sigmoid, which we derived earlier:

$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

Therefore,

$$\frac{\partial A_{hidden}}{\partial Z_{hidden}} = \sigma'(Z_{hidden}) = A_{hidden} \odot (1 - A_{hidden})$$

Where $\odot$ denotes the element-wise (Hadamard) product. You would simply multily each element in one matrix with the other at the corresponding index.

From $Z_{hidden} = X \cdot W_{hidden}$, we have:

$$\frac{\partial Z_{hidden}}{\partial W_{hidden}} = X^T$$

Combining these terms, the gradient with respect to $W_{hidden}$ is:

$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = \frac{1}{N} X^T \cdot \left( (\hat{Y} - Y) \cdot W_{output}^T \odot (A_{hidden} \odot (1 - A_{hidden})) \right)$$

This is quite overwhelming to do for each layer, and if you are overwhelmed I don't blame you. I was too the first time when I had to derive this.

Fortunately, we can make our lives easier. This is because the neural network is essentially broken up in repeating modules. If we know the derivative of each component, we just need to add it to the update as we move backward from the front to the back of the network with our gradients. Hence, it is called backpropagation.

Modular implementation

Closely look at the chain rules for $W_{output}$ and $W_{hidden}$. If you look closely the first terms match, and the second differ. So we can pass this part back in the network so we don't have to re-calculate it. If the network had more hidden layers, you would see more repeating parts in your chain rule derivative.

Let's explicitly derive the modular updates. Let the derivative of the error and the softmax function at the output layer be:

$$\delta_{output} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} = \frac{1}{N} (\hat{Y} - Y)$$

To update $W_{output}$, we simply multiply it with $\mathbf{A}^T$ and use our SGD update rule!

Now, we want to pass the error up one layer in the network to make sure that the layer deeper in the network can also update their weights, without having to recalculate the derivative of $\mathcal{L}_{CE}$. We let the layer that we are currently at pass the term $(\hat{Y} - Y) \cdot W_{output}^T$, which we will refer to as $\delta_{hidden\_pre\_activation}$, up one level (to the sigmoid activation).

Then at the activation we multiply it with the derivative of the sigmoid (and the input from the layer below it). So what we get is:

$$\delta_{hidden} = \delta_{hidden\_pre\_activation} \odot (A_{hidden} \odot (1 - A_{hidden}))$$

Then, $\delta_{hidden}$ is passed up to the hidden layer. We know the derivative w.r.t. the weight matrix in the hidden layer is $\mathbf{X}^{T}_{input}$. So, the gradient with respect to $W_{hidden}$ can be written modularly as:

$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = X^T \cdot \delta_{hidden}$$

This modular form is much cleaner and easier to implement in code. It shows that the gradient for a weight matrix is the product of the transpose of the input to that layer and the error propagated back to the output of that layer's activation function.

These are the core update equations we will use to implement our feedforward neural network using NumPy. We will use these gradients to update the weights using SGD:

$W_{new} = W_{old} - \alpha \frac{\partial \mathcal{L}_{CE}}{\partial W}$

where, $\alpha$ is the learning rate.

Implementing a Feed Forward Neural Network in NumPy

Now that we have had a look at the theory, we can start implementing the neural network. It may look intimidating, but it's much more straightforward than you think it is. Just a little tedious, but you'll be the better for it, I promise!

Do the exercises below, if in the code you see hints for later exercises. You may ignore these for now.

Exercise 3.1 : Forward Pass

In this exercise you are going to implement the forward pass of the FFN. Make sure to implement the indicated methods/attributes in the FFNLayer, Sigmoid, Softmax, and Cross Entropy classes.

HINT: If you want to test this, you can randomly initialize a matrix and see if the output dimensions match by just running the module!

Exercise 3.2: Backward Pass

Each method mentioned in the previous exercise has a backpropagate (and in the case of the layer an update weights method).

Exercise 3.3: Training Loop

Implement the training loop, see hints in the exercise!

In [ ]:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from typing import Tuple

%matplotlib inline

In [ ]:

#### Feedforward Layer
class FFNLayer():
  def __init__(self, in_dims:int, out_dims: int, initialization: str = 'normal'):
    self.in_dims = in_dims
    self.out_dims = out_dims

    # EXERCISE 3.1 Initialize weights here, just sample a matrix from a normal distribution
    if initialization == 'normal':
      self.W = np.random.randn(in_dims, out_dims)

    elif initialization == 'kaiming':
      # EXERCISE 3.6. Implement Kaiming He weight initialization here!
      std_dev = np.sqrt(2.0 / out_dims)
      self.W = np.random.randn(in_dims, out_dims) * std_dev

    self._learning_rate = 1.0

  @property
  def learning_rate(self) -> float:
    return self._learning_rate

  @learning_rate.setter
  def learning_rate(self, value: float) -> None:
    self._learning_rate = value

  def __call__(self, X: np.ndarray) -> np.ndarray:
    # EXERCISE 3.1 Implement Forward pass here
    self.X = X
    out = X @ self.W

    return out

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:

    # EXERCISE 3.2 Implement Backward Pass Here
    self.update_weights(delta)

    # HINT: Don't forget to calculate the delta_next here,
    # otherwise other layers won't be updated!
    # Delta next is the derivative wrt the input of this network
    delta_next = delta @ self.W.T

    return delta_next

  def update_weights(self, delta: np.ndarray) -> None:

    # EXERCISE 3.2 Implement weight update here!

    update = self.X.T @ delta
    self.W = self.W - self._learning_rate * update


### Activation functions
class Sigmoid():

  def __init__(self):
    self.A = None

  def __call__(self, X) -> np.ndarray:
    # EXERCISE 3.1 Implement forward pass here
    self.A = 1. / (1+ np.exp(-X))
    return self.A

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:
    # EXERCISE 3.2 Implement backprop here
    return delta * self.A * (1 - self.A)


class SoftMax():
  def __init__(self):
    self.X = None

  def __call__(self, X: np.ndarray):
    # Exercise 3.1 Implement softmax forward pass here!

    return np.exp(X) / np.sum(np.exp(X), axis=1, keepdims=True)

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:
    # EXERCISE 3.2 Implement backprop here
    # HINT: For now we assume the only loss function that we are using is the cross entropy!
    # So no need to overcomplicate it!
    return delta

class CrossEntropy():
  def __init__(self, epsilon: float = 1e-8):
    self.epsilon = epsilon
    self.Y_true = None

  def __call__(self, Y_True: np.ndarray, Y_pred: np.ndarray) -> np.ndarray:
    # EXERCISE 3.1 Implement Forward pass/ error calculation here
    self.Y_true = Y_True
    return -np.sum(Y_True * np.log(Y_pred + self.epsilon))

  def backpropagate(self, Y_pred: np.ndarray) -> np.ndarray:
    # EXERCISE 3.2: Implement derivative of loss here
    return Y_pred - self.Y_true

class ModuleList():
  def __init__(self):
    self.layers = []

  def set_learning_rate(self, lr: float = 0.01):
    for layer in self.layers:
      if hasattr(layer, 'learning_rate'):
        layer.learning_rate = lr

  def set_training(self):
    for layer in self.layers:
      if hasattr(layer, 'training'):
        layer.training = True

  def set_evaluation(self):
    for layer in self.layers:
      if hasattr(layer, 'training'):
        layer.training = False

  def add(self, layer):
    self.layers.append(layer)

  def __call__(self, X: np.ndarray) -> np.ndarray:
    # NOTE: This is how they forward pass will be called
    # for all of modules.
    for layer in self.layers:
      X = layer(X)
    return X

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:
    for layer in reversed(self.layers):
      delta = layer.backpropagate(delta)
    return delta

In [ ]:

class DataLoader():
  def __init__(self, batch_size = 8, X: np.ndarray = None, y: np.ndarray = None, flatten: bool = False):
    '''
    Simple DataLoader, PyTorch style!

    Don't modify anything here!
    '''
    self.X = X
    if flatten:
      self.X = X.reshape(X.shape[0], -1)

    self.y = y
    self.shuffle_data()
    self.batch_size = batch_size

    self.start_index = 0
    self.end_index = self.batch_size

  def shuffle_data(self):
    indices = np.arange(self.X.shape[0])
    np.random.shuffle(indices)
    self.X = self.X[indices]
    self.y = self.y[indices]

  def __len__(self):
    return len(self.X)

  def __iter__(self):
    return self

  def __next__(self):

    # reset indices and shuffle data
    if self.end_index > self.X.shape[0]:
      self.start_index = 0
      self.end_index = self.batch_size
      self.shuffle_data()
      raise StopIteration

    next_X = self.X[self.start_index:self.end_index]
    next_y = self.y[self.start_index:self.end_index]

    # Ensure indices are updated
    self.start_index = self.end_index
    self.end_index += self.batch_size

    return next_X, next_y


class DataLoaderFactory():
  def __init__(self, batch_size: int  = 8) -> None:
    '''
    Simple Dataloader, PyTorch style!

    Don't modify anything here!
    '''
    self.X, self.y = datasets.load_iris(return_X_y=True)

    # normalize the features
    self.X = (self.X - np.mean(self.X, axis=0)) / np.std(self.X, axis=0)

    # split X in X_train and X_validate as well as y
    self.X_train = self.X[:120]
    self.X_validate = self.X[120:]
    self.y_train_sparse = self.y[:120]
    self.y_validate_sparse = self.y[120:]

    # convert y_train and y_validate into one_hot matrix
    self.y_train = np.zeros((self.y_train_sparse.shape[0], 3))
    self.y_train[np.arange(self.y_train_sparse.shape[0]), self.y_train_sparse] = 1

    self.y_validate = np.zeros((self.y_validate_sparse.shape[0], 3))
    self.y_validate[np.arange(self.y_validate_sparse.shape[0]), self.y_validate_sparse] = 1

    self.batch_size = batch_size

    self.train_dataset = DataLoader(self.batch_size, self.X_train, self.y_train)
    self.validation_dataset = DataLoader(self.batch_size, self.X_validate, self.y_validate)

    self.len_train = len(self.train_dataset)
    self.len_validate = len(self.validation_dataset)

  def get_validation_dataset(self):
    return self.validation_dataset

  def get_train_dataset(self):
    return self.train_dataset


# Standard neural network: This will be our baseline
module_list = ModuleList()
module_list.add(FFNLayer(4, 10))
module_list.add(Sigmoid())
module_list.add(FFNLayer(10, 3))
module_list.add(SoftMax())

loss_fn = CrossEntropy()

def train_model(module_list, loss_fn, epochs: int = 200, learning_rate: float = 0.01, dataloader_factory: DataLoaderFactory = DataLoaderFactory, batch_size: int = 8):

  # set the learning rate
  module_list.set_learning_rate(learning_rate)

  # Get the dataloaders
  loader_factory = dataloader_factory(batch_size = batch_size )
  train_loader = loader_factory.get_train_dataset()
  validation_loader = loader_factory.get_validation_dataset()

  train_loss_history = []
  validation_loss_history = []
  for epoch in range(epochs):
    avg_loss = 0
    module_list.set_training()
    for data, y_target in train_loader:

      # Exercise 3.3: Implement the training update here!

      # forward pass, get the predicted y
      y_pred = module_list(data)

      # calculate the entropy loss! name it "loss"
      loss = loss_fn(y_target, y_pred)

      # calculate the delta, from the loss function!
      delta = loss_fn.backpropagate(y_pred)

      # backprop!
      module_list.backpropagate(delta)

      ### END YOUR code
      num_batches = len(train_loader) / train_loader.batch_size
      avg_loss += loss / num_batches


    # validate
    avg_val_loss = 0
    module_list.set_evaluation()
    for data, y_target in validation_loader:
      y_pred = module_list(data)
      loss = loss_fn(y_target, y_pred)
      num_batches = len(validation_loader) / validation_loader.batch_size
      avg_val_loss += loss / num_batches

    validation_loss_history.append(float(avg_val_loss))
    print(f'Epoch: {epoch} \t average loss: {avg_loss} \t validation loss: {avg_val_loss}')
    train_loss_history.append(float(avg_loss))
  return train_loss_history, validation_loss_history


train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 200, learning_rate = 0.01)

Epoch: 0 	 average loss: 12.876451251592778 	 validation loss: 13.848139633140521
Epoch: 1 	 average loss: 8.428405148365744 	 validation loss: 11.699000985523965
Epoch: 2 	 average loss: 6.548130182924329 	 validation loss: 10.17205907226144
Epoch: 3 	 average loss: 5.546434961584627 	 validation loss: 8.604551047398909
Epoch: 4 	 average loss: 4.890943038264167 	 validation loss: 7.832317997532835
Epoch: 5 	 average loss: 4.441990424103529 	 validation loss: 7.2837164845133
Epoch: 6 	 average loss: 4.078781727272315 	 validation loss: 6.500430296042209
Epoch: 7 	 average loss: 3.794490124377517 	 validation loss: 6.687610520574084
Epoch: 8 	 average loss: 3.5730276216201613 	 validation loss: 5.59214939415077
Epoch: 9 	 average loss: 3.412279383311249 	 validation loss: 5.933382355480783
Epoch: 10 	 average loss: 3.2599143650735916 	 validation loss: 4.997641923990608
Epoch: 11 	 average loss: 3.128033471195376 	 validation loss: 5.651968064159734
Epoch: 12 	 average loss: 3.0068533160419877 	 validation loss: 5.1289159068134715
Epoch: 13 	 average loss: 2.915928806143964 	 validation loss: 5.2259809180235655
Epoch: 14 	 average loss: 2.8228096551889763 	 validation loss: 4.953140145082106
Epoch: 15 	 average loss: 2.7396010514897875 	 validation loss: 4.442096129561617
Epoch: 16 	 average loss: 2.6572541360622095 	 validation loss: 4.51543848513687
Epoch: 17 	 average loss: 2.5941829831336807 	 validation loss: 4.5421124052256
Epoch: 18 	 average loss: 2.5267240631667702 	 validation loss: 4.605948305851484
Epoch: 19 	 average loss: 2.4608466379861444 	 validation loss: 4.727874705840747
Epoch: 20 	 average loss: 2.4193749061368184 	 validation loss: 4.381828593375727
Epoch: 21 	 average loss: 2.3552090788994424 	 validation loss: 4.370768871481581
Epoch: 22 	 average loss: 2.2980825009113723 	 validation loss: 4.291348164152263
Epoch: 23 	 average loss: 2.262368013956502 	 validation loss: 4.469250301069051
Epoch: 24 	 average loss: 2.216356030997794 	 validation loss: 4.162598731773602
Epoch: 25 	 average loss: 2.1635035069901054 	 validation loss: 3.5236288301490823
Epoch: 26 	 average loss: 2.1315536439540974 	 validation loss: 4.153631566601206
Epoch: 27 	 average loss: 2.0895937251000567 	 validation loss: 3.841337222743058
Epoch: 28 	 average loss: 2.060813421774886 	 validation loss: 3.7236253582319003
Epoch: 29 	 average loss: 2.0157244042808378 	 validation loss: 3.6822490194463118
Epoch: 30 	 average loss: 1.982748658169428 	 validation loss: 3.647299876990255
Epoch: 31 	 average loss: 1.9492499232305551 	 validation loss: 3.4618941525841107
Epoch: 32 	 average loss: 1.9104516833854315 	 validation loss: 3.557510694351941
Epoch: 33 	 average loss: 1.882596106581973 	 validation loss: 3.497848210929792
Epoch: 34 	 average loss: 1.8561509698047782 	 validation loss: 3.5681554901397323
Epoch: 35 	 average loss: 1.821324178081523 	 validation loss: 3.427621753234631
Epoch: 36 	 average loss: 1.7962697427740768 	 validation loss: 3.5250566820296267
Epoch: 37 	 average loss: 1.771955135390365 	 validation loss: 3.4767820617259932
Epoch: 38 	 average loss: 1.7390973030970607 	 validation loss: 3.4485988034961768
Epoch: 39 	 average loss: 1.7209966614719288 	 validation loss: 3.2379409509381305
Epoch: 40 	 average loss: 1.6980023482954285 	 validation loss: 3.3253925890811
Epoch: 41 	 average loss: 1.671461836561106 	 validation loss: 2.878572842740721
Epoch: 42 	 average loss: 1.6520229059985951 	 validation loss: 3.161433040428903
Epoch: 43 	 average loss: 1.6213667539362695 	 validation loss: 3.0935330002262837
Epoch: 44 	 average loss: 1.5992184230255515 	 validation loss: 3.3869559585975497
Epoch: 45 	 average loss: 1.5801385299606279 	 validation loss: 2.9180234015099225
Epoch: 46 	 average loss: 1.5552961449958231 	 validation loss: 2.926171746675905
Epoch: 47 	 average loss: 1.5416569814289405 	 validation loss: 3.1435734303820997
Epoch: 48 	 average loss: 1.5241398021704913 	 validation loss: 3.0199799615254195
Epoch: 49 	 average loss: 1.5001384378271023 	 validation loss: 3.194068970094291
Epoch: 50 	 average loss: 1.4888566610051535 	 validation loss: 3.0761622148313
Epoch: 51 	 average loss: 1.4707552744567054 	 validation loss: 2.958433386837736
Epoch: 52 	 average loss: 1.4474574867979155 	 validation loss: 2.896181817408489
Epoch: 53 	 average loss: 1.4373861552948668 	 validation loss: 2.9739036674609904
Epoch: 54 	 average loss: 1.4182400148257295 	 validation loss: 2.6709843662020245
Epoch: 55 	 average loss: 1.3958714477009009 	 validation loss: 3.02279727647121
Epoch: 56 	 average loss: 1.384968463862705 	 validation loss: 2.998726838501022
Epoch: 57 	 average loss: 1.3767571697533856 	 validation loss: 2.715000166867854
Epoch: 58 	 average loss: 1.3707100458045085 	 validation loss: 2.517784327814934
Epoch: 59 	 average loss: 1.3417141114588234 	 validation loss: 2.878489663312237
Epoch: 60 	 average loss: 1.3243802093674297 	 validation loss: 2.850328954396831
Epoch: 61 	 average loss: 1.30906385683988 	 validation loss: 2.9353656240075363
Epoch: 62 	 average loss: 1.2967251864731317 	 validation loss: 2.8944520350305742
Epoch: 63 	 average loss: 1.287079561230237 	 validation loss: 2.773897518993663
Epoch: 64 	 average loss: 1.272123938271146 	 validation loss: 2.7114585643885984
Epoch: 65 	 average loss: 1.2629831531205211 	 validation loss: 2.7073007533463462
Epoch: 66 	 average loss: 1.2431919488175607 	 validation loss: 2.640439076099754
Epoch: 67 	 average loss: 1.2387808950006094 	 validation loss: 2.605224721734002
Epoch: 68 	 average loss: 1.229353443569123 	 validation loss: 2.906531507106036
Epoch: 69 	 average loss: 1.212144886496255 	 validation loss: 2.7241977070503216
Epoch: 70 	 average loss: 1.2038353924474032 	 validation loss: 2.7622085294769905
Epoch: 71 	 average loss: 1.1877538476231446 	 validation loss: 2.8445379113814058
Epoch: 72 	 average loss: 1.1777666100324953 	 validation loss: 2.902013289363989
Epoch: 73 	 average loss: 1.1683206629589094 	 validation loss: 2.4523322986686624
Epoch: 74 	 average loss: 1.1600052334282884 	 validation loss: 2.5796823880145814
Epoch: 75 	 average loss: 1.1484168643516852 	 validation loss: 2.580862794497159
Epoch: 76 	 average loss: 1.1392578768136032 	 validation loss: 2.5123513295408504
Epoch: 77 	 average loss: 1.1259072201176832 	 validation loss: 2.552627304770274
Epoch: 78 	 average loss: 1.1177584375347136 	 validation loss: 2.39862750138853
Epoch: 79 	 average loss: 1.1118136419817544 	 validation loss: 2.6077274120791305
Epoch: 80 	 average loss: 1.105604870794345 	 validation loss: 2.4509983407825047
Epoch: 81 	 average loss: 1.0893116696020846 	 validation loss: 2.474171209164985
Epoch: 82 	 average loss: 1.0829997566511926 	 validation loss: 2.6399854002110894
Epoch: 83 	 average loss: 1.0767569365731036 	 validation loss: 2.563307182943121
Epoch: 84 	 average loss: 1.0648125956001147 	 validation loss: 2.0711905543042515
Epoch: 85 	 average loss: 1.0562341670154456 	 validation loss: 2.6212602630495265
Epoch: 86 	 average loss: 1.0522330612135447 	 validation loss: 2.421997331585186
Epoch: 87 	 average loss: 1.0410380659272045 	 validation loss: 2.519902619364209
Epoch: 88 	 average loss: 1.0361885089835847 	 validation loss: 2.54077192370148
Epoch: 89 	 average loss: 1.0251547132899734 	 validation loss: 2.5682392271606473
Epoch: 90 	 average loss: 1.018598390069917 	 validation loss: 2.4586633561874445
Epoch: 91 	 average loss: 1.015366195297088 	 validation loss: 2.2860183436259307
Epoch: 92 	 average loss: 1.0064175753894986 	 validation loss: 2.3862018999776753
Epoch: 93 	 average loss: 0.9952858888515875 	 validation loss: 2.440534102267711
Epoch: 94 	 average loss: 0.9921692909365711 	 validation loss: 2.5474685160330037
Epoch: 95 	 average loss: 0.984170427821788 	 validation loss: 2.1820289066755882
Epoch: 96 	 average loss: 0.9766847587288038 	 validation loss: 2.4401572596121945
Epoch: 97 	 average loss: 0.9711769733906249 	 validation loss: 2.336666157906847
Epoch: 98 	 average loss: 0.9672052079334882 	 validation loss: 2.3835367681858695
Epoch: 99 	 average loss: 0.9594562559944413 	 validation loss: 2.5304263454746385
Epoch: 100 	 average loss: 0.9542008672052794 	 validation loss: 2.4144403971913078
Epoch: 101 	 average loss: 0.9452175264128148 	 validation loss: 2.060223622474019
Epoch: 102 	 average loss: 0.9370998601648988 	 validation loss: 2.400136574103495
Epoch: 103 	 average loss: 0.9322996664115101 	 validation loss: 2.326780978613785
Epoch: 104 	 average loss: 0.9272760836381021 	 validation loss: 2.5584284967165574
Epoch: 105 	 average loss: 0.9198359615574225 	 validation loss: 2.298406905274156
Epoch: 106 	 average loss: 0.9147457480177649 	 validation loss: 2.041879177598244
Epoch: 107 	 average loss: 0.9136322091234614 	 validation loss: 2.1823788234596924
Epoch: 108 	 average loss: 0.9058075905867676 	 validation loss: 2.0691434513614118
Epoch: 109 	 average loss: 0.8986625684274713 	 validation loss: 2.1925619992938934
Epoch: 110 	 average loss: 0.8949421704876473 	 validation loss: 2.267436167516099
Epoch: 111 	 average loss: 0.8891917823757274 	 validation loss: 2.0541088911527687
Epoch: 112 	 average loss: 0.8897442443855978 	 validation loss: 2.25910204750533
Epoch: 113 	 average loss: 0.8771797859444843 	 validation loss: 2.2044832546102544
Epoch: 114 	 average loss: 0.8724744048515389 	 validation loss: 2.108949889379259
Epoch: 115 	 average loss: 0.8716487762155941 	 validation loss: 2.287931732718384
Epoch: 116 	 average loss: 0.8666969135994691 	 validation loss: 1.9237882949675083
Epoch: 117 	 average loss: 0.8635982936530161 	 validation loss: 2.2591045182696075
Epoch: 118 	 average loss: 0.8585946834780187 	 validation loss: 1.960137803801289
Epoch: 119 	 average loss: 0.8494866030919224 	 validation loss: 2.3116881175999167
Epoch: 120 	 average loss: 0.8485137241635765 	 validation loss: 2.324486757050336
Epoch: 121 	 average loss: 0.8408944181084109 	 validation loss: 2.3546895183910155
Epoch: 122 	 average loss: 0.8392930300098753 	 validation loss: 2.090051572573004
Epoch: 123 	 average loss: 0.8339064822820381 	 validation loss: 2.239053372531525
Epoch: 124 	 average loss: 0.8277608583927167 	 validation loss: 2.2814939651108306
Epoch: 125 	 average loss: 0.8259154486896306 	 validation loss: 2.06533323478245
Epoch: 126 	 average loss: 0.8196631700638061 	 validation loss: 2.330942697118567
Epoch: 127 	 average loss: 0.8181665987321731 	 validation loss: 1.9349865269438062
Epoch: 128 	 average loss: 0.8137713662568457 	 validation loss: 1.9091371475411798
Epoch: 129 	 average loss: 0.8103325185678298 	 validation loss: 2.3371645460292676
Epoch: 130 	 average loss: 0.8032299566586985 	 validation loss: 2.2323059314912976
Epoch: 131 	 average loss: 0.8029690603520934 	 validation loss: 2.281292269325871
Epoch: 132 	 average loss: 0.798943008534672 	 validation loss: 2.2140813832646047
Epoch: 133 	 average loss: 0.7915032559841959 	 validation loss: 2.2910419543984757
Epoch: 134 	 average loss: 0.7909205872556115 	 validation loss: 1.8882857213305133
Epoch: 135 	 average loss: 0.7846520030734216 	 validation loss: 2.0680032077433457
Epoch: 136 	 average loss: 0.7789134642832911 	 validation loss: 2.140632028908206
Epoch: 137 	 average loss: 0.7762498540622806 	 validation loss: 1.8041454772324172
Epoch: 138 	 average loss: 0.7738244098574364 	 validation loss: 1.7947189265376762
Epoch: 139 	 average loss: 0.771844205910409 	 validation loss: 1.9263763249016757
Epoch: 140 	 average loss: 0.7671703545089301 	 validation loss: 1.9595705045071279
Epoch: 141 	 average loss: 0.7666041916760664 	 validation loss: 1.8608620117008101
Epoch: 142 	 average loss: 0.760969153022881 	 validation loss: 1.8113095451053631
Epoch: 143 	 average loss: 0.7574599240947051 	 validation loss: 2.1953635511592244
Epoch: 144 	 average loss: 0.7537118858920284 	 validation loss: 2.033836870901588
Epoch: 145 	 average loss: 0.7500463538095076 	 validation loss: 2.026952953753571
Epoch: 146 	 average loss: 0.7491561828678723 	 validation loss: 2.0096181003852562
Epoch: 147 	 average loss: 0.7430990275088932 	 validation loss: 1.702165645171998
Epoch: 148 	 average loss: 0.7397137719708341 	 validation loss: 2.356425810723327
Epoch: 149 	 average loss: 0.7350368325805393 	 validation loss: 1.9709271126533539
Epoch: 150 	 average loss: 0.7358088861679627 	 validation loss: 2.272333743432945
Epoch: 151 	 average loss: 0.7334430447141163 	 validation loss: 2.083266633790773
Epoch: 152 	 average loss: 0.7308486937279167 	 validation loss: 2.2980587353975332
Epoch: 153 	 average loss: 0.7236184335121555 	 validation loss: 2.2157789710033815
Epoch: 154 	 average loss: 0.7239335360644976 	 validation loss: 2.021639690671152
Epoch: 155 	 average loss: 0.7184370355773531 	 validation loss: 1.9461574863260949
Epoch: 156 	 average loss: 0.7182261939360709 	 validation loss: 1.7180596193043307
Epoch: 157 	 average loss: 0.7128965085189733 	 validation loss: 2.12446779183591
Epoch: 158 	 average loss: 0.7130622574201095 	 validation loss: 2.1746073510320354
Epoch: 159 	 average loss: 0.7081867840679936 	 validation loss: 2.105227946314806
Epoch: 160 	 average loss: 0.7067967482892801 	 validation loss: 1.6432306642069654
Epoch: 161 	 average loss: 0.7073031554260671 	 validation loss: 1.8881604873413296
Epoch: 162 	 average loss: 0.7007858471108552 	 validation loss: 2.0735475283241316
Epoch: 163 	 average loss: 0.6988819039380295 	 validation loss: 2.01417700261809
Epoch: 164 	 average loss: 0.6931656716818217 	 validation loss: 2.0780424151427566
Epoch: 165 	 average loss: 0.6936670341873881 	 validation loss: 2.2516130755942143
Epoch: 166 	 average loss: 0.6898258773276496 	 validation loss: 2.1374262463809375
Epoch: 167 	 average loss: 0.6868625184434066 	 validation loss: 2.1099015523012943
Epoch: 168 	 average loss: 0.6849473094700889 	 validation loss: 2.0444609974119428
Epoch: 169 	 average loss: 0.6810284067211267 	 validation loss: 1.7511298230315038
Epoch: 170 	 average loss: 0.6779688217890574 	 validation loss: 2.114083614357218
Epoch: 171 	 average loss: 0.6766001075216355 	 validation loss: 1.7956765505659096
Epoch: 172 	 average loss: 0.6769083653524919 	 validation loss: 1.8352831131114822
Epoch: 173 	 average loss: 0.6702901799292094 	 validation loss: 2.1343246555108295
Epoch: 174 	 average loss: 0.6724450188355912 	 validation loss: 1.9894816853359272
Epoch: 175 	 average loss: 0.6675848884177195 	 validation loss: 1.7982495286122804
Epoch: 176 	 average loss: 0.6629875709990498 	 validation loss: 1.9734539592813976
Epoch: 177 	 average loss: 0.661593604810183 	 validation loss: 1.8259275669426605
Epoch: 178 	 average loss: 0.6605198442254508 	 validation loss: 2.162061169876119
Epoch: 179 	 average loss: 0.6578753649431639 	 validation loss: 1.869400302483072
Epoch: 180 	 average loss: 0.6561413750465747 	 validation loss: 2.1036985612372296
Epoch: 181 	 average loss: 0.6543495215186343 	 validation loss: 2.1778986491701295
Epoch: 182 	 average loss: 0.6514736701824058 	 validation loss: 1.8740721020566835
Epoch: 183 	 average loss: 0.6508229973427326 	 validation loss: 2.1075317776446196
Epoch: 184 	 average loss: 0.6466932296805387 	 validation loss: 1.9401044643304535
Epoch: 185 	 average loss: 0.6472551350433228 	 validation loss: 1.9344848719719834
Epoch: 186 	 average loss: 0.6465962597514265 	 validation loss: 1.7277484143166708
Epoch: 187 	 average loss: 0.6397086420445642 	 validation loss: 1.9532739521221045
Epoch: 188 	 average loss: 0.6402010967098208 	 validation loss: 1.7660026431120994
Epoch: 189 	 average loss: 0.6364909196842085 	 validation loss: 1.9162918423118205
Epoch: 190 	 average loss: 0.6358487626019935 	 validation loss: 1.8706916264165678
Epoch: 191 	 average loss: 0.6340259912246997 	 validation loss: 1.8844557170510328
Epoch: 192 	 average loss: 0.6295511943963474 	 validation loss: 2.1069962656595247
Epoch: 193 	 average loss: 0.6267826905670674 	 validation loss: 1.8879703163372932
Epoch: 194 	 average loss: 0.6278079896897101 	 validation loss: 2.054647170915377
Epoch: 195 	 average loss: 0.6263612051970355 	 validation loss: 1.7507940633607835
Epoch: 196 	 average loss: 0.6239866940264168 	 validation loss: 1.9988632824343773
Epoch: 197 	 average loss: 0.6207804728626571 	 validation loss: 1.9080022961997796
Epoch: 198 	 average loss: 0.6200292705850267 	 validation loss: 1.9747945283697077
Epoch: 199 	 average loss: 0.619541825245869 	 validation loss: 1.8940195339202466

In [ ]:

def plot_train_validation_losses(train_loss_history, validation_loss_history):
  indices = np.arange(len(train_loss_history))
  plt.plot(indices, train_loss_history, label='train loss')
  plt.plot(indices, validation_loss_history, label='validation loss')
  plt.legend()
  plt.title('Train and Validation loss for your neural network')
  plt.show()

plot_train_validation_losses(train_loss_history, validation_loss_history)

3.3 Activation functions

In the previous exercise, we used the sigmoid activation function. We are going to have a look at several different activation functions that are specifically designed for deeper neural networks.

Many activation functions have been designed for increasingly deep neural networks. In the before times, sigmoids (and TanH) activation functions were used in deep neural networks, these had several problems however:

Vanishing Gradients: This is the most significant problem. Sigmoid and Tanh saturate for large values, causing their gradients to become near zero. This severely hinders the learning of earlier layers in deep networks during backpropagation.
Slow Convergence: Due to vanishing gradients, models using Sigmoid or Tanh can converge much slower, especially on deep architectures.
Computational Cost: The exponential function in Sigmoid and Tanh is computationally more expensive than the alternatives.
Non-Zero Centered Output (Sigmoid): Sigmoid's output is always positive. This can lead to suboptimal gradient updates in subsequent layers.

Alternatives activation functions that suffer less from these problems are: ReLU, Leaky-ReLU, and ELU. These are discussed next.

Rectified Linear Unit (ReLU) Activation Function

The Rectified Linear Unit (ReLU) is one of the most popular activation functions used in deep learning. It's computationally efficient and has helped address issues like the vanishing gradient problem that can occur with sigmoid and tanh functions.

The ReLU function is defined as:

$$ \text{ReLU}(z) = \max(0, z) $$

This means that if the input $z$ is positive, the output is $z$. If the input $z$ is zero or negative, the output is 0. The derivative of the ReLU function is:

$$ \frac{d}{dz} \text{ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \le 0 \end{cases} $$

This means the gradient is 1 for any positive input and 0 for any non-positive input. This simple derivative is one of the reasons ReLU is computationally efficient and helps combat the vanishing gradient problem.

Leaky Rectified Linear Unit (Leaky ReLU) Activation Function

The Leaky Rectified Linear Unit (Leaky ReLU) is another variation of the ReLU activation function designed to address the "dying ReLU" problem. While ReLU outputs zero for all negative inputs, which can cause neurons to become inactive and stop learning, Leaky ReLU allows a small, non-zero gradient when the input is negative.

The Leaky ReLU function is defined as:

$$ \text{Leaky ReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \le 0 \end{cases} $$

where $\alpha$ is a small positive constant, typically around 0.01.

The derivative of the Leaky ReLU function is:

$$ \frac{d}{dz} \text{Leaky ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \le 0 \end{cases} $$

For positive inputs, the derivative is 1, just like ReLU. For non-positive inputs, the derivative is $\alpha$. This small non-zero gradient ensures that neurons can still learn even when their input is negative, preventing them from becoming permanently inactive.

Exponential Linear Unit (ELU) Activation Function

The Exponential Linear Unit (ELU) is another popular activation function that aims to combine the benefits of ReLU while addressing some of its drawbacks. It tends to produce negative outputs for negative inputs, which can help push the mean of activations closer to zero, potentially leading to faster learning.

The ELU function is defined as:

$$ \text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \le 0 \end{cases} $$

where $\alpha$ is a hyperparameter, usually set to 1.

The derivative of the ELU function is:

$$ \frac{d}{dz} \text{ELU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha e^z & \text{if } z \le 0 \end{cases} $$

For $z > 0$, the derivative is 1, similar to ReLU. For $z \le 0$, the derivative is $\alpha e^z$. This derivative is non-zero for negative inputs, which helps mitigate the "dying ReLU" problem (where neurons can become inactive and stop learning).

Run the cell below to see what these functions look like!

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

# Generate input values
z = np.linspace(-5, 5, 100)

# ReLU function
def relu(z):
  return np.maximum(0, z)

# ELU function
def elu(z, alpha=1.0):
  return np.where(z > 0, z, alpha * (np.exp(z) - 1))

# Leaky ReLU function
def leaky_relu(z, alpha=0.01):
  return np.where(z > 0, z, alpha * z)

# Create subplots (2 rows, 2 columns)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Plot ReLU on the first subplot (top-left)
axes[0, 0].plot(z, relu(z), label='ReLU')
axes[0, 0].set_title('ReLU Activation Function')
axes[0, 0].set_xlabel('Input (z)')
axes[0, 0].set_ylabel('Output (ReLU(z))')
axes[0, 0].grid(True)
axes[0, 0].legend()

# Plot ELU with different alpha values on the second subplot (top-right)
alphas_elu = [0.5, 1.0, 2.0]
for alpha in alphas_elu:
  axes[0, 1].plot(z, elu(z, alpha), label=f'ELU (α={alpha})')

axes[0, 1].set_title('ELU Activation Function with Different Alpha Values')
axes[0, 1].set_xlabel('Input (z)')
axes[0, 1].set_ylabel('Output (ELU(z))')
axes[0, 1].grid(True)
axes[0, 1].legend()

# Plot Leaky ReLU with different alpha values on the third subplot (bottom-left)
alphas_leaky_relu = [0.01, 0.1, 0.2]
for alpha in alphas_leaky_relu:
  axes[1, 0].plot(z, leaky_relu(z, alpha), label=f'Leaky ReLU (α={alpha})')

axes[1, 0].set_title('Leaky ReLU Activation Function with Different Alpha Values')
axes[1, 0].set_xlabel('Input (z)')
axes[1, 0].set_ylabel('Output (Leaky ReLU(z))')
axes[1, 0].grid(True)
axes[1, 0].legend()

# Hide the fourth subplot as it's not used
fig.delaxes(axes[1, 1])

plt.tight_layout() # Adjust layout to prevent overlap
plt.show()

Exercise 3.4 Implementing ReLU, leaky-ReLU, and ELU

In the below class definitions, implement the forward and backward functions.

HINT: You can already see how the forward function is implemented in the code cell above, keep it a secret, don't tell your fellow classmates.

In [ ]:

class ReLU():
  def __init__(self):
    self.X = None

  def __call__(self, X: np.ndarray) -> np.ndarray:

    ## EXERCISE 3.4
    self.X = X
    return np.maximum(0, X)

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:

    ## EXERCISE 3.4
    binarized_matrix = np.where(self.X > 0, 1, 0)
    return delta * binarized_matrix

class LeakyReLU():
  def __init__(self, alpha: float = 0.01):
    self.alpha = alpha
    self.X = None

  def __call__(self, X: np.ndarray) -> np.ndarray:

    ## EXERCISE 3.4
    self.X = X
    return np.where(self.X > 0, self.X, self.alpha * self.X)

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:

    ## EXERCISE 3.4
    binarized_matrix = np.where(self.X > 0, 1, self.alpha)
    return delta * binarized_matrix

class ELU():
  def __init__(self, alpha: float = 1.0):
    self.alpha = alpha
    self.X = None

  def __call__(self, X: np.ndarray) -> np.ndarray:

    ## EXERCISE 3.4
    self.X = X
    return np.where(self.X > 0, self.X, self.alpha * (np.exp(self.X) - 1))

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:

    ## EXERCISE 3.4
    binarized_matrix = np.where(self.X > 0, 1, self.alpha * np.exp(self.X))
    return delta * binarized_matrix

Exercise 3.5: Run the network with the activation functions

Run the network with the activation functions that you have just implemented on the MNIST digit dataset. The MNIST dataset are small images of handwritten digits, in our case we flatten them and feed them into our neural network.

Here is example data from MNIST:

Source: https://www.geeksforgeeks.org/machine-learning/mnist-dataset/

Try running the network for 20 epochs on Sigmoid, ReLU, LeakyReLU, and ELU activation functions.

HINT: Right now, ReLU, LeakyReLU and ELU should all return NaN's. It's a sign you are on the right way, but not a guarentee. We will see next, why this is!

In [ ]:

# Install the MNIST dataset
pip install mnist_datasets

Collecting mnist_datasets
  Downloading mnist_datasets-0.12-py3-none-any.whl.metadata (4.8 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (2.0.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (4.67.1)
Downloading mnist_datasets-0.12-py3-none-any.whl (6.7 kB)
Installing collected packages: mnist_datasets
Successfully installed mnist_datasets-0.12

In [ ]:

from mnist_datasets import MNISTLoader


# class for the MNIST dataloader. DO NOT CHANGE THIS.
class MNISTDataLoaderFactory():
  def __init__(self, batch_size: int  = 8) -> None:
    '''
    Simple Dataloader, PyTorch style!

    Don't modify anything here!
    '''
    train_X, train_y = MNISTLoader().load()
    validate_X, validate_y = MNISTLoader().load(train = False )

    # Calculate mean and standard deviation, add epsilon for stability
    mean_train = np.mean(train_X, axis = 0)
    std_train = np.std(train_X)  # Add epsilon
    mean_validate = np.mean(validate_X, axis=0)
    std_validate = np.std(validate_X) # Add epsilon

    # normalize the features
    train_X = (train_X - mean_train) / std_train
    validate_X = (validate_X - mean_validate) / std_validate

    # convert y_train and y_validate into one_hot matrix
    train_y  = np.array(train_y).astype(int)
    validate_y  = np.array(validate_y).astype(int)

    train_y_one_hot  = np.zeros((train_y.shape[0], 10))
    train_y_one_hot[np.arange(train_y.shape[0]), train_y] = 1.

    validate_y_one_hot  = np.zeros((validate_y.shape[0], 10))
    validate_y_one_hot[np.arange(validate_y.shape[0]), validate_y] = 1.

    self.batch_size = batch_size

    self.train_dataset = DataLoader(self.batch_size, train_X, train_y_one_hot, flatten = True)
    self.validation_dataset = DataLoader(self.batch_size, validate_X, validate_y_one_hot, flatten = True)

    self.len_train = len(self.train_dataset)
    self.len_validate = len(self.validation_dataset)

  def get_validation_dataset(self):
    return self.validation_dataset

  def get_train_dataset(self):
    return self.train_dataset

In [ ]:

def mnist_model(activation = 'sigmoid', initialization = 'normal') -> Tuple[ModuleList, CrossEntropy]:
  '''
  MNIST neural network model.

  For exercise 1.5, only change the activation function.

  For exercise 1.6, change the initialization after you have modified it in the FFNLayer class!
  '''

  if activation == 'sigmoid':
    activation_fn = Sigmoid

  elif activation == 'relu':
    activation_fn = ReLU

  elif activation == 'leaky_relu':
    activation_fn = LeakyReLU

  elif activation == 'elu':
    activation_fn = ELU
  else:
    raise ValueError(f'Unknown activation function: {activation}, pick one from: sigmoid, relu, leaky_relu, elu')

  module_list = ModuleList()
  module_list.add(FFNLayer(784, 20, initialization = initialization))
  module_list.add(activation_fn())
  module_list.add(FFNLayer(20, 20, initialization = initialization))
  module_list.add(activation_fn())
  module_list.add(FFNLayer(20, 10, initialization = initialization))
  module_list.add(SoftMax())

  loss_fn = CrossEntropy()

  return module_list, loss_fn

# module_list, loss_fn = mnist_model(activation = 'relu', initialization = 'kaiming')
# train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 10, learning_rate = 0.001, dataloader_factory = MNISTDataLoaderFactory)

In [ ]:

plot_train_validation_losses(train_loss_history, validation_loss_history)

So why the NAN's?

So what is going on, why are the losses turning into NaN's in spite of us using "Advanced Activations" that are supposed to solve vanishing gradients? While ReLU, ELU, and LeakyReLU address vanishing gradients, they have a different challenge: their positive side is unbounded. With standard normal initialization, the weights can cause the outputs of these activation functions (the features) to grow very large as data passes through the layers. This phenomenon is sometimes referred to as exploding activations or exploding values. When these large values reach the final layer and are fed into the Softmax function, the exponentiation step (np.exp(X)) can result in numbers that are too large to be represented by your computer's floating-point precision, leading to overflow. This overflow often produces inf (infinity), and subsequent calculations within the Softmax or the Cross-Entropy loss (like inf / inf) result in NaN (Not a Number) values, causing the training to fail.

There are a variety of ways in which we can deal with it:

Better Weight Initialization: Due to the multiplication of the weights with the features, they keep on adding up. So we want to initialize the features in a smarter way.
Numerically Stable Softmax: We could implement a numerically more stable version of the softmax loss.
Gradient Clipping: We could clip the gradients during backpropagation, such that they are only allowed to update the weights by so much each iteration.
Smaller Learning Rate: Another way of restricting the size of the weight updates each iteration.

We will implement a technique that adresses the first issue. This technique is described below!

Weight Initialization: Kaiming (He)

Introduced by Kaiming He et al. in 2015, Kaiming initialization was specifically designed for layers with ReLU and its variants (like Leaky ReLU). Unlike Sigmoid and Tanh, ReLU's mean is not zero, and it sets negative inputs to zero, which affects the variance of the activations.

Kaiming initialization accounts for the fact that ReLU "kills" half of the neurons (on average) by outputting zero for negative inputs. It aims to maintain the variance of the activations by scaling the weights appropriately.

The variance of the weights is typically set based on the number of input neurons (fan-in) or output neurons (fan-out), often using fan-in:

$$ \text{Var}(W) = \frac{2}{\text{fan_in}} $$

or using fan-out:

$$ \text{Var}(W) = \frac{2}{\text{fan_out}} $$

The weights are typically initialized by drawing from:

A normal distribution with mean 0 and standard deviation $\sigma = \sqrt{\frac{2}{\text{fan_in}}}$ (or $\sqrt{\frac{2}{\text{fan_out}}}$).
$$W \sim N\left(0, \sqrt{\frac{2}{\text{fan\_in}}}\right)$$
A uniform distribution in the range $[-\text{limit}, \text{limit}]$, where $\text{limit} = \sqrt{\frac{6}{\text{fan_in}}}$ (or $\sqrt{\frac{6}{\text{fan_out}}}$).
$$ W \sim U\left[-\sqrt{\frac{6}{\text{fan_in}}}, \sqrt{\frac{6}{\text{fan_in}}}\right] $$

Kaiming initialization is generally the preferred method when using ReLU or Leaky ReLU activation functions, as it specifically addresses the change in variance caused by these activations.

Exercise 3.6: Implementing He initialization

We are going to implement He initialization in our FFNLayer class so we can actually use the activations that we have just implemented.

The only thing that you have to implement is the Kaiming (He) initialization in the `FFNLayer, using the fan_in method. See the comment in the class and don't forget to rerun the notebook cell!

Once you have implemented this, swap the initialization parameter in the mnist_model function below to 'kaiming'. Then change the learning rate to 0.0001 and run the model with ReLU, ELU, and LeakyReLU for 100 epochs.

What do you notice now?

In [ ]:

module_list, loss_fn = mnist_model(activation = 'relu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 50, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)

In [ ]:

module_list, loss_fn = mnist_model(activation = 'elu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)

What you may have noticed is that training a neural network with one of the ReLU family activation functions in combination with kaiming's initialization produces the best results.

We have worked on a shallow neural network so far. Another technique to train bigger neural networks, is to normalize the features between each layer. We will be discussing a method of doing this next.

3.4 Batch Normalization

Batch Normalization (BatchNorm) is a technique introduced to address challenges in training deep neural networks. Initially, it was hypothesized that BatchNorm primarily helps by reducing Internal Covariate Shift, the change in the distribution of network activations during training. However, more recent research, notably the paper "How Does Batch Normalization Help Optimization?" by Santurkar et al., has suggested that while BatchNorm might not eliminate Internal Covariate Shift, its main benefits stem from other factors that make the optimization process smoother and more efficient.

Instead of primarily reducing Internal Covariate Shift, Santurkar et al. argue that Batch Normalization helps by:

Making the optimization landscape smoother: This allows gradients to be more predictable and less noisy, enabling the use of higher learning rates and faster convergence.
Reducing the dependence of gradients on the scale of parameters: This makes the training less sensitive to the initial values of the weights.

Regardless of the exact mechanism, Batch Normalization has proven to be a highly effective technique for stabilizing and accelerating the training of deep neural networks.

For a mini-batch of size $m$, let $x_i$ be the input to a neuron in a layer for the $i$-th example in the batch. Batch Normalization performs the following transformation for each neuron's input:

Calculate the mini-batch mean: $$ \mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i \quad \quad (\text{BN.1}) $$ This is the average of the inputs for a single neuron across the entire mini-batch. In NumPy, you would calculate the mean along the batch dimension.
Calculate the mini-batch variance: $$\sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2 \quad \quad (\text{BN.2})$$

This is the variance of the inputs for a single neuron across the mini-batch. In NumPy, you would calculate the variance along the batch dimension.
Normalize the input:
$$\hat{x_i} = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \quad \quad (\text{BN.3})$$
Here, $\epsilon$ is a small constant added for numerical stability to prevent division by zero in case the mini-batch variance is zero. This step scales the inputs to have zero mean and unit variance.
Scale and Shift: $$ y_i = \gamma \hat{x}_i + \beta \quad \quad (\text{BN.4})$$ This is the final output of the Batch Normalization layer. $\gamma$ (gamma) and $\beta$ (beta) are learnable parameters of the Batch Normalization layer. $\gamma$ scales the normalized input, and $\beta$ shifts it. These parameters allow the network to learn to restore the original representation power if needed, by essentially learning to undo the normalization if that's optimal for the network.

During Training:

The mini-batch mean $\mu_{\mathcal{B}}$ and variance $\sigma_{\mathcal{B}}^2$ are calculated for each mini-batch.
The learnable parameters $\gamma$ and $\beta$ are updated using gradient descent.
We also maintain a running average of the means and variances across all mini-batches to be used during inference.

During Inference (Testing):

We use the accumulated global mean and variance (calculated as running averages during training) to normalize the inputs.

The formulas for updating the running average are typically:

$$ \text{running_mean} = \text{momentum} \cdot \text{running_mean} + (1 - \text{momentum}) \cdot \mu_{\mathcal{B}} \quad \quad (\text{BN.5})$$$$ \text{running_variance} = \text{momentum} \cdot \text{running_variance} + (1 - \text{momentum}) \cdot \sigma_{\mathcal{B}}^2 \quad \quad (\text{BN.6})$$

where momentum is a hyperparameter (typically close to 1, e.g., 0.9 or 0.99).

Batch Normalization is a powerful technique that leads to:

Faster Training: It allows for higher learning rates.
Reduced Dependence on Initialization: The network becomes less sensitive to the initial values of the weights.
Regularization Effect: It adds a slight regularization effect.

Implementing Batch Normalization involves both the forward pass (normalization and scaling/shifting) and the backward pass (calculating gradients with respect to the inputs, $\gamma$, and $\beta$).

Gradients for Gamma and Beta in Batch Normalization

In Batch Normalization, $\gamma$ (gamma) and $\beta$ (beta) are learnable parameters that are updated during training using gradient descent. The update formulas depend on the gradients of the loss function with respect to these parameters ($\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L }{\partial \beta}$)*.

Recall the output of the Batch Norm layer for a single example $i$ in a mini-batch: $y_i = \gamma \hat{x}_i + \beta$, where $\hat{x}_i$ is the normalized input.

Here's how to calculate the gradients :

Gradient with Respect to $\beta$:

The cross entropy loss depends on $y_i$, and $y_i$ depends linearly on $\beta$. Using the chain rule, the gradient of the loss with respect to $\beta$ is the sum of the gradients of the loss with respect to the outputs of the Batch Norm layer ($y_i$) across the mini-batch:
$$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \beta} $$
Since $\frac{\partial y_i}{\partial \beta} = 1$, this simplifies to: $$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} $$ In terms of the incoming delta ($\delta_{out}$) from the layer above (where $\delta_{out} = \frac{\partial L}{\partial y}$), the gradient is: $$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} (\delta_{out})_i \quad \quad (\text{BN.7})$$
Gradient with Respect to $\gamma$: Using the chain rule for $\gamma$: $$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \gamma} $$

Since $y_i = \gamma \hat{x}_i + \beta$, $\frac{\partial y_i}{\partial \gamma} = \hat{x}_i$.

So the gradient is the sum of the element-wise product of the incoming delta and the normalized input ($\hat{x}_i$ ) across the mini-batch:
$$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \hat{x}_i $$
In terms of the incoming delta ($\delta_{out}$) and the normalized input $\hat{x}_i$:
$$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} (\delta_{out})_i \cdot \hat{x}_i \quad \quad (\text{BN.8})$$

These gradients ($\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L}{\partial \beta}$) are then used to update $\gamma$ and $\beta$ using the standard gradient descent rule:

$$ \gamma_{new} = \gamma_{old} - \alpha \frac{\partial L}{\partial \gamma} $$$$ \beta_{new} = \beta_{old} - \alpha \frac{\partial L}{\partial \beta} $$

where $\alpha$ is the learning rate.

Exercise 3.7 Implementing Batch Normalization

For the next exercise we are going to implement batch normalization. You will implement the forward pass. In the backward pass you only need to implement the updates for Gamma and Beta, the delta (the bit that is propagated further down the network) is already implemented. The equations that you need are already indicated above with $\text{BN}.x$.

If you are curious about how these are derived you can check out this blog:

https://apxml.com/courses/deep-learning-regularization-optimization/chapter-4-normalization-techniques/batch-norm-backward-pass

*$\mathcal{L}_{CE}$ is in this derivation referred to as $L$, because LaTex was having issues.

In [ ]:

class BatchNormalization():

  def __init__(self, momentum: float = 0.9, epsilon: float = 1e-5):
    '''
    Batch Normalization
    '''
    self.momentum = momentum
    self.epsilon = epsilon
    self.running_mean = None
    self.running_var = None

    self.X = None
    self.gamma = None
    self.beta = None
    self._learning_rate = 1.0

    self.training = True

  @property
  def learning_rate(self) -> float:
    return self._learning_rate

  @learning_rate.setter
  def learning_rate(self, value: float) -> None:
    self._learning_rate = value

  def __call__(self, X: np.ndarray) -> np.ndarray:
    self.X = X
    self.batch_size = X.shape[0]

    ## EXERCISE 1.7 Implement the forward pass for batch normalization

    # if there is no gamma or beta, initialize using the shape of
    # of X. Gamma is initialized as a vector of ones and beta
    # is initialized as a vector of zeros
    if self.gamma is None or self.beta is None:
      self.gamma = np.ones(X.shape[1])
      self.beta = np.zeros(X.shape[1])

    # variance and mean for each feature. (While training)
    if self.training:

      # TODO: calculate the mean and variance
      self.mu = np.mean(X, axis=0) # (BN.1)
      self.var = np.var(X, axis=0) # (BN.2)

      # update the running mean & running var
      if self.running_mean is None or self.running_var is None:

        # TODO: if no running mean/var implemented,
        # initialize them using the current variance
        self.running_mean = self.mu
        self.running_var = self.var

      else:

        # TODO: Implemente the running mean and running variance
        self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * self.mu # (BN.5)
        self.running_var = self.momentum * self.running_var + (1 - self.momentum) * self.var # (BN.6)

    # If not training, use the running mean and variance.
    else:
      self.mu = self.running_mean
      self.var = self.running_var

    # Normalize the input (don't forget to add epsilon)
    # TODO: Calculate the normalization of the input
    self.X_hat = (X - self.mu) / np.sqrt(self.var + self.epsilon) # (BN.3)

    # TODO: Scale and shift the input
    out = self.gamma * self.X_hat + self.beta # (BN.4)

    return out

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:

    # Now we need to implement the derivative with respect to the input X
    # which is pretty tricky at this stage. So that is provided :)
    # NO NEED TO IMPLEMENT THE DELTA_OUT UPDATE!
    constant = self.gamma / (self.batch_size * np.sqrt(self.var + self.epsilon))

    first_term = self.batch_size * delta
    second_term = np.sum(delta, axis = 0, keepdims=True) # changed axes
    third_term = self.X_hat * np.sum(delta * self.X_hat, axis = 0, keepdims = True) # changed axes

    delta_out = constant * (first_term - second_term - third_term)

    # EXERCISE 1.7: Implement the backprop for gamma and beta here!
    self.gamma = self.gamma - self._learning_rate * np.sum(delta * self.X_hat, axis=0) # (BN.7)
    self.beta = self.beta - self._learning_rate * np.sum(delta, axis=0) # (BN.8)

    return delta_out

Exercise 3.8: Testing Batch Normalization

Try out your implementation of batch normalization. Just run the function below as is. As you may notice, the batch size is 32 instead of 8. This is because batch normalization averages features over batches. So you need to have large enough batches to estimate the parameters.

I got the model to train to about:

average loss: 0.37
validation loss: 2.68

In [ ]:

def mnist_model_bn(activation = 'sigmoid', initialization = 'normal') -> Tuple[ModuleList, CrossEntropy]:
  '''
  MNIST neural network model.
  '''

  if activation == 'sigmoid':
    activation_fn = Sigmoid

  elif activation == 'relu':
    activation_fn = ReLU

  elif activation == 'leaky_relu':
    activation_fn = LeakyReLU

  elif activation == 'elu':
    activation_fn = ELU
  else:
    raise ValueError(f'Unknown activation function: {activation}, pick one from: sigmoid, relu, leaky_relu, elu')

  module_list = ModuleList()
  module_list.add(FFNLayer(784, 100, initialization = initialization))
  module_list.add(BatchNormalization())
  module_list.add(activation_fn())
  module_list.add(FFNLayer(100, 100, initialization = initialization))
  module_list.add(BatchNormalization())
  module_list.add(activation_fn())
  module_list.add(FFNLayer(100, 100, initialization = initialization))
  module_list.add(BatchNormalization())
  module_list.add(activation_fn())
  module_list.add(FFNLayer(100, 10, initialization = initialization))
  module_list.add(SoftMax())

  loss_fn = CrossEntropy()

  return module_list, loss_fn

module_list, loss_fn = mnist_model_bn(activation = 'elu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0005, dataloader_factory = MNISTDataLoaderFactory, batch_size = 32)
# plot_train_validation_losses(train_loss_history, validation_loss_history)

3.5 Other Normalization Techniques

While Batch Normalization is widely used and effective, its reliance on mini-batch statistics can be a limitation. Several other normalization techniques have been developed that address this by normalizing across different dimensions, making them less dependent on batch size:

Layer Normalization: Normalizes across the features of each individual sample. This is particularly useful in recurrent neural network models.
Instance Normalization: Normalizes across the spatial dimensions (height and width) for each sample and each feature channel independently. This has shown success in style transfer tasks.
Group Normalization: Divides the channels into groups and normalizes the features within each group for each sample. This provides a balance between Batch Normalization and Layer Normalization and works well across a wide range of batch sizes.

These techniques offer alternatives to Batch Normalization when its assumptions or requirements are not met, providing more stable training and better performance in specific scenarios. Though we will not go in depth on these techniques in this tutorial, it is useful to know about them.

Concluding remarks

We have done a lot in this tutorial. We've understood feed forward neural networks, one of the earliest neural network and a critical component in many specialized architectures. Next, we've implemented activation functions, weight initialization, and batch normalization.

You may pat yourself on the back, many people that apply neural networks today have never implemented one from scratch in NumPy or seen how backpropagation works. Now you have!

As you may have noticed, our architecture for predicting MNIST classes was not optimized for images. In the next tutorial we will be implementing a convolutional neural network, which for many years was the go-to architecture for image related tasks. Only in the last few years did it get a new architecture that under certain conditions can challenge it.