Welcome back! In the last tutorial we covered regularization and the bias variance trade-off. We discovered that a complicated model can basically memorize the training data and not generelize very well and that a model that is too simple does not learn the underlying complexity of the data.
Important notes¶
- It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
- Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features
Lesson 3: Feedforward Neural Networks from Scratch¶
This tutorial teaches you to build a neural network and implement backpropagation from scratch using NumPy—no deep learning framework required.
First, we will explain conceptually what is happening in a feedforward neural network (also known as a universal function approximator), then we will derive the update equations (backpropagation), and finally we will implement our first feedforward neural network in NumPy.
Neural networks are often referred to as universal function approximators. This means that a sufficiently large and complex neural network can approximate any continuous function to an arbitrary degree of accuracy. When machine learning papers mention "estimating a function" or "learning a function," they are typically referring to this capability of neural networks.
In our classification examples, the neural network learns a complex function that maps input features to a probability distribution over different classes. For example, it could learn to predict the outcome of a soccer match, determining which team will win, lose, or draw, based on various attributes like team form, player statistics, historical performance, and home-field advantage.
3.1 Conceptual representation of an FFN¶
In our first class we implemented softmax regression. We did this for two reasons;
- We wanted to show you how multi-class classification worked.
- We wanted you to get comfortable stacking multiple layers of weight vectors on top of each other, to form matrices.
We are now going to take point (2) further, and stack several matrix operations after each other. In simplest terms, a feedforward neural network looks like this:
Here we have to define several attributes.
- $K$ are the number of attributes that your data has. In image analysis, this might be a feature for each pixel, if you are classifying flowers this would be the four different dimensions of your sepal and petal.
- $N$ is your batch dimension. This would be the amount of datapoints you "feed" into your neural network for a training "iteration".
- $Weights$ are your first and second weight matrix. You take the dot product between your weight matrix and your input data or your features.
- $Features$ are the features that you get after multiplying the first weight matrix with the input data. We refer to the numbers we calculate as features when they are not part of the output layer.
The Forward pass in a deep neural network is when a batch of data is passed from left to right in a neural network. In our case we:
- Dot Product the input data $(N \times K )$ with the Weight matrix $(K \times F)$ , to obtain the feature matrix $(N \times F)$. In this case the dimension $F$ is also referred to as the number of neurons in the input layer.
- The feature matrix is multiplied by the second weight matrix $(F \times C)$, to obtain the output (logits) matrix $(N \times C)$.
- The output matrix is normalized by the softmax to obtain probability predictions for each class.
The input data is typically referred to as the input layer, the dotproduct between the first weight matrix and the input data is referred to as the first hidden layer and the dotproduct on the right is usually referred to as the output layer. If you add more layers in between, these will also be referred to as hidden layers.
What is not shown in this figure is what we do between layers, typically there is an activation function, which uses a normalization technique to keep the values of the feature matrices between certain values. Omitted from this representation is a bias term. In this case it would be a vectors of $(F \times 1)$ and $(C \times 1)$, they would be added to each feature. Because we will be calculating quite a few derivatives, the bias term has been omitted to keep them as uncluttered as possible.
The cool thing about this drawing is that you have already implemented the operation on the right. The only thing we need to do is: add the operation on the left and figure out how to adapt stochastic gradient descent (SGD) to work on this simple neural network.
Now that we conceptually know what a neural network does. Let's dive into the math and derive the update equations. You will see that these make it easy to create a modular implementation of neural networks.
3.2 Deriving the Update Equations (Backpropagation)¶
Now that we have a conceptual understanding of a feedforward neural network, let's dive into the mathematics behind training it. We will use stochastic gradient descent (SGD) to update the weights of our network. To do this, we need to calculate the gradients of the loss function with respect to each weight matrix. At each operation we will also calculate the gradient with respect to the input of the layer and pass this down the network such that this can be used to update the network at that layer as well. This process is known as backpropagation.
We will consider a two-layer neural network with an input layer, one hidden layer, and an output layer as was shown in the figure above.
NOTE: As mentioned before, the bias term in these calculations has been omitted for simplicity.
Our network can be represented by the following equations:
Input to Hidden Layer: $$Z_{hidden} = X \cdot W_{hidden}$$
where $X$ is the input data ($N \times K$), $W_{hidden}$ is the weight matrix for the input layer ($K \times F$), and $Z_{hidden}$ is the pre-activation output of the hidden layer ($N \times F$).
Activation Function: $$A_{hidden} = \sigma(Z_{hidden})$$
where $\sigma$ is the sigmoid activation function (the same one we used in logistic regression!) and $A_{hidden}$ is the activated output of the hidden layer ($N \times F$).
Hidden to Output Layer: $$Z_{output} = A_{hidden} \cdot W_{output}$$
where $W_{output}$ is the weight matrix for the output layer ($F \times C$), and $Z_{output}$ is the pre-activation output of the output layer ($N \times C$).
Output Function: $$\hat{Y} = \text{softmax}(Z_{output})$$
where $\hat{Y}$ is the predicted probabilities for each class ($N \times C$).
We will use the cross-entropy loss function (over a batch of data, so we average over each data point), $\mathcal{L}_{CE}$. Recall from lesson 1 that it is defined as:
$$\mathcal{L}_{CE} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C Y_{i,c} \log(\hat{Y}_{i,c})$$where $Y$ are the true labels ($N \times C$), in one-hot encoding.
Our goal is to find the gradients with respect to the output weights $\frac{\partial L}{\partial W_{output}}$ and with respect to the hidden weights $\frac{\partial L}{\partial W_{hidden}}$. We will use the chain rule to calculate these gradients, working backward from the loss function.
Gradient with Respect to $W_{output}$¶
To find $\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}}$, we apply the chain rule up until that point:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} \cdot \frac{\partial Z_{output}}{\partial W_{output}}$$We know that $$Z_{output} = A_{hidden} \cdot W_{output}$$
Therefore, $$\frac{\partial Z_{output}}{\partial W_{output}} = A_{hidden}^T$$
The derivative of the cross-entropy loss with respect to the pre-activation output of the softmax layer ($\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}}$) is a result that we recall from previous derivations in lesson 1:
$$\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} = \frac{1}{N} (\hat{Y} - Y)$$Therefore, the gradient with respect to $W_{output}$ is:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}} = \frac{1}{N} A_{hidden}^T \cdot (\hat{Y} - Y)$$
Because we are now working with batches, the cross entropy loss is averaged.
Gradient with Respect to $W_{hidden}$¶
To find $\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}}$, we apply the chain rule up to one level deeper in the network:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} \cdot \frac{\partial Z_{output}}{\partial A_{hidden}} \cdot \frac{\partial A_{hidden}}{\partial Z_{hidden}} \cdot \frac{\partial Z_{hidden}}{\partial W_{hidden}}$$Notice for later, that the first part of the chain rule is the same as when we calculated the update for $\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}}$. So we don't need to repeat ourselves and calculate the update for terms: $\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}}$ or the softmax.
Instead of taking the derivative with the weight matrix in the first layer (where $W_{output}$ lives), we are now of course taking the derivative with respect to the weights in the hidden layer. So we need to calculate:
$$\frac{\partial Z_{output}}{\partial A_{hidden}} = W_{output}^T$$Next, we need to add the derivative of the sigmoid, which we derived earlier:
$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$Therefore,
$$\frac{\partial A_{hidden}}{\partial Z_{hidden}} = \sigma'(Z_{hidden}) = A_{hidden} \odot (1 - A_{hidden})$$Where $\odot$ denotes the element-wise (Hadamard) product. You would simply multily each element in one matrix with the other at the corresponding index.
From $Z_{hidden} = X \cdot W_{hidden}$, we have:
$$\frac{\partial Z_{hidden}}{\partial W_{hidden}} = X^T$$
Combining these terms, the gradient with respect to $W_{hidden}$ is:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = \frac{1}{N} X^T \cdot \left( (\hat{Y} - Y) \cdot W_{output}^T \odot (A_{hidden} \odot (1 - A_{hidden})) \right)$$
This is quite overwhelming to do for each layer, and if you are overwhelmed I don't blame you. I was too the first time when I had to derive this.
Fortunately, we can make our lives easier. This is because the neural network is essentially broken up in repeating modules. If we know the derivative of each component, we just need to add it to the update as we move backward from the front to the back of the network with our gradients. Hence, it is called backpropagation.
Modular implementation¶
Closely look at the chain rules for $W_{output}$ and $W_{hidden}$. If you look closely the first terms match, and the second differ. So we can pass this part back in the network so we don't have to re-calculate it. If the network had more hidden layers, you would see more repeating parts in your chain rule derivative.
Let's explicitly derive the modular updates. Let the derivative of the error and the softmax function at the output layer be:
$$\delta_{output} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} = \frac{1}{N} (\hat{Y} - Y)$$
To update $W_{output}$, we simply multiply it with $\mathbf{A}^T$ and use our SGD update rule!
Now, we want to pass the error up one layer in the network to make sure that the layer deeper in the network can also update their weights, without having to recalculate the derivative of $\mathcal{L}_{CE}$. We let the layer that we are currently at pass the term $(\hat{Y} - Y) \cdot W_{output}^T$, which we will refer to as $\delta_{hidden\_pre\_activation}$, up one level (to the sigmoid activation).
Then at the activation we multiply it with the derivative of the sigmoid (and the input from the layer below it). So what we get is:
$$\delta_{hidden} = \delta_{hidden\_pre\_activation} \odot (A_{hidden} \odot (1 - A_{hidden}))$$
Then, $\delta_{hidden}$ is passed up to the hidden layer. We know the derivative w.r.t. the weight matrix in the hidden layer is $\mathbf{X}^{T}_{input}$. So, the gradient with respect to $W_{hidden}$ can be written modularly as:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = X^T \cdot \delta_{hidden}$$This modular form is much cleaner and easier to implement in code. It shows that the gradient for a weight matrix is the product of the transpose of the input to that layer and the error propagated back to the output of that layer's activation function.
These are the core update equations we will use to implement our feedforward neural network using NumPy. We will use these gradients to update the weights using SGD:
$W_{new} = W_{old} - \alpha \frac{\partial \mathcal{L}_{CE}}{\partial W}$
where, $\alpha$ is the learning rate.
Implementing a Feed Forward Neural Network in NumPy¶
Now that we have had a look at the theory, we can start implementing the neural network. It may look intimidating, but it's much more straightforward than you think it is. Just a little tedious, but you'll be the better for it, I promise!
Do the exercises below, if in the code you see hints for later exercises. You may ignore these for now.
Exercise 3.1 : Forward Pass¶
In this exercise you are going to implement the forward pass of the FFN. Make sure to implement the indicated methods/attributes in the FFNLayer, Sigmoid, Softmax, and Cross Entropy classes.
HINT: If you want to test this, you can randomly initialize a matrix and see if the output dimensions match by just running the module!
Exercise 3.2: Backward Pass¶
Each method mentioned in the previous exercise has a backpropagate (and in the case of the layer an update weights method).
Exercise 3.3: Training Loop¶
Implement the training loop, see hints in the exercise!
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from typing import Tuple
%matplotlib inline
#### Feedforward Layer
class FFNLayer():
def __init__(self, in_dims:int, out_dims: int, initialization: str = 'normal'):
self.in_dims = in_dims
self.out_dims = out_dims
# EXERCISE 3.1 Initialize weights here, just sample a matrix from a normal distribution
if initialization == 'normal':
self.W = np.random.randn(in_dims, out_dims)
elif initialization == 'kaiming':
# EXERCISE 3.6. Implement Kaiming He weight initialization here!
std_dev = np.sqrt(2.0 / out_dims)
self.W = np.random.randn(in_dims, out_dims) * std_dev
self._learning_rate = 1.0
@property
def learning_rate(self) -> float:
return self._learning_rate
@learning_rate.setter
def learning_rate(self, value: float) -> None:
self._learning_rate = value
def __call__(self, X: np.ndarray) -> np.ndarray:
# EXERCISE 3.1 Implement Forward pass here
self.X = X
out = X @ self.W
return out
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# EXERCISE 3.2 Implement Backward Pass Here
self.update_weights(delta)
# HINT: Don't forget to calculate the delta_next here,
# otherwise other layers won't be updated!
# Delta next is the derivative wrt the input of this network
delta_next = delta @ self.W.T
return delta_next
def update_weights(self, delta: np.ndarray) -> None:
# EXERCISE 3.2 Implement weight update here!
update = self.X.T @ delta
self.W = self.W - self._learning_rate * update
### Activation functions
class Sigmoid():
def __init__(self):
self.A = None
def __call__(self, X) -> np.ndarray:
# EXERCISE 3.1 Implement forward pass here
self.A = 1. / (1+ np.exp(-X))
return self.A
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# EXERCISE 3.2 Implement backprop here
return delta * self.A * (1 - self.A)
class SoftMax():
def __init__(self):
self.X = None
def __call__(self, X: np.ndarray):
# Exercise 3.1 Implement softmax forward pass here!
return np.exp(X) / np.sum(np.exp(X), axis=1, keepdims=True)
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# EXERCISE 3.2 Implement backprop here
# HINT: For now we assume the only loss function that we are using is the cross entropy!
# So no need to overcomplicate it!
return delta
class CrossEntropy():
def __init__(self, epsilon: float = 1e-8):
self.epsilon = epsilon
self.Y_true = None
def __call__(self, Y_True: np.ndarray, Y_pred: np.ndarray) -> np.ndarray:
# EXERCISE 3.1 Implement Forward pass/ error calculation here
self.Y_true = Y_True
return -np.sum(Y_True * np.log(Y_pred + self.epsilon))
def backpropagate(self, Y_pred: np.ndarray) -> np.ndarray:
# EXERCISE 3.2: Implement derivative of loss here
return Y_pred - self.Y_true
class ModuleList():
def __init__(self):
self.layers = []
def set_learning_rate(self, lr: float = 0.01):
for layer in self.layers:
if hasattr(layer, 'learning_rate'):
layer.learning_rate = lr
def set_training(self):
for layer in self.layers:
if hasattr(layer, 'training'):
layer.training = True
def set_evaluation(self):
for layer in self.layers:
if hasattr(layer, 'training'):
layer.training = False
def add(self, layer):
self.layers.append(layer)
def __call__(self, X: np.ndarray) -> np.ndarray:
# NOTE: This is how they forward pass will be called
# for all of modules.
for layer in self.layers:
X = layer(X)
return X
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
for layer in reversed(self.layers):
delta = layer.backpropagate(delta)
return delta
class DataLoader():
def __init__(self, batch_size = 8, X: np.ndarray = None, y: np.ndarray = None, flatten: bool = False):
'''
Simple DataLoader, PyTorch style!
Don't modify anything here!
'''
self.X = X
if flatten:
self.X = X.reshape(X.shape[0], -1)
self.y = y
self.shuffle_data()
self.batch_size = batch_size
self.start_index = 0
self.end_index = self.batch_size
def shuffle_data(self):
indices = np.arange(self.X.shape[0])
np.random.shuffle(indices)
self.X = self.X[indices]
self.y = self.y[indices]
def __len__(self):
return len(self.X)
def __iter__(self):
return self
def __next__(self):
# reset indices and shuffle data
if self.end_index > self.X.shape[0]:
self.start_index = 0
self.end_index = self.batch_size
self.shuffle_data()
raise StopIteration
next_X = self.X[self.start_index:self.end_index]
next_y = self.y[self.start_index:self.end_index]
# Ensure indices are updated
self.start_index = self.end_index
self.end_index += self.batch_size
return next_X, next_y
class DataLoaderFactory():
def __init__(self, batch_size: int = 8) -> None:
'''
Simple Dataloader, PyTorch style!
Don't modify anything here!
'''
self.X, self.y = datasets.load_iris(return_X_y=True)
# normalize the features
self.X = (self.X - np.mean(self.X, axis=0)) / np.std(self.X, axis=0)
# split X in X_train and X_validate as well as y
self.X_train = self.X[:120]
self.X_validate = self.X[120:]
self.y_train_sparse = self.y[:120]
self.y_validate_sparse = self.y[120:]
# convert y_train and y_validate into one_hot matrix
self.y_train = np.zeros((self.y_train_sparse.shape[0], 3))
self.y_train[np.arange(self.y_train_sparse.shape[0]), self.y_train_sparse] = 1
self.y_validate = np.zeros((self.y_validate_sparse.shape[0], 3))
self.y_validate[np.arange(self.y_validate_sparse.shape[0]), self.y_validate_sparse] = 1
self.batch_size = batch_size
self.train_dataset = DataLoader(self.batch_size, self.X_train, self.y_train)
self.validation_dataset = DataLoader(self.batch_size, self.X_validate, self.y_validate)
self.len_train = len(self.train_dataset)
self.len_validate = len(self.validation_dataset)
def get_validation_dataset(self):
return self.validation_dataset
def get_train_dataset(self):
return self.train_dataset
# Standard neural network: This will be our baseline
module_list = ModuleList()
module_list.add(FFNLayer(4, 10))
module_list.add(Sigmoid())
module_list.add(FFNLayer(10, 3))
module_list.add(SoftMax())
loss_fn = CrossEntropy()
def train_model(module_list, loss_fn, epochs: int = 200, learning_rate: float = 0.01, dataloader_factory: DataLoaderFactory = DataLoaderFactory, batch_size: int = 8):
# set the learning rate
module_list.set_learning_rate(learning_rate)
# Get the dataloaders
loader_factory = dataloader_factory(batch_size = batch_size )
train_loader = loader_factory.get_train_dataset()
validation_loader = loader_factory.get_validation_dataset()
train_loss_history = []
validation_loss_history = []
for epoch in range(epochs):
avg_loss = 0
module_list.set_training()
for data, y_target in train_loader:
# Exercise 3.3: Implement the training update here!
# forward pass, get the predicted y
y_pred = module_list(data)
# calculate the entropy loss! name it "loss"
loss = loss_fn(y_target, y_pred)
# calculate the delta, from the loss function!
delta = loss_fn.backpropagate(y_pred)
# backprop!
module_list.backpropagate(delta)
### END YOUR code
num_batches = len(train_loader) / train_loader.batch_size
avg_loss += loss / num_batches
# validate
avg_val_loss = 0
module_list.set_evaluation()
for data, y_target in validation_loader:
y_pred = module_list(data)
loss = loss_fn(y_target, y_pred)
num_batches = len(validation_loader) / validation_loader.batch_size
avg_val_loss += loss / num_batches
validation_loss_history.append(float(avg_val_loss))
print(f'Epoch: {epoch} \t average loss: {avg_loss} \t validation loss: {avg_val_loss}')
train_loss_history.append(float(avg_loss))
return train_loss_history, validation_loss_history
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 200, learning_rate = 0.01)
Epoch: 0 average loss: 12.876451251592778 validation loss: 13.848139633140521 Epoch: 1 average loss: 8.428405148365744 validation loss: 11.699000985523965 Epoch: 2 average loss: 6.548130182924329 validation loss: 10.17205907226144 Epoch: 3 average loss: 5.546434961584627 validation loss: 8.604551047398909 Epoch: 4 average loss: 4.890943038264167 validation loss: 7.832317997532835 Epoch: 5 average loss: 4.441990424103529 validation loss: 7.2837164845133 Epoch: 6 average loss: 4.078781727272315 validation loss: 6.500430296042209 Epoch: 7 average loss: 3.794490124377517 validation loss: 6.687610520574084 Epoch: 8 average loss: 3.5730276216201613 validation loss: 5.59214939415077 Epoch: 9 average loss: 3.412279383311249 validation loss: 5.933382355480783 Epoch: 10 average loss: 3.2599143650735916 validation loss: 4.997641923990608 Epoch: 11 average loss: 3.128033471195376 validation loss: 5.651968064159734 Epoch: 12 average loss: 3.0068533160419877 validation loss: 5.1289159068134715 Epoch: 13 average loss: 2.915928806143964 validation loss: 5.2259809180235655 Epoch: 14 average loss: 2.8228096551889763 validation loss: 4.953140145082106 Epoch: 15 average loss: 2.7396010514897875 validation loss: 4.442096129561617 Epoch: 16 average loss: 2.6572541360622095 validation loss: 4.51543848513687 Epoch: 17 average loss: 2.5941829831336807 validation loss: 4.5421124052256 Epoch: 18 average loss: 2.5267240631667702 validation loss: 4.605948305851484 Epoch: 19 average loss: 2.4608466379861444 validation loss: 4.727874705840747 Epoch: 20 average loss: 2.4193749061368184 validation loss: 4.381828593375727 Epoch: 21 average loss: 2.3552090788994424 validation loss: 4.370768871481581 Epoch: 22 average loss: 2.2980825009113723 validation loss: 4.291348164152263 Epoch: 23 average loss: 2.262368013956502 validation loss: 4.469250301069051 Epoch: 24 average loss: 2.216356030997794 validation loss: 4.162598731773602 Epoch: 25 average loss: 2.1635035069901054 validation loss: 3.5236288301490823 Epoch: 26 average loss: 2.1315536439540974 validation loss: 4.153631566601206 Epoch: 27 average loss: 2.0895937251000567 validation loss: 3.841337222743058 Epoch: 28 average loss: 2.060813421774886 validation loss: 3.7236253582319003 Epoch: 29 average loss: 2.0157244042808378 validation loss: 3.6822490194463118 Epoch: 30 average loss: 1.982748658169428 validation loss: 3.647299876990255 Epoch: 31 average loss: 1.9492499232305551 validation loss: 3.4618941525841107 Epoch: 32 average loss: 1.9104516833854315 validation loss: 3.557510694351941 Epoch: 33 average loss: 1.882596106581973 validation loss: 3.497848210929792 Epoch: 34 average loss: 1.8561509698047782 validation loss: 3.5681554901397323 Epoch: 35 average loss: 1.821324178081523 validation loss: 3.427621753234631 Epoch: 36 average loss: 1.7962697427740768 validation loss: 3.5250566820296267 Epoch: 37 average loss: 1.771955135390365 validation loss: 3.4767820617259932 Epoch: 38 average loss: 1.7390973030970607 validation loss: 3.4485988034961768 Epoch: 39 average loss: 1.7209966614719288 validation loss: 3.2379409509381305 Epoch: 40 average loss: 1.6980023482954285 validation loss: 3.3253925890811 Epoch: 41 average loss: 1.671461836561106 validation loss: 2.878572842740721 Epoch: 42 average loss: 1.6520229059985951 validation loss: 3.161433040428903 Epoch: 43 average loss: 1.6213667539362695 validation loss: 3.0935330002262837 Epoch: 44 average loss: 1.5992184230255515 validation loss: 3.3869559585975497 Epoch: 45 average loss: 1.5801385299606279 validation loss: 2.9180234015099225 Epoch: 46 average loss: 1.5552961449958231 validation loss: 2.926171746675905 Epoch: 47 average loss: 1.5416569814289405 validation loss: 3.1435734303820997 Epoch: 48 average loss: 1.5241398021704913 validation loss: 3.0199799615254195 Epoch: 49 average loss: 1.5001384378271023 validation loss: 3.194068970094291 Epoch: 50 average loss: 1.4888566610051535 validation loss: 3.0761622148313 Epoch: 51 average loss: 1.4707552744567054 validation loss: 2.958433386837736 Epoch: 52 average loss: 1.4474574867979155 validation loss: 2.896181817408489 Epoch: 53 average loss: 1.4373861552948668 validation loss: 2.9739036674609904 Epoch: 54 average loss: 1.4182400148257295 validation loss: 2.6709843662020245 Epoch: 55 average loss: 1.3958714477009009 validation loss: 3.02279727647121 Epoch: 56 average loss: 1.384968463862705 validation loss: 2.998726838501022 Epoch: 57 average loss: 1.3767571697533856 validation loss: 2.715000166867854 Epoch: 58 average loss: 1.3707100458045085 validation loss: 2.517784327814934 Epoch: 59 average loss: 1.3417141114588234 validation loss: 2.878489663312237 Epoch: 60 average loss: 1.3243802093674297 validation loss: 2.850328954396831 Epoch: 61 average loss: 1.30906385683988 validation loss: 2.9353656240075363 Epoch: 62 average loss: 1.2967251864731317 validation loss: 2.8944520350305742 Epoch: 63 average loss: 1.287079561230237 validation loss: 2.773897518993663 Epoch: 64 average loss: 1.272123938271146 validation loss: 2.7114585643885984 Epoch: 65 average loss: 1.2629831531205211 validation loss: 2.7073007533463462 Epoch: 66 average loss: 1.2431919488175607 validation loss: 2.640439076099754 Epoch: 67 average loss: 1.2387808950006094 validation loss: 2.605224721734002 Epoch: 68 average loss: 1.229353443569123 validation loss: 2.906531507106036 Epoch: 69 average loss: 1.212144886496255 validation loss: 2.7241977070503216 Epoch: 70 average loss: 1.2038353924474032 validation loss: 2.7622085294769905 Epoch: 71 average loss: 1.1877538476231446 validation loss: 2.8445379113814058 Epoch: 72 average loss: 1.1777666100324953 validation loss: 2.902013289363989 Epoch: 73 average loss: 1.1683206629589094 validation loss: 2.4523322986686624 Epoch: 74 average loss: 1.1600052334282884 validation loss: 2.5796823880145814 Epoch: 75 average loss: 1.1484168643516852 validation loss: 2.580862794497159 Epoch: 76 average loss: 1.1392578768136032 validation loss: 2.5123513295408504 Epoch: 77 average loss: 1.1259072201176832 validation loss: 2.552627304770274 Epoch: 78 average loss: 1.1177584375347136 validation loss: 2.39862750138853 Epoch: 79 average loss: 1.1118136419817544 validation loss: 2.6077274120791305 Epoch: 80 average loss: 1.105604870794345 validation loss: 2.4509983407825047 Epoch: 81 average loss: 1.0893116696020846 validation loss: 2.474171209164985 Epoch: 82 average loss: 1.0829997566511926 validation loss: 2.6399854002110894 Epoch: 83 average loss: 1.0767569365731036 validation loss: 2.563307182943121 Epoch: 84 average loss: 1.0648125956001147 validation loss: 2.0711905543042515 Epoch: 85 average loss: 1.0562341670154456 validation loss: 2.6212602630495265 Epoch: 86 average loss: 1.0522330612135447 validation loss: 2.421997331585186 Epoch: 87 average loss: 1.0410380659272045 validation loss: 2.519902619364209 Epoch: 88 average loss: 1.0361885089835847 validation loss: 2.54077192370148 Epoch: 89 average loss: 1.0251547132899734 validation loss: 2.5682392271606473 Epoch: 90 average loss: 1.018598390069917 validation loss: 2.4586633561874445 Epoch: 91 average loss: 1.015366195297088 validation loss: 2.2860183436259307 Epoch: 92 average loss: 1.0064175753894986 validation loss: 2.3862018999776753 Epoch: 93 average loss: 0.9952858888515875 validation loss: 2.440534102267711 Epoch: 94 average loss: 0.9921692909365711 validation loss: 2.5474685160330037 Epoch: 95 average loss: 0.984170427821788 validation loss: 2.1820289066755882 Epoch: 96 average loss: 0.9766847587288038 validation loss: 2.4401572596121945 Epoch: 97 average loss: 0.9711769733906249 validation loss: 2.336666157906847 Epoch: 98 average loss: 0.9672052079334882 validation loss: 2.3835367681858695 Epoch: 99 average loss: 0.9594562559944413 validation loss: 2.5304263454746385 Epoch: 100 average loss: 0.9542008672052794 validation loss: 2.4144403971913078 Epoch: 101 average loss: 0.9452175264128148 validation loss: 2.060223622474019 Epoch: 102 average loss: 0.9370998601648988 validation loss: 2.400136574103495 Epoch: 103 average loss: 0.9322996664115101 validation loss: 2.326780978613785 Epoch: 104 average loss: 0.9272760836381021 validation loss: 2.5584284967165574 Epoch: 105 average loss: 0.9198359615574225 validation loss: 2.298406905274156 Epoch: 106 average loss: 0.9147457480177649 validation loss: 2.041879177598244 Epoch: 107 average loss: 0.9136322091234614 validation loss: 2.1823788234596924 Epoch: 108 average loss: 0.9058075905867676 validation loss: 2.0691434513614118 Epoch: 109 average loss: 0.8986625684274713 validation loss: 2.1925619992938934 Epoch: 110 average loss: 0.8949421704876473 validation loss: 2.267436167516099 Epoch: 111 average loss: 0.8891917823757274 validation loss: 2.0541088911527687 Epoch: 112 average loss: 0.8897442443855978 validation loss: 2.25910204750533 Epoch: 113 average loss: 0.8771797859444843 validation loss: 2.2044832546102544 Epoch: 114 average loss: 0.8724744048515389 validation loss: 2.108949889379259 Epoch: 115 average loss: 0.8716487762155941 validation loss: 2.287931732718384 Epoch: 116 average loss: 0.8666969135994691 validation loss: 1.9237882949675083 Epoch: 117 average loss: 0.8635982936530161 validation loss: 2.2591045182696075 Epoch: 118 average loss: 0.8585946834780187 validation loss: 1.960137803801289 Epoch: 119 average loss: 0.8494866030919224 validation loss: 2.3116881175999167 Epoch: 120 average loss: 0.8485137241635765 validation loss: 2.324486757050336 Epoch: 121 average loss: 0.8408944181084109 validation loss: 2.3546895183910155 Epoch: 122 average loss: 0.8392930300098753 validation loss: 2.090051572573004 Epoch: 123 average loss: 0.8339064822820381 validation loss: 2.239053372531525 Epoch: 124 average loss: 0.8277608583927167 validation loss: 2.2814939651108306 Epoch: 125 average loss: 0.8259154486896306 validation loss: 2.06533323478245 Epoch: 126 average loss: 0.8196631700638061 validation loss: 2.330942697118567 Epoch: 127 average loss: 0.8181665987321731 validation loss: 1.9349865269438062 Epoch: 128 average loss: 0.8137713662568457 validation loss: 1.9091371475411798 Epoch: 129 average loss: 0.8103325185678298 validation loss: 2.3371645460292676 Epoch: 130 average loss: 0.8032299566586985 validation loss: 2.2323059314912976 Epoch: 131 average loss: 0.8029690603520934 validation loss: 2.281292269325871 Epoch: 132 average loss: 0.798943008534672 validation loss: 2.2140813832646047 Epoch: 133 average loss: 0.7915032559841959 validation loss: 2.2910419543984757 Epoch: 134 average loss: 0.7909205872556115 validation loss: 1.8882857213305133 Epoch: 135 average loss: 0.7846520030734216 validation loss: 2.0680032077433457 Epoch: 136 average loss: 0.7789134642832911 validation loss: 2.140632028908206 Epoch: 137 average loss: 0.7762498540622806 validation loss: 1.8041454772324172 Epoch: 138 average loss: 0.7738244098574364 validation loss: 1.7947189265376762 Epoch: 139 average loss: 0.771844205910409 validation loss: 1.9263763249016757 Epoch: 140 average loss: 0.7671703545089301 validation loss: 1.9595705045071279 Epoch: 141 average loss: 0.7666041916760664 validation loss: 1.8608620117008101 Epoch: 142 average loss: 0.760969153022881 validation loss: 1.8113095451053631 Epoch: 143 average loss: 0.7574599240947051 validation loss: 2.1953635511592244 Epoch: 144 average loss: 0.7537118858920284 validation loss: 2.033836870901588 Epoch: 145 average loss: 0.7500463538095076 validation loss: 2.026952953753571 Epoch: 146 average loss: 0.7491561828678723 validation loss: 2.0096181003852562 Epoch: 147 average loss: 0.7430990275088932 validation loss: 1.702165645171998 Epoch: 148 average loss: 0.7397137719708341 validation loss: 2.356425810723327 Epoch: 149 average loss: 0.7350368325805393 validation loss: 1.9709271126533539 Epoch: 150 average loss: 0.7358088861679627 validation loss: 2.272333743432945 Epoch: 151 average loss: 0.7334430447141163 validation loss: 2.083266633790773 Epoch: 152 average loss: 0.7308486937279167 validation loss: 2.2980587353975332 Epoch: 153 average loss: 0.7236184335121555 validation loss: 2.2157789710033815 Epoch: 154 average loss: 0.7239335360644976 validation loss: 2.021639690671152 Epoch: 155 average loss: 0.7184370355773531 validation loss: 1.9461574863260949 Epoch: 156 average loss: 0.7182261939360709 validation loss: 1.7180596193043307 Epoch: 157 average loss: 0.7128965085189733 validation loss: 2.12446779183591 Epoch: 158 average loss: 0.7130622574201095 validation loss: 2.1746073510320354 Epoch: 159 average loss: 0.7081867840679936 validation loss: 2.105227946314806 Epoch: 160 average loss: 0.7067967482892801 validation loss: 1.6432306642069654 Epoch: 161 average loss: 0.7073031554260671 validation loss: 1.8881604873413296 Epoch: 162 average loss: 0.7007858471108552 validation loss: 2.0735475283241316 Epoch: 163 average loss: 0.6988819039380295 validation loss: 2.01417700261809 Epoch: 164 average loss: 0.6931656716818217 validation loss: 2.0780424151427566 Epoch: 165 average loss: 0.6936670341873881 validation loss: 2.2516130755942143 Epoch: 166 average loss: 0.6898258773276496 validation loss: 2.1374262463809375 Epoch: 167 average loss: 0.6868625184434066 validation loss: 2.1099015523012943 Epoch: 168 average loss: 0.6849473094700889 validation loss: 2.0444609974119428 Epoch: 169 average loss: 0.6810284067211267 validation loss: 1.7511298230315038 Epoch: 170 average loss: 0.6779688217890574 validation loss: 2.114083614357218 Epoch: 171 average loss: 0.6766001075216355 validation loss: 1.7956765505659096 Epoch: 172 average loss: 0.6769083653524919 validation loss: 1.8352831131114822 Epoch: 173 average loss: 0.6702901799292094 validation loss: 2.1343246555108295 Epoch: 174 average loss: 0.6724450188355912 validation loss: 1.9894816853359272 Epoch: 175 average loss: 0.6675848884177195 validation loss: 1.7982495286122804 Epoch: 176 average loss: 0.6629875709990498 validation loss: 1.9734539592813976 Epoch: 177 average loss: 0.661593604810183 validation loss: 1.8259275669426605 Epoch: 178 average loss: 0.6605198442254508 validation loss: 2.162061169876119 Epoch: 179 average loss: 0.6578753649431639 validation loss: 1.869400302483072 Epoch: 180 average loss: 0.6561413750465747 validation loss: 2.1036985612372296 Epoch: 181 average loss: 0.6543495215186343 validation loss: 2.1778986491701295 Epoch: 182 average loss: 0.6514736701824058 validation loss: 1.8740721020566835 Epoch: 183 average loss: 0.6508229973427326 validation loss: 2.1075317776446196 Epoch: 184 average loss: 0.6466932296805387 validation loss: 1.9401044643304535 Epoch: 185 average loss: 0.6472551350433228 validation loss: 1.9344848719719834 Epoch: 186 average loss: 0.6465962597514265 validation loss: 1.7277484143166708 Epoch: 187 average loss: 0.6397086420445642 validation loss: 1.9532739521221045 Epoch: 188 average loss: 0.6402010967098208 validation loss: 1.7660026431120994 Epoch: 189 average loss: 0.6364909196842085 validation loss: 1.9162918423118205 Epoch: 190 average loss: 0.6358487626019935 validation loss: 1.8706916264165678 Epoch: 191 average loss: 0.6340259912246997 validation loss: 1.8844557170510328 Epoch: 192 average loss: 0.6295511943963474 validation loss: 2.1069962656595247 Epoch: 193 average loss: 0.6267826905670674 validation loss: 1.8879703163372932 Epoch: 194 average loss: 0.6278079896897101 validation loss: 2.054647170915377 Epoch: 195 average loss: 0.6263612051970355 validation loss: 1.7507940633607835 Epoch: 196 average loss: 0.6239866940264168 validation loss: 1.9988632824343773 Epoch: 197 average loss: 0.6207804728626571 validation loss: 1.9080022961997796 Epoch: 198 average loss: 0.6200292705850267 validation loss: 1.9747945283697077 Epoch: 199 average loss: 0.619541825245869 validation loss: 1.8940195339202466
def plot_train_validation_losses(train_loss_history, validation_loss_history):
indices = np.arange(len(train_loss_history))
plt.plot(indices, train_loss_history, label='train loss')
plt.plot(indices, validation_loss_history, label='validation loss')
plt.legend()
plt.title('Train and Validation loss for your neural network')
plt.show()
plot_train_validation_losses(train_loss_history, validation_loss_history)
3.3 Activation functions¶
In the previous exercise, we used the sigmoid activation function. We are going to have a look at several different activation functions that are specifically designed for deeper neural networks.
Many activation functions have been designed for increasingly deep neural networks. In the before times, sigmoids (and TanH) activation functions were used in deep neural networks, these had several problems however:
Vanishing Gradients: This is the most significant problem. Sigmoid and Tanh saturate for large values, causing their gradients to become near zero. This severely hinders the learning of earlier layers in deep networks during backpropagation.
Slow Convergence: Due to vanishing gradients, models using Sigmoid or Tanh can converge much slower, especially on deep architectures.
Computational Cost: The exponential function in Sigmoid and Tanh is computationally more expensive than the alternatives.
Non-Zero Centered Output (Sigmoid): Sigmoid's output is always positive. This can lead to suboptimal gradient updates in subsequent layers.
Alternatives activation functions that suffer less from these problems are: ReLU, Leaky-ReLU, and ELU. These are discussed next.
Rectified Linear Unit (ReLU) Activation Function¶
The Rectified Linear Unit (ReLU) is one of the most popular activation functions used in deep learning. It's computationally efficient and has helped address issues like the vanishing gradient problem that can occur with sigmoid and tanh functions.
The ReLU function is defined as:
$$ \text{ReLU}(z) = \max(0, z) $$This means that if the input $z$ is positive, the output is $z$. If the input $z$ is zero or negative, the output is 0. The derivative of the ReLU function is:
$$ \frac{d}{dz} \text{ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \le 0 \end{cases} $$This means the gradient is 1 for any positive input and 0 for any non-positive input. This simple derivative is one of the reasons ReLU is computationally efficient and helps combat the vanishing gradient problem.
Leaky Rectified Linear Unit (Leaky ReLU) Activation Function¶
The Leaky Rectified Linear Unit (Leaky ReLU) is another variation of the ReLU activation function designed to address the "dying ReLU" problem. While ReLU outputs zero for all negative inputs, which can cause neurons to become inactive and stop learning, Leaky ReLU allows a small, non-zero gradient when the input is negative.
The Leaky ReLU function is defined as:
$$ \text{Leaky ReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \le 0 \end{cases} $$where $\alpha$ is a small positive constant, typically around 0.01.
The derivative of the Leaky ReLU function is:
$$ \frac{d}{dz} \text{Leaky ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \le 0 \end{cases} $$For positive inputs, the derivative is 1, just like ReLU. For non-positive inputs, the derivative is $\alpha$. This small non-zero gradient ensures that neurons can still learn even when their input is negative, preventing them from becoming permanently inactive.
Exponential Linear Unit (ELU) Activation Function¶
The Exponential Linear Unit (ELU) is another popular activation function that aims to combine the benefits of ReLU while addressing some of its drawbacks. It tends to produce negative outputs for negative inputs, which can help push the mean of activations closer to zero, potentially leading to faster learning.
The ELU function is defined as:
$$ \text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \le 0 \end{cases} $$where $\alpha$ is a hyperparameter, usually set to 1.
The derivative of the ELU function is:
$$ \frac{d}{dz} \text{ELU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha e^z & \text{if } z \le 0 \end{cases} $$For $z > 0$, the derivative is 1, similar to ReLU. For $z \le 0$, the derivative is $\alpha e^z$. This derivative is non-zero for negative inputs, which helps mitigate the "dying ReLU" problem (where neurons can become inactive and stop learning).
Run the cell below to see what these functions look like!
import numpy as np
import matplotlib.pyplot as plt
# Generate input values
z = np.linspace(-5, 5, 100)
# ReLU function
def relu(z):
return np.maximum(0, z)
# ELU function
def elu(z, alpha=1.0):
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
# Leaky ReLU function
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
# Create subplots (2 rows, 2 columns)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Plot ReLU on the first subplot (top-left)
axes[0, 0].plot(z, relu(z), label='ReLU')
axes[0, 0].set_title('ReLU Activation Function')
axes[0, 0].set_xlabel('Input (z)')
axes[0, 0].set_ylabel('Output (ReLU(z))')
axes[0, 0].grid(True)
axes[0, 0].legend()
# Plot ELU with different alpha values on the second subplot (top-right)
alphas_elu = [0.5, 1.0, 2.0]
for alpha in alphas_elu:
axes[0, 1].plot(z, elu(z, alpha), label=f'ELU (α={alpha})')
axes[0, 1].set_title('ELU Activation Function with Different Alpha Values')
axes[0, 1].set_xlabel('Input (z)')
axes[0, 1].set_ylabel('Output (ELU(z))')
axes[0, 1].grid(True)
axes[0, 1].legend()
# Plot Leaky ReLU with different alpha values on the third subplot (bottom-left)
alphas_leaky_relu = [0.01, 0.1, 0.2]
for alpha in alphas_leaky_relu:
axes[1, 0].plot(z, leaky_relu(z, alpha), label=f'Leaky ReLU (α={alpha})')
axes[1, 0].set_title('Leaky ReLU Activation Function with Different Alpha Values')
axes[1, 0].set_xlabel('Input (z)')
axes[1, 0].set_ylabel('Output (Leaky ReLU(z))')
axes[1, 0].grid(True)
axes[1, 0].legend()
# Hide the fourth subplot as it's not used
fig.delaxes(axes[1, 1])
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()
Exercise 3.4 Implementing ReLU, leaky-ReLU, and ELU¶
In the below class definitions, implement the forward and backward functions.
HINT: You can already see how the forward function is implemented in the code cell above, keep it a secret, don't tell your fellow classmates.
class ReLU():
def __init__(self):
self.X = None
def __call__(self, X: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
self.X = X
return np.maximum(0, X)
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
binarized_matrix = np.where(self.X > 0, 1, 0)
return delta * binarized_matrix
class LeakyReLU():
def __init__(self, alpha: float = 0.01):
self.alpha = alpha
self.X = None
def __call__(self, X: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
self.X = X
return np.where(self.X > 0, self.X, self.alpha * self.X)
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
binarized_matrix = np.where(self.X > 0, 1, self.alpha)
return delta * binarized_matrix
class ELU():
def __init__(self, alpha: float = 1.0):
self.alpha = alpha
self.X = None
def __call__(self, X: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
self.X = X
return np.where(self.X > 0, self.X, self.alpha * (np.exp(self.X) - 1))
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
binarized_matrix = np.where(self.X > 0, 1, self.alpha * np.exp(self.X))
return delta * binarized_matrix
Exercise 3.5: Run the network with the activation functions¶
Run the network with the activation functions that you have just implemented on the MNIST digit dataset. The MNIST dataset are small images of handwritten digits, in our case we flatten them and feed them into our neural network.
Here is example data from MNIST:
Source: https://www.geeksforgeeks.org/machine-learning/mnist-dataset/
Try running the network for 20 epochs on Sigmoid, ReLU, LeakyReLU, and ELU activation functions.
HINT: Right now, ReLU, LeakyReLU and ELU should all return NaN's. It's a sign you are on the right way, but not a guarentee. We will see next, why this is!
# Install the MNIST dataset
pip install mnist_datasets
Collecting mnist_datasets Downloading mnist_datasets-0.12-py3-none-any.whl.metadata (4.8 kB) Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (2.0.2) Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (4.67.1) Downloading mnist_datasets-0.12-py3-none-any.whl (6.7 kB) Installing collected packages: mnist_datasets Successfully installed mnist_datasets-0.12
from mnist_datasets import MNISTLoader
# class for the MNIST dataloader. DO NOT CHANGE THIS.
class MNISTDataLoaderFactory():
def __init__(self, batch_size: int = 8) -> None:
'''
Simple Dataloader, PyTorch style!
Don't modify anything here!
'''
train_X, train_y = MNISTLoader().load()
validate_X, validate_y = MNISTLoader().load(train = False )
# Calculate mean and standard deviation, add epsilon for stability
mean_train = np.mean(train_X, axis = 0)
std_train = np.std(train_X) # Add epsilon
mean_validate = np.mean(validate_X, axis=0)
std_validate = np.std(validate_X) # Add epsilon
# normalize the features
train_X = (train_X - mean_train) / std_train
validate_X = (validate_X - mean_validate) / std_validate
# convert y_train and y_validate into one_hot matrix
train_y = np.array(train_y).astype(int)
validate_y = np.array(validate_y).astype(int)
train_y_one_hot = np.zeros((train_y.shape[0], 10))
train_y_one_hot[np.arange(train_y.shape[0]), train_y] = 1.
validate_y_one_hot = np.zeros((validate_y.shape[0], 10))
validate_y_one_hot[np.arange(validate_y.shape[0]), validate_y] = 1.
self.batch_size = batch_size
self.train_dataset = DataLoader(self.batch_size, train_X, train_y_one_hot, flatten = True)
self.validation_dataset = DataLoader(self.batch_size, validate_X, validate_y_one_hot, flatten = True)
self.len_train = len(self.train_dataset)
self.len_validate = len(self.validation_dataset)
def get_validation_dataset(self):
return self.validation_dataset
def get_train_dataset(self):
return self.train_dataset
def mnist_model(activation = 'sigmoid', initialization = 'normal') -> Tuple[ModuleList, CrossEntropy]:
'''
MNIST neural network model.
For exercise 1.5, only change the activation function.
For exercise 1.6, change the initialization after you have modified it in the FFNLayer class!
'''
if activation == 'sigmoid':
activation_fn = Sigmoid
elif activation == 'relu':
activation_fn = ReLU
elif activation == 'leaky_relu':
activation_fn = LeakyReLU
elif activation == 'elu':
activation_fn = ELU
else:
raise ValueError(f'Unknown activation function: {activation}, pick one from: sigmoid, relu, leaky_relu, elu')
module_list = ModuleList()
module_list.add(FFNLayer(784, 20, initialization = initialization))
module_list.add(activation_fn())
module_list.add(FFNLayer(20, 20, initialization = initialization))
module_list.add(activation_fn())
module_list.add(FFNLayer(20, 10, initialization = initialization))
module_list.add(SoftMax())
loss_fn = CrossEntropy()
return module_list, loss_fn
# module_list, loss_fn = mnist_model(activation = 'relu', initialization = 'kaiming')
# train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 10, learning_rate = 0.001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
So why the NAN's?¶
So what is going on, why are the losses turning into NaN's in spite of us using "Advanced Activations" that are supposed to solve vanishing gradients? While ReLU, ELU, and LeakyReLU address vanishing gradients, they have a different challenge: their positive side is unbounded. With standard normal initialization, the weights can cause the outputs of these activation functions (the features) to grow very large as data passes through the layers. This phenomenon is sometimes referred to as exploding activations or exploding values. When these large values reach the final layer and are fed into the Softmax function, the exponentiation step (np.exp(X)) can result in numbers that are too large to be represented by your computer's floating-point precision, leading to overflow. This overflow often produces inf (infinity), and subsequent calculations within the Softmax or the Cross-Entropy loss (like inf / inf) result in NaN (Not a Number) values, causing the training to fail.
There are a variety of ways in which we can deal with it:
- Better Weight Initialization: Due to the multiplication of the weights with the features, they keep on adding up. So we want to initialize the features in a smarter way.
- Numerically Stable Softmax: We could implement a numerically more stable version of the softmax loss.
- Gradient Clipping: We could clip the gradients during backpropagation, such that they are only allowed to update the weights by so much each iteration.
- Smaller Learning Rate: Another way of restricting the size of the weight updates each iteration.
We will implement a technique that adresses the first issue. This technique is described below!
Weight Initialization: Kaiming (He)¶
Introduced by Kaiming He et al. in 2015, Kaiming initialization was specifically designed for layers with ReLU and its variants (like Leaky ReLU). Unlike Sigmoid and Tanh, ReLU's mean is not zero, and it sets negative inputs to zero, which affects the variance of the activations.
Kaiming initialization accounts for the fact that ReLU "kills" half of the neurons (on average) by outputting zero for negative inputs. It aims to maintain the variance of the activations by scaling the weights appropriately.
The variance of the weights is typically set based on the number of input neurons (fan-in) or output neurons (fan-out), often using fan-in:
$$ \text{Var}(W) = \frac{2}{\text{fan_in}} $$or using fan-out:
$$ \text{Var}(W) = \frac{2}{\text{fan_out}} $$The weights are typically initialized by drawing from:
A normal distribution with mean 0 and standard deviation $\sigma = \sqrt{\frac{2}{\text{fan_in}}}$ (or $\sqrt{\frac{2}{\text{fan_out}}}$).
$$W \sim N\left(0, \sqrt{\frac{2}{\text{fan\_in}}}\right)$$A uniform distribution in the range $[-\text{limit}, \text{limit}]$, where $\text{limit} = \sqrt{\frac{6}{\text{fan_in}}}$ (or $\sqrt{\frac{6}{\text{fan_out}}}$).
$$ W \sim U\left[-\sqrt{\frac{6}{\text{fan_in}}}, \sqrt{\frac{6}{\text{fan_in}}}\right] $$
Kaiming initialization is generally the preferred method when using ReLU or Leaky ReLU activation functions, as it specifically addresses the change in variance caused by these activations.
Exercise 3.6: Implementing He initialization¶
We are going to implement He initialization in our FFNLayer class so we can actually use the activations that we have just implemented.
The only thing that you have to implement is the Kaiming (He) initialization in the `FFNLayer, using the fan_in method. See the comment in the class and don't forget to rerun the notebook cell!
Once you have implemented this, swap the initialization parameter in the mnist_model function below to 'kaiming'. Then change the learning rate to 0.0001 and run the model with ReLU, ELU, and LeakyReLU for 100 epochs.
What do you notice now?
module_list, loss_fn = mnist_model(activation = 'relu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 50, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
module_list, loss_fn = mnist_model(activation = 'elu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
What you may have noticed is that training a neural network with one of the ReLU family activation functions in combination with kaiming's initialization produces the best results.
We have worked on a shallow neural network so far. Another technique to train bigger neural networks, is to normalize the features between each layer. We will be discussing a method of doing this next.
3.4 Batch Normalization¶
Batch Normalization (BatchNorm) is a technique introduced to address challenges in training deep neural networks. Initially, it was hypothesized that BatchNorm primarily helps by reducing Internal Covariate Shift, the change in the distribution of network activations during training. However, more recent research, notably the paper "How Does Batch Normalization Help Optimization?" by Santurkar et al., has suggested that while BatchNorm might not eliminate Internal Covariate Shift, its main benefits stem from other factors that make the optimization process smoother and more efficient.
Instead of primarily reducing Internal Covariate Shift, Santurkar et al. argue that Batch Normalization helps by:
- Making the optimization landscape smoother: This allows gradients to be more predictable and less noisy, enabling the use of higher learning rates and faster convergence.
- Reducing the dependence of gradients on the scale of parameters: This makes the training less sensitive to the initial values of the weights.
Regardless of the exact mechanism, Batch Normalization has proven to be a highly effective technique for stabilizing and accelerating the training of deep neural networks.
For a mini-batch of size $m$, let $x_i$ be the input to a neuron in a layer for the $i$-th example in the batch. Batch Normalization performs the following transformation for each neuron's input:
Calculate the mini-batch mean: $$ \mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i \quad \quad (\text{BN.1}) $$ This is the average of the inputs for a single neuron across the entire mini-batch. In NumPy, you would calculate the mean along the batch dimension.
Calculate the mini-batch variance: $$\sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2 \quad \quad (\text{BN.2})$$
This is the variance of the inputs for a single neuron across the mini-batch. In NumPy, you would calculate the variance along the batch dimension.
Normalize the input:
$$\hat{x_i} = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \quad \quad (\text{BN.3})$$Here, $\epsilon$ is a small constant added for numerical stability to prevent division by zero in case the mini-batch variance is zero. This step scales the inputs to have zero mean and unit variance.
Scale and Shift: $$ y_i = \gamma \hat{x}_i + \beta \quad \quad (\text{BN.4})$$ This is the final output of the Batch Normalization layer. $\gamma$ (gamma) and $\beta$ (beta) are learnable parameters of the Batch Normalization layer. $\gamma$ scales the normalized input, and $\beta$ shifts it. These parameters allow the network to learn to restore the original representation power if needed, by essentially learning to undo the normalization if that's optimal for the network.
During Training:
- The mini-batch mean $\mu_{\mathcal{B}}$ and variance $\sigma_{\mathcal{B}}^2$ are calculated for each mini-batch.
- The learnable parameters $\gamma$ and $\beta$ are updated using gradient descent.
- We also maintain a running average of the means and variances across all mini-batches to be used during inference.
During Inference (Testing):
- We use the accumulated global mean and variance (calculated as running averages during training) to normalize the inputs.
The formulas for updating the running average are typically:
$$ \text{running_mean} = \text{momentum} \cdot \text{running_mean} + (1 - \text{momentum}) \cdot \mu_{\mathcal{B}} \quad \quad (\text{BN.5})$$$$ \text{running_variance} = \text{momentum} \cdot \text{running_variance} + (1 - \text{momentum}) \cdot \sigma_{\mathcal{B}}^2 \quad \quad (\text{BN.6})$$where momentum is a hyperparameter (typically close to 1, e.g., 0.9 or 0.99).
Batch Normalization is a powerful technique that leads to:
- Faster Training: It allows for higher learning rates.
- Reduced Dependence on Initialization: The network becomes less sensitive to the initial values of the weights.
- Regularization Effect: It adds a slight regularization effect.
Implementing Batch Normalization involves both the forward pass (normalization and scaling/shifting) and the backward pass (calculating gradients with respect to the inputs, $\gamma$, and $\beta$).
Gradients for Gamma and Beta in Batch Normalization¶
In Batch Normalization, $\gamma$ (gamma) and $\beta$ (beta) are learnable parameters that are updated during training using gradient descent. The update formulas depend on the gradients of the loss function with respect to these parameters ($\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L }{\partial \beta}$)*.
Recall the output of the Batch Norm layer for a single example $i$ in a mini-batch: $y_i = \gamma \hat{x}_i + \beta$, where $\hat{x}_i$ is the normalized input.
Here's how to calculate the gradients :
Gradient with Respect to $\beta$:
The cross entropy loss depends on $y_i$, and $y_i$ depends linearly on $\beta$. Using the chain rule, the gradient of the loss with respect to $\beta$ is the sum of the gradients of the loss with respect to the outputs of the Batch Norm layer ($y_i$) across the mini-batch:
$$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \beta} $$Since $\frac{\partial y_i}{\partial \beta} = 1$, this simplifies to: $$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} $$ In terms of the incoming delta ($\delta_{out}$) from the layer above (where $\delta_{out} = \frac{\partial L}{\partial y}$), the gradient is: $$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} (\delta_{out})_i \quad \quad (\text{BN.7})$$
Gradient with Respect to $\gamma$: Using the chain rule for $\gamma$: $$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \gamma} $$
Since $y_i = \gamma \hat{x}_i + \beta$, $\frac{\partial y_i}{\partial \gamma} = \hat{x}_i$.
So the gradient is the sum of the element-wise product of the incoming delta and the normalized input ($\hat{x}_i$ ) across the mini-batch:
$$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \hat{x}_i $$In terms of the incoming delta ($\delta_{out}$) and the normalized input $\hat{x}_i$:
$$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} (\delta_{out})_i \cdot \hat{x}_i \quad \quad (\text{BN.8})$$
These gradients ($\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L}{\partial \beta}$) are then used to update $\gamma$ and $\beta$ using the standard gradient descent rule:
$$ \gamma_{new} = \gamma_{old} - \alpha \frac{\partial L}{\partial \gamma} $$$$ \beta_{new} = \beta_{old} - \alpha \frac{\partial L}{\partial \beta} $$where $\alpha$ is the learning rate.
Exercise 3.7 Implementing Batch Normalization¶
For the next exercise we are going to implement batch normalization. You will implement the forward pass. In the backward pass you only need to implement the updates for Gamma and Beta, the delta (the bit that is propagated further down the network) is already implemented. The equations that you need are already indicated above with $\text{BN}.x$.
If you are curious about how these are derived you can check out this blog:
*$\mathcal{L}_{CE}$ is in this derivation referred to as $L$, because LaTex was having issues.
class BatchNormalization():
def __init__(self, momentum: float = 0.9, epsilon: float = 1e-5):
'''
Batch Normalization
'''
self.momentum = momentum
self.epsilon = epsilon
self.running_mean = None
self.running_var = None
self.X = None
self.gamma = None
self.beta = None
self._learning_rate = 1.0
self.training = True
@property
def learning_rate(self) -> float:
return self._learning_rate
@learning_rate.setter
def learning_rate(self, value: float) -> None:
self._learning_rate = value
def __call__(self, X: np.ndarray) -> np.ndarray:
self.X = X
self.batch_size = X.shape[0]
## EXERCISE 1.7 Implement the forward pass for batch normalization
# if there is no gamma or beta, initialize using the shape of
# of X. Gamma is initialized as a vector of ones and beta
# is initialized as a vector of zeros
if self.gamma is None or self.beta is None:
self.gamma = np.ones(X.shape[1])
self.beta = np.zeros(X.shape[1])
# variance and mean for each feature. (While training)
if self.training:
# TODO: calculate the mean and variance
self.mu = np.mean(X, axis=0) # (BN.1)
self.var = np.var(X, axis=0) # (BN.2)
# update the running mean & running var
if self.running_mean is None or self.running_var is None:
# TODO: if no running mean/var implemented,
# initialize them using the current variance
self.running_mean = self.mu
self.running_var = self.var
else:
# TODO: Implemente the running mean and running variance
self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * self.mu # (BN.5)
self.running_var = self.momentum * self.running_var + (1 - self.momentum) * self.var # (BN.6)
# If not training, use the running mean and variance.
else:
self.mu = self.running_mean
self.var = self.running_var
# Normalize the input (don't forget to add epsilon)
# TODO: Calculate the normalization of the input
self.X_hat = (X - self.mu) / np.sqrt(self.var + self.epsilon) # (BN.3)
# TODO: Scale and shift the input
out = self.gamma * self.X_hat + self.beta # (BN.4)
return out
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# Now we need to implement the derivative with respect to the input X
# which is pretty tricky at this stage. So that is provided :)
# NO NEED TO IMPLEMENT THE DELTA_OUT UPDATE!
constant = self.gamma / (self.batch_size * np.sqrt(self.var + self.epsilon))
first_term = self.batch_size * delta
second_term = np.sum(delta, axis = 0, keepdims=True) # changed axes
third_term = self.X_hat * np.sum(delta * self.X_hat, axis = 0, keepdims = True) # changed axes
delta_out = constant * (first_term - second_term - third_term)
# EXERCISE 1.7: Implement the backprop for gamma and beta here!
self.gamma = self.gamma - self._learning_rate * np.sum(delta * self.X_hat, axis=0) # (BN.7)
self.beta = self.beta - self._learning_rate * np.sum(delta, axis=0) # (BN.8)
return delta_out
Exercise 3.8: Testing Batch Normalization¶
Try out your implementation of batch normalization. Just run the function below as is. As you may notice, the batch size is 32 instead of 8. This is because batch normalization averages features over batches. So you need to have large enough batches to estimate the parameters.
I got the model to train to about:
- average loss: 0.37
- validation loss: 2.68
def mnist_model_bn(activation = 'sigmoid', initialization = 'normal') -> Tuple[ModuleList, CrossEntropy]:
'''
MNIST neural network model.
'''
if activation == 'sigmoid':
activation_fn = Sigmoid
elif activation == 'relu':
activation_fn = ReLU
elif activation == 'leaky_relu':
activation_fn = LeakyReLU
elif activation == 'elu':
activation_fn = ELU
else:
raise ValueError(f'Unknown activation function: {activation}, pick one from: sigmoid, relu, leaky_relu, elu')
module_list = ModuleList()
module_list.add(FFNLayer(784, 100, initialization = initialization))
module_list.add(BatchNormalization())
module_list.add(activation_fn())
module_list.add(FFNLayer(100, 100, initialization = initialization))
module_list.add(BatchNormalization())
module_list.add(activation_fn())
module_list.add(FFNLayer(100, 100, initialization = initialization))
module_list.add(BatchNormalization())
module_list.add(activation_fn())
module_list.add(FFNLayer(100, 10, initialization = initialization))
module_list.add(SoftMax())
loss_fn = CrossEntropy()
return module_list, loss_fn
module_list, loss_fn = mnist_model_bn(activation = 'elu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0005, dataloader_factory = MNISTDataLoaderFactory, batch_size = 32)
# plot_train_validation_losses(train_loss_history, validation_loss_history)
3.5 Other Normalization Techniques¶
While Batch Normalization is widely used and effective, its reliance on mini-batch statistics can be a limitation. Several other normalization techniques have been developed that address this by normalizing across different dimensions, making them less dependent on batch size:
- Layer Normalization: Normalizes across the features of each individual sample. This is particularly useful in recurrent neural network models.
- Instance Normalization: Normalizes across the spatial dimensions (height and width) for each sample and each feature channel independently. This has shown success in style transfer tasks.
- Group Normalization: Divides the channels into groups and normalizes the features within each group for each sample. This provides a balance between Batch Normalization and Layer Normalization and works well across a wide range of batch sizes.
These techniques offer alternatives to Batch Normalization when its assumptions or requirements are not met, providing more stable training and better performance in specific scenarios. Though we will not go in depth on these techniques in this tutorial, it is useful to know about them.
Concluding remarks¶
We have done a lot in this tutorial. We've understood feed forward neural networks, one of the earliest neural network and a critical component in many specialized architectures. Next, we've implemented activation functions, weight initialization, and batch normalization.
You may pat yourself on the back, many people that apply neural networks today have never implemented one from scratch in NumPy or seen how backpropagation works. Now you have!
As you may have noticed, our architecture for predicting MNIST classes was not optimized for images. In the next tutorial we will be implementing a convolutional neural network, which for many years was the go-to architecture for image related tasks. Only in the last few years did it get a new architecture that under certain conditions can challenge it.