Welcome back! In the last tutorial we covered regularization and the bias variance trade-off. We discovered that a complicated model can basically memorize the training data and not generelize very well and that a model that is too simple does not learn the underlying complexity of the data.
Important notes¶
- It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
- Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features
Lesson 3: Feedforward Neural Networks from Scratch¶
This tutorial teaches you to build a neural network and implement backpropagation from scratch using NumPy—no deep learning framework required.
First, we will explain conceptually what is happening in a feedforward neural network (also known as a universal function approximator), then we will derive the update equations (backpropagation), and finally we will implement our first feedforward neural network in NumPy.
Neural networks are often referred to as universal function approximators. This means that a sufficiently large and complex neural network can approximate any continuous function to an arbitrary degree of accuracy. When machine learning papers mention "estimating a function" or "learning a function," they are typically referring to this capability of neural networks.
In our classification examples, the neural network learns a complex function that maps input features to a probability distribution over different classes. For example, it could learn to predict the outcome of a soccer match, determining which team will win, lose, or draw, based on various attributes like team form, player statistics, historical performance, and home-field advantage.
3.1 Conceptual representation of an FFN¶
In our first class we implemented softmax regression. We did this for two reasons;
- We wanted to show you how multi-class classification worked.
- We wanted you to get comfortable stacking multiple layers of weight vectors on top of each other, to form matrices.
We are now going to take point (2) further, and stack several matrix operations after each other. In simplest terms, a feedforward neural network looks like this:
Here we have to define several attributes.
- $K$ are the number of attributes that your data has. In image analysis, this might be a feature for each pixel, if you are classifying flowers this would be the four different dimensions of your sepal and petal.
- $N$ is your batch dimension. This would be the amount of datapoints you "feed" into your neural network for a training "iteration".
- $Weights$ are your first and second weight matrix. You take the dot product between your weight matrix and your input data or your features.
- $Features$ are the features that you get after multiplying the first weight matrix with the input data. We refer to the numbers we calculate as features when they are not part of the output layer.
The Forward pass in a deep neural network is when a batch of data is passed from left to right in a neural network. In our case we:
- Dot Product the input data $(N \times K )$ with the Weight matrix $(K \times F)$ , to obtain the feature matrix $(N \times F)$. In this case the dimension $F$ is also referred to as the number of neurons in the input layer.
- The feature matrix is multiplied by the second weight matrix $(F \times C)$, to obtain the output (logits) matrix $(N \times C)$.
- The output matrix is normalized by the softmax to obtain probability predictions for each class.
The input data is typically referred to as the input layer, the dotproduct between the first weight matrix and the input data is referred to as the first hidden layer and the dotproduct on the right is usually referred to as the output layer. If you add more layers in between, these will also be referred to as hidden layers.
What is not shown in this figure is what we do between layers, typically there is an activation function, which uses a normalization technique to keep the values of the feature matrices between certain values. Omitted from this representation is a bias term. In this case it would be a vectors of $(F \times 1)$ and $(C \times 1)$, they would be added to each feature. Because we will be calculating quite a few derivatives, the bias term has been omitted to keep them as uncluttered as possible.
The cool thing about this drawing is that you have already implemented the operation on the right. The only thing we need to do is: add the operation on the left and figure out how to adapt stochastic gradient descent (SGD) to work on this simple neural network.
Now that we conceptually know what a neural network does. Let's dive into the math and derive the update equations. You will see that these make it easy to create a modular implementation of neural networks.
3.2 Deriving the Update Equations (Backpropagation)¶
Now that we have a conceptual understanding of a feedforward neural network, let's dive into the mathematics behind training it. We will use stochastic gradient descent (SGD) to update the weights of our network. To do this, we need to calculate the gradients of the loss function with respect to each weight matrix. At each operation we will also calculate the gradient with respect to the input of the layer and pass this down the network such that this can be used to update the network at that layer as well. This process is known as backpropagation.
We will consider a two-layer neural network with an input layer, one hidden layer, and an output layer as was shown in the figure above.
NOTE: As mentioned before, the bias term in these calculations has been omitted for simplicity.
Our network can be represented by the following equations:
Input to Hidden Layer: $$Z_{hidden} = X \cdot W_{hidden}$$
where $X$ is the input data ($N \times K$), $W_{hidden}$ is the weight matrix for the input layer ($K \times F$), and $Z_{hidden}$ is the pre-activation output of the hidden layer ($N \times F$).
Activation Function: $$A_{hidden} = \sigma(Z_{hidden})$$
where $\sigma$ is the sigmoid activation function (the same one we used in logistic regression!) and $A_{hidden}$ is the activated output of the hidden layer ($N \times F$).
Hidden to Output Layer: $$Z_{output} = A_{hidden} \cdot W_{output}$$
where $W_{output}$ is the weight matrix for the output layer ($F \times C$), and $Z_{output}$ is the pre-activation output of the output layer ($N \times C$).
Output Function: $$\hat{Y} = \text{softmax}(Z_{output})$$
where $\hat{Y}$ is the predicted probabilities for each class ($N \times C$).
We will use the cross-entropy loss function (over a batch of data, so we average over each data point), $\mathcal{L}_{CE}$. Recall from lesson 1 that it is defined as:
$$\mathcal{L}_{CE} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C Y_{i,c} \log(\hat{Y}_{i,c})$$where $Y$ are the true labels ($N \times C$), in one-hot encoding.
Our goal is to find the gradients with respect to the output weights $\frac{\partial L}{\partial W_{output}}$ and with respect to the hidden weights $\frac{\partial L}{\partial W_{hidden}}$. We will use the chain rule to calculate these gradients, working backward from the loss function.
Gradient with Respect to $W_{output}$¶
To find $\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}}$, we apply the chain rule up until that point:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} \cdot \frac{\partial Z_{output}}{\partial W_{output}}$$We know that $$Z_{output} = A_{hidden} \cdot W_{output}$$
Therefore, $$\frac{\partial Z_{output}}{\partial W_{output}} = A_{hidden}^T$$
The derivative of the cross-entropy loss with respect to the pre-activation output of the softmax layer ($\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}}$) is a result that we recall from previous derivations in lesson 1:
$$\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} = \frac{1}{N} (\hat{Y} - Y)$$Therefore, the gradient with respect to $W_{output}$ is:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}} = \frac{1}{N} A_{hidden}^T \cdot (\hat{Y} - Y)$$
Because we are now working with batches, the cross entropy loss is averaged.
Gradient with Respect to $W_{hidden}$¶
To find $\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}}$, we apply the chain rule up to one level deeper in the network:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} \cdot \frac{\partial Z_{output}}{\partial A_{hidden}} \cdot \frac{\partial A_{hidden}}{\partial Z_{hidden}} \cdot \frac{\partial Z_{hidden}}{\partial W_{hidden}}$$Notice for later, that the first part of the chain rule is the same as when we calculated the update for $\frac{\partial \mathcal{L}_{CE}}{\partial W_{output}}$. So we don't need to repeat ourselves and calculate the update for terms: $\frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}}$ or the softmax.
Instead of taking the derivative with the weight matrix in the first layer (where $W_{output}$ lives), we are now of course taking the derivative with respect to the weights in the hidden layer. So we need to calculate:
$$\frac{\partial Z_{output}}{\partial A_{hidden}} = W_{output}^T$$Next, we need to add the derivative of the sigmoid, which we derived earlier:
$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$Therefore,
$$\frac{\partial A_{hidden}}{\partial Z_{hidden}} = \sigma'(Z_{hidden}) = A_{hidden} \odot (1 - A_{hidden})$$Where $\odot$ denotes the element-wise (Hadamard) product. You would simply multily each element in one matrix with the other at the corresponding index.
From $Z_{hidden} = X \cdot W_{hidden}$, we have:
$$\frac{\partial Z_{hidden}}{\partial W_{hidden}} = X^T$$
Combining these terms, the gradient with respect to $W_{hidden}$ is:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = \frac{1}{N} X^T \cdot \left( (\hat{Y} - Y) \cdot W_{output}^T \odot (A_{hidden} \odot (1 - A_{hidden})) \right)$$
This is quite overwhelming to do for each layer, and if you are overwhelmed I don't blame you. I was too the first time when I had to derive this.
Fortunately, we can make our lives easier. This is because the neural network is essentially broken up in repeating modules. If we know the derivative of each component, we just need to add it to the update as we move backward from the front to the back of the network with our gradients. Hence, it is called backpropagation.
Modular implementation¶
Closely look at the chain rules for $W_{output}$ and $W_{hidden}$. If you look closely the first terms match, and the second differ. So we can pass this part back in the network so we don't have to re-calculate it. If the network had more hidden layers, you would see more repeating parts in your chain rule derivative.
Let's explicitly derive the modular updates. Let the derivative of the error and the softmax function at the output layer be:
$$\delta_{output} = \frac{\partial \mathcal{L}_{CE}}{\partial Z_{output}} = \frac{1}{N} (\hat{Y} - Y)$$
To update $W_{output}$, we simply multiply it with $\mathbf{A}^T$ and use our SGD update rule!
Now, we want to pass the error up one layer in the network to make sure that the layer deeper in the network can also update their weights, without having to recalculate the derivative of $\mathcal{L}_{CE}$. We let the layer that we are currently at pass the term $(\hat{Y} - Y) \cdot W_{output}^T$, which we will refer to as $\delta_{hidden\_pre\_activation}$, up one level (to the sigmoid activation).
Then at the activation we multiply it with the derivative of the sigmoid (and the input from the layer below it). So what we get is:
$$\delta_{hidden} = \delta_{hidden\_pre\_activation} \odot (A_{hidden} \odot (1 - A_{hidden}))$$
Then, $\delta_{hidden}$ is passed up to the hidden layer. We know the derivative w.r.t. the weight matrix in the hidden layer is $\mathbf{X}^{T}_{input}$. So, the gradient with respect to $W_{hidden}$ can be written modularly as:
$$\frac{\partial \mathcal{L}_{CE}}{\partial W_{hidden}} = X^T \cdot \delta_{hidden}$$This modular form is much cleaner and easier to implement in code. It shows that the gradient for a weight matrix is the product of the transpose of the input to that layer and the error propagated back to the output of that layer's activation function.
These are the core update equations we will use to implement our feedforward neural network using NumPy. We will use these gradients to update the weights using SGD:
$W_{new} = W_{old} - \alpha \frac{\partial \mathcal{L}_{CE}}{\partial W}$
where, $\alpha$ is the learning rate.
Implementing a Feed Forward Neural Network in NumPy¶
Now that we have had a look at the theory, we can start implementing the neural network. It may look intimidating, but it's much more straightforward than you think it is. Just a little tedious, but you'll be the better for it, I promise!
Do the exercises below, if in the code you see hints for later exercises. You may ignore these for now.
Exercise 3.1 : Forward Pass¶
In this exercise you are going to implement the forward pass of the FFN. Make sure to implement the indicated methods/attributes in the FFNLayer, Sigmoid, Softmax, and Cross Entropy classes.
HINT: If you want to test this, you can randomly initialize a matrix and see if the output dimensions match by just running the module!
Exercise 3.2: Backward Pass¶
Each method mentioned in the previous exercise has a backpropagate (and in the case of the layer an update weights method).
Exercise 3.3: Training Loop¶
Implement the training loop, see hints in the exercise!
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from typing import Tuple
%matplotlib inline
#### Feedforward Layer
class FFNLayer():
def __init__(self, in_dims:int, out_dims: int, initialization: str = 'normal'):
self.in_dims = in_dims
self.out_dims = out_dims
# EXERCISE 3.1 Initialize weights here, just sample a matrix from a normal distribution
if initialization == 'normal':
self.W = pass
elif initialization == 'kaiming':
# EXERCISE 3.6. Implement Kaiming He weight initialization here!
std_dev = pass
self.W = pass
self._learning_rate = 1.0
@property
def learning_rate(self) -> float:
return self._learning_rate
@learning_rate.setter
def learning_rate(self, value: float) -> None:
self._learning_rate = value
def __call__(self, X: np.ndarray) -> np.ndarray:
# EXERCISE 3.1 Implement Forward pass here
self.X = pass
out = pass
return out
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# EXERCISE 3.2 Implement Backward Pass Here
self.update_weights(delta)
# HINT: Don't forget to calculate the delta_next here,
# otherwise other layers won't be updated!
# Delta next is the derivative wrt the input of this network
delta_next = pass
return delta_next
def update_weights(self, delta: np.ndarray) -> None:
# EXERCISE 3.2 Implement weight update here!
update = pass
self.W = pass
### Activation functions
class Sigmoid():
def __init__(self):
self.A = None
def __call__(self, X) -> np.ndarray:
# EXERCISE 3.1 Implement forward pass here
self.A = pass
return self.A
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# EXERCISE 3.2 Implement backprop here
return pass
class SoftMax():
def __init__(self):
self.X = None
def __call__(self, X: np.ndarray):
# Exercise 3.1 Implement softmax forward pass here!
return pass
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# EXERCISE 3.2 Implement backprop here
# HINT: For now we assume the only loss function that we are using is the cross entropy!
# So no need to overcomplicate it!
return pass
class CrossEntropy():
def __init__(self, epsilon: float = 1e-8):
self.epsilon = epsilon
self.Y_true = None
def __call__(self, Y_True: np.ndarray, Y_pred: np.ndarray) -> np.ndarray:
# EXERCISE 3.1 Implement Forward pass/ error calculation here
self.Y_true = Y_True
return pass
def backpropagate(self, Y_pred: np.ndarray) -> np.ndarray:
# EXERCISE 3.2: Implement derivative of loss here
return pass
class ModuleList():
def __init__(self):
"""
A pytorch style module list,
Don't modify anything here!
"""
self.layers = []
def set_learning_rate(self, lr: float = 0.01):
for layer in self.layers:
if hasattr(layer, 'learning_rate'):
layer.learning_rate = lr
def set_training(self):
for layer in self.layers:
if hasattr(layer, 'training'):
layer.training = True
def set_evaluation(self):
for layer in self.layers:
if hasattr(layer, 'training'):
layer.training = False
def add(self, layer):
self.layers.append(layer)
def __call__(self, X: np.ndarray) -> np.ndarray:
# NOTE: This is how they forward pass will be called
# for all of modules.
for layer in self.layers:
X = layer(X)
return X
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
for layer in reversed(self.layers):
delta = layer.backpropagate(delta)
return delta
class DataLoader():
def __init__(self, batch_size = 8, X: np.ndarray = None, y: np.ndarray = None, flatten: bool = False):
'''
Simple DataLoader, PyTorch style!
Don't modify anything here!
'''
self.X = X
if flatten:
self.X = X.reshape(X.shape[0], -1)
self.y = y
self.shuffle_data()
self.batch_size = batch_size
self.start_index = 0
self.end_index = self.batch_size
def shuffle_data(self):
indices = np.arange(self.X.shape[0])
np.random.shuffle(indices)
self.X = self.X[indices]
self.y = self.y[indices]
def __len__(self):
return len(self.X)
def __iter__(self):
return self
def __next__(self):
# reset indices and shuffle data
if self.end_index > self.X.shape[0]:
self.start_index = 0
self.end_index = self.batch_size
self.shuffle_data()
raise StopIteration
next_X = self.X[self.start_index:self.end_index]
next_y = self.y[self.start_index:self.end_index]
# Ensure indices are updated
self.start_index = self.end_index
self.end_index += self.batch_size
return next_X, next_y
class DataLoaderFactory():
def __init__(self, batch_size: int = 8) -> None:
'''
Simple Dataloader, PyTorch style!
Don't modify anything here!
'''
self.X, self.y = datasets.load_iris(return_X_y=True)
# normalize the features
self.X = (self.X - np.mean(self.X, axis=0)) / np.std(self.X, axis=0)
# split X in X_train and X_validate as well as y
self.X_train = self.X[:120]
self.X_validate = self.X[120:]
self.y_train_sparse = self.y[:120]
self.y_validate_sparse = self.y[120:]
# convert y_train and y_validate into one_hot matrix
self.y_train = np.zeros((self.y_train_sparse.shape[0], 3))
self.y_train[np.arange(self.y_train_sparse.shape[0]), self.y_train_sparse] = 1
self.y_validate = np.zeros((self.y_validate_sparse.shape[0], 3))
self.y_validate[np.arange(self.y_validate_sparse.shape[0]), self.y_validate_sparse] = 1
self.batch_size = batch_size
self.train_dataset = DataLoader(self.batch_size, self.X_train, self.y_train)
self.validation_dataset = DataLoader(self.batch_size, self.X_validate, self.y_validate)
self.len_train = len(self.train_dataset)
self.len_validate = len(self.validation_dataset)
def get_validation_dataset(self):
return self.validation_dataset
def get_train_dataset(self):
return self.train_dataset
# Standard neural network: This will be our baseline
module_list = ModuleList()
module_list.add(FFNLayer(4, 10))
module_list.add(Sigmoid())
module_list.add(FFNLayer(10, 3))
module_list.add(SoftMax())
loss_fn = CrossEntropy()
def train_model(module_list, loss_fn, epochs: int = 200, learning_rate: float = 0.01, dataloader_factory: DataLoaderFactory = DataLoaderFactory, batch_size: int = 8):
# set the learning rate
module_list.set_learning_rate(learning_rate)
# Get the dataloaders
loader_factory = dataloader_factory(batch_size = batch_size )
train_loader = loader_factory.get_train_dataset()
validation_loader = loader_factory.get_validation_dataset()
train_loss_history = []
validation_loss_history = []
for epoch in range(epochs):
avg_loss = 0
module_list.set_training()
for data, y_target in train_loader:
# Exercise 3.3: Implement the training update here!
# TODO: forward pass, get the predicted y
y_pred = pass
# TODO: calculate the entropy loss! name it "loss"
loss = pass
# TODO: calculate the delta, from the loss function!
delta = pass
# TODO: backprop, use the module list we have implemented before!
### END YOUR code
num_batches = len(train_loader) / train_loader.batch_size
avg_loss += loss / num_batches
# validate
avg_val_loss = 0
module_list.set_evaluation()
for data, y_target in validation_loader:
y_pred = module_list(data)
loss = loss_fn(y_target, y_pred)
num_batches = len(validation_loader) / validation_loader.batch_size
avg_val_loss += loss / num_batches
validation_loss_history.append(float(avg_val_loss))
print(f'Epoch: {epoch} \t average loss: {avg_loss} \t validation loss: {avg_val_loss}')
train_loss_history.append(float(avg_loss))
return train_loss_history, validation_loss_history
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 200, learning_rate = 0.01)
def plot_train_validation_losses(train_loss_history, validation_loss_history):
indices = np.arange(len(train_loss_history))
plt.plot(indices, train_loss_history, label='train loss')
plt.plot(indices, validation_loss_history, label='validation loss')
plt.legend()
plt.title('Train and Validation loss for your neural network')
plt.show()
plot_train_validation_losses(train_loss_history, validation_loss_history)
3.3 Activation functions¶
In the previous exercise, we used the sigmoid activation function. We are going to have a look at several different activation functions that are specifically designed for deeper neural networks.
Many activation functions have been designed for increasingly deep neural networks. In the before times, sigmoids (and TanH) activation functions were used in deep neural networks, these had several problems however:
Vanishing Gradients: This is the most significant problem. Sigmoid and Tanh saturate for large values, causing their gradients to become near zero. This severely hinders the learning of earlier layers in deep networks during backpropagation.
Slow Convergence: Due to vanishing gradients, models using Sigmoid or Tanh can converge much slower, especially on deep architectures.
Computational Cost: The exponential function in Sigmoid and Tanh is computationally more expensive than the alternatives.
Non-Zero Centered Output (Sigmoid): Sigmoid's output is always positive. This can lead to suboptimal gradient updates in subsequent layers.
Alternatives activation functions that suffer less from these problems are: ReLU, Leaky-ReLU, and ELU. These are discussed next.
Rectified Linear Unit (ReLU) Activation Function¶
The Rectified Linear Unit (ReLU) is one of the most popular activation functions used in deep learning. It's computationally efficient and has helped address issues like the vanishing gradient problem that can occur with sigmoid and tanh functions.
The ReLU function is defined as:
$$ \text{ReLU}(z) = \max(0, z) $$This means that if the input $z$ is positive, the output is $z$. If the input $z$ is zero or negative, the output is 0. The derivative of the ReLU function is:
$$ \frac{d}{dz} \text{ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \le 0 \end{cases} $$This means the gradient is 1 for any positive input and 0 for any non-positive input. This simple derivative is one of the reasons ReLU is computationally efficient and helps combat the vanishing gradient problem.
Leaky Rectified Linear Unit (Leaky ReLU) Activation Function¶
The Leaky Rectified Linear Unit (Leaky ReLU) is another variation of the ReLU activation function designed to address the "dying ReLU" problem. While ReLU outputs zero for all negative inputs, which can cause neurons to become inactive and stop learning, Leaky ReLU allows a small, non-zero gradient when the input is negative.
The Leaky ReLU function is defined as:
$$ \text{Leaky ReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \le 0 \end{cases} $$where $\alpha$ is a small positive constant, typically around 0.01.
The derivative of the Leaky ReLU function is:
$$ \frac{d}{dz} \text{Leaky ReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \le 0 \end{cases} $$For positive inputs, the derivative is 1, just like ReLU. For non-positive inputs, the derivative is $\alpha$. This small non-zero gradient ensures that neurons can still learn even when their input is negative, preventing them from becoming permanently inactive.
Exponential Linear Unit (ELU) Activation Function¶
The Exponential Linear Unit (ELU) is another popular activation function that aims to combine the benefits of ReLU while addressing some of its drawbacks. It tends to produce negative outputs for negative inputs, which can help push the mean of activations closer to zero, potentially leading to faster learning.
The ELU function is defined as:
$$ \text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \le 0 \end{cases} $$where $\alpha$ is a hyperparameter, usually set to 1.
The derivative of the ELU function is:
$$ \frac{d}{dz} \text{ELU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha e^z & \text{if } z \le 0 \end{cases} $$For $z > 0$, the derivative is 1, similar to ReLU. For $z \le 0$, the derivative is $\alpha e^z$. This derivative is non-zero for negative inputs, which helps mitigate the "dying ReLU" problem (where neurons can become inactive and stop learning).
Run the cell below to see what these functions look like!
import numpy as np
import matplotlib.pyplot as plt
# Generate input values
z = np.linspace(-5, 5, 100)
# ReLU function
def relu(z):
return np.maximum(0, z)
# ELU function
def elu(z, alpha=1.0):
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
# Leaky ReLU function
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
# Create subplots (2 rows, 2 columns)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Plot ReLU on the first subplot (top-left)
axes[0, 0].plot(z, relu(z), label='ReLU')
axes[0, 0].set_title('ReLU Activation Function')
axes[0, 0].set_xlabel('Input (z)')
axes[0, 0].set_ylabel('Output (ReLU(z))')
axes[0, 0].grid(True)
axes[0, 0].legend()
# Plot ELU with different alpha values on the second subplot (top-right)
alphas_elu = [0.5, 1.0, 2.0]
for alpha in alphas_elu:
axes[0, 1].plot(z, elu(z, alpha), label=f'ELU (α={alpha})')
axes[0, 1].set_title('ELU Activation Function with Different Alpha Values')
axes[0, 1].set_xlabel('Input (z)')
axes[0, 1].set_ylabel('Output (ELU(z))')
axes[0, 1].grid(True)
axes[0, 1].legend()
# Plot Leaky ReLU with different alpha values on the third subplot (bottom-left)
alphas_leaky_relu = [0.01, 0.1, 0.2]
for alpha in alphas_leaky_relu:
axes[1, 0].plot(z, leaky_relu(z, alpha), label=f'Leaky ReLU (α={alpha})')
axes[1, 0].set_title('Leaky ReLU Activation Function with Different Alpha Values')
axes[1, 0].set_xlabel('Input (z)')
axes[1, 0].set_ylabel('Output (Leaky ReLU(z))')
axes[1, 0].grid(True)
axes[1, 0].legend()
# Hide the fourth subplot as it's not used
fig.delaxes(axes[1, 1])
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()
Exercise 3.4 Implementing ReLU, leaky-ReLU, and ELU¶
In the below class definitions, implement the forward and backward functions.
HINT: You can already see how the forward function is implemented in the code cell above, keep it a secret, don't tell your fellow classmates.
class ReLU():
def __init__(self):
self.X = None
def __call__(self, X: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
self.X = X
return pass
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
# how do you implement relu here?
binarized_matrix = pass
X_out = pass
# think how to update the gradient!
return X_out
class LeakyReLU():
def __init__(self, alpha: float = 0.01):
self.alpha = alpha
self.X = None
def __call__(self, X: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
self.X = X
X_out = pass
return X_out
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
binarized_matrix = pass
X_out = pass
return X_out
class ELU():
def __init__(self, alpha: float = 1.0):
self.alpha = alpha
self.X = None
def __call__(self, X: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
self.X = X
X_out = pass
return X_out
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
## EXERCISE 3.4
binarized_matrix = pass
X_out = pass
return X_out
Exercise 3.5: Run the network with the activation functions¶
Run the network with the activation functions that you have just implemented on the MNIST digit dataset. The MNIST dataset are small images of handwritten digits, in our case we flatten them and feed them into our neural network.
Here is example data from MNIST:
Source: https://www.geeksforgeeks.org/machine-learning/mnist-dataset/
Try running the network for 20 epochs on Sigmoid, ReLU, LeakyReLU, and ELU activation functions.
HINT: Right now, ReLU, LeakyReLU and ELU should all return NaN's. It's a sign you are on the right way, but not a guarentee. We will see next, why this is!
# Install the MNIST dataset
pip install mnist_datasets
Collecting mnist_datasets Downloading mnist_datasets-0.12-py3-none-any.whl.metadata (4.8 kB) Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (2.0.2) Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (4.67.1) Downloading mnist_datasets-0.12-py3-none-any.whl (6.7 kB) Installing collected packages: mnist_datasets Successfully installed mnist_datasets-0.12
from mnist_datasets import MNISTLoader
# class for the MNIST dataloader. DO NOT CHANGE THIS.
class MNISTDataLoaderFactory():
def __init__(self, batch_size: int = 8) -> None:
'''
Simple Dataloader, PyTorch style!
Don't modify anything here!
'''
train_X, train_y = MNISTLoader().load()
validate_X, validate_y = MNISTLoader().load(train = False )
# Calculate mean and standard deviation, add epsilon for stability
mean_train = np.mean(train_X, axis = 0)
std_train = np.std(train_X) # Add epsilon
mean_validate = np.mean(validate_X, axis=0)
std_validate = np.std(validate_X) # Add epsilon
# normalize the features
train_X = (train_X - mean_train) / std_train
validate_X = (validate_X - mean_validate) / std_validate
# convert y_train and y_validate into one_hot matrix
train_y = np.array(train_y).astype(int)
validate_y = np.array(validate_y).astype(int)
train_y_one_hot = np.zeros((train_y.shape[0], 10))
train_y_one_hot[np.arange(train_y.shape[0]), train_y] = 1.
validate_y_one_hot = np.zeros((validate_y.shape[0], 10))
validate_y_one_hot[np.arange(validate_y.shape[0]), validate_y] = 1.
self.batch_size = batch_size
self.train_dataset = DataLoader(self.batch_size, train_X, train_y_one_hot, flatten = True)
self.validation_dataset = DataLoader(self.batch_size, validate_X, validate_y_one_hot, flatten = True)
self.len_train = len(self.train_dataset)
self.len_validate = len(self.validation_dataset)
def get_validation_dataset(self):
return self.validation_dataset
def get_train_dataset(self):
return self.train_dataset
def mnist_model(activation = 'sigmoid', initialization = 'normal') -> Tuple[ModuleList, CrossEntropy]:
'''
MNIST neural network model.
For exercise 1.5, only change the activation function.
For exercise 1.6, change the initialization after you have modified it in the FFNLayer class!
'''
if activation == 'sigmoid':
activation_fn = Sigmoid
elif activation == 'relu':
activation_fn = ReLU
elif activation == 'leaky_relu':
activation_fn = LeakyReLU
elif activation == 'elu':
activation_fn = ELU
else:
raise ValueError(f'Unknown activation function: {activation}, pick one from: sigmoid, relu, leaky_relu, elu')
module_list = ModuleList()
module_list.add(FFNLayer(784, 20, initialization = initialization))
module_list.add(activation_fn())
module_list.add(FFNLayer(20, 20, initialization = initialization))
module_list.add(activation_fn())
module_list.add(FFNLayer(20, 10, initialization = initialization))
module_list.add(SoftMax())
loss_fn = CrossEntropy()
return module_list, loss_fn
# module_list, loss_fn = mnist_model(activation = 'relu', initialization = 'kaiming')
# train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 10, learning_rate = 0.001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
So why the NAN's?¶
So what is going on, why are the losses turning into NaN's in spite of us using "Advanced Activations" that are supposed to solve vanishing gradients? While ReLU, ELU, and LeakyReLU address vanishing gradients, they have a different challenge: their positive side is unbounded. With standard normal initialization, the weights can cause the outputs of these activation functions (the features) to grow very large as data passes through the layers. This phenomenon is sometimes referred to as exploding activations or exploding values. When these large values reach the final layer and are fed into the Softmax function, the exponentiation step (np.exp(X)) can result in numbers that are too large to be represented by your computer's floating-point precision, leading to overflow. This overflow often produces inf (infinity), and subsequent calculations within the Softmax or the Cross-Entropy loss (like inf / inf) result in NaN (Not a Number) values, causing the training to fail.
There are a variety of ways in which we can deal with it:
- Better Weight Initialization: Due to the multiplication of the weights with the features, they keep on adding up. So we want to initialize the features in a smarter way.
- Numerically Stable Softmax: We could implement a numerically more stable version of the softmax loss.
- Gradient Clipping: We could clip the gradients during backpropagation, such that they are only allowed to update the weights by so much each iteration.
- Smaller Learning Rate: Another way of restricting the size of the weight updates each iteration.
We will implement a technique that adresses the first issue. This technique is described below!
Weight Initialization: Kaiming (He)¶
Introduced by Kaiming He et al. in 2015, Kaiming initialization was specifically designed for layers with ReLU and its variants (like Leaky ReLU). Unlike Sigmoid and Tanh, ReLU's mean is not zero, and it sets negative inputs to zero, which affects the variance of the activations.
Kaiming initialization accounts for the fact that ReLU "kills" half of the neurons (on average) by outputting zero for negative inputs. It aims to maintain the variance of the activations by scaling the weights appropriately.
The variance of the weights is typically set based on the number of input neurons (fan-in) or output neurons (fan-out), often using fan-in:
$$ \text{Var}(W) = \frac{2}{\text{fan_in}} $$or using fan-out:
$$ \text{Var}(W) = \frac{2}{\text{fan_out}} $$The weights are typically initialized by drawing from:
A normal distribution with mean 0 and standard deviation $\sigma = \sqrt{\frac{2}{\text{fan_in}}}$ (or $\sqrt{\frac{2}{\text{fan_out}}}$).
$$W \sim N\left(0, \sqrt{\frac{2}{\text{fan\_in}}}\right)$$A uniform distribution in the range $[-\text{limit}, \text{limit}]$, where $\text{limit} = \sqrt{\frac{6}{\text{fan_in}}}$ (or $\sqrt{\frac{6}{\text{fan_out}}}$).
$$ W \sim U\left[-\sqrt{\frac{6}{\text{fan_in}}}, \sqrt{\frac{6}{\text{fan_in}}}\right] $$
Kaiming initialization is generally the preferred method when using ReLU or Leaky ReLU activation functions, as it specifically addresses the change in variance caused by these activations.
Exercise 3.6: Implementing He initialization¶
We are going to implement He initialization in our FFNLayer class so we can actually use the activations that we have just implemented.
The only thing that you have to implement is the Kaiming (He) initialization in the `FFNLayer, using the fan_in method. See the comment in the class and don't forget to rerun the notebook cell!
Once you have implemented this, swap the initialization parameter in the mnist_model function below to 'kaiming'. Then change the learning rate to 0.0001 and run the model with ReLU, ELU, and LeakyReLU for 100 epochs.
What do you notice now?
module_list, loss_fn = mnist_model(activation = 'relu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 50, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
module_list, loss_fn = mnist_model(activation = 'elu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
module_list, loss_fn = mnist_model(activation = 'leaky_relu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0001, dataloader_factory = MNISTDataLoaderFactory)
plot_train_validation_losses(train_loss_history, validation_loss_history)
What you may have noticed is that training a neural network with one of the ReLU family activation functions in combination with kaiming's initialization produces the best results.
We have worked on a shallow neural network so far. Another technique to train bigger neural networks, is to normalize the features between each layer. We will be discussing a method of doing this next.
3.4 Batch Normalization¶
Batch Normalization (BatchNorm) is a technique introduced to address challenges in training deep neural networks. Initially, it was hypothesized that BatchNorm primarily helps by reducing Internal Covariate Shift, the change in the distribution of network activations during training. However, more recent research, notably the paper "How Does Batch Normalization Help Optimization?" by Santurkar et al., has suggested that while BatchNorm might not eliminate Internal Covariate Shift, its main benefits stem from other factors that make the optimization process smoother and more efficient.
Instead of primarily reducing Internal Covariate Shift, Santurkar et al. argue that Batch Normalization helps by:
- Making the optimization landscape smoother: This allows gradients to be more predictable and less noisy, enabling the use of higher learning rates and faster convergence.
- Reducing the dependence of gradients on the scale of parameters: This makes the training less sensitive to the initial values of the weights.
Regardless of the exact mechanism, Batch Normalization has proven to be a highly effective technique for stabilizing and accelerating the training of deep neural networks.
For a mini-batch of size $m$, let $x_i$ be the input to a neuron in a layer for the $i$-th example in the batch. Batch Normalization performs the following transformation for each neuron's input:
Calculate the mini-batch mean: $$ \mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i \quad \quad (\text{BN.1}) $$ This is the average of the inputs for a single neuron across the entire mini-batch. In NumPy, you would calculate the mean along the batch dimension.
Calculate the mini-batch variance: $$\sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2 \quad \quad (\text{BN.2})$$
This is the variance of the inputs for a single neuron across the mini-batch. In NumPy, you would calculate the variance along the batch dimension.
Normalize the input:
$$\hat{x_i} = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \quad \quad (\text{BN.3})$$Here, $\epsilon$ is a small constant added for numerical stability to prevent division by zero in case the mini-batch variance is zero. This step scales the inputs to have zero mean and unit variance.
Scale and Shift: $$ y_i = \gamma \hat{x}_i + \beta \quad \quad (\text{BN.4})$$ This is the final output of the Batch Normalization layer. $\gamma$ (gamma) and $\beta$ (beta) are learnable parameters of the Batch Normalization layer. $\gamma$ scales the normalized input, and $\beta$ shifts it. These parameters allow the network to learn to restore the original representation power if needed, by essentially learning to undo the normalization if that's optimal for the network.
During Training:
- The mini-batch mean $\mu_{\mathcal{B}}$ and variance $\sigma_{\mathcal{B}}^2$ are calculated for each mini-batch.
- The learnable parameters $\gamma$ and $\beta$ are updated using gradient descent.
- We also maintain a running average of the means and variances across all mini-batches to be used during inference.
During Inference (Testing):
- We use the accumulated global mean and variance (calculated as running averages during training) to normalize the inputs.
The formulas for updating the running average are typically:
$$ \text{running_mean} = \text{momentum} \cdot \text{running_mean} + (1 - \text{momentum}) \cdot \mu_{\mathcal{B}} \quad \quad (\text{BN.5})$$$$ \text{running_variance} = \text{momentum} \cdot \text{running_variance} + (1 - \text{momentum}) \cdot \sigma_{\mathcal{B}}^2 \quad \quad (\text{BN.6})$$where momentum is a hyperparameter (typically close to 1, e.g., 0.9 or 0.99).
Batch Normalization is a powerful technique that leads to:
- Faster Training: It allows for higher learning rates.
- Reduced Dependence on Initialization: The network becomes less sensitive to the initial values of the weights.
- Regularization Effect: It adds a slight regularization effect.
Implementing Batch Normalization involves both the forward pass (normalization and scaling/shifting) and the backward pass (calculating gradients with respect to the inputs, $\gamma$, and $\beta$).
Gradients for Gamma and Beta in Batch Normalization¶
In Batch Normalization, $\gamma$ (gamma) and $\beta$ (beta) are learnable parameters that are updated during training using gradient descent. The update formulas depend on the gradients of the loss function with respect to these parameters ($\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L }{\partial \beta}$)*.
Recall the output of the Batch Norm layer for a single example $i$ in a mini-batch: $y_i = \gamma \hat{x}_i + \beta$, where $\hat{x}_i$ is the normalized input.
Here's how to calculate the gradients :
Gradient with Respect to $\beta$:
The cross entropy loss depends on $y_i$, and $y_i$ depends linearly on $\beta$. Using the chain rule, the gradient of the loss with respect to $\beta$ is the sum of the gradients of the loss with respect to the outputs of the Batch Norm layer ($y_i$) across the mini-batch:
$$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \beta} $$Since $\frac{\partial y_i}{\partial \beta} = 1$, this simplifies to: $$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} $$ In terms of the incoming delta ($\delta_{out}$) from the layer above (where $\delta_{out} = \frac{\partial L}{\partial y}$), the gradient is: $$ \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} (\delta_{out})_i \quad \quad (\text{BN.7})$$
Gradient with Respect to $\gamma$: Using the chain rule for $\gamma$: $$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \gamma} $$
Since $y_i = \gamma \hat{x}_i + \beta$, $\frac{\partial y_i}{\partial \gamma} = \hat{x}_i$.
So the gradient is the sum of the element-wise product of the incoming delta and the normalized input ($\hat{x}_i$ ) across the mini-batch:
$$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \cdot \hat{x}_i $$In terms of the incoming delta ($\delta_{out}$) and the normalized input $\hat{x}_i$:
$$ \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} (\delta_{out})_i \cdot \hat{x}_i \quad \quad (\text{BN.8})$$
These gradients ($\frac{\partial L}{\partial \gamma}$ and $\frac{\partial L}{\partial \beta}$) are then used to update $\gamma$ and $\beta$ using the standard gradient descent rule:
$$ \gamma_{new} = \gamma_{old} - \alpha \frac{\partial L}{\partial \gamma} $$$$ \beta_{new} = \beta_{old} - \alpha \frac{\partial L}{\partial \beta} $$where $\alpha$ is the learning rate.
Exercise 3.7 Implementing Batch Normalization¶
For the next exercise we are going to implement batch normalization. You will implement the forward pass. In the backward pass you only need to implement the updates for Gamma and Beta, the delta (the bit that is propagated further down the network) is already implemented. The equations that you need are already indicated above with $\text{BN}.x$.
If you are curious about how these are derived you can check out this blog:
*$\mathcal{L}_{CE}$ is in this derivation referred to as $L$, because LaTex was having issues.
class BatchNormalization():
def __init__(self, momentum: float = 0.9, epsilon: float = 1e-5):
'''
Batch Normalization
'''
self.momentum = momentum
self.epsilon = epsilon
self.running_mean = None
self.running_var = None
self.X = None
self.gamma = None
self.beta = None
self._learning_rate = 1.0
self.training = True
@property
def learning_rate(self) -> float:
return self._learning_rate
@learning_rate.setter
def learning_rate(self, value: float) -> None:
self._learning_rate = value
def __call__(self, X: np.ndarray) -> np.ndarray:
self.X = X
self.batch_size = X.shape[0]
## EXERCISE 1.7 Implement the forward pass for batch normalization
# if there is no gamma or beta, initialize using the shape of
# of X. Gamma is initialized as a vector of ones and beta
# is initialized as a vector of zeros
if self.gamma is None or self.beta is None:
self.gamma = pass
self.beta = pass
# variance and mean for each feature. (While training)
if self.training:
# TODO: calculate the mean and variance
self.mu = pass # (BN.1)
self.var = pass # (BN.2)
# update the running mean & running var
if self.running_mean is None or self.running_var is None:
# TODO: if no running mean/var implemented,
# initialize them using the current variance
self.running_mean = pass
self.running_var = pass
else:
# TODO: Implemente the running mean and running variance
self.running_mean = pass # (BN.5)
self.running_var = pass # (BN.6)
# If not training, use the running mean and variance.
else:
self.mu = self.running_mean
self.var = self.running_var
# Normalize the input (don't forget to add epsilon)
# TODO: Calculate the normalization of the input
self.X_hat = pass # (BN.3)
# TODO: Scale and shift the input
out = pass# (BN.4)
return out
def backpropagate(self, delta: np.ndarray) -> np.ndarray:
# Now we need to implement the derivative with respect to the input X
# which is pretty tricky at this stage. So that is provided :)
# NO NEED TO IMPLEMENT THE DELTA_OUT UPDATE!
constant = self.gamma / (self.batch_size * np.sqrt(self.var + self.epsilon))
first_term = self.batch_size * delta
second_term = np.sum(delta, axis = 0, keepdims=True) # changed axes
third_term = self.X_hat * np.sum(delta * self.X_hat, axis = 0, keepdims = True) # changed axes
delta_out = constant * (first_term - second_term - third_term)
# EXERCISE 1.7: Implement the backprop for gamma and beta here!
self.gamma = pass # (BN.7)
self.beta = pass # (BN.8)
return delta_out
Exercise 3.8: Testing Batch Normalization¶
Try out your implementation of batch normalization. Just run the function below as is. As you may notice, the batch size is 32 instead of 8. This is because batch normalization averages features over batches. So you need to have large enough batches to estimate the parameters.
I got the model to train to about:
- average loss: 0.37
- validation loss: 2.68
def mnist_model_bn(activation = 'sigmoid', initialization = 'normal') -> Tuple[ModuleList, CrossEntropy]:
'''
MNIST neural network model.
'''
if activation == 'sigmoid':
activation_fn = Sigmoid
elif activation == 'relu':
activation_fn = ReLU
elif activation == 'leaky_relu':
activation_fn = LeakyReLU
elif activation == 'elu':
activation_fn = ELU
else:
raise ValueError(f'Unknown activation function: {activation}, pick one from: sigmoid, relu, leaky_relu, elu')
module_list = ModuleList()
module_list.add(FFNLayer(784, 100, initialization = initialization))
module_list.add(BatchNormalization())
module_list.add(activation_fn())
module_list.add(FFNLayer(100, 100, initialization = initialization))
module_list.add(BatchNormalization())
module_list.add(activation_fn())
module_list.add(FFNLayer(100, 100, initialization = initialization))
module_list.add(BatchNormalization())
module_list.add(activation_fn())
module_list.add(FFNLayer(100, 10, initialization = initialization))
module_list.add(SoftMax())
loss_fn = CrossEntropy()
return module_list, loss_fn
module_list, loss_fn = mnist_model_bn(activation = 'elu', initialization = 'kaiming')
train_loss_history, validation_loss_history = train_model(module_list, loss_fn, epochs = 100, learning_rate = 0.0005, dataloader_factory = MNISTDataLoaderFactory, batch_size = 32)
# plot_train_validation_losses(train_loss_history, validation_loss_history)
3.5 Other Normalization Techniques¶
While Batch Normalization is widely used and effective, its reliance on mini-batch statistics can be a limitation. Several other normalization techniques have been developed that address this by normalizing across different dimensions, making them less dependent on batch size:
- Layer Normalization: Normalizes across the features of each individual sample. This is particularly useful in recurrent neural network models.
- Instance Normalization: Normalizes across the spatial dimensions (height and width) for each sample and each feature channel independently. This has shown success in style transfer tasks.
- Group Normalization: Divides the channels into groups and normalizes the features within each group for each sample. This provides a balance between Batch Normalization and Layer Normalization and works well across a wide range of batch sizes.
These techniques offer alternatives to Batch Normalization when its assumptions or requirements are not met, providing more stable training and better performance in specific scenarios. Though we will not go in depth on these techniques in this tutorial, it is useful to know about them.
Concluding remarks¶
We have done a lot in this tutorial. We've understood feed forward neural networks, one of the earliest neural network and a critical component in many specialized architectures. Next, we've implemented activation functions, weight initialization, and batch normalization.
You may pat yourself on the back, many people that apply neural networks today have never implemented one from scratch in NumPy or seen how backpropagation works. Now you have!
As you may have noticed, our architecture for predicting MNIST classes was not optimized for images. In the next tutorial we will be implementing a convolutional neural network, which for many years was the go-to architecture for image related tasks. Only in the last few years did it get a new architecture that under certain conditions can challenge it.