Last lesson we implemented a feedforward neural network and a variety of techniques that we can use to make it work better. We also noticed that it did not work too well on image data. Hence, in this lesson we will explore a layer that is specialized in handling image data.

Important notes

It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features

Lesson 4: Convolutional Neural Networks (CNNs)

Name: Deep Learning & Medical Image Analysis
Author: Riaan Zoetmulder

Convolutional Neural Networks (CNNs) are a type of neural network that are primarily used for computer vision tasks. Unsurprisingly, the main operation behind CNNs is the convolution operation. You may have heard of this operation if you have ever taken a signal processing class. They involve sliding a smaller matrix (a kernel ) over a matrix (such as an image), and doing an element wise multiplication and summation to obtain a new matrix of features. Below is an example of this operation in 2D:

GIF taken from: https://medium.com/nerd-for-tech/all-about-convolutions-331471b9e5e5

If you want to play around with the convolution operation, you can try out this visualization: https://kowshik24.github.io/convolution-visualizer/

Example: 1D convolution

To get familiar with the convolution operation we are going to discuss several applications. First, we have a simple example of a 1D convolution. The black line indicates a series of 0's with a bump of 1's, followed by 0's. This is our "data". Then we have a variety of kernels:

A kernel of all 1's.
A Gaussian kernel.
A Canny Edge detection kernel, which detects abrupt changes in values.

The calculated value of the convolution (moving and element wise multiplication with the underlying part of the data) is shown in a different color. Run the cells below and move the slider from left to right to see what happens.

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib.widgets import Button

from ipywidgets import interactive
import ipywidgets as widgets

%matplotlib inline

In [ ]:

# Create the data vector
data_vector = np.concatenate([np.zeros(20), np.ones(20), np.zeros(20)])

# Define the kernels
kernel_size = 5

# Kernel of ones
kernel_ones = np.ones(kernel_size)

# Gaussian kernel
gaussian_kernel = np.exp(-(np.arange(kernel_size) - 2)**2 / (2 * 2**2))
gaussian_kernel /= gaussian_kernel.sum() # Normalize the kernel

# 1D Canny edge detection kernel (simplified)
canny_kernel = np.array([-1, -1, 0, 1, 1])

kernels = {
    'Ones': kernel_ones,
    'Gaussian': gaussian_kernel,
    'Canny Edge': canny_kernel
}

# Function to perform convolution
def convolve_1d(data, kernel):
    output = np.convolve(data, kernel, mode='valid')
    return output


def plot_convolution(index = 0):
    ones = kernels['Ones']
    gaussian = kernels['Gaussian']
    canny = kernels['Canny Edge']
    indices = [x for x in range(len(data_vector))]

    # convolve the ones vector with
    convolved_data_ones = convolve_1d(data_vector, ones)
    convolved_data_gauss = convolve_1d(data_vector, gaussian)
    convolved_data_canny = convolve_1d(data_vector, canny)

    # three subplots next to each other
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # plot the data
    axes[0].plot(indices, data_vector, color = 'black')

    conv_ones_indices = [x for x in range(len(convolved_data_ones[:index - 2]))]
    axes[0].plot(indices, data_vector, color = 'black')
    axes[0].plot(conv_ones_indices, convolved_data_ones[:index - 2], color = 'orange')
    axes[0].set_title('All 1s Kernel')
    axes[0].set_xlabel('index')
    axes[0].set_ylabel('value')

    axes[1].plot(indices, data_vector, color = 'black')
    axes[1].plot(conv_ones_indices, convolved_data_gauss[:index - 2], color = 'red')
    axes[1].set_title('Gaussian')
    axes[1].set_xlabel('index')
    axes[1].set_ylabel('value')

    axes[2].plot(indices, data_vector, color = 'black')
    axes[2].plot(conv_ones_indices, convolved_data_canny[:index - 2], color = 'green')
    axes[2].set_title('Canny Edge')
    axes[2].set_xlabel('index')
    axes[2].set_ylabel('value')
    plt.show()
    plt.close()

interactive_plot = interactive(plot_convolution, index=(2, len(data_vector) - 2))
output = interactive_plot.children[-1]
interactive_plot

interactive(children=(IntSlider(value=2, description='index', max=58, min=2), Output()), _dom_classes=('widget…

Example: 2D Convolution on an Image

Before deep learning, convolutions were widely used to process images. For example, to introduce a blur or to detect edges. Run the below example to see what happens when you do this!

In [ ]:

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
from skimage.feature import canny
from skimage import data

# Load the astronaut image from PIL
try:
    # img = Image.open(Image.get_image_resource('stinkbug.png'))
    img = Image.fromarray(data.astronaut())
except:
    # If stinkbug.png is not available, use a placeholder or another image
    print("Stinkbug image not found. Using a placeholder.")
    # Create a dummy image for demonstration
    img = Image.fromarray(np.random.randint(0, 256, (200, 300, 3), dtype=np.uint8))


# Convert the image to grayscale for Canny edge detection
img_gray = img.convert('L')
img_gray_array = np.array(img_gray)

# Apply Gaussian blur
img_blurred = gaussian_filter(img, sigma=5)

# Apply Canny edge detection
# sigma is the standard deviation of the Gaussian filter used in Canny
edges = canny(img_gray_array, sigma=3)

# Create a figure with three columns
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Display the original image
axes[0].imshow(img)
axes[0].set_title('Original Image')
axes[0].axis('off') # Hide axes ticks

# Display the image with Gaussian blur
axes[1].imshow(img_blurred)
axes[1].set_title('Gaussian Blur')
axes[1].axis('off')

# Display the image with Canny edge detection
axes[2].imshow(edges, cmap='gray') # Use grayscale colormap for edges
axes[2].set_title('Canny Edge Detection')
axes[2].axis('off')

plt.tight_layout()
plt.show()

4.1 Combining Neural Networks and Convolutions

So how do we combine convolutions and neural networks? Well remember that a convolution in image processing is basically sliding a 2D kernel over a data matrix. In CNNs we basically just have kernels with learnable parameters. And just like in feedforward neural networks (which we covered last time), we put a whole bunch of them together to create a layer out of them, and then we have multiple layers.

For an example see the image below (ignore the average pooling and flattening for now):

Source: https://amitkumar-aimlp.github.io/projects/computer-vision-fundamentals/#convolutional-neural-networks-cnns

The Math behind CNNs

We denote an input image as $X$, which has $H$ (height), $W$, (width), and $C$ (channels). During training you would use a batch of images which would be denoted $N$. The total size of a batch during training will be ($N \times C \times H \times W$). This is called the channels-first orientation, which is used in a popular deep learning framework PyTorch. For now we will be working with a single image $X$, so we can ignore $N$ for now.

The convolutional kernel (also referred to as a filter) is denoted $K$. It has dimensions $C \times k_h \times k_w \times O$, where $k_h$ and $k_w$ are the kernel height and width, $C$ is the number of channels in the image, and $O$ is the number of output channels (or filters). Important to note is that in a layer of a convolutional network, you don't only have 1 filter, you can have many. Hence, $O$ can be greater than 1.

In addition to the kernel, two other parameters have to be set. The stride ($s$) and the padding ($p$). The stride can be seen as the step size the the filter is moved by each side. So with a stride of 1, the center of the filter is moved by 1 pixel, with 2 by 2 pixels etc. Padding can be seen as extra borders that are added to an image, we assume that it is applied consistently to the height and width of the image or feature map. Padding can be added if images need to be a certain size (such that the output has a certain size).

Output Dimensions of a Convolutional Layer

To calculate the output dimensions of the feature map (F) after applying a convolution to an image you need the following formulae:

For the output height $H'$: $$ H' = \left\lfloor \frac{H - k_h + 2p}{s} \right\rfloor + 1 $$

For the output width $W'$: $$ W' = \left\lfloor \frac{W - k_w + 2p}{s} \right\rfloor + 1 $$

where:

$H$ is the input height
$W$ is the input width
$k_h$ is the kernel height
$k_w$ is the kernel width
$p$ is the padding (assuming the same padding is applied to all sides)
$s$ is the stride
$\lfloor \cdot \rfloor$ denotes the floor operation (rounding down to the nearest integer)

The output tensor $F$ will have dimensions $N \times O \times H' \times W' $, where $H'$ and $W'$ depend on the stride ($s$) and padding ($p$) used in the convolution and $O$ is the amount of convolutional kernels that you have. Assume below that the formula is 1-indexed.

$$F_{i, j, o} = (I \ast K_o)(i, j) = \sum_{c = 1}^C\sum_{x = 1}^{k_h} \sum_{y = 1}^{k_w} K_{c, x, y, o} \cdot X_{c, i + x, j + y} + B_o $$

In this formula, the value of the output feature map $F_o$ is calculated at index $(i, j)$ using the kernel $K$ and a bias $B_o$ that is unique to each output feature map. The operation will have to calculate the value for each of the $O$ output feature maps at each of the indices.

Doing this is very computationally expensive. Therefore, the convolution operation is often implemented using optimized matrix multiplication routines This is what we will talk about next.

Implementing Convolutions as Matrix Multiplications

In the 1990's a highly optimized matrix multiplcation was developed called GEneralized Matrix Multiplication (GEMM) in the BLAS library. This operation was highly optimized for CPU's, but not for GPU's. So NVIDIA develped CuDNN to port it to hardware.

In order to transform our relatively slow sliding window convolution operation into a faster alternative, we need change it into a matrix multiplication, but how do we do this? Let's take a single convolutional kernel, like before this kernel has dimensions $k_h \times k_w \times c$ . The first thing we would want to do is transform it into a vector representation. If we had multiple kernels we would want to stack them into a matrix. For each "patch" in our image we want to do the same thing. What will be the dimensions of the kernel and each patch? Well the flattened kernel dimensions will be ($k_h \cdot k_w \cdot c \times 1$), i.e. you multiply each dimension and the channel with each other. The 1 is the number of kernels ($O$) that we have. The image matrix will have shape number of patches by the length of a flattened kernel ($H'\cdot W'\times k_h \cdot k_w \cdot c$).

It really is easier to visualize this, check out the video below!

In [ ]:

import ipywidgets as widgets
from IPython.display import display
import os

def load_video(filename, remote_url=None):
    """Load video from local or remote source"""
    try:
        # Check if file exists locally
        if os.path.exists(filename):
            return widgets.Video.from_file(filename)

        # Download from remote if URL provided
        if remote_url:
            import urllib.request
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(remote_url, filename)
            print("✓ Download complete")
            return widgets.Video.from_file(filename)

        raise FileNotFoundError(f"{filename} not found")

    except Exception as e:
        print(f"⚠ Could not load video: {e}")
        return None

# Use it
video = load_video('GEMM_trick.mp4',
                   remote_url='https://raw.githubusercontent.com/RiaanZoetmulder/deep_learning_course/main/lecture_4/GEMM_trick.mp4')
if video:
    display(video)

Downloading GEMM_trick.mp4...
✓ Download complete

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x00\t7\xcamdat\x00\x00…

So how do we implement this?

There is a neat implementation that uses for-loops and gets very low level into this. If you are interested you can read about it here. This is a bit too low level for our purposes, so we are going to use a little used function in numpy called numpy.lib.stride_tricks.as_strided. Have a look at the documentation. We have to specify the shape of the new array that we want to make using the output shape formulae that we defined before. Then we need to specify how many bytes a step in each direction should be. We can use np.ndarray.stride attribute for this. We apply the numpy.lib.stride_tricks.as_strided function and flatten the result.

The im2col function implemented below shows how this is done! Have a close look at it, we will be implementing parts of the forward pass for a convolutional neural network later!

In [ ]:

import numpy as np
from numpy.lib.stride_tricks import as_strided
from scipy.ndimage import gaussian_filter
from PIL import Image
from skimage import data
import matplotlib
import matplotlib.pyplot as plt
from google.colab import files

In [ ]:

def generate_2d_gaussian_kernel(kernel_channels, kernel_height, kernel_width, sigma):
  '''
  Guassian 2D kernel helper function, produces a 2D Gaussian kernel with the specified parameters.

  NOTE: You may ignore this for now.
  '''
  # Create an empty kernel with the desired shape
  gaussian_kernel_3d = np.zeros((kernel_channels, kernel_height, kernel_width))

  # Generate a 2D Gaussian kernel
  gaussian_kernel_2d = np.fromfunction(
      lambda x, y: np.exp(-((x - (kernel_height-1)/2)**2 + (y - (kernel_width-1)/2)**2) / (2 * sigma**2)),
      (kernel_height, kernel_width)
  )
  # Normalize the 2D kernel
  gaussian_kernel_2d /= np.sum(gaussian_kernel_2d)

  # Stack the 2D kernel to create a 3D kernel with the specified number of channels
  for c in range(kernel_channels):
      gaussian_kernel_3d[c, :, :] = gaussian_kernel_2d
  return gaussian_kernel_3d


def im2col(image, kernel_size, stride = 1):
  '''
  image 2 column function.

  Operates on a single image (so no batches)

  NOTE: Have a look at this function.

  Parameters
  ----------
  image: np.ndarray
    image array
  kernel_size: Tuple
    size of the kernel
  stride : int
    stride that we are using to move our convolution over the image
  '''
  C, H, W = image.shape
  k_c, k_h, k_w = kernel_size

  assert k_c == C, 'Kernel channels and image channels mismatch.'

  # calculate the out dimensions [using the above formulae for the shape after the convolution]
  out_h = (H - k_h) // stride + 1
  out_w = (W - k_w) // stride + 1

  # define the output shape, before flattening: [H', W', K_c, K_h, K_w]
  out_shape = (
      out_h, # 28
      out_w, # 28
      k_c, # 3
      k_h, # 5
      k_w # 5
  )

  # get the number of bytes per stride
  x_stride_bytes = image.strides[1] # Stride over height dimension
  y_stride_bytes = image.strides[2] # Stride over width dimension
  channel_stride_bytes = image.strides[0] # Stride over channel dimension

  # define the strides
  strides = (
      x_stride_bytes * stride, # Stride for output height
      y_stride_bytes * stride, # Stride for output width
      channel_stride_bytes, # Stride for kernel channel
      x_stride_bytes, # Stride for kernel height
      y_stride_bytes # Stride for kernel width
  )

  # apply the as stride function
  patches = as_strided(image, shape = out_shape, strides = strides, writeable=False)

  # return the flattened patches array
  return patches.reshape(-1, k_h* k_w*k_c).T, out_h, out_w

# downsample the astronaut image to 3 x 32 x 32
astronaut_img = Image.fromarray(data.astronaut())
astronaut_img = astronaut_img.resize((32, 32))
astronaut_img = np.array(astronaut_img)
image = astronaut_img.transpose(2, 0, 1)

# Define kernel channels before the if/elif block
kernel_channels = 3
kernel_height = 5
kernel_width = 5
padding = 0
stride = 1
sigma = 1.0

# Standard deviation for the Gaussian filter
kernel = generate_2d_gaussian_kernel(kernel_channels, kernel_height, kernel_width, sigma)
kernel_shape = (kernel_channels, kernel_height, kernel_width)

# this will result in a 75 x 784 matrix. Each patch flattend out has 75 elements, and we have
# 784 steps as we convolve our kernel over the image.
reshaped_image_patches, out_h, out_w = im2col(image, kernel_size = kernel_shape, stride = 1)
flattened_kernel = kernel.reshape(-1, 5*5*3)

# Multiply the reshaped_image_patches with the flattened_kerne.
# If this works correctly, you get a 1x784 vector with only the value 75 in it!
feature_map = np.dot(flattened_kernel, reshaped_image_patches)

# transform the image back!
# Next we are going to transform it back to a 1 x H'x C' tensor.
feature_map = feature_map.reshape(1, out_h, out_w)

In [ ]:

# Squeeze the feature map to remove the dimension of size 1
plt.imshow(feature_map.squeeze(), cmap='gray') # Use grayscale colormap
plt.show()

Exercise 4.1: Implementing the forward pass of a convolutional neural network.

We've just seen how a convolution can be implemented via matrix multiplication for a single image, using a single filter. Ofcourse, we want to generalize this to a batch of images and a greater number of filter banks. This is what we will be doing next.

When you are done implementing your Conv2D network, test it using the code at the bottom of the cell. Do a single forward pass and see what the shape is of the image.

In [ ]:

import numpy as np
from numpy.lib.stride_tricks import as_strided
from typing import Tuple, Union

In [ ]:

class Conv2D():
  """
  A custom implementation of a 2D Convolutional Layer.

  This class implements the forward and backward passes of a convolutional
  layer using the im2col (image to column) approach, which transforms the
  convolution operation into a matrix multiplication for efficiency.
  It also includes methods for weight initialization and updating during training.
  """
  def __init__(self, in_channels: int, out_channels: int, kernel_size: Union[int, Tuple[int, int]], stride: Union[int, Tuple[int, int]] = 1, padding: Union[int, Tuple[int, int]] = 0):
    """
    Initializes the Conv2D layer.

    Parameters
    ----------
    in_channels : int
        Number of input channels (e.g., 3 for RGB images).
    out_channels : int
        Number of output channels (number of filters/kernels).
    kernel_size : int or tuple
        Size of the convolutional kernel (height, width). If an int,
        the kernel will be square (size, size).
    stride : int or tuple, optional
        Stride of the convolution (height, width). If an int, the stride
        will be the same for both dimensions. Defaults to 1.
    padding : int or tuple, optional
        Padding to apply to the input image (height, width). If an int,
        the same padding is applied to all sides. Defaults to 0.
    """
    self.in_channels = in_channels
    self.out_channels = out_channels

    # Ensure kernel_size is a tuple (height, width)
    if isinstance(kernel_size, int):
      self.kernel_size: Tuple[int, int] = (kernel_size, kernel_size)

    else:
        self.kernel_size: Tuple[int, int] = kernel_size

    if isinstance(stride, int):
        self.stride: Tuple[int, int] = (stride, stride)
    else:
        self.stride: Tuple[int, int] = stride

    if isinstance(padding, int):
        self.padding: Tuple[int, int] = (padding, padding)
    else:
        self.padding: Tuple[int, int] = padding

    self.im2col_matrix: np.ndarray = None
    self.learning_rate: float = 0.1 # Example learning rate
    self.gradient_clip_threshold: float = 1.0 # Threshold for gradient clipping

    # Initialize weights (kernels) and biases
    # Kernel shape: (out_channels, in_channels, kernel_height, kernel_width)
    std_dev = np.sqrt(2. / (self.in_channels * self.kernel_size[0] * self.kernel_size[1]))
    self.convolutional_kernel: np.ndarray = np.random.randn(out_channels, in_channels, self.kernel_size[0], self.kernel_size[1]) * std_dev

    # Bias shape: (out_channels,)
    self.bias: np.ndarray = np.random.randn(out_channels) * 0.01 # Added scaling for better initialization

    self.x_shape: Tuple[int, int, int, int] = None # Store the shape of the input X


  def __call__(self, X: np.ndarray) -> np.ndarray:
    """
    EXERCISES HERE !!!

    Performs the forward pass of the convolutional layer.

    Applies padding, reshapes the input using im2col, performs matrix
    multiplication with the flattened kernels, reshapes the output
    to the feature map shape, and adds the bias.

    Parameters
    ----------
    X : np.ndarray
        Input tensor to the convolutional layer. Shape (Batch, Channels, Height, Width).

    Returns
    -------
    np.ndarray
        Output feature map after convolution. Shape (Batch, out_channels, H_out, W_out).
    """
    # X shape: (Batch, Channels, Height, Width)
    self.x_shape = X.shape
    B, C, H, W = self.x_shape
    k_h, k_w = self.kernel_size
    p_h, p_w = self.padding
    s_h, s_w = self.stride

    # EXERCISE 1.1 a) Apply padding
    if p_h > 0 or p_w > 0:
        # Use np.pad to add padding to the last two dimensions (height and width)

        X_padded = None # TODO: Implement this

    else:
        X_padded = X

    # Get padded dimensions
    Hp, Wp = X_padded.shape[2], X_padded.shape[3]

    # Calculate output dimensions
    # EXERCISE 4.1 b) Calculate the output dimensions (think about how the above formula changes when you already padded your image)
    out_h = None # TODO: Implement this
    out_w = None # TODO: Implement this

    # EXERCISE 4.1 c) Calculate the total number of patches (this is the same size as the output feature map for a single convolution)
    num_patches = None # TODO: Implement this

    # Calculate the size of each patch
    patch_size = C * k_h * k_w # This C is the input channels to this layer

    # Define the shape for as_strided (Batch, output_height, output_width, in_channels, kernel_height, kernel_width)
    # This is the shape of the output before flattening for matrix multiplication.
    as_strided_shape = (
        B,
        out_h,
        out_w,
        C,
        k_h,
        k_w
    )

    # EXERCISE 1.1 d) Implement the strides (in bytes in each direction)
    # These strides define how many bytes to move in the underlying data
    # to get to the next element along each dimension of the as_strided_shape.
    self.B_stride, self.C_stride, self.H_stride, self.W_stride = X_padded.strides

    strides = (
        # TODO: Implement this
        None,               # Stride for batch dimension
        None, # Stride for output height (move stride steps down in padded image)
        None, # Stride for output width (move stride steps right in padded image)
        None,               # Stride for input channel dimension within a patch

        # DO NOT CHANGE: These are the strides within a single kernel window.
        self.H_stride,               # Stride for kernel height dimension within a patch
        self.W_stride                # Stride for kernel width dimension within a patch
    )

    # Use as_strided to get the patches view
    # The view is (Batch, out_h, out_w, in_channels, k_h, k_w)
    patches_view = as_strided(X_padded, shape=as_strided_shape, strides=strides, writeable=False)

    # Reshape the patches view for matrix multiplication (im2col)
    # Reshape to (Batch * out_h * out_w, C * k_h * k_w)
    # Each row is a flattened patch, ready for multiplication with the flattened kernels.
    self.im2col_matrix = im2col_matrix = patches_view.reshape(num_patches, patch_size)

    # Reshape to (out_channels, in_channels * k_h * k_w)
    # This is the flattened kernel matrix, ready for multiplication.
    flattened_kernel = self.convolutional_kernel.reshape(self.out_channels, patch_size)

    # EXERCISE 1.1 e) Perform matrix multiplication
    # The result is the flattened output feature map (out_channels, Batch * out_h * out_w)
    output_linear = None # TODO: Implement this

    # Reshape the output back to the feature map shape
    # Reshape from (out_channels, Batch * out_h * W_out) to (out_channels, Batch, out_h, W_out),
    # then transpose to (Batch, out_channels, out_h, W_out) to match standard tensor layout.
    output_feature_map = output_linear.reshape(self.out_channels, B, out_h, out_w).transpose(1, 0, 2, 3)

    # Add bias
    # Bias shape is (out_channels,). Reshape to (1, out_channels, 1, 1) for broadcasting
    # across the batch, height, and width dimensions of the output feature map.
    output_feature_map += self.bias.reshape(1, self.out_channels, 1, 1)

    return output_feature_map

  def update_weights(self):
    """
    Updates the convolutional kernel and bias weights using calculated gradients
    and the learning rate, with gradient clipping.
    """
    # Apply gradient clipping to weights
    weight_grad_norm = np.linalg.norm(self.gradients_W_tensor)
    if weight_grad_norm > self.gradient_clip_threshold:
        self.gradients_W_tensor = self.gradients_W_tensor * (self.gradient_clip_threshold / weight_grad_norm)

    # Apply gradient clipping to biases
    bias_grad_norm = np.linalg.norm(self.gradients_b)
    if bias_grad_norm > self.gradient_clip_threshold:
        self.gradients_b = self.gradients_b * (self.gradient_clip_threshold / bias_grad_norm)

    self.convolutional_kernel -= self.gradients_W_tensor * self.learning_rate
    self.bias -= self.gradients_b * self.learning_rate

  def backpropagate(self, delta: np.ndarray) -> np.ndarray:
    """
    NO IMPLEMENTATION HERE

    Performs the backward pass of the convolutional layer.

    Calculates gradients for the weights and biases, updates them, and computes
    the gradient with respect to the input to pass back to the previous layer.

    Parameters
    ----------
    delta : np.ndarray
        Gradient of the loss with respect to the output of this layer.
        Shape (Batch, out_channels, H_out, W_out).

    Returns
    -------
    np.ndarray
        Gradient of the loss with respect to the input of this layer.
        Shape (Batch, in_channels, H, W).
    """
    B, O, H_out, W_out = delta.shape
    num_patches, patch_size = self.im2col_matrix.shape

    # GRADIENTS: for the weight matrix
    delta_reshaped = delta.reshape(O, B*H_out*W_out) # flatten the batch, height and width dimension of the tensor

    # Calculate gradients for weights by multiplying reshaped delta with im2col_matrix
    gradients_W_matrix = np.dot(delta_reshaped, self.im2col_matrix) # out shape: (O, C * k_h * k_w)
    self.gradients_W_tensor = gradients_W_matrix.reshape(self.out_channels, self.in_channels, self.kernel_size[0], self.kernel_size[1])

    # GRADIENTS: for the bias
    # Calculate gradients for bias by summing delta along batch, height, and width dimensions
    self.gradients_b = np.sum(delta, axis=(0, 2, 3))

    # BACKPROP: derivative wrt to the input.
    # STEP 1: reshape & transpose filterbank: [O, C, k_h, k_w] to [C*k_h*k_w, O]
    # Reshape the kernel to (C * k_h * k_w, out_channels) for multiplication with delta
    reshaped_w_matrix = self.convolutional_kernel.reshape(self.out_channels, patch_size).T

    # STEP 2: Multiplication: result should be: [C*k_h*k_w, B*H_out*W_out]
    # Reshape delta to (Batch * H_out * W_out, out_channels) for dot product with transposed kernel
    delta_reshaped_for_input_grad = delta.reshape(O, -1)  # (O, B*H_out*W_out)
    dX_column_format = np.dot(reshaped_w_matrix, delta_reshaped_for_input_grad)

    # STEP 4: col2im operation
    # Transform the input gradients from im2col format back to the original image shape
    # Pass the stored original input shape (self.x_shape) to col2im_strided_vectorized
    dX = col2im_strided(dX_column_format, self.x_shape, self.kernel_size, self.stride, self.padding)

    # GRADIENTS: Update
    self.update_weights()

    return dX

def col2im_strided(col: np.ndarray,
    input_shape: Tuple[int, int, int, int],
    kernel_size: Tuple[int, int],
    stride: Tuple[int, int] = (1, 1),
    padding: Tuple[int, int]=(0, 0)) -> np.ndarray:
    """
    NO IMPLEMENTATION HERE

    Inverse of im2col that matches the layout produced by the col2im operation
    in the backpropagation step. A possible upgrade could be vectorization.

    im2col_batch → col shape: (C*k_h*k_w, B*OH*OW)
    This function reconstructs (B, C, H, W).

    Parameters
    ----------
    col : np.ndarray
        Shape (C*k_h*k_w, B*OH*OW)
    input_shape : tuple
        (B, C, H, W)  -- original unpadded shape
    kernel_size : tuple
        (k_h, k_w)
    stride : tuple
        (s_h, s_w)
    padding : tuple
        (p_h, p_w)

    Returns
    -------
    out : np.ndarray
        Reconstructed tensor with shape (B, C, H, W)
    """
    B, C, H, W = input_shape
    k_h, k_w = kernel_size
    s_h, s_w = stride
    p_h, p_w = padding
    Hp = H + 2 * p_h
    Wp = W + 2 * p_w
    OH = (Hp - k_h) // s_h + 1
    OW = (Wp - k_w) // s_w + 1

    # 1) Undo the im2col reshape/order:
    # col.T: (B*OH*OW, C*k_h*k_w)
    # reshape: (B, OH, OW, C, k_h, k_w)
    # transpose: (B, C, k_h, k_w, OH, OW)
    patches = col.T.reshape(B, OH, OW, C, k_h, k_w).transpose(0, 3, 4, 5, 1, 2)

    # 2) Scatter-add into padded canvas [This part could be vectorized]
    out_padded = np.zeros((B, C, Hp, Wp), dtype=col.dtype)

    for i in range(k_h):
        i_slice = slice(i, i + s_h * OH, s_h)  # length OH
        for j in range(k_w):
            j_slice = slice(j, j + s_w * OW, s_w)  # length OW
            # patches[:, :, i, j, :, :] has shape (B, C, OH, OW)
            out_padded[:, :, i_slice, j_slice] += patches[:, :, i, j, :, :]

    # 3) Remove padding to return (B, C, H, W)
    if p_h > 0 or p_w > 0:
        out = out_padded[:, :, p_h:Hp - p_h, p_w:Wp - p_w]
    else:
        out = out_padded

    return out

# PARAMETERS
padding = 2
stride = 2
kernel_size = 5
out_channels = 3
in_channels = 1
batch_size = 8

image = np.random.randn(batch_size, in_channels, 28, 28)
conv_layer = Conv2D(in_channels, out_channels, kernel_size, stride, padding)
out = conv_layer(image)

print(f'Shape of the input image is: {str(image.shape)} -> after applying Conv2D -> Shape of out is: {str(out.shape)}.')
print(f'Shape of the output image is: {"\n\tCorrect, continue to the next exercise!" if str(out.shape) == str((batch_size, out_channels, 14, 14)) else "incorrect"}')

Shape of the input image is: (8, 1, 28, 28) -> after applying Conv2D -> Shape of out is: (8, 3, 14, 14).
Shape of the output image is: 
	Correct, continue to the next exercise!

In [ ]:

# Solution 4.2.3
batch_size = 12
in_channels = 2
image = np.random.randn(batch_size, in_channels, 64, 64)


### For exercise 4.2.3, it is handier if you make a quick implementation using your
### convolution layer!


print(f'Shape of the input image is: {str(image.shape)} -> after applying Conv2D -> Shape of out is: {str(out.shape)}.')

Shape of the input image is: (12, 2, 64, 64) -> after applying Conv2D -> Shape of out is: (12, 8, 19, 19).

Exercise 4.2: Questions

Here are some questions for you to answer:

If the input of a convolutional neural network has 8 channels. Is the convolutional filter applied over all channels?
Are the amount of filters that you have in your convolutional filter equal to the output number of channels in the feature map that you calculate?
You are going to do some calculations and calculate the dimensions of the feature map. Here is the necessary information:

You have convolutional layers. Conv1 and Conv2. You are inputting a batch of images into Conv1 and are extracting a feature map. This feature map will then be put into Conv2. From this you will extract the output. Your task is to calculate the dimensions of the output.
- Conv1 has a kernel size of 5, padding of 4, and a stride of 4. The output channels are 4
- Conv2 has a kernel size of 3, padding of 2, and a stride of 1. The output channels are 8
- input has a batch size of 12, 2 channels, a height and width of 64.

What is the 'output' batch_size, channels, height and width?

Exercise 4.3: Run your Convolutional Neural Network

Below you will find the components that you need that are used to construct a convolutional neural network (make sure you run the cell directly below to install the dataset).

You will construct the following architecture in create_conv_model():

A Conv2D Layer with kernel size: 5, out_channels: 8, padding: 2, stride: 2.
A Conv2D Layer with kernel size: 5, out_channels: 16, padding: 2, stride: 2.
A Conv2D Layer with kernel size: 5, out_channels: 32, padding: 0, stride: 2.
A Flatten Layer (to make sure we can actually use FFN layers next)
A fully connected/feedforward layer (FFNLayer) with in_dims: 128, out_dims 64, and initialization= 'kaiming'
A fullyu connected/feedforward layer (FFNLayer) with in_dims: 64, out_dims: 10, and initialization = 'kaiming'
A SoftMax activation

Make sure to add a LeakyReLU with an alpha of 0.01 after each layer that is not the output layer!

When this is complete, run your code!

NOTE: Feel free to modify the amount of epochs, this code is fairly slow. If you run it for 3 epochs and the loss is decreasing, you are good!

In [ ]:

pip install mnist_datasets

Collecting mnist_datasets
  Downloading mnist_datasets-0.14-py3-none-any.whl.metadata (5.3 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (2.0.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from mnist_datasets) (4.67.1)
Downloading mnist_datasets-0.14-py3-none-any.whl (7.0 kB)
Installing collected packages: mnist_datasets
Successfully installed mnist_datasets-0.14

In [ ]:

import sys
import os

# Check if we're in Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    # Download the components file from GitHub
    !wget https://raw.githubusercontent.com/RiaanZoetmulder/deep_learning_course/main/lecture_4/cnn_components.py
    print("Downloaded cnn_components.py from GitHub")
else:
    # Assume local environment has the file
    print("Running locally")

# Now you can import the classes from your file
from mnist_datasets import MNISTLoader
import numpy

try:
  from cnn_components import FFNLayer, LeakyReLU, SoftMax, CrossEntropy, Flatten, DataLoader, MNISTDataLoaderFactory, ModuleList
  print("Classes imported successfully!")

except ImportError as e:
  print('Cannot import cnn_components.py, ensure that they are in the SAME folder as the jupyter notebook!')

--2026-01-05 09:56:59--  https://raw.githubusercontent.com/RiaanZoetmulder/deep_learning_course/main/lecture_4/cnn_components.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14523 (14K) [text/plain]
Saving to: ‘cnn_components.py’


cnn_components.py     0%[                    ]       0  --.-KB/s               
cnn_components.py   100%[===================>]  14.18K  --.-KB/s    in 0.001s  

2026-01-05 09:56:59 (9.83 MB/s) - ‘cnn_components.py’ saved [14523/14523]

Downloaded cnn_components.py from GitHub
Classes imported successfully!

In [ ]:

def create_conv_model():
  """

  IMPORTANT
  ---------

  Implement your model here
  """

  module_list = ModuleList()

  # EXERCISE 4.3: Add the necessary componenets here
  # module_list.add('YOUR OP HERE')

  loss_fn = CrossEntropy()

  return module_list, loss_fn

In [ ]:

def train_model(module_list, loss_fn, epochs: int = 200, learning_rate: float = 0.01, dataloader_factory: MNISTDataLoaderFactory = MNISTDataLoaderFactory, batch_size: int = 8):
  """
  Trains the convolutional neural network model.

  Parameters
  ----------
  module_list : ModuleList
      The list of layers in the neural network.
  loss_fn : object
      The loss function to use for training (e.g., CrossEntropy).
  epochs : int, optional
      The number of training epochs. Defaults to 200.
  learning_rate : float, optional
      The learning rate for updating the model weights. Defaults to 0.01.
  dataloader_factory : MNISTDataLoaderFactory, optional
      The factory for creating data loaders. Defaults to MNISTDataLoaderFactory.
  batch_size : int, optional
      The size of the mini-batches for training. Defaults to 8.

  Returns
  -------
  tuple
      A tuple containing two lists: train_loss_history and validation_loss_history.
      Each list contains the average loss per epoch for the training and
      validation sets, respectively.
  """

  # set the learning rate
  module_list.set_learning_rate(learning_rate)

  # Get the dataloaders
  loader_factory = dataloader_factory(batch_size = batch_size, flatten = False)
  train_loader = loader_factory.get_train_dataset()
  validation_loader = loader_factory.get_validation_dataset()

  train_loss_history = []
  validation_loss_history = []
  for epoch in range(epochs):
    avg_loss = 0
    module_list.set_training()
    for data, y_target in train_loader:

      # forward pass, get the predicted y
      y_pred = module_list(data)

      # calculate the entropy loss! name it "loss"
      loss = loss_fn(y_target, y_pred)

      # calculate the delta, from the loss function!
      delta = loss_fn.backpropagate(y_pred)

      # backprop!
      module_list.backpropagate(delta)

      num_batches = len(train_loader) / train_loader.batch_size
      avg_loss += loss / num_batches

    # validate
    avg_val_loss = 0
    module_list.set_evaluation()
    for data, y_target in validation_loader:
      y_pred = module_list(data)
      loss = loss_fn(y_target, y_pred)
      num_batches = len(validation_loader) / validation_loader.batch_size
      avg_val_loss += loss / num_batches

    validation_loss_history.append(float(avg_val_loss))
    print(f'Epoch: {epoch} \t average train loss: {avg_loss} \t average validation loss: {avg_val_loss}')
    train_loss_history.append(float(avg_loss))
  return train_loss_history, validation_loss_history

def test_model(module_list, dataloader_factory: MNISTDataLoaderFactory = MNISTDataLoaderFactory, batch_size: int = 64):
  """
  Test the accuracy of your model

  """
  loader_factory = MNISTDataLoaderFactory(batch_size = batch_size, flatten = False)
  validation_loader = loader_factory.get_validation_dataset()
  accuracy_batches = []
  for data, y_target in validation_loader:
        y_pred = module_list(data)

        # calculate accuracy
        y_pred_class = np.argmax(y_pred, axis=1)
        y_target_class = np.argmax(y_target, axis=1)
        accuracy = np.mean(y_pred_class == y_target_class)
        accuracy_batches.append(accuracy)

  # print accuracy
  print(f'Accuracy: {100.*np.mean(accuracy_batches)}')

# Run these models to train and test.
conv_model, loss_fn = create_conv_model()
train_loss_history, validation_loss_history = train_model(conv_model, loss_fn, epochs=10, learning_rate=0.001, dataloader_factory=MNISTDataLoaderFactory, batch_size=8)
test_model(conv_model, MNISTDataLoaderFactory, batch_size = 64)

Downloading MNIST files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading train-images-idx3-ubyte.gz...

Downloading MNIST files:  50%|█████     | 2/4 [00:00<00:00,  2.41it/s]

Downloading train-labels-idx1-ubyte.gz...
Downloading t10k-images-idx3-ubyte.gz...

Downloading MNIST files: 100%|██████████| 4/4 [00:01<00:00,  2.42it/s]

Downloading t10k-labels-idx1-ubyte.gz...

Downloading MNIST files: 100%|██████████| 4/4 [00:00<00:00, 23464.64it/s]

Epoch: 0 	 average train loss: 15.900976147676468 	 average validation loss: 11.498041380722466
Epoch: 1 	 average train loss: 9.196905169328195 	 average validation loss: 7.125798458129948
Epoch: 2 	 average train loss: 6.752099250472739 	 average validation loss: 5.7294880674145885
Epoch: 3 	 average train loss: 5.8012304524470535 	 average validation loss: 5.17205212414666
Epoch: 4 	 average train loss: 5.329610489246053 	 average validation loss: 4.722597126960939
Epoch: 5 	 average train loss: 4.9449339515697215 	 average validation loss: 4.467636974936229
Epoch: 6 	 average train loss: 4.709580608340185 	 average validation loss: 4.259066900517537
Epoch: 7 	 average train loss: 4.5035826473387806 	 average validation loss: 4.053243441465476
Epoch: 8 	 average train loss: 4.308376490855851 	 average validation loss: 4.010741479163866
Epoch: 9 	 average train loss: 4.2027612740165505 	 average validation loss: 3.8237715208034992

Downloading MNIST files: 100%|██████████| 4/4 [00:00<00:00, 17050.02it/s]
Downloading MNIST files: 100%|██████████| 4/4 [00:00<00:00, 29852.70it/s]

Accuracy: 85.52684294871796

As you may have noticed, the neural network calculates very slowly. This is why we do not run this network for 100's of epochs (currently). Once we start moving things to the GPU, there will be ample time to do this. Next, we will be discussing the properties of CNN's.

4.2 Properties of Convolutional Neural Networks

Translation Equivariance

Convolutional layers have several key properties that make them particularly effective for processing grid-like data such as images. One of the most important properties is Equivariance to Translation, which directly stems from the mathematical definition of the convolution operation.

Translation equivariance can be thought of as follows. If you shift all the pixels in an image by a certain amount, that means that all the features in the feature map produced by the convolutions are also shifted by that amount. This means that no matter where you put the object of interest on the image, downstream the CNN the features will still contain the requisite information, just at a different location. One property that this is often confused by is translation invariance, which would mean that no matter how you translate your pixels in your image, the output signal remains the same. It is important to remember that invariance and equivariance are not the same. For a more formal derivation of this property. See the appendix.

While convolutional layers are naturally equivariant to translation, they are generally not equivariant to other types of transformations such as rotation, scaling, or perspective changes. If an object in an image is rotated, scaled, or viewed from a different angle, the features detected by a standard convolutional kernel will likely change significantly, and the output feature map will not simply be a rotated, scaled, or perspectively transformed version of the original. Handling these other transformations often requires data augmentation (training the network on transformed versions of the input) or more complex network architectures designed to be invariant or equivariant to these specific transformations. A lot of work has gone into trying to make convolutional neural networks equivariant to other transformations.

Feature Reuse

A key aspect of convolutional neural networks (and deep neural networks in general) is the concept of feature reuse. In CNNs, early layers learn to detect simple features like edges, corners, and textures. These basic features are then passed as input to subsequent layers, which combine them to detect more complex patterns and objects. This hierarchical structure means that the features learned by earlier layers can be reused by multiple higher-level features. For example, an edge detector learned in a shallow layer can be used as a component for recognizing various shapes (like squares, triangles, or circles) in later layers. This reuse of learned features across different parts of the network makes CNNs highly efficient and allows them to learn complex representations with a relatively smaller number of parameters compared to networks that don't employ this principle.

Above: Convolutional Kernels plotted after training on a large scale image dataset. Taken from: ImageNet Classification with Deep Convolutional Neural Networks

Sparse Interactions

Convolutional Neural Networks leverage the concept of sparse interactions, which is a key difference from traditional fully connected networks. Instead of every output unit interacting with every input unit, each unit in a convolutional layer interacts with only a small, localized region of the input, defined by the kernel size. This means that the connections between layers are sparse. This sparsity significantly reduces the number of parameters in the model, making it more computationally efficient and less prone to overfitting, especially when dealing with high-dimensional input data like images. The benefits of sparse interactions include faster training times (if properly paralellized), reduced memory requirements, and improved generalization to unseen data due to the inherent regularization effect.

Receptive Fields

The receptive field of a unit in a convolutional neural network is the region in the input space that affects that unit's output. For units in the first convolutional layer, the receptive field is simply the size of the kernel. However, as you move deeper into the network, the receptive field of units in subsequent layers grows. This is because each unit in a deeper layer receives input from a larger area of the preceding layer's feature map, which in turn corresponds to an even larger area in the original input image. The increasing size of receptive fields as the network depth increases allows CNNs to capture hierarchical features, from simple edges and textures in early layers to more complex objects and patterns in deeper layers. Understanding the receptive field helps in designing CNN architectures and interpreting what different layers are learning.

Explaining the Receptive Field Visualization Conceptually

Imagine you're looking at a picture, but you can only see a small part of it through a tiny window. In a convolutional neural network, each "neuron" or processing unit in the first layer acts like that tiny window, looking at just a small patch of the original image (this small patch is determined by the Kernel Size). This is its receptive field.

Now, as the information from the image goes through more layers of the network (that's what the Number of Layers slider represents), the "window" for the neurons in those deeper layers gets bigger. Think of it like combining several tiny windows from the previous layer to make a larger window in the current layer.

The Stride tells you how big of a step the "window" takes as it slides across the image. A bigger stride means bigger steps, and this also affects how much of the original image a deeper layer "sees".

The Padding adds extra "blank space" around the edges of the image. This can help the network process the edges of the image more effectively, and it also influences the size of the receptive field.

The visualization you see shows a 256x256 white box, representing your image. The green box overlaid on it shows you the receptive field for a unit in the last layer of your simulated network, looking at the very center of the image.

By moving the sliders for Kernel Size, Stride, Padding, and Number of Layers, you're changing the "rules" of how the network processes the image. Watch how the green box changes size!

Bigger Kernel Size: The initial window is larger, so the receptive field grows faster.
Bigger Stride: The steps are larger, which also makes the receptive field grow faster.
More Layers: Each new layer combines information from a wider area of the previous layer, making the receptive field significantly larger.
Padding: Padding can influence the exact size and positioning of the receptive field, especially near the edges, although in this visualization, we're focusing on the center.

In short, the green box shows you how much of the original picture a single point in the final output of the network is "aware" of. Deeper layers have larger receptive fields because they have processed information from a wider area of the initial input.

The visualization is of course a simplification of a real scenario. There are many more operations in a neural network that we can account for (such as Average Pooling, Max Pooling etc). Moreover, we are assuming the stride, kernel size are consistent throughout the network, which is unrealistic. However this visualization helps build intuition also for debugging CNN architectures poor performance. For a more formal explanation of the math in this visualization, see the appendix.

In [ ]:

pip install opencv-python

Requirement already satisfied: opencv-python in /usr/local/lib/python3.12/dist-packages (4.12.0.88)
Requirement already satisfied: numpy<2.3.0,>=2 in /usr/local/lib/python3.12/dist-packages (from opencv-python) (2.0.2)

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interactive
from PIL import Image
import cv2
from skimage import data

astronaut_img_array = data.astronaut()
astronaut_img = Image.fromarray(astronaut_img_array)
astronaut_img_resized = astronaut_img.resize((256, 256))
img_array = np.array(astronaut_img_resized)


print("Astronaut image loaded and resized to 256x256.")

def calculate_receptive_field(kernel_size, stride, padding, num_layers):
    """Calculates the receptive field size after a number of convolutional layers."""
    rf = kernel_size
    for i in range(1, num_layers):
        # For this simplified visualization where kernel and stride are the same for all layers:
        rf = rf + (kernel_size - 1) * (stride**(i))

    return rf

def visualize_receptive_field(kernel_size, stride, padding, num_layers):
    """Visualizes the receptive field at the center of a 256x256 image."""
    img_size = 256
    # img = Image.new('RGB', (img_size, img_size), color = 'white')
    # img_array = np.array(img)

    # Calculate receptive field
    rf_size = calculate_receptive_field(kernel_size, stride, padding, num_layers)

    # Ensure receptive field does not exceed image size
    rf_size = min(rf_size, img_size)

    # Calculate the coordinates for the center of the image
    center_x, center_y = img_size // 2, img_size // 2

    # Calculate the top-left corner of the receptive field square
    # Ensure the receptive field is centered as much as possible, handling boundary cases
    start_x = max(0, center_x - rf_size // 2)
    start_y = max(0, center_y - rf_size // 2)

    # Calculate the bottom-right corner, ensuring it stays within bounds
    end_x = min(img_size, start_x + rf_size)
    end_y = min(img_size, start_y + rf_size)

    # Create an overlay for the receptive field
    overlay = np.zeros_like(img_array, dtype=np.uint8)
    overlay[start_y:end_y, start_x:end_x, :] = [0, 255, 0] # Green color

    # Blend the image and the overlay
    alpha = 0.5
    blended_img = cv2.addWeighted(img_array, 1 - alpha, overlay, alpha, 0)


    plt.figure(figsize=(6, 6))
    plt.imshow(blended_img)
    plt.title(f'Receptive Field Visualization\nKernel: {kernel_size}x{kernel_size}, Stride: {stride}, Padding: {padding}, Layers: {num_layers}\nRF Size: {rf_size}x{rf_size}')
    plt.axis('off')
    plt.show()

# Create interactive widgets
kernel_size_slider = widgets.IntSlider(min=1, max=8, step=1, value=3, description='Kernel Size:')
stride_slider = widgets.IntSlider(min=1, max=5, step=1, value=1, description='Stride:')
padding_slider = widgets.IntSlider(min=0, max=5, step=1, value=0, description='Padding:')
num_layers_slider = widgets.IntSlider(min=1, max=5, step=1, value=1, description='Number of Layers:')


interactive_plot = interactive(visualize_receptive_field,
                               kernel_size=kernel_size_slider,
                               stride=stride_slider,
                               padding=padding_slider,
                               num_layers=num_layers_slider)

output = interactive_plot.children[-1]
interactive_plot

Astronaut image loaded and resized to 256x256.

interactive(children=(IntSlider(value=3, description='Kernel Size:', max=8, min=1), IntSlider(value=1, descrip…

4.3 Different types of layers in CNN's

We have discussed the basic convolutional layer and given you intuition as to how it works, how it learns, and what its properties are. Since, the introduction of convolutional neural networks they have been combined and tweaked in a variety of creative ways for different purposes. In this section we will discuss those purposes and show a few examples.

1x1 Convolution

A 1x1 convolution operates on each pixel across all channels independently. While it doesn't consider spatial neighborhoods like larger kernels, it's useful for several reasons:

Channel Dimensionality Reduction/Expansion: It's used to reduce or increase the number of channels in a feature map. This can help manage the computational cost and the number of parameters in deeper layers.
Equivalent to a Fully Connected Layer per Pixel (with Shared Weights): You can think of a 1x1 convolution as applying a small, fully connected network independently to each pixel location across all its channels. Each pixel has a vector of features across the channels, and the 1x1 convolution performs a matrix multiplication on this feature vector, transforming it into a new feature vector with a different number of elements (the output channels). The key distinction from a standard fully connected layer is that the same set of weights (the 1x1xC filters) are applied to every pixel location.

In [ ]:

import ipywidgets as widgets
from IPython.display import display
import os

def load_video(filename, remote_url=None):
    """Load video from local or remote source"""
    try:
        # Check if file exists locally
        if os.path.exists(filename):
            return widgets.Video.from_file(filename)

        # Download from remote if URL provided
        if remote_url:
            import urllib.request
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(remote_url, filename)
            print("✓ Download complete")
            return widgets.Video.from_file(filename)

        raise FileNotFoundError(f"{filename} not found")

    except Exception as e:
        print(f"⚠ Could not load video: {e}")
        return None

# Use it
video = load_video('pointwise_convolution.mp4',
                   remote_url='https://raw.githubusercontent.com/RiaanZoetmulder/deep_learning_course/main/lecture_4/pointwise_convolution.mp4')
if video:
    display(video)

Downloading pointwise_convolution.mp4...
✓ Download complete

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x00\x0e\r(mdat\x00\x00…

Transposed Convolutions

Transposed convolutions (also known as deconvolutions or fractional-strided convolutions) are not true inverses of convolutions, but they are used to upsample feature maps, effectively increasing their spatial dimensions. The core idea is to learn to reverse the process of a standard convolution. Instead of sliding a kernel over the input and producing a smaller output, a transposed convolution effectively inserts zeros (or other values) into the input and then applies a convolution-like operation to expand the spatial size.

Imagine taking each pixel of your input feature map and distributing its value to a larger area in the output, weighted by the kernel. This is conceptually similar to how a single pixel in a standard convolution contributes to multiple pixels in the output feature map when the kernel slides over it. By controlling the padding and stride, you can achieve different upsampling factors. Transposed convolutions are commonly used in tasks like image segmentation (where you need to generate an output mask the same size as the input image) and generative models (where you need to create larger images from smaller latent representations).

We will need transposed convolutions once we get to more exciting applications of neural networks, such as building encoder-decoder architectures.

Max-Pooling and Average Pooling

While not strictly convolutional layers in the sense of having learnable kernels, pooling layers are almost always used in conjunction with convolutional layers. They perform downsampling operations on the feature maps, which means they reduce the spatial dimensions (height and width). This reduction helps in several ways:

Reducing Computation and Parameters: By making the feature maps smaller, subsequent layers have fewer inputs to process, which speeds up computation and reduces the number of parameters in the network.
Providing Translation Invariance: Pooling makes the network more robust to small shifts or translations of features in the input image. For example, in Max Pooling, if a strong feature is detected within a pooling window, its exact position within that window doesn't matter as much, as the maximum value will be selected regardless.
Extracting Dominant Features:
- Max Pooling: Selects the maximum value within the pooling window. This is effective at capturing the most important or salient feature within that region.
- Average Pooling: Calculates the average value within the pooling window. This provides a smoother representation of the region and is often used in the later stages of a network or in specific architectures.

The calculation of the output size is the same as in regular convolutions. In essence, pooling layers help to summarize the information in the feature maps, making the network more efficient and less sensitive to minor variations in the input.

In [ ]:

import ipywidgets as widgets
from IPython.display import display
import os

def load_video(filename, remote_url=None):
    """Load video from local or remote source"""
    try:
        # Check if file exists locally
        if os.path.exists(filename):
            return widgets.Video.from_file(filename)

        # Download from remote if URL provided
        if remote_url:
            import urllib.request
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(remote_url, filename)
            print("✓ Download complete")
            return widgets.Video.from_file(filename)

        raise FileNotFoundError(f"{filename} not found")

    except Exception as e:
        print(f"⚠ Could not load video: {e}")
        return None

# Use it
video = load_video('pooling.mp4',
                   remote_url='https://raw.githubusercontent.com/RiaanZoetmulder/deep_learning_course/main/lecture_4/pooling.mp4')
if video:
    display(video)

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x00\x0e\xea\xc0mdat\x0…

Global Average Pooling

Global Average Pooling (GAP) is a pooling operation that drastically reduces the spatial dimensions of a feature map. Instead of pooling over small local regions with a fixed-size kernel, GAP takes the average of all the elements within each feature map. This results in a single value for each channel. If a convolutional layer outputs a feature map of size $C \times H \times W$, a Global Average Pooling layer would output a tensor of size $C \times 1 \times 1$. GAP is often used as an alternative to fully connected layers at the end of a CNN, as it reduces the number of parameters and can help prevent overfitting. It also provides a more interpretable connection between the feature maps and the final classification layer, as each output channel can be seen as a confidence score for a particular category.

Other operations

Of course there are all sorts of other operations that you can apply to convolutional neural networks. Think of different attention blocks (Squeeze Exite, Gather Excite, etc.), various forms of normalization (batch normalization, group normalization, layer normalization), connections (residual connections, skip connections). These will all be introduced as we need them for different purposes.

4.4 Conclusion

In this tutorial you have been introduced to one of the backbone architectures of computer vision; The convolutional neural network. We introduced the convolution operation, how to parameterize it in a neural network, how to build a functioning CNN, and we described some useful properties. You can be quite proud of what you have accomplished today, few courses go as deep into these marvels of modern AI.

In the next (technically oriented) lession, we will leave our numpy-foo behind and swap it out for PyTorch. This will allow us to focus more on abstracting away details and focus on the next step in our journey: CNN architectures.

4.5 Appendix

Translation Equivariance

The convolution operation between an input signal (or image) $X$ and a kernel $K$ is defined mathematically as:

For a 1D signal: $$ (X \ast K)(t) = \sum_{\tau} X(\tau) K(t - \tau) $$

Equivariance to Translation:

Let's consider a translated input signal $X'(t) = X(t - \delta)$, where $\delta$ is the amount of translation. When we convolve this translated input with the same kernel $K$, the output is:

$$ (X' \ast K)(t) = \sum_{\tau} X'(\tau) K(t - \tau) $$

Substituting $X'(\tau) = X(\tau - \delta)$: $$ (X' \ast K)(t) = \sum_{\tau} X(\tau - \delta) K(t - \tau) $$

Let $\tau' = \tau - \delta$. Then $\tau = \tau' + \delta$. Substituting this into the equation: $$ (X' \ast K)(t) = \sum_{\tau'} X(\tau') K(t - (\tau' + \delta)) = \sum_{\tau'} X(\tau') K((t - \delta) - \tau') $$

Notice that the last expression is exactly the convolution of the original input $X$ with the kernel $K$, evaluated at the translated location $(t - \delta)$.

$$ (X' \ast K)(t) = (X \ast K)(t - \delta) $$

This equation shows that translating the input signal by $\delta$ results in the output being translated by the same amount $\delta$. This property is known as translation equivariance.

Calculating the Receptive Field

The receptive field ($R_L$) of a unit in layer $L$ of a CNN can be calculated iteratively based on the receptive field of the previous layer ($R_{L-1}$), the kernel size ($k_L$), and the stride ($s_L$) of the current layer.

The formula for calculating the receptive field of a layer $L$ is:

$$ R_L = R_{L-1} + (k_L - 1) \times \prod_{i=1}^{L-1} s_i $$

Where:

$R_L$ is the receptive field size of layer $L$.
$R_{L-1}$ is the receptive field size of the previous layer ($L-1$).
$k_L$ is the kernel size of layer $L$.
$s_i$ is the stride of layer $i$.
$\prod_{i=1}^{L-1} s_i$ is the product of the strides of all layers from 1 to $L-1$.

For the first convolutional layer (L=1), the receptive field is simply the kernel size:

$$ R_1 = k_1 $$

Let's walk through an example with two convolutional layers:

Layer 1: Kernel size $k_1=3$, Stride $s_1=1$.
Layer 2: Kernel size $k_2=5$, Stride $s_2=2$.

Calculating the receptive field of Layer 1: $R_1 = k_1 = 3$

Calculating the receptive field of Layer 2: $R_2 = R_1 + (k_2 - 1) \times s_1$ $R_2 = 3 + (5 - 1) \times 1$ $R_2 = 3 + 4 \times 1$ $R_2 = 7$

So, a unit in the second convolutional layer "sees" a $7 \times 7$ region of the original input image.

You can continue this process for each subsequent convolutional layer to determine the receptive field at any point in the network. Pooling layers also affect the receptive field. If a pooling layer with size $p_L$ and stride $s_L$ is introduced at layer $L$, the formula for the receptive field of the next layer would need to incorporate the pooling stride.

Important notes

Lesson 4: Convolutional Neural Networks (CNNs)

Example: 1D convolution

Example: 2D Convolution on an Image

4.1 Combining Neural Networks and Convolutions

The Math behind CNNs

Output Dimensions of a Convolutional Layer

Implementing Convolutions as Matrix Multiplications

Exercise 4.1: Implementing the forward pass of a convolutional neural network.

Exercise 4.2: Questions

Exercise 4.3: Run your Convolutional Neural Network

4.2 Properties of Convolutional Neural Networks

Translation Equivariance

Feature Reuse

Sparse Interactions

Receptive Fields

Explaining the Receptive Field Visualization Conceptually

4.3 Different types of layers in CNN's

1x1 Convolution

Transposed Convolutions

Max-Pooling and Average Pooling

Global Average Pooling

Other operations

4.4 Conclusion

4.5 Appendix

Translation Equivariance

Calculating the Receptive Field

Course Author: Riaan Zoetmulder