Welcome to the Philips deep learning course! Have you ever wondered how computers can "see" and understand images? This course is designed to introduce you to the fascinating world of deep learning, with a specific focus on its powerful applications in computer vision. We will build our skills from the ground up, equipping you to ask insightful questions and contribute effectively when discussing new product ideas involving deep learning.

Lesson 1: Linear and Logistic Regression

In our first lesson, we'll start with foundational concepts by exploring linear and logistic regression. You might ask, why begin with such seemingly simple models when our goal is deep learning?

There are several compelling reasons:

Building Blocks: Neural networks can be understood as powerful compositions of these basic models. Linear and logistic regression introduce fundamental concepts crucial for building neural networks, such as loss functions (measuring how well our model performs), weight matrices (the parameters our models learn), output functions (transforming the model's output), and optimization (finding the best parameters).
Illustrating Core Ideas: These simpler models provide clear and intuitive examples to demonstrate more abstract but critical concepts like overfitting (when a model performs too well on training data but poorly on new data), underfitting (when a model is too simple to capture the data's patterns), and the bias-variance tradeoff (the inherent conflict between a model's simplicity and its ability to fit the data). Understanding these concepts with linear and logistic regression makes it easier to grasp them when we move to more complex neural network architectures.

Important notes

It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features

1. Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It is important to have at least seen linear regression once before diving into neural networks. This is because it familiarizes you with important concepts: Weights, Bias Terms (not to be confused with the more colloquial use of the term bias), Target Values, and some basic linear algebra operations that you will need later.

Goal of Linear Regression:

The primary goal of linear regression is to find the "best-fit" line (or hyperplane in higher dimensions) that minimizes the difference between the observed values and the values predicted by the linear model. This difference is typically measured using the Mean Squared Error (MSE). By minimizing the MSE, we aim to find the parameters of the linear equation that best represent the underlying relationship in the data.

Matrices in Linear Regression:

To represent the data and the model parameters, we typically use matrices:

y (Target values): This is a column vector containing the dependent variable values (the thing we want to predict). It has dimensions $n \times 1$, where $n$ is the number of observations.

$$ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} $$

X (Data matrix): This is a matrix containing the independent variable values (the stuff from which we are predicting). To include a bias term in the model, we add a column of ones to the left of the independent variable data. It has dimensions $n \times (d+1)$, where $n$ is the number of observations and $d$ is the number of independent variables.

$$ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,d} \\ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,d} \end{bmatrix} $$

w (Coefficients/Weights): This is a column vector containing the coefficients (including the bias term) that we want to learn from the data. It has dimensions $(d+1) \times 1$. This is the part of the regression that will be changed during our fitting of the line.

$$ \mathbf{w} = \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{bmatrix} $$

Finding the solution for Mean the Squared Error:

The linear regression model can be expressed as:

$$ \hat{\mathbf{y}} = \mathbf{X} \mathbf{w} $$

where $\hat{\mathbf{y}}$ are the predicted values. Basically, you multiply your weight matrix $\mathbf{w}$ with your data matrix $\mathbf{X}$, to get the predictions $\hat{\mathbf{y}}$.

The Mean Squared Error (MSE) is defined as:

$$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 $$

It measures the difference between your predictions and the real target values. Each difference is then squared to make sure it is positive. In matrix form, the sum of squared errors (SSE) is:

$$ SSE = (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) $$

To find the analytical solution for the weights $\mathbf{w}$ that minimize the SSE (and thus the MSE), we take the derivative of the SSE with respect to $\mathbf{w}$, set it to zero, and solve for $\mathbf{w}$:

$$ \frac{\partial SSE}{\partial \mathbf{w}} = \frac{\partial}{\partial \mathbf{w}} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) = \mathbf{0} $$

Expanding and differentiating, we get:

$$ -2 \mathbf{X}^T (\mathbf{y} - \mathbf{X} \mathbf{w}) = \mathbf{0} $$$$ -2 \mathbf{X}^T \mathbf{y} + 2 \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{0} $$$$ \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{y} $$

Assuming $(\mathbf{X}^T \mathbf{X})$ is invertible, the analytical solution for the weights $\mathbf{w}$ is:

$$ \mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$

You can see that the Moore-Penrose Pseudoinverse emerges in this equation.

Interesting Fact: This equation is known as the Normal Equation, and it provides a closed-form solution for the optimal weights in linear regression.

In [ ]:

import ipywidgets as widgets
from IPython.display import display
import os

def load_video(filename, remote_url=None):
    """Load video from local or remote source"""
    try:
        # Check if file exists locally
        if os.path.exists(filename):
            return widgets.Video.from_file(filename)

        # Download from remote if URL provided
        if remote_url:
            import urllib.request
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(remote_url, filename)
            print("✓ Download complete")
            return widgets.Video.from_file(filename)

        raise FileNotFoundError(f"{filename} not found")

    except Exception as e:
        print(f"⚠ Could not load video: {e}")
        return None

# Use it
video = load_video('linear_regression.mp4',
                   remote_url='https://raw.githubusercontent.com/RiaanZoetmulder/deep_learning_course/main/lecture_1/linear_regression.mp4')
if video:
    display(video)

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x00\x1f\xab\xa8mdat\x0…

Exercise 1.1: Implementing linear regression

As our first exercise, we are going to implement finding the optimal weight matrix ($\mathbf{w}$). In order to do this we will first have to generate some toy data $\mathbf{X}$ and $\mathbf{y}$.

In [ ]:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [ ]:

# GENERATE TOY DATA
number_of_datapoints = 20

# generate the matrix of data (X)
X = np.random.randn(number_of_datapoints, 2)
X[:, 0] = 1.

# fake weights that we are going to estimate
w_real = np.array([0.5, 1])

# we define y ourselves, so we know what the correct answer is!
# EXTRA info: We do not add an error term, so y is perfectly predictable from X
y = w_real[0]* X[:, 0] + w_real[1]*X[:, 1]

# TODO: Implement this! See the Normal Equation for the form.
w_predicted = np.linalg.inv(X.T @ X) @ X.T @ y

In [ ]:

def plot_data(X, y, w_real, predicted = True):
  '''
  Plots the generated data and the real or predicted

  Parameters
  ----------
  X : array-like, shape (n_samples, n_features)
      The input data. n_samples is the number of data points and n_features in our case is 2!
  y : array-like, shape (n_samples,)
      The target values.
  '''
  n_samples, n_features = X.shape
  assert n_features == 2, "X should have 2 features!"

  # Plot the generated data
  plt.scatter(X[:, 1], y, label='Generated Data')

  # Plot the fake line
  x_fake = np.linspace(X[:, 1].min(), X[:, 1].max(), 100)
  y_fake = w_real[0] + w_real[1] * x_fake

  label = 'Real Line' if not predicted else 'Predicted Line'
  color = 'green' if not predicted else 'red'
  title = 'Generated Data and Real Line' if not predicted else 'Generated Data and Predicted Line'
  plt.plot(x_fake, y_fake, color=color, label=label)

  plt.xlabel('X')
  plt.ylabel('y')
  plt.title(title)
  plt.legend()
  plt.grid(True)
  plt.show()


plot_data(X, y, w_real, predicted=False)

# TODO: uncomment this to check your solution, the green and red lines should
# be the same!
# plot_data(X, y, w_predicted, predicted = True)

2. Logistic Regression

The Problem

While linear regression is used for predicting continuous values (like a house price), logistic regression (specifically binary logistic regression) is used for classification problems. The goal is to predict which of two distinct categories or classes (typically labeled 0 and 1, like "spam" or "not spam," or "malignant" or "benign") a given data point belongs to.

Unlike linear regression, where we found a direct formula to calculate the best weights (this direct formula is what we call an analytical solution), there is no such simple formula for finding the optimal weights in logistic regression. This is primarily because of the sigmoid function (or logistic function) which we use at the end of our model. Logistic regression primarily familiarizes us with output/activation functions (in this case the sigmoid), optimization (in this case Stochastic Gradient Descent), and predicting probabilities.

The logistic regression model works by first calculating a linear combination of the inputs and weights. A linear combination is simply multiplying each input feature by its corresponding weight and summing them up, just like we did in the linear regression equation ($z = w_0 \cdot 1 + w_1 \cdot x_1 + w_2 \cdot x_2 + \dots$). We can represent this in matrix form as:

$$ z = \mathbf{X} \mathbf{w} $$

After calculating this linear combination ($z$), we apply the sigmoid function to it. The sigmoid function is a mathematical function that takes any real-valued number and squashes it into a value between 0 and 1. The equation for the sigmoid function is:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

So, the predicted probability $\hat{y}$ that the data point belongs to class 1 is given by:

$$ \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} $$

The output of the sigmoid function ($\hat{y}$) can be interpreted as the probability of the data point belonging to class 1. To make a final class prediction, we typically use a threshold (a cutoff value, often 0.5). If the calculated probability ($\hat{y}$) is greater than or equal to this threshold, we classify the data point as class 1; otherwise, we classify it as class 0.

When we start talking about neural networks later, $\mathbf{z}$ are often referred to as logits and $\hat{y}$ are referred to as your probabilities.

So what does this look like?

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

In [ ]:

# generate toy data
X = np.random.randn(20, 2)
X[:, 0] = 1.

# calculate the classes
weights_real = np.array([0.5, 1])
y_continuous = weights_real[0]* X[:, 0] + weights_real[1]*X[:, 1]
y_classes = (y_continuous > 0.0).astype(int)

# this is your sigmoid activation function!
def sigmoid(logits):
  return 1 / (1 + np.exp(-logits))

In [ ]:

# Plotting the sigmoid function
plt.figure(figsize = (15,10))
z_sigmoid = np.linspace(-7.5, 7.5, 100) # Stretched out x-axis
sigmoid_z = sigmoid(z_sigmoid)

plt.plot(z_sigmoid, sigmoid_z, label='Sigmoid Function', color='orange')

# Calculate predicted probabilities for the data points
y_hat = sigmoid(y_continuous)

# Plot the predicted probabilities on the sigmoid curve, colored by their actual class
plt.scatter(y_continuous, y_hat, c=y_classes, cmap='viridis', label='Predicted Probability ($\hat{y}$)')

# Add the horizontal cutoff line
plt.axhline(0.5, color='grey', linestyle='--', label='Classification Cutoff (0.5)')

plt.xlabel('z (Linear Combination)')
plt.ylabel('Probability')
plt.title('Sigmoid Function, Data Points, and Predicted Probabilities')
plt.legend()
plt.grid(True)
plt.show()

<>:12: SyntaxWarning: invalid escape sequence '\h'
<>:12: SyntaxWarning: invalid escape sequence '\h'
/tmp/ipython-input-3637950725.py:12: SyntaxWarning: invalid escape sequence '\h'
  plt.scatter(y_continuous, y_hat, c=y_classes, cmap='viridis', label='Predicted Probability ($\hat{y}$)')

As you can see, the sigmoid function squishes the values of Y to be within the 0 to 1 range.

The Binary Cross Entropy Loss

In linear regression we used a loss function called "mean square error". However, now we are working with probabilities instead of normal values.

There are several reasons for not using the mean squared error, most are beyond the scope of this course. However, a very important practical reason is that you want to make sure that overconfidence is penalized. If our logistic regression model makes a very confident but incorrect prediction, the mean squared error does not penalize this behaviour.

Instead, for classification problems like logistic regression, we use a different measure called "cross-entropy", which does penalize the model for overconfident predictions.

For binary classification (where there are only two classes, like 0 or 1), the formula for cross-entropy loss for a single data point is:

$$ \text{Loss} = -(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})) $$

Here we have:

$y$ -> The actual class or "ground truth". In our case it is 0 or 1
$\hat{y}$ -> The predicted class probability (between 0 and 1).

If the actual class $y$ is 1, the formula becomes $-\log(\hat{y})$, which is small if $\hat{y}$ is close to 1 and large if $\hat{y}$ is close to 0. If the actual class $y$ is 0, the formula becomes $-\log(1 - \hat{y})$, which is small if $\hat{y}$ is close to 0 and large if $\hat{y}$ is close to 1.

(Stochastic) Gradient Descent

In logistic regression, because we use that curved "S"-shaped function (the sigmoid) at the end, we don't have such a simple, analytical solution to find the best weights. Hence, we can't just calculate the answer in one step.

So, instead of a direct calculation, we have to use a different approach. Imagine you're trying to find the lowest point in a valley while blindfolded. You can't just walk directly there. What you'd do is start somewhere, feel around a bit to see which direction goes downhill, take a small step in that direction, and repeat the process. You keep taking small steps downhill until you can't go down any further.

That's similar to what we do in logistic regression to find the best weights. We start with some initial guess for the weights. Then, we repeatedly adjust these weights in small steps, trying to make our model better at predicting the classes and reducing the "loss" (which, as we discussed, is measured by cross-entropy). This process of starting with a guess and gradually improving it over many steps is called an iterative optimization method. We keep iterating, or repeating the steps, until we find the set of weights that minimizes our error.

Now, let's go back to our valley analogy. How do you know which direction is downhill? You'd feel the slope or the steepness of the ground. In machine learning, the equivalent of feeling the slope is calculating something called the gradient.

The gradient tells us the direction in which the loss function is increasing most steeply. Since we want to find the lowest point (where the loss is minimized), we need to move in the opposite direction of the steepest increase. Think of it as following the path that goes straight downhill. This process of moving in the direction opposite to the gradient to decrease the loss is the core idea behind gradient descent.

So, with gradient descent, we start with our initial guess for the weights, calculate the gradient of the loss function at that point, and then adjust the weights by taking a step in the opposite direction of the gradient.

How big of a step should we take? That's controlled by a crucial parameter called the learning rate (in deep learning it is typically denoted as $\lambda$). The learning rate is a small positive number that determines how much we update the weights based on the gradient.

A small learning rate means we take tiny steps. This makes the process slow, like cautiously tiptoeing downhill. However, it increases the chances of finding the absolute lowest point without overshooting it.
A large learning rate means we take big steps, like leaping down the hillside. This can make the process faster, but there's a risk of overshooting the lowest point or even bouncing back and forth around it without ever settling down.

The goal is to choose a learning rate that is just right – allowing us to reach the minimum loss efficiently without being too aggressive. We repeat this process of calculating the gradient and updating the weights using the learning rate over many iterations until we reach a point where the loss is as low as possible.

let's look at two ways to perform gradient descent: standard Gradient Descent (GD) and Stochastic Gradient Descent (SGD).

In standard Gradient Descent (GD), to figure out which direction is downhill (to calculate the gradient), we use all of our data points. We calculate the error for every single data point, sum up how these errors would change if we slightly adjusted the weights, and then take a step based on that total sum. While this gives us a very accurate picture of the true "downhill" direction based on all the data, it can be very computationally expensive if we have a really large dataset. Imagine having millions or billions of data points – calculating the gradient using all of them for every single step would take a very long time!

This is where Stochastic Gradient Descent (SGD) comes in. The word "stochastic" basically means "involving a random variable." In SGD, instead of using the entire dataset to calculate the gradient, we use just one single data point (or sometimes a very small group of data points, called a "mini-batch") at each step. We calculate the gradient based on this one (or few) data points and immediately update the weights. This reduces the computational cost, but the steps are noisier. As a result of this noise our path downward will be more zigzaggy than with regular GD.

Update Equations for Stochastic Gradient Descent

In SGD, we update the weights iteratively based on the gradient calculated from a single data point (or a small batch, in which it is called batch gradient descent). The formula for updating the weights at each step is:

$$ w_{new} = w_{old} - \lambda \times \nabla L(y_i, \hat{y}_i, w) $$

Let's break down what each part of this formula means:

$w_{new}$: These are the new, updated weights that we calculate in the current step. These will be used in the next iteration.
$w_{old}$: These are the current weights (the weights from the previous iteration) before we update them. We start with some initial weights and update them repeatedly using this formula.
$\lambda$ is our learning rate. The small positive number that controls how big of a step we take in the direction of the gradient.
$\nabla L(y_i, \hat{y}_i, w)$: This is the gradient of the loss function for a single data point $(x_i, y_i)$ with respect to the weights $w$. The loss function $L$ measures how wrong our prediction $\hat{y}_i$ is compared to the actual value $y_i$ for that specific data point, given the current weights $w$. The gradient $\nabla L$ tells us the direction of the steepest increase in the loss for this single data point. By subtracting the learning rate times the gradient, we move the weights in the direction that decreases the loss for that data point.

The one thing we are missing is the definition of the gradient $\nabla L(y_i, \hat{y}_i, w)$. We will give a brief derivation of it below. If you are interested in the full derivation, scroll down to the appendix.

Remember, we have our binary cross entropy that we seek to minimize at each time step $i$:

$$ \text{Loss} = -(y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i))$$

We want to minimize it by making the predicted values $\hat{y}$ very close to the true values $y$, by changing the weight matrix $\mathbf{W}$. If you remember from high school, we can take the derivative to get a step in the right direction. But we cannot do this directly, because there is also a sigmoid function in the way. So we have to use the chain rule. Another way of writing $\nabla L(y_i, \hat{y}_i, w)$ is $ \frac{\delta L}{\delta w}$.

$$\frac{\delta L}{\delta w} = \frac{\delta L}{\delta \hat{y}_i}\frac{\delta \hat{y}_i}{\delta z_i}\frac{\delta z_i}{\delta w} \quad\quad (1)$$

That means we have to take three derivatives:

The derivative of the binary cross entropy loss w.r.t the probabilities:

$$\frac{\delta L}{\delta \hat{y}_i} = \frac{\hat{y}_i - y_i}{\hat{y}_i (1 - \hat{y}_i)} $$

The derivative of the sigmoid with respect to the logits $z$:

$$ \frac{\delta \hat{y}_i}{\delta z_i} =\hat{y_i}(1 - \hat{y}_i)$$

The derivative of $z$ with respect to the weights $w$:

$$ \frac{\partial z_i}{\partial \mathbf{w}} = \mathbf{x_i}^T $$

When we put all of these together using formula $(1)$. We get the following expression:

$$ \frac{\delta L}{\delta w} = \frac{\hat{y}_i - y_i}{\hat{y}_i (1 - \hat{y}_i)}(\hat{y_i}(1 - \hat{y}_i)) \mathbf{x_i}^T \quad \quad (2) $$

This simplifies to:

$$ \frac{\delta L}{\delta w} = (\hat{y}_i - y_i)\mathbf{x_i}^T \quad \quad (3) $$

Plug that back into our update equation, and we get:

$$ w_{new} = w_{old} - \lambda (\hat{y}_i - y_i)\mathbf{x_i}^T$$

This lies at heart of our optimization process. We sample a datapoint $i$, feed multiply it with the weight matrix to get our prediction $\hat{y}$ and update it with the weights. Do this often enough and you should see the model becoming more and more accurate.

Exercise 1.2 Exploring the Iris dataset

We are going to work with a "real" dataset about flowers, how nice! This dataset contains data about three different species of flowers, their sepal width, length and petal width and lenght. The first thing you are going to do is answer a few questions about this dataset after loading it.

In [ ]:

from sklearn import datasets
import numpy as np
import pandas as pd
from random import shuffle
import matplotlib
import matplotlib.pyplot as plt

In [ ]:

# this loads the iris dataset for scikit-learn
iris = datasets.load_iris()
X, y = iris.data, iris.target

#### WRITE CODE HERE TO ANSWER QUESTIONS #####

Questions

How many datapoints are there in this dataset?
Excluding the bias, how many weights will we have in our weight matrix (i.e. what will the dimension be)?

Now that you have answered the questions, we are going to start implementing logistic regression. But before we do that we have to remove one of the species from our iris dataset. This is because there are three species of flower, and our current form of logistic regression can only handle 2 classes at the same time. So let's start with that.

In [ ]:

def clean_data(X, y):
  include_index = y != 2
  X_new = X[include_index]
  y_new = y[include_index]

  # add a bias term here
  X_new = np.hstack((np.ones((X_new.shape[0], 1)), X_new))
  return X_new, y_new

X_new, y_new = clean_data(X, y)

Good, now that we have a cleaned dataset (with a bias term appended to it), we can start setting up our model. If you do it correctly, you should see the cross entropy loss decreasing.

In [ ]:

# helper functions
def binary_cross_entropy(y, y_hat):
  '''
  TODO

  Implement the binary cross entropy function
  see notes above for the formula.
  '''

  ### YOUR CODE HERE
  return - (y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

def calc_gradient(y, y_hat, x_i):
  '''
  TODO:
  implement the gradient calculation

  Hint: don't forget to transpose the x_i vector
  '''
  ### YOUR CODE HERE
  return (y_hat - y) * x_i.T


def sigmoid(logits):
  return 1 / (1 + np.exp(-logits))

def forward_pass(x_i, w):

  '''
  TODO:

  implement the forward pass (i.e. the calculation of the probability of
  a vector x belonging to class 0 or 1)
  '''
  ### YOUR CODE HERE
  z = np.dot(x_i, w)
  y_hat = sigmoid(z)
  return y_hat

def get_random_sample(shape):
  sample = [x for x in range(shape)]
  shuffle(sample)
  return sample

def train_logistic_regression(w, X_new, y_new, epochs, learning_rate, report_losses = True):
  num_datapoints = X_new.shape[0]

  loss_history = []

  for epoch in range(epochs):
    loss_avg = 0
    for index in get_random_sample(num_datapoints):
      x_i = X_new[index, :]
      y_i = y_new[index]

      # TODO: implement the forward pass, gradient calculation and
      # weight update. Be careful about the shape of the weight matrix
      # during the weight update! It must remain the same!

      ### YOUR CODE HERE
      y_hat = forward_pass(x_i, w)
      gradient = calc_gradient(y_i, y_hat, x_i)

      w = np.subtract(w.T, (learning_rate * gradient)).T

      ### END YOUR CODE

      # calculate the binary cross entropy loss and save
      loss = binary_cross_entropy(y_i, y_hat)
      loss_avg += loss[0]/num_datapoints

    loss_history.append(loss_avg)
    if report_losses:
      print(f"Epoch: {epoch}, Loss: {loss_avg}")

  return w, loss_history

w = np.random.randn(5, 1)
num_datapoints = X_new.shape[0]
epochs = 20
learning_rate = 0.01

w, loss_history = train_logistic_regression(w, X_new, y_new, epochs, learning_rate)

Epoch: 0, Loss: 1.0426130755047525
Epoch: 1, Loss: 0.2656436109569505
Epoch: 2, Loss: 0.1987845691102973
Epoch: 3, Loss: 0.15322818780888842
Epoch: 4, Loss: 0.12680498612021063
Epoch: 5, Loss: 0.10690100769893235
Epoch: 6, Loss: 0.09320840203366443
Epoch: 7, Loss: 0.08234762002322855
Epoch: 8, Loss: 0.07385506716205814
Epoch: 9, Loss: 0.06688461261284244
Epoch: 10, Loss: 0.060978804918462505
Epoch: 11, Loss: 0.05630440244836959
Epoch: 12, Loss: 0.05213250660086208
Epoch: 13, Loss: 0.04859928193132791
Epoch: 14, Loss: 0.045566900507241104
Epoch: 15, Loss: 0.04279064061896112
Epoch: 16, Loss: 0.040556989230835463
Epoch: 17, Loss: 0.03811486455646388
Epoch: 18, Loss: 0.03643224293287226
Epoch: 19, Loss: 0.034693924256946375

In [ ]:

# Plot the loss history
plt.figure(figsize=(10, 6))
plt.plot(range(epochs), loss_history)
plt.xlabel('Epoch')
plt.ylabel('Average Binary Cross-Entropy Loss')
plt.title('Training Loss over Epochs (SGD)')
plt.grid(True)
plt.show()

Exercise 1.3: Learning Rate experiment

Now that we have a working model, we are going to run a few experiments. I would like to see you try different values for the learning rate and report back on the accuracy (defined below, just call the function with the arguments). I have also provided a list below with different values for the learning rates that you can try. Report back on the accuracies and answer the questions once you are done.

In [ ]:

def calculate_accuracy(w, X_new, y_new):
  num_datapoints = X_new.shape[0]
  correct_predictions = 0
  for index in range(num_datapoints):
    x_i = X_new[index, :]
    y_i = y_new[index]

    y_i_pred = forward_pass(x_i, w)
    if y_i_pred > 0.5:
      y_i_pred = 1
    else:
      y_i_pred = 0

    if y_i_pred == y_i:
      correct_predictions += 1

  accuracy = correct_predictions / num_datapoints
  print('Accuracy: {}\n'.format(accuracy*100.))

# Play around with the learning rates
learning_rates = [0.1, 0.01, 0.001, 0.0001]
for lr in learning_rates:
  w = np.random.randn(5, 1)
  w, loss_history = train_logistic_regression(w, X_new, y_new, epochs, lr, report_losses = False)
  print(f'Learning Rate: {lr}')
  calculate_accuracy(w, X_new, y_new)

Learning Rate: 0.1
Accuracy: 100.0

Learning Rate: 0.01
Accuracy: 100.0

Learning Rate: 0.001
Accuracy: 87.0

Learning Rate: 0.0001
Accuracy: 48.0

What do you notice about the relationship between the learning rate and the accuracy?
Bonus: If you run the experiment more than one time, you may find different results. Do you have any idea why this is?
Bonus: Sometimes the experiments result in 100% accuracy, do you think that if we get data that the model has not seen yet it will also get 100% accuracy, why (not)?

3. Softmax Regression

In the last section, we've played around with logistic regression and the binary cross entropy loss. These are specific cases of something called softmax regression and the cross entropy loss. The latter are used in the case there are more than two classes.

The softmax output function is given (for a datapoint i and a class c out of a total of K classes) given by:

$$ \sigma_{softmax} (z_{i, c}) = \frac{e^{z_{i, c}}}{\sum^K_{i, j = 1} e^{z_{i, j}}}$$

The (multi-class) cross entropy loss is given (for a class c and a single datapoint i) by:

$$ \mathcal{L}_{CE} = - \sum^K_{c = 1} y_c \log(\hat{y}_c)$$

Where $y$ is the target label, and $\hat{y}$ is the predicted label. One thing in softmax regression that is different from logistic regression is that instead of having a vector with weights (and a bias). You now have a $K\times M$ matrix, where $K$ are the number of classes and $M$ are the the number of weights + 1. Intuitively, what happens during softmax is that you calculate the probability of each class, then you take the maximum of that and that is your prediction. This exercise teaches you how to stack weight vectors on top of each other and use those instead of single regression vectors, which is something that we do in deep learning all the time.

Example of prediction in softmax regression

Let's illustrate a forward pass (this is, making predictions) in softmax regression with 3 classes using matrix operations, similar to what we would do with the Iris dataset (which has 4 features + a bias term).

Assume we have a data matrix $\mathbf{X}$ with $n$ data points and $d=4$ features, plus a column of ones for the bias term, so its dimensions are $n \times 5$.

$$ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & x_{1,4} \\ 1 & x_{2,1} & x_{2,2} & x_{2,3} & x_{2,4} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n,1} & x_{n,2} & x_{n,3} & x_{n,4} \end{bmatrix} $$

We have a weight matrix $\mathbf{W}$ with dimensions $(d+1) \times K$, where $d+1=5$ (features + bias) and $K=3$ (number of classes, i.e. the different kinds of flowers in our dataset). Each column of $\mathbf{W}$ corresponds to the weight vector for a specific class.

$$ \mathbf{W} = \begin{bmatrix} w_{0,1} & w_{0,2} & w_{0,3} \\ w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \\ w_{3,1} & w_{3,2} & w_{3,3} \\ w_{4,1} & w_{4,2} & w_{4,3} \end{bmatrix} $$

The first step of the forward pass is to calculate the linear combinations (logits) for each data point and each class by performing matrix multiplication:

$$ \mathbf{Z} = \mathbf{X} \mathbf{W} $$

The dimensions of $\mathbf{Z}$ will be $n \times K$ (i.e., $n \times 3$). Each element $z_{i,c}$ in $\mathbf{Z}$ is the linear combination for data point $i$ and class $c$.

$$ \mathbf{Z} = \begin{bmatrix} z_{1,1} & z_{1,2} & z_{1,3} \\ z_{2,1} & z_{2,2} & z_{2,3} \\ \vdots & \vdots & \vdots \\ z_{n,1} & z_{n,2} & z_{n,3} \end{bmatrix} $$

The second step is to apply the softmax function to each row of the logits matrix $\mathbf{Z}$. The softmax function is applied independently to each data point's vector of logits. For each data point $i$, the predicted probability for class $c$, $\hat{y}_{i,c}$, is calculated as:

$$ \hat{y}_{i,c} = \frac{e^{z_{i,c}}}{\sum_{j=1}^{K} e^{z_{i,j}}} $$

Applying the softmax function to every element of $\mathbf{Z}$ results in the predicted probability matrix $\hat{\mathbf{Y}}$, which has the same dimensions as $\mathbf{Z}$ ($n \times 3$). Each element $\hat{y}_{i,c}$ represents the predicted probability that data point $i$ belongs to class $c$. The values in each row of $\hat{\mathbf{Y}}$ will sum up to 1.

$$ \hat{\mathbf{Y}} = \sigma_{softmax} (\mathbf{Z}) = \begin{bmatrix} \hat{y}_{1,1} & \hat{y}_{1,2} & \hat{y}_{1,3} \\ \hat{y}_{2,1} & \hat{y}_{2,2} & \hat{y}_{2,3} \\ \vdots & \vdots & \vdots \\ \hat{y}_{n,1} & \hat{y}_{n,2} & \hat{y}_{n,3} \end{bmatrix} $$

The matrix $\hat{\mathbf{Y}}$ contains the predicted probability distribution over the three classes for each of the $n$ data points.

Softmax Update Equations

In order to jump straight into implementation, here are the equations you will need to implement softmax regression. First, the gradient:

$$\nabla \mathcal{L}_{CE}(\mathbf{x}_i, \mathbf{y}_i, \mathbf{W}) =\mathbf{x}_i(\mathbf{\hat{y}}_{i} - \mathbf{y}_{i})^T $$

Because the gradient is taken with respect to each class, we would like to make sure that updates are easy to do. This is the vectorized respresentation of the gradient update for a randomly sampled datapoint $i$:

In our Iris dataset example, the vector $\mathbf{x}_i$ would be a $5 \times 1$ vector, $(\mathbf{\hat{y}}_{i} - \mathbf{y}_{i})^T$ would be a $1\times 3$ vector. If the outer product is taken between these two vectors, you should get a $5\times3$ matrix. This is the same size as our weight matrix. The final update equation becomes:

$$\mathbf{w}_{new} = \mathbf{w}_{old} - \lambda \mathbf{x}_i(\mathbf{\hat{y}}_{i} - \mathbf{y}_{i})^T $$

If you are interested in the derivations behind the softmax update, please check the appendix.

Exercise 1.4 Implement softmax regression

You are going to implement (parts of softmax regression). In the cells below I have indicated what you should implement yourself. Look at the equations above to see what you have to do!

Hint: Be mindful of how numpy handles outer products, the transposes in the gradient calculation are slightly different.

In [ ]:

import numpy as np
from sklearn import datasets
import matplotlib
import matplotlib.pyplot as plt

In [ ]:

X, y = datasets.load_iris(return_X_y=True)
num_datapoints = X.shape[0]

# add the bias term, the matrix size is now: [150, 5]
X = np.hstack((np.ones((X.shape[0], 1)), X))

# convert y into a one hot matrix. Instead of labels like 0, 1, 2, we would get [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
y_one_hot = np.zeros((y.shape[0], 3))
y_one_hot[np.arange(y.shape[0]), y] = 1

# initialize the weight matrix
w = np.random.randn(5, 3)
epochs = 100
learning_rate = 0.01

def softmax(x):
  return np.exp(x) / np.sum(np.exp(x), axis=0)

def cross_entropy_loss(y, y_hat):
  return -np.sum(y * np.log(y_hat + 1e-8))

def forward_pass(x_i, w):

  #### START: YImplement the forward pass
  z = np.dot(x_i, w)
  y_hat = softmax(z)
  return y_hat

def calc_gradient(y, y_hat, x_i):

  # START: implement calculate gradient
  # HINT: expand the dimensions of y_hat, such that they are (3, 1) instead of (3,)
  y_hat = np.expand_dims(y_hat, axis=1)
  y = np.expand_dims(y, axis=1)
  x_i = np.expand_dims(x_i, axis=1)

  return np.dot(x_i, (y_hat - y).T)

def randomize_dataset(X, y):
  indices = np.arange(X.shape[0])
  np.random.shuffle(indices)
  return X[indices], y[indices]

loss_history = []
for x in range(epochs):
  loss_avg = 0
  sampled_X, sampled_y = randomize_dataset(X, y_one_hot)
  for i in range(num_datapoints):
    x_i = sampled_X[i, :]
    y_i = sampled_y[i, :]

    #### YOUR CODE HERE

    # forward pass, calculate y_hat
    y_hat = forward_pass(x_i, w)

    # calculate the gradient
    gradient = calc_gradient(y_i, y_hat, x_i)

    # update the weights
    w = w - learning_rate * gradient

    # save the loss average
    loss_avg += cross_entropy_loss(y_i, y_hat) / num_datapoints

    #### END

  loss_history.append(loss_avg)
  print(f"Epoch: {x}, Loss: {loss_avg}")

Epoch: 0, Loss: 0.8960875206027263
Epoch: 1, Loss: 0.35902582220663726
Epoch: 2, Loss: 0.3317707975787206
Epoch: 3, Loss: 0.3094047728735891
Epoch: 4, Loss: 0.30665249339150313
Epoch: 5, Loss: 0.2954354210684385
Epoch: 6, Loss: 0.28266655951140324
Epoch: 7, Loss: 0.2887721416270834
Epoch: 8, Loss: 0.2822760689948172
Epoch: 9, Loss: 0.254536159631553
Epoch: 10, Loss: 0.24966177470364848
Epoch: 11, Loss: 0.2562815446638368
Epoch: 12, Loss: 0.2485189698966169
Epoch: 13, Loss: 0.23564335200175465
Epoch: 14, Loss: 0.23464689980218856
Epoch: 15, Loss: 0.22997993393875032
Epoch: 16, Loss: 0.2161016704687553
Epoch: 17, Loss: 0.21707194266094237
Epoch: 18, Loss: 0.21278922593217336
Epoch: 19, Loss: 0.2132590391287725
Epoch: 20, Loss: 0.20910304814253816
Epoch: 21, Loss: 0.20835880254714267
Epoch: 22, Loss: 0.19659052951816522
Epoch: 23, Loss: 0.19832301974402972
Epoch: 24, Loss: 0.18935410291050003
Epoch: 25, Loss: 0.1891801416426302
Epoch: 26, Loss: 0.191574762873479
Epoch: 27, Loss: 0.1827721239501104
Epoch: 28, Loss: 0.18576914115790727
Epoch: 29, Loss: 0.1716505402283855
Epoch: 30, Loss: 0.18702863696119662
Epoch: 31, Loss: 0.17128723087898798
Epoch: 32, Loss: 0.16598351007790957
Epoch: 33, Loss: 0.16647300144219396
Epoch: 34, Loss: 0.15255511973584004
Epoch: 35, Loss: 0.16530774641773405
Epoch: 36, Loss: 0.165746388158841
Epoch: 37, Loss: 0.16696286982431918
Epoch: 38, Loss: 0.1629706591756492
Epoch: 39, Loss: 0.16000312123600277
Epoch: 40, Loss: 0.16850231143831135
Epoch: 41, Loss: 0.1528135561943311
Epoch: 42, Loss: 0.16429007663082593
Epoch: 43, Loss: 0.1621990195857935
Epoch: 44, Loss: 0.15202925142231768
Epoch: 45, Loss: 0.15221210120154705
Epoch: 46, Loss: 0.1499201323168957
Epoch: 47, Loss: 0.15395510017094347
Epoch: 48, Loss: 0.1555996017849912
Epoch: 49, Loss: 0.14222351123529814
Epoch: 50, Loss: 0.14858716115334386
Epoch: 51, Loss: 0.14300757889340565
Epoch: 52, Loss: 0.13993092220836506
Epoch: 53, Loss: 0.1455818147131279
Epoch: 54, Loss: 0.14589184174011086
Epoch: 55, Loss: 0.13968313909627086
Epoch: 56, Loss: 0.1371213501528684
Epoch: 57, Loss: 0.1423458219598678
Epoch: 58, Loss: 0.13614046988351494
Epoch: 59, Loss: 0.13335173960314375
Epoch: 60, Loss: 0.1404467735062873
Epoch: 61, Loss: 0.12866070147501346
Epoch: 62, Loss: 0.13300962721772172
Epoch: 63, Loss: 0.1375486175888615
Epoch: 64, Loss: 0.12484745927999519
Epoch: 65, Loss: 0.12463685889841811
Epoch: 66, Loss: 0.12908745910476965
Epoch: 67, Loss: 0.12900329773569227
Epoch: 68, Loss: 0.13356325994031876
Epoch: 69, Loss: 0.12456749894757185
Epoch: 70, Loss: 0.13316035504111984
Epoch: 71, Loss: 0.1302904786873909
Epoch: 72, Loss: 0.12818915158878966
Epoch: 73, Loss: 0.12720678848247002
Epoch: 74, Loss: 0.12246814606043514
Epoch: 75, Loss: 0.12353601567297605
Epoch: 76, Loss: 0.1313971460054467
Epoch: 77, Loss: 0.11985102125994035
Epoch: 78, Loss: 0.12079491605269521
Epoch: 79, Loss: 0.1239282701634861
Epoch: 80, Loss: 0.11923648740700485
Epoch: 81, Loss: 0.12466664936999285
Epoch: 82, Loss: 0.11772045119311558
Epoch: 83, Loss: 0.12200979361306101
Epoch: 84, Loss: 0.1197272512012538
Epoch: 85, Loss: 0.11878144406767223
Epoch: 86, Loss: 0.11747508100561291
Epoch: 87, Loss: 0.11732568985268159
Epoch: 88, Loss: 0.12099885615126164
Epoch: 89, Loss: 0.11741025491445983
Epoch: 90, Loss: 0.1081759982738595
Epoch: 91, Loss: 0.12155218390903245
Epoch: 92, Loss: 0.11831285905315385
Epoch: 93, Loss: 0.11951658349657798
Epoch: 94, Loss: 0.11775344038378864
Epoch: 95, Loss: 0.1178820358282577
Epoch: 96, Loss: 0.12218248288746701
Epoch: 97, Loss: 0.11297256421212183
Epoch: 98, Loss: 0.11963031962488943
Epoch: 99, Loss: 0.11346475954030406

In [ ]:

plt.plot(loss_history)
plt.show()

If your loss function decreases to about 0.1, you are good!

End of Tutorial

Now that you have completed this tutorial, you have been introduced to many important topics: Weight Matrices, Loss functions (Means Squared Error and Binary Cross Entropy), optimization techniques (Stochastic Gradient Descent), learning rates, and output functions (The sigmoid activation function and the softmax activation function). These will set you up nicely to learn more about deep learning.

Appendix: Derivation of binary logistic regression updates.

Derivative of the sigmoid function

The sigmoid function is defined as:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

To find the derivative of the sigmoid function with respect to $z$, which is $\frac{\partial \sigma(z)}{\partial z}$ or $\frac{\partial \hat{y}}{\partial z}$, we can use the quotient rule or rewrite the expression. Let's use the latter approach:

$$ \sigma(z) = (1 + e^{-z})^{-1} $$

Now, applying the chain rule for differentiation:

$$ \frac{\partial \sigma(z)}{\partial z} = -1 \cdot (1 + e^{-z})^{-2} \cdot \frac{\partial}{\partial z}(1 + e^{-z}) $$$$ \frac{\partial \sigma(z)}{\partial z} = - (1 + e^{-z})^{-2} \cdot (-e^{-z}) $$$$ \frac{\partial \sigma(z)}{\partial z} = \frac{e^{-z}}{(1 + e^{-z})^2} $$

We can rewrite this expression in terms of $\sigma(z)$:

$$ \frac{e^{-z}}{(1 + e^{-z})^2} = \frac{1 + e^{-z} - 1}{(1 + e^{-z})^2} = \frac{1 + e^{-z}}{(1 + e^{-z})^2} - \frac{1}{(1 + e^{-z})^2} $$$$ = \frac{1}{1 + e^{-z}} - \left(\frac{1}{1 + e^{-z}}\right)^2 $$$$ = \sigma(z) - (\sigma(z))^2 $$$$ = \sigma(z)(1 - \sigma(z)) $$

So, the derivative of the sigmoid function with respect to $z$ is:

$$ \frac{\partial \hat{y}}{\partial z} = \sigma(z)(1 - \sigma(z)) = \hat{y}(1 - \hat{y}) $$

Derivative of the binary cross entropy

The Binary Cross-Entropy loss for a single data point $(x_i, y_i)$ is given by:

$$ L(y_i, \hat{y}_i) = -(y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)) $$

To find the derivative of the loss with respect to the predicted probability $\hat{y}_i$, which is $\frac{\partial L}{\partial \hat{y}_i}$, we differentiate the loss function with respect to $\hat{y}_i$:

$$ \frac{\partial L}{\partial \hat{y}_i} = - \frac{\partial}{\partial \hat{y}_i} (y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)) $$

Using the fact that $\frac{\partial}{\partial x} \log(x) = \frac{1}{x}$:

$$ \frac{\partial L}{\partial \hat{y}_i} = - \left( y_i \frac{\partial}{\partial \hat{y}_i} \log(\hat{y}_i) + (1 - y_i) \frac{\partial}{\partial \hat{y}_i} \log(1 - \hat{y}_i) \right) $$

For the second term:

$$ \frac{\partial}{\partial \hat{y}_i} \log(1 - \hat{y}_i) = \frac{1}{1 - \hat{y}_i} \cdot (-1) = - \frac{1}{1 - \hat{y}_i} $$

Substituting these back into the derivative of the loss:

$$ \frac{\partial L}{\partial \hat{y}_i} = - \left( y_i \frac{1}{\hat{y}_i} + (1 - y_i) \left( - \frac{1}{1 - \hat{y}_i} \right) \right) $$$$ \frac{\partial L}{\partial \hat{y}_i} = - \left( \frac{y_i}{\hat{y}_i} - \frac{1 - y_i}{1 - \hat{y}_i} \right) $$

To simplify, we find a common denominator:

$$ \frac{\partial L}{\partial \hat{y}_i} = - \left( \frac{y_i (1 - \hat{y}_i) - \hat{y}_i (1 - y_i)}{\hat{y}_i (1 - \hat{y}_i)} \right) $$$$ \frac{\partial L}{\partial \hat{y}_i} = - \left( \frac{y_i - y_i \hat{y}_i - \hat{y}_i + y_i \hat{y}_i}{\hat{y}_i (1 - \hat{y}_i)} \right) $$$$ \frac{\partial L}{\partial \hat{y}_i} = - \left( \frac{y_i - \hat{y}_i}{\hat{y}_i (1 - \hat{y}_i)} \right) $$

Finally, distributing the negative sign:

$$ \frac{\partial L}{\partial \hat{y}_i} = \frac{\hat{y}_i - y_i}{\hat{y}_i (1 - \hat{y}_i)} $$

So, the derivative of the Binary Cross-Entropy loss with respect to the predicted probability is:

$$ \frac{\partial L}{\partial \hat{y}_i} = \frac{\hat{y}_i - y_i}{\hat{y}_i (1 - \hat{y}_i)} $$

Derivative of the softmax function

In softmax regression, instead of outputting a single probability like in binary logistic regression, we output a probability distribution over multiple classes. The softmax function takes a vector of real numbers (the linear combinations, or "logits", for each class) and converts them into a probability distribution, where the probabilities for all classes sum up to 1.

For a data point $i$ and a class $c$, the output of the softmax function, denoted as $\hat{y}_{i,c}$, is calculated as follows:

$$ \hat{y}_{i,c} = \frac{e^{z_{i,c}}}{\sum_{j=1}^{K} e^{z_{i,j}}} $$

Where:

$z_{i,c}$ is the linear combination of inputs and weights for data point $i$ and class $c$. This is calculated as $z_{i,c} = \mathbf{x}_i \mathbf{w}_c$, where $\mathbf{x}_i$ is the input vector for data point $i$ and $\mathbf{w}_c$ is the weight vector for class $c$.
$K$ is the total number of classes.
The denominator is the sum of the exponentiated linear combinations for all classes for data point $i$. This ensures that the output probabilities for all classes sum up to 1.

We will consider two cases:

Case 1: When $j = c$ (The target class is the same as the predicted class)

We need to find $\frac{\partial \hat{y}_{i,c}}{\partial z_{i,c}}$. Using the quotient rule $\left(\frac{u}{v}\right)' = \frac{u'v - uv'}{v^2}$, where $u = e^{z_{i,c}}$ and $v = \sum_{k=1}^{K} e^{z_{i,k}}$:

$\frac{\partial u}{\partial z_{i,c}} = e^{z_{i,c}}$

$\frac{\partial v}{\partial z_{i,c}} = \frac{\partial}{\partial z_{i,c}} \left( e^{z_{i,1}} + e^{z_{i,2}} + \dots + e^{z_{i,c}} + \dots + e^{z_{i,K}} \right) = e^{z_{i,c}}$ (since only the term with $z_{i,c}$ is dependent on $z_{i,c}$)

Now, applying the quotient rule:

$$ \frac{\partial \hat{y}_{i,c}}{\partial z_{i,c}} = \frac{(e^{z_{i,c}})(\sum_{k=1}^{K} e^{z_{i,k}}) - (e^{z_{i,c}})(e^{z_{i,c}})}{(\sum_{k=1}^{K} e^{z_{i,k}})^2} $$

We can factor out $e^{z_{i,c}}$ from the numerator:

$$ \frac{\partial \hat{y}_{i,c}}{\partial z_{i,c}} = \frac{e^{z_{i,c}} \left( \sum_{k=1}^{K} e^{z_{i,k}} - e^{z_{i,c}} \right)}{(\sum_{k=1}^{K} e^{z_{i,k}})^2} $$

Now, let's rewrite this in terms of $\hat{y}_{i,c}$:

$$ \frac{\partial \hat{y}_{i,c}}{\partial z_{i,c}} = \frac{e^{z_{i,c}}}{\sum_{k=1}^{K} e^{z_{i,k}}} \cdot \frac{\sum_{k=1}^{K} e^{z_{i,k}} - e^{z_{i,c}}}{\sum_{k=1}^{K} e^{z_{i,k}}} = \hat{y}_{i,c} \left( 1 - \frac{e^{z_{i,c}}}{\sum_{k=1}^{K} e^{z_{i,k}}} \right) $$$$ \frac{\partial \hat{y}_{i,c}}{\partial z_{i,c}} = \hat{y}_{i,c} (1 - \hat{y}_{i,c}) $$

This result is very similar to the derivative of the sigmoid function!

Case 2: When $j \neq c$ (The output class is different from the linear combination class)

We need to find $\frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}}$ where $j \neq c$. Using the quotient rule again, with $u = e^{z_{i,j}}$ and $v = \sum_{k=1}^{K} e^{z_{i,k}}$:

$\frac{\partial u}{\partial z_{i,c}} = 0$ (since $j \neq c$, $e^{z_{i,j}}$ does not depend on $z_{i,c}$)

$\frac{\partial v}{\partial z_{i,c}} = e^{z_{i,c}}$

Applying the quotient rule:

$$ \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} = \frac{(0)(\sum_{k=1}^{K} e^{z_{i,k}}) - (e^{z_{i,j}})(e^{z_{i,c}})}{(\sum_{k=1}^{K} e^{z_{i,k}})^2} $$$$ \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} = - \frac{e^{z_{i,j}} e^{z_{i,c}}}{(\sum_{k=1}^{K} e^{z_{i,k}})^2} $$

We can rewrite this in terms of softmax outputs:

$$ \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} = - \frac{e^{z_{i,j}}}{\sum_{k=1}^{K} e^{z_{i,k}}} \cdot \frac{e^{z_{i,c}}}{\sum_{k=1}^{K} e^{z_{i,k}}} = - \hat{y}_{i,j} \hat{y}_{i,c} $$

So, the derivative of the softmax output for class $j$ with respect to the linear combination for class $c$ (when $j \neq c$) is:

$$ \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} = - \hat{y}_{i,j} \hat{y}_{i,c} $$

In summary, the derivative of the softmax output $\hat{y}_{i,j}$ with respect to the linear combination $z_{i,c}$ is:

$$ \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} = \begin{cases} \hat{y}_{i,c}(1 - \hat{y}_{i,c}) & \text{if } j = c \\ -\hat{y}_{i,j}\hat{y}_{i,c} & \text{if } j \neq c \end{cases} $$

Derivative of the Cross entropy function

For multi-class classification problems, we use the cross-entropy loss function. For a single data point $i$ and its true class label $y_i$ (represented as a one-hot encoded vector), and the predicted probability distribution $\hat{\mathbf{y}}_i$ (output of the softmax function), the cross-entropy loss is defined as:

$$ \mathcal{L}_{CE} = - \sum_{c=1}^{K} y_{i,c} \log(\hat{y}_{i,c}) $$

Where:

$\mathcal{L}_{CE}$ is the cross-entropy loss for data point $i$.
$K$ is the total number of classes.
$y_{i,c}$ is the $c$-th element of the one-hot encoded true label vector for data point $i$. This will be 1 for the true class and 0 for all other classes.
$\hat{y}_{i,c}$ is the predicted probability that data point $i$ belongs to class $c$ (output of the softmax function for class $c$).

Since $y_{i,c}$ is 1 only for the true class (let's say the true class is $t$), the sum simplifies to just the term for the true class:

$$ \mathcal{L}_{CE} = - \log(\hat{y}_{i,t}) $$

where $\hat{y}_{i,t}$ is the predicted probability for the true class $t$. The goal of minimizing this loss function is to maximize the predicted probability of the true class.

We want to find the partial derivative of $\mathcal{L}_{CE}$ with respect to the predicted probability for a specific class $c$, which is $\frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,c}}$.

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,c}} = \frac{\partial}{\partial \hat{y}_{i,c}} \left( - \sum_{j=1}^{K} y_{i,j} \log(\hat{y}_{i,j}) \right) $$

We can move the negative sign outside the derivative and the derivative inside the sum:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,c}} = - \sum_{j=1}^{K} \frac{\partial}{\partial \hat{y}_{i,c}} (y_{i,j} \log(\hat{y}_{i,j})) $$

Now, we differentiate each term in the sum with respect to $\hat{y}_{i,c}$. Remember that $y_{i,j}$ is a constant with respect to $\hat{y}_{i,c}$.

The derivative $\frac{\partial}{\partial \hat{y}_{i,c}} (y_{i,j} \log(\hat{y}_{i,j}))$ is non-zero only when $j = c$. In this case, the derivative is:

$$ \frac{\partial}{\partial \hat{y}_{i,c}} (y_{i,c} \log(\hat{y}_{i,c})) = y_{i,c} \frac{\partial}{\partial \hat{y}_{i,c}} (\log(\hat{y}_{i,c})) = y_{i,c} \cdot \frac{1}{\hat{y}_{i,c}} $$

For all other cases where $j \neq c$, the term $y_{i,j} \log(\hat{y}_{i,j})$ does not contain $\hat{y}_{i,c}$, so its derivative with respect to $\hat{y}_{i,c}$ is 0.

Therefore, the sum simplifies to just the term where $j=c$:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,c}} = - \left( y_{i,c} \cdot \frac{1}{\hat{y}_{i,c}} \right) $$

So, the partial derivative of the cross-entropy loss with respect to the predicted probability for class $c$ is:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,c}} = - \frac{y_{i,c}}{\hat{y}_{i,c}} $$

This derivative tells us how the cross-entropy loss changes as the predicted probability for a specific class $c$ changes.

Derivation of the Gradient of the softmax function with Cross entropy Loss

We are using the chain rule to find the gradient of the loss $\mathcal{L}_{CE}$ with respect to the weight vector for class $c$, $\mathbf{w}_c$:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \mathbf{w}_c} = \sum_{j=1}^{K} \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,j}} \times \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} \times \frac{\partial z_{i,c}}{\partial \mathbf{w}_c} $$

We have the following individual derivatives:

$\frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,j}} = - \frac{y_{i,j}}{\hat{y}_{i,j}}$ (Derivative of Cross-Entropy Loss w.r.t. Softmax Output)
$\frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} = \begin{cases} \hat{y}_{i,c}(1 - \hat{y}_{i,c}) & \text{if } j = c \\ -\hat{y}_{i,j}\hat{y}_{i,c} & \text{if } j \neq c \end{cases}$ (Derivative of Softmax Output w.r.t. Linear Combination)
$\frac{\partial z_{i,c}}{\partial \mathbf{w}_c} = \mathbf{x}_i^T$ (Derivative of Linear Combination w.r.t. Weights for class $c$)

Now, let's substitute these into the chain rule formula. The sum over $j$ needs to consider the two cases for $\frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}}$:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \mathbf{w}_c} = \left( \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,c}} \times \frac{\partial \hat{y}_{i,c}}{\partial z_{i,c}} \times \frac{\partial z_{i,c}}{\partial \mathbf{w}_c} \right) + \sum_{j \neq c}^{K} \left( \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,j}} \times \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} \times \frac{\partial z_{i,c}}{\partial \mathbf{w}_c} \right) $$

Let's substitute the derivatives for each part:

For the $j = c$ term: $$ \left( - \frac{y_{i,c}}{\hat{y}_{i,c}} \right) \times \left( \hat{y}_{i,c}(1 - \hat{y}_{i,c}) \right) \times \left( \mathbf{x}_i^T \right) = - y_{i,c}(1 - \hat{y}_{i,c}) \mathbf{x}_i^T $$

For the $j \neq c$ terms: $$ \sum_{j \neq c}^{K} \left( - \frac{y_{i,j}}{\hat{y}_{i,j}} \right) \times \left( - \hat{y}_{i,j}\hat{y}_{i,c} \right) \times \left( \mathbf{x}_i^T \right) = \sum_{j \neq c}^{K} y_{i,j} \hat{y}_{i,c} \mathbf{x}_i^T $$

Now, sum the terms:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \mathbf{w}_c} = - y_{i,c}(1 - \hat{y}_{i,c}) \mathbf{x}_i^T + \sum_{j \neq c}^{K} y_{i,j} \hat{y}_{i,c} \mathbf{x}_i^T $$

Factor out $\mathbf{x}_i^T$:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \mathbf{w}_c} = \left( - y_{i,c}(1 - \hat{y}_{i,c}) + \sum_{j \neq c}^{K} y_{i,j} \hat{y}_{i,c} \right) \mathbf{x}_i^T $$$$ \frac{\partial \mathcal{L}_{CE}}{\partial \mathbf{w}_c} = \left( - y_{i,c} + y_{i,c}\hat{y}_{i,c} + \sum_{j \neq c}^{K} y_{i,j} \hat{y}_{i,c} \right) \mathbf{x}_i^T $$

Since $y_{i,j}$ is 0 for all classes except the true class (let's say the true class is $t$), the sum $\sum_{j \neq c}^{K} y_{i,j}$ is 0 if $c$ is the true class, and 1 if $c$ is not the true class (because only the $y_{i,t}$ term will be 1).Let's simplify the approach.

Let's go back to the form:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} = \sum_{j=1}^{K} \frac{\partial \mathcal{L}_{CE}}{\partial \hat{y}_{i,j}} \frac{\partial \hat{y}_{i,j}}{\partial z_{i,c}} $$

Substitute the derivatives:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} = \sum_{j=1}^{K} \left( - \frac{y_{i,j}}{\hat{y}_{i,j}} \right) \left (\hat{y}_{i,j} (\delta_{jc} - \hat{y}_{i,c}) \right) $$

Where $\delta_{jc}$ is the Kronecker delta, which is 1 if $j=c$ and 0 otherwise.

$$ \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} = - \sum_{j=1}^{K} y_{i,j} (\delta_{jc} - \hat{y}_{i,c}) $$$$ \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} = - \sum_{j=1}^{K} (y_{i,j}\delta_{jc} - y_{i,j}\hat{y}_{i,c}) $$$$ \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} = - \left( \sum_{j=1}^{K} y_{i,j}\delta_{jc} - \sum_{j=1}^{K} y_{i,j}\hat{y}_{i,c} \right) $$

The term $\sum_{j=1}^{K} y_{i,j}\delta_{jc}$ is equal to $y_{i,c}$ because $\delta_{jc}$ is only non-zero when $j=c$.

The term $\sum_{j=1}^{K} y_{i,j}\hat{y}_{i,c} = \hat{y}_{i,c} \sum_{j=1}^{K} y_{i,j}$. Since $y_{i,j}$ is a one-hot encoded vector for the true class, $\sum_{j=1}^{K} y_{i,j} = 1$. So, $\sum_{j=1}^{K} y_{i,j}\hat{y}_{i,c} = \hat{y}_{i,c}$.

Substituting these back:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} = - (y_{i,c} - \hat{y}_{i,c}) = \hat{y}_{i,c} - y_{i,c} $$

Now, to get the gradient with respect to the weights $\mathbf{w}_c$, we multiply by $\frac{\partial z_{i,c}}{\partial \mathbf{w}_c}$:

$$ \frac{\partial \mathcal{L}_{CE}}{\partial \mathbf{w}_c} = \frac{\partial \mathcal{L}_{CE}}{\partial z_{i,c}} \times \frac{\partial z_{i,c}}{\partial \mathbf{w}_c} = (\hat{y}_{i,c} - y_{i,c}) \mathbf{x}_i^T $$

This is the gradient of the cross-entropy loss with respect to the weight vector for class $c$.

So, the gradient for the weights of class $c$ is:

$$ \nabla_{\mathbf{w}_c} \mathcal{L}_{CE} = (\hat{y}_{i,c} - y_{i,c}) \mathbf{x}_i^T $$

In [ ]: