Welcome back! In the last lesson we discussed various forms of regression. We saw how they worked and implemented them. One of the questions that was asked, will after only "training" on a dataset in the way that we did generalise to other datasets? The answer to that is, generally, no it will not. In this tutorial we will discuss varios methods to improve generalization to unseen data.
Important notes¶
- It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
- Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features
Lesson 2: Model Complexity and Regularization¶
In machine learning, model complexity refers to the intricacy of the function that a model can learn from the data. A complex model (like a large neural network) can capture more nuanced relationships and patterns, while a simple model (such as linear regression) is more restricted in what it can represent.
The complexity of a model is often related to:
- Number of parameters: Models with more parameters (like a deep neural network with many layers and neurons) generally have higher complexity.
- Type of model: Some models, by their nature, are more complex than others.
- Number of features: Using a larger number of features can also increase model complexity, as the model needs to learn relationships across more dimensions.
The level of model complexity has a significant impact on how well the model generalizes to new, unseen data, and is closely tied to the concepts of bias, variance. We will first explain these concepts and then tie them into the concepts of overfitting, and underfitting.
We will start with an example of overfitting
2.1 An example of overfitting¶
In the last lession we discussed and implemented a variety of regression methods; Linear, logistic, and softmax regression. We applied them to a variety of datasets, but we did not see how well they generalized to similar data outside of the dataset we trained on. The primary goal of training a model is for it to perform well on data that it has not seen yet.
Let's repeat our experiment from the last time and see what happens if train on half the data and test it on the other half. This time of course, we are going to use a library to implement this specific algorithm.
Exercise 2.1: Observe what happens¶
Run the code below, as is. What do you observe?
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
import numpy as np
# get iris datasets from sklearn as X, y
X, y = datasets.load_breast_cancer(return_X_y=True)
# one hot encode y
# y_one_hot = np.zeros((y.shape[0], 3))
# y_one_hot[np.arange(y.shape[0]), y] = 1
# shuffle X and y
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y = y[indices]
# take the first 100 datapoints to train, the final 50 to test.
x_train, y_train = X[:200], y[:200]
x_test, y_test = X[200:], y[200:]
regression_model = LogisticRegression(
solver='lbfgs',
multi_class='multinomial',
penalty=None,
max_iter=10000
)
regression_model.fit(x_train, y_train)
train_score = regression_model.score(x_train, y_train)
print(f'The accuracy on the training dataset is: {train_score*100.} %')
test_score = regression_model.score(x_test, y_test)
print(f'The accuracy on the test dataset is: {test_score*100.} %')
/usr/local/lib/python3.12/dist-packages/sklearn/linear_model/_logistic.py:1237: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. From then on, binary problems will be fit as proper binary logistic regression models (as if multi_class='ovr' were set). Leave it to its default value to avoid this warning. warnings.warn(
LogisticRegression(max_iter=10000, multi_class='multinomial')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=10000, multi_class='multinomial')
The accuracy on the training dataset is: 96.5 % The accuracy on the test dataset is: 95.39295392953929 %
What you may notice (remember it is stochastic, so there is a bit of variation in the results), is that accuracy is (much) higher on the training set than on the testing set. This phenomenon is called overfitting. In laymans terms, overfitting basically means that your model has the complexity to perfectly remember the dataset.
Exercise 2.2: Adding regularization, the easy way!¶
Add regularization to prevent overfitting to the model. You can change the penalty paramater in the LogisticRegression class from None to 'l2'.
Question: What happens to the accuracy on the training set and on the test set?
2.2 Overfitting, Underfitting, and the bias-variance tradeoff¶
As you saw in the previous exercise, a model can perform very well on the data it was trained on, but poorly on new, unseen data. This is overfitting. Think of it like memorizing the answers to a test instead of understanding the concepts – you'll do great on that specific test, but struggle with a slightly different one. In the context of regression, an overfitted model has learned the noise and specific quirks of the training data, rather than the underlying pattern.
On the other hand, underfitting occurs when your model is too simple to capture the underlying patterns in the data. It's like trying to fit a straight line to data sampled from a parabola – the model doesn't even perform well on the training data, let alone new data.
The bias-variance tradeoff is a core concept in understanding model performance. Bias refers to the error introduced by approximating a real-world problem with a simplified model. A high-bias model (like a simple linear regression on complex data) makes strong assumptions and might underfit. This would result in a systemic deviation of your predictions in one or the other direction. Variance refers to the sensitivity of the model to small changes in the training data. This means that if you re-train your model on different versions of a dataset, during testing the output will have variability. A high-variance model (like a very complex model with many parameters trained on limited data) can capture noise and might overfit. See the figure below for intuition.
The reason why it is called the bias-variance tradeoff is because there is an "ideal" model complexity, where increasing the complexity would result in more variance, and reducing the model complexity would lead to increasing bias. One of the cool things is that we can actually mathematically prove that our error decomposes into bias and variance.
I will not go into the proof here, because it involves a greater level of comfort with probability theory than would be warranted for a beginner course. But I will state the resulting formula here and point out the relevant terms. You do not need to obsess over it, but making sure that you understand the trade-off is important.
$$ \mathbb{E}[\underbrace{(y - \hat{f}(x))^2}_{\text{Squared Error}}]= \underbrace{(f(x) -\mathbb{E}[\hat{f}(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) -\mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}} $$- $f(x)$ is the function that we are trying to estimate by developing a model. It has noise added to it, this is why we end up with $\sigma^2$.
- $\hat{f}(x)$ is the model that we are using to estimate the function $f(x)$.
- $\mathbb{E}[\cdot]$ is called the expectation, in math terms it is the sum of the probability of a value multiplied by the value itself (or in the continuous case, it would be an integral over these values). Put more practically, if you have a normal distribution, it would just be the average.
From these equations you should be able to see that: Bias is about how far off your average prediction is from the true function. Variance is about how much your predictions fluctuate around the average of your predictions. Try to connect it to picture above.
If you don't feel comfortable with the explanation, don't worry too much about it. Just remember the picture above and the next picture!
2.3 Demonstration of Overfitting & Underfitting¶
A common and intuitive way to demonstrate overfitting is by fitting a high-degree polynomial to a limited number of data points. As a reminder, a polynomial is simply an expression consisting of variables and coefficients that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponents of variables. In other words, it's a sum of terms like $ax^k$, where $a$ is a number and $k$ is a non-negative whole number. For instance, $2x^2 - 5x + 1$ is a polynomial.
Consider a dataset with $N$ data points $(x_i, y_i)$ for $i=1, \dots, N$. If we attempt to fit a polynomial of degree $N-1$ to this dataset, the model can be written as:
$$ P(x) = a_{N-1}x^{N-1} + a_{N-2}x^{N-2} + \dots + a_1x + a_0 $$
To find the coefficients $a_0, a_1, \dots, a_{N-1}$, we set up a system of $N$ linear equations based on the data points:
$$ \begin{pmatrix} 1 & x_1 & x_1^2 & \dots & x_1^{N-1} \\ 1 & x_2 & x_2^2 & \dots & x_2^{N-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_N & x_N^2 & \dots & x_N^{N-1} \end{pmatrix} \begin{pmatrix} a_0 \\ a_1 \\ \vdots \\ a_{N-1} \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix} $$
This system of equations has a unique solution for the coefficients $a_i$ (provided the $x_i$ values are distinct). This means that there exists a unique polynomial of degree at most $N-1$ that passes exactly through all $N$ data points.
When this polynomial is used as a model, it will have zero error on the training data (the original $N$ points). However, this does not guarantee good performance on new, unseen data. The polynomial has essentially "memorized" the training data, including any noise or random fluctuations. On data points that are not part of the original training set, the fitted polynomial can deviate significantly from the true underlying relationship, leading to high test error. This discrepancy between perfect performance on the training data and poor performance on test data is the hallmark of overfitting.
In contrast, a model with a lower degree polynomial might not fit the training data perfectly, but it would likely generalize better to new data by capturing the overall trend rather than the specific details of the training points.
Exercise 2.3: Visually showing overfitting¶
Play around with the code below, especially vary the NUMBER_OF_DATAPOINTS parameter. Try to keep it between 5-10. The model should draw a line through all of the datapoints, so long as the number of datapoints are high enough. Once you are done doing that, flip the plot_test_data boolean to see if the testing datapoints that have been sampled using the same procedure also lie on the fitted red line.
Exercise 2.4: Visually showing underfitting¶
To demonstrate underfitting, increase the NUMBER_OF_DATAPOINTS to about 100. Then set the POLYNOMIAL_DEGREE to 0 or a 1. You should see a plot where we are trying to estimate a noisey sine wave with a line, which is something that you probably shouldn't do.
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import numpy as np
## VARY THE NUMBER OF DATAPOINTS between 1 and 10
NUMBER_OF_DATAPOINTS = 5
POLYNOMIAL_DEGREE = NUMBER_OF_DATAPOINTS - 1
## EXERCISE: Sample some new data, see if fits the line. Simply flip the boolean
plot_test_data = True
# Generate toy data
# np.random.seed(0)
X = np.sort(5 * np.random.rand(NUMBER_OF_DATAPOINTS, 1), axis=0)
y = 10*np.sin(X).ravel() + 2* np.random.normal(0, 1, X.shape[0])
# Fit a high-degree polynomial model (likely to overfit)
poly_model = make_pipeline(PolynomialFeatures(POLYNOMIAL_DEGREE), LinearRegression())
poly_model.fit(X, y)
# Generate data for plotting the fitted curve
X_plot = np.linspace(0, 5, 100).reshape(-1, 1)
y_plot = poly_model.predict(X_plot)
# Plot the data and the overfitted regression line
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Train Data')
plt.plot(X_plot, y_plot, color='red', label=f'Polynomial (degree {POLYNOMIAL_DEGREE})')
if plot_test_data:
X_test = np.sort(5 * np.random.rand(20, 1), axis=0)
y_test = 10*np.sin(X_test).ravel() + 2* np.random.normal(0, 1, X_test.shape[0])
plt.scatter(X_test, y_test, label='Test Data', color = 'green', marker = '*')
plt.title('Overfitting Example')
plt.xlabel('X')
plt.ylabel('y')
plt.ylim(-20, 20)
plt.legend()
plt.grid(True)
plt.show()
2.3 Regularization¶
So how do we fix overfitting? As we saw in the earlier exercise you could add a penalty term to your Logistic Regression. In this section we will explain how regularization in the context of linear regression. This will be important because neural networks are complicated models that can easily overfit on data. Some of the techniques we can use for the regularization of linear and logistic regression are exactly the same as those used in simple neural networks.
Background: Maximum Likelihood Estimation (MLE) vs. Maximum A Posteriori (MAP)¶
Up until now, the models we've discussed implicitly use a principle called Maximum Likelihood Estimation (MLE). In MLE, the goal is to find the parameters of a model that maximize the probability of observing the given training data. In other words, we choose the model parameters that make the observed data most "likely".
Let's consider a simple example: a coin toss. Suppose we want to estimate the probability of getting heads ($\theta$) for a potentially biased coin. We flip the coin 5 times and observe the sequence H, H, T, H, T.
Using MLE, we want to find the value of $\theta$ that maximizes the probability of this specific sequence. The probability of this sequence is given by $\theta^3 (1-\theta)^2$ (since we got 3 heads and 2 tails). To find the $\theta$ that maximizes this, we recognize that the maximum occurs at the observed frequency of heads. In this case, the MLE estimate for $\theta$ is $3/5 = 0.6$.
Now, imagine we only flipped the coin once and got heads (H). The MLE estimate for $\theta$ would be $1/1 = 1.0$. This suggests the coin will always land on heads, which feels like an overconfident conclusion based on a single observation, especially if we suspect the coin is likely fair. This is where MLE can lead to overfitting – it perfectly fits the observed (limited) data but might not generalize well to future coin flips.
Maximum A Posteriori (MAP) Estimation addresses this by incorporating prior knowledge about the parameters. Instead of just maximizing the likelihood of the data, MAP maximizes the posterior probability of the parameters given the data. This posterior probability is calculated using Bayes' Rule:
$$ P(\theta|\text{Data}) = \frac{P(\text{Data}|\theta) \times P(\theta)}{P(\text{Data})} $$Let's break down the terms in Bayes' Rule:
- $P(\theta|\text{Data})$: This is the posterior probability. It's the probability of the parameter $\theta$ given the observed data. This is what MAP aims to maximize.
- $P(\text{Data}|\theta)$: This is the likelihood. It's the probability of observing the data given a specific value of the parameter $\theta$. MLE focuses solely on maximizing this term.
- $P(\theta)$: This is the prior probability. It represents our beliefs about the probability distribution of the parameter $\theta$ before we observe any data.
- $P(\text{Data})$: This is the evidence (or marginal likelihood). It's the overall probability of observing the data, averaged over all possible values of $\theta$. For MAP estimation, this term is a normalizing constant and doesn't depend on $\theta$, so we can ignore it when maximizing with respect to $\theta$. This is why we often see the proportional form: $P(\theta|\text{Data}) \propto P(\text{Data}|\theta) \times P(\theta)$. Where the term $\propto$ is the "proportional to" sign. It indicates that as the MAP increases, the posterior probability increases and vice versa.
Going back to our single coin flip resulting in heads. If we use a prior that favors a fair coin (e.g., a Beta distribution that gives higher probability to values of $\theta$ closer to 0.5), the MAP estimate will be a compromise between the observed data (which suggests $\theta=1.0$ based on the likelihood) and our prior belief (which suggests $\theta \approx 0.5$). The resulting MAP estimate for $\theta$ will be closer to 0.5 than the MLE estimate of 1.0, providing a more regularized and less overfitted estimate.
In essence, the prior in MAP acts as a form of regularization, pulling the parameter estimates towards values that are considered more likely a priori, which can help prevent overfitting, especially when the amount of data is small.
Background: Deriving Regularized Linear Regression¶
Regularization in linear regression can be understood from a Maximum A Posteriori (MAP) perspective. Let's consider the standard linear regression model:
$$ y = \mathbf{w}^T \mathbf{x} + \epsilon $$where $\mathbf{w}$ is the vector of weights, $\mathbf{x}$ is the vector of features, and $\epsilon$ is the error term, typically assumed to be normally distributed with mean 0 and variance $\sigma^2$, i.e., $\epsilon \sim \mathcal{N}(0, \sigma^2)$.
From this assumption, the likelihood (MLE) of observing a single data point $(x_i, y_i)$ given the weights $\mathbf{w}$ can be expressed as sampling from a Gaussian distribution where the mean is $\mathbf{w}^T \mathbf{x}_i$ and the variance is $\sigma^2$:
$$ y_i \sim \mathcal{N}(\mathbf{w}^T \mathbf{x}_i, \sigma^2) $$To get the MAP estimate and derive the analytical solution to regularized linear regression, we put a prior on the weights of the MLE. Such that:
$$ P(\mathbf{w}|\mathbf{Y}, \mathbf{X}) \propto P(\mathbf{Y}|\mathbf{X}, \mathbf{w}) P(\mathbf{w}) $$
The mathematical details of the derivation are in the $appendix$.
The analytical solution for the weights $\mathbf{w}$ in L2 regularized linear regression (Ridge Regression) is given by:
$$ \mathbf{w} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} \quad \quad \quad (1) $$Where:
- $\mathbf{w}$ is the vector of learned weights.
- $\mathbf{X}$ is the design matrix (each row is a data point's features, with a column of ones for the intercept if included). Really, it's just a data matrix like we had before but now with some function applied to it.
- $\mathbf{I}$ is the identity matrix of the same dimension as $\mathbf{X}^T \mathbf{X}$.
- $\lambda$ is the regularization hyperparameter ($\lambda \ge 0$).
- $\mathbf{y}$ is the vector of target values.
Exercise 2.5: Implement Regularized linear regression¶
Below we have copied, most of the solution to the linear regression problem from the first notebook. The exercise is simple, implement equation $(1)$ to solve the regularized linear regression problem.
Next, plot the two lines of w_MLE (linear regression) and w_MAP (regularized linear regression).
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# GENERATE TOY DATA
number_of_datapoints = 20
regularization_strength = 1.0
# generate the matrix of data (X)
X = np.random.randn(number_of_datapoints, 2)
X[:, 0] = 1.
# fake weights that we are going to estimate
w_real = np.array([0.5, 1])
# we define y ourselves, so we know what the correct answer is!
# EXTRA info: We do not add an error term, so y is perfectly predictable from X
y = w_real[0]* X[:, 0] + w_real[1]*X[:, 1] + np.random.randn(number_of_datapoints , 1).squeeze()
# TODO: Implement this! See the regularized linear regression equation above
w_MAP = None
# Don't change this!
w_MLE = np.linalg.inv(X.T @ X ) @ X.T @ y
def plot_data(X, y, w_MLE, w_MAP):
'''
Plots the generated data and the real or predicted
Parameters
----------
X : array-like, shape (n_samples, n_features)
The input data. n_samples is the number of data points and n_features in our case is 2!
y : array-like, shape (n_samples,)
The target values.
w_MLE : array-like, shape (n_features,)
The MLE weights.
w_MAP : array-like, shape (n_features,)
The MAP weights.
'''
n_samples, n_features = X.shape
assert n_features == 2, "X should have 2 features!"
# Plot the generated data
plt.scatter(X[:, 1], y, label='Generated Data')
# Plot the fake line
x_fake = np.linspace(X[:, 1].min(), X[:, 1].max(), 100)
y_MLE = w_MLE[0] + w_MLE[1] * x_fake
y_MAP = w_MAP[0] + w_MAP[1] * x_fake
plt.plot(x_fake, y_MLE, color='red', label='MLE estimate')
plt.plot(x_fake, y_MAP, color='green', label='MAP estimate')
plt.xlabel('X')
plt.ylabel('y')
plt.title('MLE vs MAP estimate of linear regression')
plt.legend()
plt.grid(True)
plt.show()
plot_data(X, y, w_MLE, w_MAP)
You should see the MLE estimate and MAP estimate deviate. If so, good job!
2.4 Regularization: L1, L2, and Beyond¶
Why do they have different effects?¶
The difference in how L1 and L2 regularization affect the weights can be visualized by looking at the "constraint" regions they impose in the weight space. The optimization problem for regularized linear regression can be viewed as minimizing the sum of squared errors subject to a constraint on the norm of the weight vector.
For L2 regularization (Ridge), the constraint is $\|\mathbf{w}\|^2_2 \le C$, where $C$ is a constant related to the regularization strength $\lambda$. In 2D (with weights $w_1$ and $w_2$), this constraint $w_1^2 + w_2^2 \le C$ defines a circle centered at the origin.
For L1 regularization (Lasso), the constraint is $\|\mathbf{w}\|_1 \le C$. In 2D, this constraint $|w_1| + |w_2| \le C$ defines a diamond shape (a square rotated by 45 degrees) centered at the origin.
The standard linear regression objective function (sum of squared errors) has elliptical contour lines in the weight space, centered at the MLE solution (the weights that minimize the error without any regularization). The goal of regularized regression is to find the point within the constraint region (the circle for L2, the diamond for L1) that minimizes the sum of squared errors. This optimal point is where the contour lines of the loss function first intersect the boundary of the constraint region.
Due to the rounded shape of the L2 constraint (circle), the intersection point with the elliptical contours of the loss function is less likely to occur exactly on an axis (where one of the weights is zero). The weights are shrunk towards the origin, but typically all weights remain non-zero.
In contrast, the L1 constraint (diamond) has sharp corners located on the axes. When the elliptical contours of the loss function expand from the center (the MLE solution), they are more likely to intersect the L1 constraint boundary at one of these corners. An intersection at a corner means that the weight corresponding to that axis is zero. This geometric property is why L1 regularization tends to produce sparse solutions, where many weights are exactly zero, effectively performing feature selection.
Visualizing the contours of the loss function and the constraint regions helps to understand why L2 regularization shrinks weights and L1 regularization promotes sparsity by driving some weights to zero.
To help you, the geometry of both L1 and L2 regularization have been plotted below.
# @title
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style("darkgrid")
# Create a figure and axes
fig, ax = plt.subplots(figsize=(6, 6))
# L2 Norm (Circle)
theta = np.linspace(0, 2*np.pi, 100)
x_l2 = np.cos(theta)
y_l2 = np.sin(theta)
ax.plot(x_l2, y_l2, label='L2 Norm (Unit Circle)')
# L1 Norm (Diamond/Square in 2D)
x_l1 = np.array([1, 0, -1, 0, 1])
y_l1 = np.array([0, 1, 0, -1, 0])
ax.plot(x_l1, y_l1, label='L1 Norm (Unit "Diamond")')
# Add origin
ax.plot(0, 0, 'ko')
# Add axes lines through the origin
ax.axhline(0, color='grey', lw=0.5)
ax.axvline(0, color='grey', lw=0.5)
# Set limits and aspect ratio
ax.set_xlim([-1.5, 1.5])
ax.set_ylim([-1.5, 1.5])
ax.set_xlabel('w1')
ax.set_ylabel('w2')
ax.set_title('Geometry of L1 and L2 Norms in 2D')
ax.set_aspect('equal', adjustable='box')
# Set major ticks and grid
major_ticks = [-1, 0, 1]
ax.set_xticks(major_ticks)
ax.set_yticks(major_ticks)
ax.grid(True, which='major')
# Remove minor ticks and grid
ax.minorticks_off()
ax.legend()
plt.show()
Other Forms of Regularization¶
While L1 and L2 are the most common, other forms of regularization exist for linear regression exist, such as:
- Elastic Net: This combines both L1 and L2 penalties: $\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|^2_2$. It aims to get the benefits of both Lasso (feature selection) and Ridge (grouping effects for correlated features).
For Neural Networks, these techniques can be used. However, more techniques for neural networks exist. Examples include: Dropout and Batch Normalization. These will be explained in the neural network specific tutorials.
In summary, regularization is a crucial technique to combat overfitting by adding a penalty to the model's complexity, typically based on the norm of the weight vector. The choice of norm (L1, L2, or a combination) influences how the weights are penalized and can have different effects on the model, such as promoting sparsity (Lasso) or shrinking all weights (Ridge).
This concludes this brief tutorial on regularization in linear regression. Next tutorial, we are going to do something more exciting. Implement our very first neural network!
Appendix¶
The Gaussian Distribution¶
The Gaussian distribution, also known as the normal distribution (if it is centered around 0 and has a standard deviation of 1) or a bell curve describes how a continuous variable is distributed around a central value.
Think of it as describing the likelihood of observing different values for something that tends to cluster around an average, with values further from the average becoming less likely. Many natural phenomena, like people's heights, test scores, or measurement errors, often follow a pattern that can be approximated by a Gaussian distribution.
The shape of the Gaussian distribution is determined by two parameters:
- Mean ($\mu$): This is the center of the distribution, representing the average value. The peak of the bell curve is located at the mean.
- Variance ($\sigma^2$): This measures the spread or dispersion of the distribution. A larger variance means the data is more spread out, and the bell curve will be wider and flatter. The standard deviation ($\sigma$) is the square root of the variance and is also used to describe the spread.
The formula for the Probability Density Function (PDF) of a one-dimensional Gaussian distribution is:
$$ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$Where:
- $x$ is the value of the variable.
- $\mu$ is the mean of the distribution.
- $\sigma^2$ is the variance of the distribution.
- $\exp$ is the exponential function.
This formula tells you the relative likelihood of observing a specific value $x$, given the mean and variance of the distribution. The higher the value of $f(x)$, the more likely it is to observe that value.
In statistical texts, you will often see this written in a more compact notation. For example, $X \sim \mathcal{N}(\mu, \sigma^2)$ means that the variable $X$ follows a Gaussian distribution with mean $\mu$ and variance $\sigma^2$. You might also see it written to show the probability of observing a specific value $x$ given the parameters, like $P(x | \mu, \sigma^2) = \mathcal{N}(x | \mu, \sigma^2)$. In the context of regression, this is often used to describe the distribution of the target variable given the input features and model parameters, for instance, $y \sim \mathcal{N}(\mathbf{x}^T \mathbf{w}, \sigma^2)$, indicating that $y$ is normally distributed with a mean determined by the linear combination of features and weights, and a variance $\sigma^2$.
Deriving Regularized Linear Regression¶
Recall the standard linear regression model:
$$ y = \mathbf{w}^T \mathbf{x} + \epsilon $$where $\mathbf{w}$ is the vector of weights, $\mathbf{x}$ is the vector of features, and $\epsilon$ is the error term, typically assumed to be normally distributed with mean 0 and variance $\sigma^2$, i.e., $\epsilon \sim \mathcal{N}(0, \sigma^2)$.
The Maximum Likelihood derivation¶
From this assumption, the likelihood of observing a single data point $(x_i, y_i)$ given the weights $\mathbf{w}$ can be expressed as sampling from a Gaussian distribution where the mean is $\mathbf{w}^T \mathbf{x}_i$ and the variance is $\sigma^2$:
$$ y_i \sim \mathcal{N}(\mathbf{w}^T \mathbf{x}_i, \sigma^2) $$Assuming the data points are independent, the likelihood of the entire dataset $\mathbf{Y} = \{y_1, \dots, y_N\}$ given the features $\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}$ and weights $\mathbf{w}$ is the product of the individual likelihoods:
$$ P(\mathbf{Y}|\mathbf{X}, \mathbf{w}) = \prod_{i=1}^N P(y_i|\mathbf{x}_i, \mathbf{w}) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{w}^T \mathbf{x}_i)^2}{2\sigma^2}\right) $$In Maximum Likelihood Estimation (MLE), we aim to maximize this likelihood with respect to $\mathbf{w}$, which is equivalent to maximizing the log-likelihood:
$$ \log P(\mathbf{Y}|\mathbf{X}, \mathbf{w}) = \sum_{i=1}^N \left[ -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y_i - \mathbf{w}^T \mathbf{x}_i)^2}{2\sigma^2} \right] $$To maximize this, we can minimize the negative log-likelihood. Dropping the constant terms and the scaling factor $1/(2\sigma^2)$ (since they don't depend on $\mathbf{w}$), we are left with minimizing the sum of squared errors:
$$ \text{Minimize: } \sum_{i=1}^N (y_i - \mathbf{w}^T \mathbf{x}_i)^2 $$This is the standard objective function for ordinary least squares linear regression, which is the result of MLE under the assumption of Gaussian noise.
The Maximum A Posteriori derivation¶
Let's add a prior distribution on the weights $\mathbf{w}$ for Maximum A Posteriori (MAP) estimation. A common choice for the prior is a zero-mean Gaussian distribution:
$$ P(\mathbf{w}) = \prod_{j=1}^D \frac{1}{\sqrt{2\pi\alpha^2}} \exp\left(-\frac{w_j^2}{2\alpha^2}\right) = \left(\frac{1}{2\pi\alpha^2}\right)^{D/2} \exp\left(-\frac{\|\mathbf{w}\|^2}{2\alpha^2}\right) $$where $D$ is the number of features and $\alpha^2$ is the variance of the prior.
In MAP estimation, we want to maximize the posterior probability $P(\mathbf{w}|\mathbf{Y}, \mathbf{X})$, which is proportional to the likelihood times the prior:
$$ P(\mathbf{w}|\mathbf{Y}, \mathbf{X}) \propto P(\mathbf{Y}|\mathbf{X}, \mathbf{w}) P(\mathbf{w}) $$Maximizing the posterior is equivalent to maximizing the log-posterior:
$$ \log P(\mathbf{w}|\mathbf{Y}, \mathbf{X}) \propto \log P(\mathbf{Y}|\mathbf{X}, \mathbf{w}) + \log P(\mathbf{w}) $$Substituting the log-likelihood and log-prior:
$$ \log P(\mathbf{w}|\mathbf{Y}, \mathbf{X}) \propto \sum_{i=1}^N \left[ -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y_i - \mathbf{w}^T \mathbf{x}_i)^2}{2\sigma^2} \right] + \left[ -\frac{D}{2}\log(2\pi\alpha^2) - \frac{\|\mathbf{w}\|^2}{2\alpha^2} \right] $$To maximize this, we can minimize the negative log-posterior. Again, dropping constant terms that don't depend on $\mathbf{w}$:
$$ \text{Minimize: } \sum_{i=1}^N \frac{(y_i - \mathbf{w}^T \mathbf{x}_i)^2}{2\sigma^2} + \frac{\|\mathbf{w}\|^2}{2\alpha^2} $$Multiplying by $2\sigma^2$ (which is a positive constant and doesn't change the location of the minimum):
$$ \text{Minimize: } \sum_{i=1}^N (y_i - \mathbf{w}^T \mathbf{x}_i)^2 + \frac{\sigma^2}{\alpha^2} \|\mathbf{w}\|^2 $$Let $\lambda = \frac{\sigma^2}{\alpha^2}$. This gives the objective function for L2 Regularized Linear Regression (Ridge Regression):
$$ \text{Minimize: } \sum_{i=1}^N (y_i - \mathbf{w}^T \mathbf{x}_i)^2 + \lambda \|\mathbf{w}\|^2 $$The term $\lambda \|\mathbf{w}\|^2 = \lambda \sum_{j=1}^D w_j^2$ is the L2 regularization term. It penalizes large weights, encouraging the model to use smaller weights. The hyperparameter $\lambda$ controls the strength of the regularization. A larger $\lambda$ means stronger regularization, shrinking the weights more towards zero.
This derivation shows that L2 regularization in linear regression is equivalent to performing MAP estimation with a Gaussian prior on the weights. The prior expresses our belief that the weights are likely to be close to zero, which helps prevent overfitting by regularizing the model complexity.
The final vectorized representation of the MAP estimate of linear regression is:
$$ \mathbf{w} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} $$