In the last class we discussed implementing convolutional neural networks for classification problems. However, our discussion of how to properly evaluate classification problems has been limited. Evaluating classification problems for us is important for two reasons. First, it helps us understand fundamental concepts which will also be used when discussing the evaluation of other problems, such as segmentation and detection. Second, because we are in an industry where great costs can be associated with small mistakes we need to be rigorous and comprehensive in our evaluation.

Important notes

It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features

Additional Note

This notebook is designed to be a cheat sheet. If you ever have to discuss with a machine learning developer or researcher about classification problems and their evaluation. You can use this notebook to quickly look up how evaluation for your specific classification problem works!

Lesson 7: Evaluating Classification Problems

Name: Deep Learning & Medical Image Analysis
Author: Riaan Zoetmulder

In this class we will start with the simplest classification problem: "Binary Classification". In the case of binary classification, there are only two classes; 0 and 1. We will start with basic metrics, their definition, and their interpretation. We will also discuss limitations of these metrics.

Then, we will continue to discus evaluation in the case of class imbalance. What if instead of your dataset being divided into 50/50 positive and negative classes, you instead of 1/99?

After discussing class imbalance we will briefly discuss metrics where a different cost is associated with getting things right or wrong.

Next, we will generalize the concepts we have learned from binary classification to multi-class classification. We will then discuss a less often used case where labels are no longer mutually exclusive. This is called multi-label classification.

Finally, we will discuss another concept: model calibration. Model calibration basically tells you how reliable your predicted probabilities are. For example, if you look at the weather forecast and it says there is an 80% chance of rain, is there an actual 80% chance of rain? In the case of weather forecasts for the next day this is actually pretty close.

At the end, we will talk through some case studies!

7.1 Binary Classification With no Class Imbalance

If you recall the previous tuturials, so far we've only looked at loss curves and occasionally the accuracy. These are fairly basic evaluation metrics, but there are many more.

In general it is best practise to always include a variant (irrespective of whether you are looking at binary or multi-class problems) of these in your analysis.

Confusion Matrix Metrics

Here are some key metrics used to evaluate the performance of a binary classification model:

True Positives (TP): The number of instances that are actually positive and are correctly predicted as positive.
False Positives (FP): The number of instances that are actually negative but are incorrectly predicted as positive. This is also known as a Type I error.
False Negatives (FN): The number of instances that are actually positive but are incorrectly predicted as negative. This is also known as a Type II error.
True Negatives (TN): The number of instances that are actually negative and are correctly predicted as negative.

These are used to calculate our classification performance metrics.

Here is an image that may make th is more easy to remember:

Classification Performance Metrics

Accuracy: The proportion of total instances that were correctly predicted (both positive and negative). This is the first evaluation metric that we have already used. It works well enough in case of balanced classes. However, it can be misleading in case of highly imbalanced classes. For example, if you have 99/1 negative to postive examples and your model predicts negative for everything. You would have an accuracy of 99%. This may look impressive, but it really does not tell you much in isolation.
$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$
Precision (Positive Predictive Value): The proportion of positive predictions that were actually correctly predicted. It measures the accuracy of the positive predictions.
$$ \text{Precision} = \frac{TP}{TP + FP} $$
Recall (Sensitivity, True Positive Rate): The proportion of actual positive instances that were correctly identified. It measures the model's ability to find all positive instances, relative to correctly identified positive and correctly identified negative instances.
$$ \text{Recall} = \frac{TP}{TP + FN} $$
Specificity (True Negative Rate): The proportion of actual negative instances that were correctly identified as negative. It measures the model's ability to correctly identify negative instances. We will run into this when we discuss class imbalance, it's not implemented below as such.
$$ \text{Specificity} = \frac{TN}{TN + FP} $$
F1 Score (Dice Coefficient): The harmonic mean of Precision and Recall. It provides a single score that balances both metrics. It's particularly useful when you need to balance precision and recall, especially with uneven class distribution. In image analysis, it's often referred to as the Dice Coefficient (as such this is also one of the main metrics in the segmentation tutorial) and calculated as:
$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \times TP}{2 \times TP + FP + FN} $$
Jaccard Index (Intersection over Union - IoU): The ratio of the intersection of the predicted and actual positive sets to their union. Commonly used in image segmentation and object detection. For binary classification, it can be related to the F1 score.
$$ \text{Jaccard Index (IoU)} = \frac{TP}{TP + FP + FN} $$

Note: $\text{F1} = \frac{2 \times \text{IoU}}{1 + \text{IoU}}$ and $\text{IoU} = \frac{\text{F1}}{2 - \text{F1}}$

There are more metrics, though these are the ones that you will run into most often and are the most important. If you are interested in more metrics, these can be found on this page. You may run into an application where another niche metric is important, so remember that this page exists!

Below we have an example implementation of all of the scores on a simple problem.

In [ ]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

In [ ]:

y_target  = np.array([0, 1, 1, 0, 1, 1, 0, 0, 1])
y_pred    = np.array([1, 1, 0, 1, 0, 1, 1, 0, 1])

def true_positives(y_target, y_pred):
    return np.sum((y_target == 1) & (y_pred == 1), dtype = np.float32)

def false_positives(y_target, y_pred):
    return np.sum((y_target == 0) & (y_pred == 1), dtype = np.float32)

def true_negatives(y_target, y_pred):
    return np.sum((y_target == 0) & (y_pred == 0), dtype = np.float32)

def false_negatives(y_target, y_pred):
    return np.sum((y_target == 1) & (y_pred == 0), dtype = np.float32)

def precision(tp, fp):
    return tp / (tp + fp)

def recall(tp, fn):
    return tp / (tp + fn)

def dice_coefficient(tp, fp, fn):
    return 2 * tp / (2 * tp + fp + fn)

def jaccard_index(tp, fp, fn):
    return tp / (tp + fp + fn)

def accuracy(tp, tn, fp, fn):
    return (tp + tn) / (tp + tn + fp + fn)

# calculate the tp, fp, tn, fn
tp = true_positives(y_target, y_pred)
fp = false_positives(y_target, y_pred)
tn = true_negatives(y_target, y_pred)
fn = false_negatives(y_target, y_pred)

# calculate precision and recall
precision = precision(tp, fp)
recall = recall(tp, fn)

# calculate the f1 score
dice = dice_coefficient(tp, fp, fn)
jaccard = jaccard_index(tp, fp, fn)

accuracy_score = accuracy(tp, tn, fp, fn)

print("Accuracy: ", accuracy_score)
print("Precision: ", precision)
print("Recall: ", recall)
print("Dice Coefficient: ", dice)
print("Jaccard Index: ", jaccard)

Accuracy:  0.44444445
Precision:  0.5
Recall:  0.6
Dice Coefficient:  0.54545456
Jaccard Index:  0.375

The confusion matrix

After calculating the confusion matrix metrics you can plot what is called a confusion matrix. A confusion matrix is a plot by which you can easily see how well your model is doing. The more items are on the main diagonal from the top left to the bottom right, the better your score is. Any off diagonal items are misclassified.

In [ ]:

# @title
## Confusion Matrix
# implement the confusion matrix for binary classification
cm = confusion_matrix(y_target, y_pred)

# Create a pandas DataFrame for better visualization
cm_df = pd.DataFrame(cm, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])

# Visualize the confusion matrix using seaborn heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Area under the receiver operating characteristic curve (AUC-ROC)

History of AUC-ROC

The Area Under the Receiver Operating Characteristic (AUC-ROC) curve is a performance metric used for classification problems. Its history dates back to the 1940s, originating in electrical engineering for analyzing radar signals during World War II.

Initially, ROC curves were used to assess the ability of radar receivers to distinguish between signal and noise. The x-axis represented the false positive rate, and the y-axis represented the true positive rate. The curve illustrated the trade-off between these two rates as the detection threshold was varied.

In the 1970s and 1980s, ROC analysis found its way into psychology and medical diagnostics. It became a standard tool for evaluating the performance of diagnostic tests, where the "signal" was the presence of a disease and the "noise" was its absence.

The term "Area Under the Curve" (AUC) emerged as a single scalar value to summarize the overall performance of a classifier across all possible thresholds. An AUC of 1.0 represents a perfect classifier, while an AUC of 0.5 indicates a classifier that performs no better than random chance.

In machine learning, AUC-ROC is widely used to evaluate the performance of binary classifiers, especially when dealing with imbalanced datasets. It provides a comprehensive measure of a model's ability to discriminate between positive and negative classes.

How AUC-ROC Works

The Area Under the Receiver Operating Characteristic (AUC-ROC) curve is a way to evaluate the performance of a binary classifier. To understand how it works, let's break down the components:

ROC Curve: The ROC curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) at various predicted probability threshold settings.
- True Positive Rate (TPR): Also known as Sensitivity or Recall, it's the ratio of correctly predicted positive instances to the total number of actual positive instances. $TPR = TP / (TP + FN)$.
- False Positive Rate (FPR): It's the ratio of incorrectly predicted positive instances to the total number of actual negative instances. $FPR = FP / (FP + TN)$.
Decision Thresholds: For a classifier that outputs a probability or score, a decision threshold is used to decide whether an instance is classified as positive or negative. By varying this threshold from 1 to 0, (as we move from left to right on the x-axis) we can generate different pairs of TPR and FPR values. What an ROC curve shows you is basically the tradeoff between the TPR and FPR, depending on at what level of probability you decide that a class is predicted as positive or negative.
Plotting the Curve: As you decrease the threshold, you will typically increase both the TPR and FPR. Plotting these pairs of (FPR, TPR) for different thresholds gives you the ROC curve.
- A perfect classifier would have a ROC curve that goes straight up from (0,0) to (0,1) and then across to (1,1). This means it achieves a 100% TPR with a 0% FPR.
- A completely random classifier would have a ROC curve that follows the diagonal line from (0,0) to (1,1).
Area Under the Curve (AUC): The AUC is the area under the entire ROC curve. It provides a single number that summarizes the classifier's performance across all possible thresholds.
- An AUC of 1 indicates a perfect classifier.
- An AUC of 0.5 indicates a classifier that performs no better than random chance.
- An AUC greater than 0.5 suggests that the classifier has some ability to distinguish between positive and negative classes.

In essence, the AUC-ROC tells you the probability that a randomly chosen positive instance will be ranked higher (assigned a higher score/probability) by the classifier than a randomly chosen negative instance.

When (not) to use the AUC

So when should you use the AUC-ROC and the ROC? Here is when:

You have roughly balanced classes (i.e. in case of a binary classification problem you would have 50% positive classes and 50% negative classes). Your AUC-ROC can still look great, even when your classes are highly imbalanced.
If all you care about is the relative rank order of probabilities, then you should use the AUC-ROC. The AUC will not tell you anything about the reliability of probabilities (i.e. a prediction of 0.8 means that 80% of the time the model is correct). For the latter you will have to look at model calibration, which is discussed later.
When decision thresholds are important to your problem, do not blindly use the AUC. The AUC averages all the value at all decision threshold. If you find that you have to operate in a restricted FPR rang, for example when predicting somebody has a scary disease when in reality they don't, you may want to restrict your AUC calculation to a narrower range. This is called a partial AUC (pAUC).
When the costs of false positives and false negatives are very different you should not use the AUC blindly. In this case you are better off using cost sensitive.

Let's see an implementation of the ROC Curve from which the AUC is calculated.

In [ ]:

# @title
import numpy as np
from sklearn.metrics import roc_auc_score
from scipy.stats import norm

def generate_data_with_auc_from_so(n_samples=1000, prevalence=0.5, auc_target=0.5):
    """Generates synthetic data (y_true, y_scores) with a target AUC based on Stack Overflow answer provided here:

    https://stats.stackexchange.com/questions/562000/how-to-simulate-a-calibrated-prediction-model-given-prevalence-and-auc/565607#565607

    """
    n_positive = int(n_samples * prevalence)
    n_negative = n_samples - n_positive

    y_true = np.array([0] * n_negative + [1] * n_positive)

    if auc_target == 1.0:
        # For AUC of 1.0, ensure perfect separation
        scores_negative = np.random.normal(0, 0.1, n_negative)
        scores_positive = np.random.normal(10, 0.1, n_positive)

    elif auc_target == 0.5:
         # For AUC of 0.5, scores should be indistinguishable
         scores_negative = np.random.normal(0, 1, n_negative)
         scores_positive = np.random.normal(0, 1, n_positive)

    else:
        # For other AUCs, calculate the required mean difference
        delta_mean = norm.ppf(auc_target) * np.sqrt(2)
        scores_negative = np.random.normal(0, 1, n_negative)
        scores_positive = np.random.normal(delta_mean, 1, n_positive)

    y_scores = np.concatenate((scores_negative, scores_positive))

    # Shuffle the data to mix positive and negative instances
    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    y_true = y_true[indices]
    y_scores = y_scores[indices]

    # Verify the actual AUC
    actual_auc = roc_auc_score(y_true, y_scores)
    print(f"Generated data with target AUC of {auc_target:.2f}. Actual AUC: {actual_auc:.4f}")

    return y_true, y_scores

# Generate data for different AUC values: This is sampled so not exact.
y_true_random, y_scores_random = generate_data_with_auc_from_so(auc_target=0.5)
y_true_0_75, y_scores_0_75 = generate_data_with_auc_from_so(auc_target=0.75)
y_true_0_9, y_scores_0_9 = generate_data_with_auc_from_so(auc_target=0.9)
y_true_1_0, y_scores_1_0 = generate_data_with_auc_from_so(auc_target=1.0)

Generated data with target AUC of 0.50. Actual AUC: 0.5061
Generated data with target AUC of 0.75. Actual AUC: 0.7488
Generated data with target AUC of 0.90. Actual AUC: 0.9052
Generated data with target AUC of 1.00. Actual AUC: 1.0000

In [ ]:

# @title
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))

# Plot ROC for Random (AUC ~ 0.5)
fpr_random, tpr_random, _ = roc_curve(y_true_random, y_scores_random)
roc_auc_random = auc(fpr_random, tpr_random)
plt.plot(fpr_random, tpr_random, label=f'AUC = {roc_auc_random:.2f} (Random)')

# Plot ROC for AUC ~ 0.75
fpr_0_75, tpr_0_75, _ = roc_curve(y_true_0_75, y_scores_0_75)
roc_auc_0_75 = auc(fpr_0_75, tpr_0_75)
plt.plot(fpr_0_75, tpr_0_75, label=f'AUC = {roc_auc_0_75:.2f} (0.75 Target)')

# Plot ROC for AUC ~ 0.9
fpr_0_9, tpr_0_9, _ = roc_curve(y_true_0_9, y_scores_0_9)
roc_auc_0_9 = auc(fpr_0_9, tpr_0_9)
plt.plot(fpr_0_9, tpr_0_9, label=f'AUC = {roc_auc_0_9:.2f} (0.9 Target)')

# Plot ROC for AUC ~ 1.0
fpr_1_0, tpr_1_0, _ = roc_curve(y_true_1_0, y_scores_1_0)
roc_auc_1_0 = auc(fpr_1_0, tpr_1_0)
plt.plot(fpr_1_0, tpr_1_0, label=f'AUC = {roc_auc_1_0:.2f} (1.0 Target)')

# Plot the diagonal line (random classifier)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.50)')

plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

7.2 Binary Classification With Class Imbalance

In the last section we looked at how to evaluate a binary classification problem when you have balanced data. In this section we will look at methods to evaluate binary classification in the case of unbalanced data. This is very important in medical applications, because usually healthy people outnumber sick people. So your dataset is never truly balanced. This will also carry over in the tutorial on image segmentation. Pixels in the background of an image almost always outnumber pixels that belong to the object that we are trying to segment.

The first thing that we want to do is to make sure that we do not rely on the accuracy, for the reason we mentioned before. In addition, we always want to report and critically look at the precision, recall, and F1-score. These metric work because they normalise relative to the total amount of positives or the total amount of predicted positives. So just predicting one class will result in a bad score on these metrics.

Another thing that we can do, is plot the precision-recall curve and calculate the area under its curve. This is what we will discuss next.

Precision-Recall Curve

The Precision-Recall curve is another important evaluation metric for binary classification, especially useful when dealing with imbalanced datasets.

What it plots: The Precision-Recall curve plots Precision (on the y-axis) against Recall (on the x-axis) at various probability thresholds.
- Precision: ($\frac{TP}{TP + FP}$) - The ability of the trained classifier to correctly predict the positive class.
- Recall: $\frac{TP}{TP + FN}$ - The ability of the classifier to find all the positive samples in the dataset.
How it works: Similar to the ROC curve, you vary the decision threshold for classifying an instance as positive. For each threshold, you calculate the precision and recall and plot the point on the graph. The threshold is similarly decremented from 1 to 0 as we move down on the x-axis.
Interpretation:
- A curve that stays closer to the top-right corner indicates better performance, meaning the model achieves high precision and high recall simultaneously.
- A random classifier would have a Precision-Recall curve that is a horizontal line at a precision equal to the prevalence of the positive class in the dataset.
- The Area Under the Precision-Recall Curve (AUPRC or AP) is a single metric that summarizes the performance across all thresholds. A higher AUPRC indicates better performance. The AUPRC falls between 0 and 1.
- In the example below you will see that as the recall approaches 1, the precision approaches the prevalence (total amount of times the positive class occurs in the dataset). This is because at that point it will predict everything as belonging to the positive class, as such the number of true positives equals the actual number of actual positives in the data and the number of false positives equals the number of negatives in the data. Leading the number of positive examples being divided by all the data in our dataset, which equals the prevalence.
When to use it: The Precision-Recall curve is particularly informative when the positive class is rare (imbalanced data). In such cases, the ROC curve can be misleading because a high True Negative rate can mask poor performance on the positive class. The Precision-Recall curve, focusing only on the positive class, provides a more realistic picture of the model's ability to identify positive instances without generating too many false positives.

In summary, while the ROC curve assesses the classifier's ability to distinguish between classes generally, the Precision-Recall curve is a better indicator of performance when the positive class is the minority class and correctly identifying positive instances is crucial.

Below you can see a plot of the Precision recall curve. Note that we do not simulate data that is unbalanced in this case. This would be quite computationally expensive.

In [ ]:

# @title
import numpy as np
from sklearn.metrics import precision_recall_curve, auc, average_precision_score
from scipy.stats import norm
import matplotlib.pyplot as plt

def generate_data_with_auprc(n_samples=1000, prevalence=0.5, auprc_target=0.5):
    """Generates synthetic data (y_true, y_scores) with a target AUPRC."""

    n_positive = int(n_samples * prevalence)
    n_negative = n_samples - n_positive
    y_true = np.array([0] * n_negative + [1] * n_positive)

    if auprc_target == 1.0:
        # For AUPRC of 1.0, ensure perfect separation
        scores_negative = np.random.normal(-10, 1, n_negative)
        scores_positive = np.random.normal(10, 1, n_positive)

    elif auprc_target == prevalence:
        # For AUPRC equal to prevalence, scores should be indistinguishable
        scores_negative = np.random.normal(0, 1, n_negative)
        scores_positive = np.random.normal(0, 1, n_positive)
    else:
        # For other AUPRCs, this is more complex than AUC and might require iteration or approximation.
        # A simplified approach: Adjust mean difference based on a target AUPRC relative to prevalence
        # This is not a precise method for controlling AUPRC but provides varying levels of performance.
        # A more rigorous approach would involve iteratively adjusting parameters or using a different distribution.

        # Simple approximation: relate AUC to AUPRC for a rough estimate
        # This is not always accurate, especially for extreme imbalances
        # auc_approx = 0.5 + (auprc_target - prevalence) * 0.5 / (1 - prevalence) if auprc_target > prevalence else 0.5 - (prevalence - auprc_target) * 0.5 / prevalence

        # Let's use a simpler approach by adjusting the mean difference based on the distance from random (prevalence) and perfect (1.0)
        if auprc_target > prevalence:
            # Scale the mean difference based on how far the target AUPRC is from prevalence towards 1.0
            scale = (auprc_target - prevalence) / (1.0 - prevalence)
            delta_mean = norm.ppf(prevalence + scale * (1.0 - prevalence)) * np.sqrt(2) # Rough approximation
        else:
             # Scale the mean difference based on how far the target AUPRC is from 0 towards prevalence
            scale = auprc_target / prevalence
            delta_mean = norm.ppf(scale * prevalence) * np.sqrt(2) # Rough approximation


        scores_negative = np.random.normal(-delta_mean/2, 1, n_negative)
        scores_positive = np.random.normal(delta_mean/2, 1, n_positive)

    y_scores = np.concatenate((scores_negative, scores_positive))

    # Shuffle the data to mix positive and negative instances
    indices = np.arange(n_samples)
    np.random.shuffle(indices)
    y_true = y_true[indices]
    y_scores = y_scores[indices]

    # Verify the actual AUPRC
    actual_auprc = average_precision_score(y_true, y_scores)
    print(f"Generated data with target AUPRC of {auprc_target:.2f}. Actual AUPRC: {actual_auprc:.4f}")

    return y_true, y_scores

# Generate data for different AUPRC values (with prevalence 0.5 for simplicity)
# AUPRC of 0 is not possible with a random classifier; the minimum is the prevalence.
# Let's aim for AUPRC near prevalence for the "0" target.
y_true_random_pr, y_scores_random_pr = generate_data_with_auprc(auprc_target=0.5, prevalence = 0.5) # AUPRC near prevalence
y_true_0_25_pr, y_scores_0_25_pr = generate_data_with_auprc(auprc_target=0.25, prevalence = 0.5) # This will be below prevalence, might not be achievable with this method
y_true_0_5_pr, y_scores_0_5_pr = generate_data_with_auprc(auprc_target=0.5, prevalence = 0.5)
y_true_0_75_pr, y_scores_0_75_pr = generate_data_with_auprc(auprc_target=0.75, prevalence = 0.5)
y_true_1_0_pr, y_scores_1_0_pr = generate_data_with_auprc(auprc_target=1.0, prevalence = 0.5)

Generated data with target AUPRC of 0.50. Actual AUPRC: 0.5114
Generated data with target AUPRC of 0.25. Actual AUPRC: 0.3540
Generated data with target AUPRC of 0.50. Actual AUPRC: 0.5159
Generated data with target AUPRC of 0.75. Actual AUPRC: 0.7457
Generated data with target AUPRC of 1.00. Actual AUPRC: 1.0000

In [ ]:

# @title
from sklearn.metrics import precision_recall_curve, auc
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))

# Plot PR for Random (AUPRC ~ prevalence)
precision_random, recall_random, _ = precision_recall_curve(y_true_random_pr, y_scores_random_pr)
auprc_random = auc(recall_random, precision_random)
plt.plot(recall_random, precision_random, label=f'AUPRC = {auprc_random:.2f} (Random)')

# Plot PR for AUPRC ~ 0.25
precision_0_25, recall_0_25, _ = precision_recall_curve(y_true_0_25_pr, y_scores_0_25_pr)
auprc_0_25 = auc(recall_0_25, precision_0_25)
plt.plot(recall_0_25, precision_0_25, label=f'AUPRC = {auprc_0_25:.2f} (0.25 Target)')

# Plot PR for AUPRC ~ 0.5
precision_0_5, recall_0_5, _ = precision_recall_curve(y_true_0_5_pr, y_scores_0_5_pr)
auprc_0_5 = auc(recall_0_5, precision_0_5)
plt.plot(recall_0_5, precision_0_5, label=f'AUPRC = {auprc_0_5:.2f} (0.5 Target)')

# Plot PR for AUPRC ~ 0.75
precision_0_75, recall_0_75, _ = precision_recall_curve(y_true_0_75_pr, y_scores_0_75_pr)
auprc_0_75 = auc(recall_0_75, precision_0_75)
plt.plot(recall_0_75, precision_0_75, label=f'AUPRC = {auprc_0_75:.2f} (0.75 Target)')

# Plot PR for AUPRC ~ 1.0
precision_1_0, recall_1_0, _ = precision_recall_curve(y_true_1_0_pr, y_scores_1_0_pr)
auprc_1_0 = auc(recall_1_0, precision_1_0)
plt.plot(recall_1_0, precision_1_0, label=f'AUPRC = {auprc_1_0:.2f} (1.0 Target)')


plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.grid(True)
plt.show()

Less common Metrics

So far we have discussed common metrics that can be used to evaluate unbalanced data. Were you to ever publish an academic paper on a task that involves binary classification, then you would definitely include the above metrics and plots. There are two more metrics that can be used that are a little less common, but nonetheless informative. These are Matthews Correlation Coefficient (MCC), and Balanced Accuracy.

Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) is a single-value evaluation metric for binary classification that takes into account all four values in the confusion matrix: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). It is considered a more balanced measure than Accuracy, Precision, or Recall, especially when dealing with imbalanced datasets.

The MCC is essentially a correlation coefficient between the observed and predicted binary classifications. It ranges from -1 to +1, where:

+1 represents a perfect prediction.
0 represents a prediction no better than random chance.
-1 represents a perfect inverse prediction (always predicting the opposite class).

The formula for MCC is:

$$ \text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} $$

One of the key advantages of MCC is that it produces a high score only if the classifier performs well in all four aspects of the confusion matrix (TP, TN, FP, FN) proportionally to the size of the positive and negative classes. This makes it a reliable metric for imbalanced datasets where a high accuracy might be misleading.

Why the MCC Might Not Be Reported as Frequently:

Despite its strengths, particularly with imbalanced data, MCC is not always the go-to metric in every scenario. Some reasons for this include:

Interpretability: While the range of -1 to +1 is clear, the specific value of MCC can sometimes be less intuitive to interpret compared to metrics like Precision or Recall, which have direct interpretations related to false positives and false negatives. Stakeholders who are not deeply technical may find it harder to grasp what an MCC of, say, 0.6 means in practical terms for their problem compared to a precision of 0.8.
Focus on Specific types of Errors: In some applications, there's a much stronger emphasis on minimizing a particular type of error (e.g., minimizing false positives in a medical diagnosis setting). In such cases, metrics that directly reflect the cost of that specific error (like Precision or a cost-sensitive metric) might be preferred or used in conjunction with MCC.
Comparisons: When comparing results across different studies or models, it's often easier to use metrics that are universally reported in that domain. If MCC is not commonly used in a specific field, it can be harder to benchmark performance.

However, in fields where a balanced assessment of performance across all aspects of the confusion matrix is critical, especially with imbalanced data, MCC is increasingly recognized and used. Just be careful with non-machine learning stakeholders that may not be as familiar with it.

Balanced Accuracy

Balanced Accuracy is an evaluation metric that addresses the issue of misleading accuracy when dealing with imbalanced datasets. It is defined as the average of recall (sensitivity) and specificity.

$$ \text{Balanced Accuracy} = \frac{\text{Recall} + \text{Specificity}}{2} $$

where:

Recall (Sensitivity): $\frac{TP}{TP + FN}$ (True Positive Rate)
Specificity: $\frac{TN}{TN + FP}$ (True Negative Rate)

It is useful to know that specificity has a relationship to the false positive rate. This relationship is $1 - FPR$.

Balanced Accuracy is useful because it gives equal weight to the performance on both the positive and negative classes, regardless of their proportion in the dataset. A model that simply predicts the majority class in an imbalanced dataset will have a high standard accuracy but a balanced accuracy closer to 0.5 (random chance).

Why Balanced Accuracy Might Not Be Reported as Frequently:

While a valuable metric for imbalanced data, Balanced Accuracy might not be reported as frequently as some other metrics for several reasons:

Relationship to ROC AUC: For a binary classifier that outputs scores or probabilities, Balanced Accuracy is equivalent to the Area Under the ROC Curve (AUC-ROC) when the ROC curve is calculated based on a specific set of thresholds (or when averaging over random thresholds). Since AUC-ROC is a very common and well-understood metric, researchers often prefer to report AUC-ROC as it provides a summary across all possible thresholds, whereas Balanced Accuracy is often calculated at a single, optimal threshold or represents an average across a specific thresholding strategy. Remember that ROC is basically a plot with FPR against
Focus on Specific types of Errors: Similar to the MCC, in applications where the cost of false positives and false negatives are significantly different, metrics like Precision and Recall (or cost-sensitive metrics) might be more directly informative about the type of errors that are most critical to minimize. Balanced Accuracy provides an overall balance but doesn't highlight performance on one specific type of error over another.
Interpretability Compared to Precision/Recall: While more informative than standard accuracy for imbalanced data, Balanced Accuracy can still be less intuitive to interpret in terms of concrete false positive or false negative rates compared to looking at Precision and Recall separately.

7.3 Cost Sensitive Evaluation

In many real-world applications, the costs associated with different types of classification errors are not equal. For example, in medical diagnosis, a False Negative (failing to detect a disease when it's present) might be significantly more costly than a False Positive (incorrectly diagnosing a healthy person with the disease). Cost-Sensitive Evaluation and the use of a Cost Matrix are methods that take these differential costs into account.

A Cost Matrix is a table that defines the cost associated with each possible outcome of a binary classification:

	Predicted Positive	Predicted Negative
Actual Positive	Cost of True Positive (often 0)	Cost of False Negative (FN)
Actual Negative	Cost of False Positive (FP)	Cost of True Negative (often 0)

The costs in the matrix are typically defined based on the specific problem and its consequences. Using this matrix, you can calculate the Total Cost of the classification model:

$$ \text{Total Cost} = (TP \times \text{Cost}_{TP}) + (FP \times \text{Cost}_{FP}) + (TN \times \text{Cost}_{TN}) + (FN \times \text{Cost}_{FN}) $$

Where $\text{Cost}_{TP}$ is the cost of a True Positive, $\text{Cost}_{FP}$ is the cost of a False Positive, and so on. Often, the costs of True Positives and True Negatives are considered to be 0, but they can be assigned values if there are benefits or costs associated with correct classifications.

Evaluation then focuses on minimizing the Total Cost. Metrics can be derived from this total cost. For instance, instead of maximizing accuracy, you would aim to minimize the average cost per instance.

Using cost-sensitive evaluation is crucial when the impact of different errors varies significantly, providing a more relevant measure of a model's performance in a practical context.

We will now replicate aspects of a tutorial on scikit-learn to show you how this is done. In this tutorial the threshold probability is tuned such that the cost is minimized.

our cost matrix looks like this:

	Predicted Positive	Predicted Negative
Actual Positive	0	3 (FN)
Actual Negative	5 (FP)	0

We will use this to calculate our costs!

In [ ]:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import confusion_matrix

In [ ]:

# @title
X, y = load_breast_cancer(return_X_y=True)

# split into train, test
# NOTE: skipping validation here, we presume these hyperparameters because optimization
# is not necessary for didactical reasons.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# define your cost matrix for cost sensitive evaluation
# No reflection on reality, these are just toy numbers!!
cost_matrix = np.array([[0, 3], [5, 0]])

# train a logistic regression model
model = LogisticRegression(
    max_iter=10000,
    penalty='elasticnet',
    C=0.5,
    solver= 'saga',
    l1_ratio = 0.5,
    class_weight = 'balanced',
    random_state=42
)
model.fit(X_train, y_train)

Out[ ]:

LogisticRegression(C=0.5, class_weight='balanced', l1_ratio=0.5, max_iter=10000,
                   penalty='elasticnet', random_state=42, solver='saga')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

So how well does the classifier work Below we plot the ROC, the Precision/recall curve, and the cost curve. On the training set, we see that it seems to work well.

We see that for our cost function, we get this concave plot. Where there is a clear minimum. This is due to us assigning a cost to both true positives and negatives.

In [ ]:

# @title
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Get predicted probabilities
y_prob = model.predict_proba(X_train)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, thresholds_roc = roc_curve(y_train, y_prob)
roc_auc = auc(fpr, tpr)

# Calculate Precision-Recall curve and AUPRC
precision, recall, thresholds_pr = precision_recall_curve(y_train, y_prob)
auprc = average_precision_score(y_train, y_prob)

# Calculate total cost for different thresholds
# sample threshold costs instead of taking probabilities in the output
thresholds_cost = np.linspace(0, 1, 100)
# thresholds_cost = np.unique(y_prob)

total_costs = []
for threshold in thresholds_cost:
    y_pred_threshold = (y_prob >= threshold).astype(int)
    # Calculate confusion matrix for the current threshold
    tp_c = np.sum((y_train == 1) & (y_pred_threshold == 1))
    fp_c = np.sum((y_train == 0) & (y_pred_threshold == 1))
    tn_c = np.sum((y_train == 0) & (y_pred_threshold == 0))
    fn_c = np.sum((y_train == 1) & (y_pred_threshold == 0))
    # Calculate total cost using the cost matrix
    cost = (tp_c * cost_matrix[0, 0] + fp_c * cost_matrix[0, 1] +
            fn_c * cost_matrix[1, 0] + tn_c * cost_matrix[1, 1])
    total_costs.append(cost)

# Plotting
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot ROC curve
axes[0].plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {roc_auc:.2f}')
axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('Receiver Operating Characteristic (ROC) Curve')
axes[0].legend(loc="lower right")
axes[0].grid(True)

# Plot Precision-Recall curve
axes[1].step(recall, precision, color='b', where='post', label=f'AUPRC = {auprc:.2f}')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlim([0.0, 1.0])
axes[1].set_title('Precision-Recall Curve')
axes[1].legend(loc="lower left")
axes[1].grid(True)

# Plot Total Cost vs. Threshold
axes[2].plot(thresholds_cost, total_costs, linestyle='-')
axes[2].set_xlabel('Probability Threshold')
axes[2].set_ylabel('Total Cost')
axes[2].set_title('Total Cost vs. Probability Threshold')
axes[2].grid(True)

plt.tight_layout()
plt.show()

Next, we will select a threshold, to minimize the total cost on our testing set. We will keep it simple this time and calculate the cost at each index on the line and select the minimum.

In [ ]:

# @title
def calculate_optimal_threshold(model, X_train, y_train) -> float:
  # Define a range of probability thresholds to evaluate
  thresholds_cost = np.linspace(0, 1, 100)

  # Get predicted probabilities for the training set
  y_prob_train = model.predict_proba(X_train)[:, 1]

  # Calculate total cost for different thresholds and store them
  total_costs_train = []
  for threshold in thresholds_cost:
      y_pred_threshold = (y_prob_train >= threshold).astype(int)
      # Calculate confusion matrix for the current threshold
      tp_c = np.sum((y_train == 1) & (y_pred_threshold == 1))
      fp_c = np.sum((y_train == 0) & (y_pred_threshold == 1))
      tn_c = np.sum((y_train == 0) & (y_pred_threshold == 0))
      fn_c = np.sum((y_train == 1) & (y_pred_threshold == 0))
      # Calculate total cost using the cost matrix
      cost = (tp_c * cost_matrix[0, 0] + fp_c * cost_matrix[0, 1] +
              fn_c * cost_matrix[1, 0] + tn_c * cost_matrix[1, 1])
      total_costs_train.append(cost)

  # Find the threshold that corresponds to the minimum total cost
  min_cost_index = np.argmin(total_costs_train)
  optimal_threshold = thresholds_cost[min_cost_index]
  min_total_cost_train = total_costs_train[min_cost_index]
  return optimal_threshold

def evaluate_test_set_performance(model, X_test, y_test, cost_matrix, optimal_threshold):
    """
    Evaluates a classification model on a test set using both a default (0.5)
    and a specified optimal probability threshold, and calculates the total cost
    and confusion matrices based on a given cost matrix.

    Args:
        model: The trained classification model (e.g., LogisticRegression).
        X_test: The test data features.
        y_test: The true labels for the test data.
        cost_matrix: A numpy array representing the cost matrix [[TN_cost, FP_cost], [FN_cost, TP_cost]].
        optimal_threshold: The optimal probability threshold to use for evaluation.

    Returns:
        A dictionary containing the total costs and confusion matrices for both
        the default and optimal thresholds on the test set.
    """
    # Get predicted probabilities for the test set
    y_prob_test = model.predict_proba(X_test)[:, 1]

    # --- Evaluate with Default Threshold (0.5) ---
    y_pred_test_default = (y_prob_test >= 0.5).astype(int)

    # Calculate confusion matrix for default threshold
    cm_test_default = confusion_matrix(y_test, y_pred_test_default)

    # Calculate confusion matrix components for the test set with default threshold
    tn_test_default, fp_test_default, fn_test_default, tp_test_default = cm_test_default.ravel()

    # Calculate the total cost on the test set using the default threshold
    total_cost_test_default = (tp_test_default * cost_matrix[1, 1] +
                               fp_test_default * cost_matrix[0, 1] +
                               fn_test_default * cost_matrix[1, 0] +
                               tn_test_default * cost_matrix[0, 0])

    # --- Evaluate with Optimal Threshold ---
    y_pred_test_optimal = (y_prob_test >= optimal_threshold).astype(int)

    # Calculate confusion matrix for optimal threshold
    cm_test_optimal = confusion_matrix(y_test, y_pred_test_optimal)

    # Calculate confusion matrix components for the test set with the optimal threshold
    tn_test_optimal, fp_test_optimal, fn_test_optimal, tp_test_optimal = cm_test_optimal.ravel()


    # Calculate the total cost on the test set using the optimal threshold
    total_cost_test_optimal = (tp_test_optimal * cost_matrix[1, 1] +
                               fp_test_optimal * cost_matrix[0, 1] +
                               fn_test_optimal * cost_matrix[1, 0] +
                               tn_test_optimal * cost_matrix[0, 0])

    results = {
        "total_cost_default": total_cost_test_default,
        "cm_default": cm_test_default,
        "total_cost_optimal": total_cost_test_optimal,
        "cm_optimal": cm_test_optimal
    }

    return results

# Example of how to call the function (assuming model, X_test, y_test, cost_matrix, and optimal_threshold are defined)
optimal_threshold = calculate_optimal_threshold(model, X_train, y_train)
evaluation_results = evaluate_test_set_performance(model, X_test, y_test, cost_matrix, optimal_threshold)

In [ ]:

# @title
# Get the confusion matrices from the evaluation results
cm_default = evaluation_results['cm_default']
cm_optimal = evaluation_results['cm_optimal']
optimal_threshold = optimal_threshold

# Create pandas DataFrames for better visualization
cm_df_default = pd.DataFrame(cm_default, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])
cm_df_optimal = pd.DataFrame(cm_optimal, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])

# Plotting
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot Confusion Matrix for Default Threshold
sns.heatmap(cm_df_default, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Confusion Matrix (Default Threshold 0.5)')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')

# Plot Confusion Matrix for Optimal Threshold
sns.heatmap(cm_df_optimal, annot=True, fmt='d', cmap='Blues', ax=axes[1])
axes[1].set_title(f'Confusion Matrix (Optimal Threshold {optimal_threshold:.4f})')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')

plt.tight_layout()
plt.show()

In [ ]:

# @title
print(f'Unoptimized Cost: {evaluation_results["total_cost_default"]}')
print(f'Optimized Cost: {evaluation_results["total_cost_optimal"]}')

Unoptimized Cost: 19
Optimized Cost: 18

Summary of Results

Based on the analysis using manual threshold optimization, we compared the performance of the logistic regression model on the test set using a default probability threshold of 0.5 and an optimal threshold determined by evaluating costs across a range of thresholds on the test set.

Default Threshold ($0.5$): The total cost on the test set with the default threshold was $19$. The confusion matrix showed $3$ false negatives and $2$ false positives.
Optimal Threshold ($0.27$): Using the optimal threshold of 0.27 (found through manual optimization on the training set) on the test set resulted in a total cost of $18$. The confusion matrix for the optimal threshold showed $6$ false negatives and $0$ false positives.

In this specific case, tuning the threshold through manual optimization resulted in a slightly lower total cost on the test set ($18$) compared to the default threshold ($19$). Nothing phenomenal, but still an improvement.

7.4 Multi-class Classification Problems

So far, we’ve focused on binary classification problems, where there are only two possible outcomes (e.g., presence or absence of a disease). However, many real-world scenarios involve classifying instances into more than two mutually exclusive classes. This is known as multi-class classification.

Evaluating multi-class models requires extending the concepts we’ve learned for binary classification. Some metrics, such as accuracy, can be applied directly. Others, like precision, recall, and F1-score, require modifications to handle multiple classes. In this section, we will explore:

How to adapt binary metrics for multi-class settings.
New metrics designed specifically for multi-class evaluation.
Best practices for interpreting these metrics in real-world applications.

Strategies for Multi-Class Evaluation: One-vs-Rest and One-vs-One

When moving from binary to multi-class classification, many evaluation metrics (such as precision, recall, and F1-score) are originally defined for two classes. To apply these metrics in a multi-class setting, we use strategies that break the problem into multiple binary evaluations. The two most common approaches are One-vs-Rest (OvR) and One-vs-One (OvO). Finally, we will discuss the multi-class confusion matrix. Which is a generalization of the common binary confusion matrix.

One-vs-Rest (OvR)

When we have more than two classes, like classifying different grades of cancer (Grade 1, Grade 2, Grade 3), we can use the One-vs-Rest (OvR) strategy to evaluate our model. Think of it as breaking down the multi-class problem into several simpler binary problems.

Concept:

For each class, we essentially ask a "yes or no" question: "Is this instance of this specific class, or is it not?"

Here's how it works:

Pick one class: Let's say we pick "Cancer Grade 1".
Create a binary problem: We treat all instances of "Cancer Grade 1" as the positive class. All other instances (Cancer Grade 2, Cancer Grade 3, and any other classes) are grouped together as the negative class ("Not Cancer Grade 1").
Evaluate like binary: We then calculate standard binary classification metrics (like Precision, Recall, and F1-score) for this specific "Cancer Grade 1 vs. Rest" problem. Precision here tells us, "When the model predicted 'Cancer Grade 1', how often was it actually Grade 1?". Recall tells us, "Of all the actual 'Cancer Grade 1' cases, how many did the model correctly identify?".
Repeat for all classes: We repeat this process for every other class. So, we'll have a "Cancer Grade 2 vs. Rest" evaluation and a "Cancer Grade 3 vs. Rest" evaluation.

The One-vs-Rest strategy is a straightforward way to extend binary metrics to multi-class problems and provides detailed performance metrics for each individual class, which is very useful in medical applications where identifying a specific disease subtype might be critical. However, a potential drawback is that when some classes have very few instances compared to others (highly imbalanced), the "Rest" category can be dominated by the majority classes, potentially making the evaluation for the minority class less informative. This is a common challenge in medical datasets.

One-vs-One (OvO)

Another strategy for extending binary evaluation metrics to multi-class problems is the One-vs-One (OvO) approach. Instead of comparing each class against all others, OvO compares every possible pair of classes.

Concept:

Imagine you have classes A, B, and C. With OvO, you would set up separate evaluations for:

Class A vs. Class B
Class A vs. Class C
Class B vs. Class C

For each pair, you only consider the instances belonging to those two classes and build a binary classification problem. You then calculate binary metrics (like Precision, Recall, F1-score) for each of these pairwise comparisons.

If you have $(k)$ classes, the number of pairwise comparisons you need to perform is $(\frac{k(k-1)}{2})$. For example, with 3 classes (A, B, C), you have $(\frac{3(3-1)}{2} = 3)$ pairs. With 4 classes (A, B, C, D), you have $(\frac{4(4-1)}{2} = 6)$ pairs.

A significant drawback of the One-vs-One strategy is that it can become computationally expensive for a large number of classes $(k)$ because the number of binary evaluations grows quadratically with $(k)$, and interpreting the overall performance based on many pairwise metrics can be more complex than interpreting per-class metrics from OvR. However, the reporting can be very detailed.

Why These Matter

By breaking down the multi-class problem into a series of binary problems, we can leverage the well-understood concepts of True Positives, False Positives, True Negatives, and False Negatives to assess performance for each class or each pair of classes. This provides a more granular view of where the model is succeeding or failing beyond just overall accuracy.

Building upon these strategies, we can then calculate multi-class generalizations of key binary metrics. We can compute precision, recall, and F1-score for each class using the One-vs-Rest approach, and then combine these per-class scores using micro-averaging, macro-averaging, or weighted averaging to get overall performance summaries that account for class imbalance in different ways. Similarly, we can extend the concepts of ROC curves and Precision-Recall curves to the multi-class setting, often by plotting curves for each class (OvR) or averaging curves across pairs (OvO), providing visual tools to understand the trade-offs between different error types across varying decision thresholds.

Beyond these adaptations of binary metrics, other evaluation metrics are particularly useful in multi-class and related multi-label scenarios common in healthcare. The multi-class confusion matrix is a direct extension of the binary confusion matrix, showing the counts of actual vs. predicted classes in a table format, allowing for easy identification of which classes are being confused with each other.

Multi-Class Confusion Matrix

The OvO and RvO strategies all depend on evaluating a multi-class problem by turning it into multiple single class problem. There is another useful plot that you will see frequently when assessing multi-class. This plot counts how many times each class was correctly predicted (on the main diagonal, these are your TP) and how many times it was incorrectly predicted as another class (off diagonal items). If you exclude the main diagonal, each row contains the False negatives for that class and each column contains the False positives for that class. The better the performance, the more items are on your main diagonal.

Below you will see an example of a confusion matrix on generated data.

In [ ]:

# @title
import numpy as np
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Generate fake data for a multi-class confusion matrix with 5 classes
n_samples = 1000
n_classes = 5

# Generate random true labels
y_true_multi = np.random.randint(0, n_classes, n_samples)

# Generate predicted labels that are mostly correct with some random misclassifications
# Introduce misclassifications with a lower probability
misclassification_rate = 0.50  # Adjust this rate to control diagonal dominance
y_pred_multi = np.copy(y_true_multi)
random_indices = np.random.choice(n_samples, size=int(n_samples * misclassification_rate), replace=False)
y_pred_multi[random_indices] = np.random.randint(0, n_classes, size=len(random_indices))

# Ensure predicted labels stay within the class range [0, n_classes-1]
y_pred_multi = np.clip(y_pred_multi, 0, n_classes - 1)

print(f"Generated {n_samples} samples with {n_classes} classes.")

Generated 1000 samples with 5 classes.

In [ ]:

# @title
# Calculate the confusion matrix
cm_multi = confusion_matrix(y_true_multi, y_pred_multi)

# Create a pandas DataFrame for better visualization
cm_df_multi = pd.DataFrame(cm_multi, index=[f'Actual Class {i}' for i in range(n_classes)],
                           columns=[f'Predicted Class {i}' for i in range(n_classes)])

# Visualize the confusion matrix using seaborn heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df_multi, annot=True, fmt='d', cmap='Blues')
plt.title('Multi-Class Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Averaging Strategies for Multi-Class Metrics

When performing One-vs-Rest or One-vs-One evaluation, we get performance metrics (like Precision, Recall, F1-score) for each individual class or each pair of classes. While per-class metrics are informative and should always be shown, we often need a single overall score to summarize the model's performance across all classes. This is where averaging strategies come in.

Micro-averaging

Micro-averaging is an averaging strategy that aggregates the contributions of all classes to compute the overall average metric. Instead of calculating the metric (like precision or recall) for each class separately and then averaging, micro-averaging is calculated globally. This is done by summing up the individual True Positives (TP), False Positives (FP), and False Negatives (FN) from the confusion matrices of all the per-class binary problems (as in the One-vs-Rest approach).

For example, to calculate the micro-averaged precision, you would sum up the TPs from all classes and divide by the sum of TPs and FPs from all classes:

$$ \text{Micro-averaged Precision} = \frac{\sum_{i=1}^{k} TP_i}{\sum_{i=1}^{k} TP_i + \sum_{i=1}^{k} FP_i} $$

where $(TP_i)$ and $(FP_i)$ are the True Positives and False Positives for class $(i)$, and $(k)$ is the number of classes.

A key characteristic of micro-averaging is that it is heavily influenced by the performance on larger classes. Classes with more instances will contribute more to the sums of TP, FP, and FN, and thus have a greater impact on the final micro-averaged score.

Crucially, for metrics like Precision, Recall, and F1-score, the micro-averaged score is equivalent to the overall accuracy of the multi-class classifier. This is because the sum of TPs across all binary OvR problems equals the total number of correctly classified instances in the multi-class problem, and the sum of TPs, FPs, TNs, and FNs across all binary OvR problems equals the total number of instances, which is the denominator for accuracy. Therefore, micro-averaged F1, precision, and recall all reduce to overall accuracy:

$$ \text{Micro-averaged F1} = \frac{2 \times \text{Total TP}}{\text{Total TP} + \text{Total FP} + \text{Total FN}} = \frac{\text{Total Correct Predictions}}{\text{Total Instances}} = \text{Overall Accuracy} $$

This equivalence means that micro-averaging might not be the most informative metric when dealing with imbalanced datasets, as it mirrors the potential misleading nature of overall accuracy in such cases. It essentially treats every instance prediction equally, regardless of the class it belongs to.

Macro-averaging

Macro-averaging is an averaging strategy that calculates the metric independently for each class and then takes the simple average of these per-class metrics. Unlike micro-averaging, macro-averaging gives equal weight to each class, regardless of the number of instances it contains.

Here's how it works:

Calculate Metric per Class: Compute the desired metric for each class individually, typically using the One-vs-Rest approach.
Simple Average: Sum up the metric values obtained for each class and divide by the total number of classes.

For example, the macro-averaged F1-score is calculated as:

$$ \text{Macro-averaged F1} = \frac{1}{k} \sum_{i=1}^{k} F1_i $$

where $(F1_i)$ is the F1-score for class $(i)$, and $(k)$ is the number of classes.

Key Characteristics:

Equal Class Contribution: Each class has the same impact on the final macro-averaged score, even if one class has significantly more instances than another.
Sensitivity to Minority Classes: Macro-averaging is particularly useful when you want to ensure that the model performs well on minority classes. A poor performance on a small class will significantly lower the macro-averaged score, even if the model performs excellently on large classes.
Not Equivalent to Accuracy: Unlike micro-averaging, macro-averaging is generally not equivalent to overall accuracy, especially in the presence of class imbalance.

Macro-averaging is a valuable metric when your goal is to give equal importance to the correct classification of instances from all classes, highlighting issues with performance on rare classes.

Weighted-averaging

Weighted-averaging is a strategy that calculates the metrics independently for each class, similar to macro-averaging, but then takes a weighted average where the weight for each class is proportional to the number of instances in that class (its support). This provides a balance between macro-averaging (equal weight to all classes) and micro-averaging (which implicitly weights by instance count).

Here's how it works:

Calculate Metric per Class: Compute the desired metric (e.g., Precision, Recall, F1-score) for each class individually.
Calculate Class Support: Determine the number of instances for each class in the dataset (or the relevant subset being evaluated).
Weighted Average: Sum up the metric values obtained for each class, multiplied by their respective support, and then divide by the total number of instances.

For example, the weighted-averaged F1-score is calculated as:

$$ \text{Weighted-averaged F1} = \frac{\sum_{i=1}^{k} F1_i \times \text{Support}_i}{\sum_{i=1}^{k} \text{Support}_i} $$

where ($F1_i$) is the F1-score for class ($i$), ($\text{Support}_i$) is the number of instances in class ($i$), and ($k$) is the number of classes.

Key Characteristics:

Reflects Class Distribution: The weighted average score reflects the model's performance on the dataset as a whole, taking into account the frequency of each class.
Compromise Metric: It sits between micro-averaging and macro-averaging. It is not as sensitive to poor performance on very small minority classes as macro-averaging is, but it doesn't completely ignore the performance on those classes like micro-averaging can appear to do (since micro average is equivalent to accuracy).
Often Used for Overall Summary: Weighted-averaging is a common choice for providing a single summary metric for multi-class classification, as it gives a sense of the model's overall performance while still accounting for potential imbalance by weighting by class frequency.

Weighted-averaging is particularly useful when you want an overall metric that reflects the model's performance in proportion to how often each class appears in the data.

Example: Generalizing Binary Classification Metrics (Part 1)

Remember the classification metrics that we used for binary classification? We are going to apply those to a multi-class classification problem by using a One-vs-Rest strategy. Then we will discuss various strategies of combining them into a single metric. We will use the Forest cover type dataset from scikit-learn. This is a more difficult dataset to work with which is also imbalanced.

Should you wish to run the code, loading the data will take a minute and training the machine learning model will take a few minutes.

In [ ]:

# @title
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, balanced_accuracy_score, confusion_matrix
import numpy as np

In [ ]:

# @title
X, y = fetch_covtype(return_X_y=True)
y -= 1

# Tabulate the number of occurrences for each cover type
cover_type_counts = pd.Series(y).value_counts().sort_index()

# Create a DataFrame for better visualization
cover_type_df = cover_type_counts.to_frame(name='Count').T

# Apply styling for better readability (e.g., format with commas)
styled_cover_type_df = cover_type_df.style.format("{:,}")

# Display the counts in a table
print("Number of instances per cover type: ")
display(styled_cover_type_df)
print('\nYou will see that class 2, 3, 4, 5, and 6 are not found as much as classes 0 and 1.')

Number of instances per cover type:

	0	1	2	3	4	5	6
Count	211,840	283,301	35,754	2,747	9,493	17,367	20,510

You will see that class 2, 3, 4, 5, and 6 are not found as much as classes 0 and 1.

In [ ]:

# @title
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Data split into training and testing sets.")
print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Testing set shape: {X_test.shape}, {y_test.shape}")

Data split into training and testing sets.
Training set shape: (464809, 54), (464809,)
Testing set shape: (116203, 54), (116203,)

In [ ]:

# @title

# Initialize and train a RandomForestClassifier model
# RandomForest is often faster than SVM or Logistic Regression on large tabular datasets
# You might need to adjust parameters like n_estimators or max_depth for performance/speed
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42, verbose=1, n_jobs=-1) # n_jobs=-1 uses all available cores

# Train the model
print("Starting RandomForest model training...")
random_forest_model.fit(X_train, y_train)

print("RandomForest model trained successfully.")

Starting RandomForest model training...

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.3min

RandomForest model trained successfully.

[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  2.6min finished

In [ ]:

# @title
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, balanced_accuracy_score, confusion_matrix
import pandas as pd
import numpy as np



def calculate_per_class_ovr_metrics(y_true, y_pred):
    """
    Performs One-vs-Rest evaluation and calculates metrics for each class.

    Args:
        y_true (np.ndarray): True labels.
        y_pred (np.ndarray): Predicted labels.

    Returns:
        tuple: A tuple containing:
               - per_class_df (pd.DataFrame): DataFrame with per-class metrics.
               - per_class_tps (list): List of per-class True Positives.
               - per_class_fps (list): List of per-class False Positives.
               - per_class_tns (list): List of per-class True Negatives.
               - per_class_fns (list): List of per-class False Negatives.
               - class_supports (list): List of class supports (actual counts).
    """
    classes = np.unique(y_true)
    class_labels = []
    accuracies = []
    precisions = []
    recalls = []
    specificities = []
    mccs = []
    f1_scores = []
    balanced_accuracies = []
    class_supports = []

    per_class_tps = []
    per_class_fps = []
    per_class_tns = []
    per_class_fns = []

    for class_id in classes:
        y_true_binary = (y_true == class_id).astype(int)
        y_pred_binary = (y_pred == class_id).astype(int)

        cm_binary = confusion_matrix(y_true_binary, y_pred_binary)
        if cm_binary.shape == (2, 2):
             tn, fp, fn, tp = cm_binary.ravel()
        elif cm_binary.shape == (1, 1):
            if np.sum(y_true_binary) == 0:
                tn, fp, fn, tp = cm_binary[0, 0], 0, 0, 0
            else:
                 tn, fp, fn, tp = 0, 0, 0, cm_binary[0, 0]
        else:
             tn, fp, fn, tp = 0, 0, 0, 0

        per_class_tps.append(tp)
        per_class_fps.append(fp)
        per_class_tns.append(tn)
        per_class_fns.append(fn)

        accuracy = accuracy_score(y_true_binary, y_pred_binary)
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0.0
        mcc = matthews_corrcoef(y_true_binary, y_pred_binary)
        f1 = f1_score(y_true_binary, y_pred_binary)
        balanced_acc = balanced_accuracy_score(y_true_binary, y_pred_binary)

        support = np.sum(y_true == class_id)

        class_labels.append(f'Class {class_id}')
        accuracies.append(accuracy)
        precisions.append(precision)
        recalls.append(recall)
        specificities.append(specificity)
        mccs.append(mcc)
        f1_scores.append(f1)
        balanced_accuracies.append(balanced_acc)
        class_supports.append(support)

    per_class_df = pd.DataFrame({
        'Class': class_labels,
        'Accuracy': accuracies,
        'Precision': precisions,
        'Recall': recalls,
        'Specificity': specificities,
        'MCC': mccs,
        'F1 Score': f1_scores,
        'Balanced Accuracy': balanced_accuracies
    })
    per_class_df = per_class_df.set_index('Class')

    return per_class_df, per_class_tps, per_class_fps, per_class_tns, per_class_fns, class_supports

def calculate_multi_class_averaged_metrics(y_true, y_pred, per_class_df, per_class_tps, per_class_fps, per_class_tns, per_class_fns, class_supports):
    """
    Calculates micro, macro, and weighted averaged multi-class metrics.

    Args:
        y_true (np.ndarray): True labels (needed for overall metrics).
        y_pred (np.ndarray): Predicted labels (needed for overall metrics).
        per_class_df (pd.DataFrame): DataFrame with per-class metrics.
        per_class_tps (list): List of per-class True Positives.
        per_class_fps (list): List of per-class False Positives.
        per_class_tns (list): List of per-class True Negatives.
        per_class_fns (list): List of per-class False Negatives.
        class_supports (list): List of class supports (actual counts).

    Returns:
        pd.DataFrame: DataFrame with averaged metrics (Micro, Macro, Weighted).
    """
    # --- Calculate Micro Averages ---
    # Use scikit-learn's direct multi-class metric calculations for Micro averages
    # as they are standard and equivalent to the sum-of-OvR-components approach
    # for P, R, F1, and directly applicable to Acc, MCC, Balanced Acc.
    micro_accuracy = accuracy_score(y_true, y_pred)
    micro_precision = precision_score(y_true, y_pred, average='micro', zero_division=0)
    micro_recall = recall_score(y_true, y_pred, average='micro', zero_division=0)
    micro_f1 = f1_score(y_true, y_pred, average='micro', zero_division=0)
    micro_mcc = matthews_corrcoef(y_true, y_pred)
    micro_balanced_accuracy = balanced_accuracy_score(y_true, y_pred)

    # Micro Specificity calculation is not standard.
    overall_cm = confusion_matrix(y_true, y_pred)
    overall_tn = np.sum(np.diag(overall_cm))

    # Correct overall TN: Sum of elements NOT in row_i or col_i for each class i, summed up. Complex.
    # Let's use the micro_specificity from summing OvR TNs/FPs as it's one interpretation,
    # or simply exclude Specificity from Micro Avg if it's not a standard interpretation.
    # Based on previous attempts and standard practice, micro specificity is often not reported.
    # Let's calculate the sum of TNs and FPs from OvR for this metric, but note it's not standard overall Spec.
    sum_tn_ovr = np.sum(per_class_tns)
    sum_fp_ovr = np.sum(per_class_fps)
    micro_specificity_ovr_sum = sum_tn_ovr / (sum_tn_ovr + sum_fp_ovr) if (sum_tn_ovr + sum_fp_ovr) > 0 else 0.0

    micro_metrics = {
        'Accuracy': micro_accuracy,
        'Precision': micro_precision,
        'Recall': micro_recall,
        'Specificity': micro_specificity_ovr_sum, # Using the sum of OvR TN/FP
        'MCC': micro_mcc,
        'F1 Score': micro_f1,
        'Balanced Accuracy': micro_balanced_accuracy
    }


    # --- Calculate Macro Averages ---
    accuracies = per_class_df['Accuracy'].tolist()
    precisions = per_class_df['Precision'].tolist()
    recalls = per_class_df['Recall'].tolist()
    specificities = per_class_df['Specificity'].tolist()
    mccs = per_class_df['MCC'].tolist()
    f1_scores = per_class_df['F1 Score'].tolist()
    balanced_accuracies = per_class_df['Balanced Accuracy'].tolist()

    macro_avg_metrics = {
        'Accuracy': np.mean(accuracies),
        'Precision': np.mean(precisions),
        'Recall': np.mean(recalls),
        'Specificity': np.mean(specificities),
        'MCC': np.mean(mccs),
        'F1 Score': np.mean(f1_scores),
        'Balanced Accuracy': np.mean(balanced_accuracies)
    }

    # --- Calculate Weighted Averages ---
    total_support = np.sum(class_supports)
    if total_support == 0:
        weighted_avg_metrics = {metric: 0.0 for metric in macro_avg_metrics.keys()}
    else:
        weighted_avg_metrics = {
            'Accuracy': np.sum(np.array(accuracies) * np.array(class_supports)) / total_support,
            'Precision': np.sum(np.array(precisions) * np.array(class_supports)) / total_support,
            'Recall': np.sum(np.array(recalls) * np.array(class_supports)) / total_support,
            'Specificity': np.sum(np.array(specificities) * np.array(class_supports)) / total_support,
            'MCC': np.sum(np.array(mccs) * np.array(class_supports)) / total_support,
            'F1 Score': np.sum(np.array(f1_scores) * np.array(class_supports)) / total_support,
            'Balanced Accuracy': np.sum(np.array(balanced_accuracies) * np.array(class_supports)) / total_support
        }

    # Create DataFrame for averaged metrics
    averaged_metrics_data = {
        'Micro Average': list(micro_metrics.values()),
        'Macro Average': list(macro_avg_metrics.values()),
        'Weighted Average': list(weighted_avg_metrics.values())
    }

    averaged_metrics_df = pd.DataFrame(averaged_metrics_data, index=list(micro_metrics.keys())).T

    return averaged_metrics_df


# --- Main Execution Block ---

# Assuming y_test and y_pred are available from previous cells
# Get predictions for the test set (if not already done)
try:
    y_pred = random_forest_model.predict(X_test)
except NameError:
    print("Error: random_forest_model or X_test not found. Please ensure model is trained and data is split.")
    y_pred = np.zeros_like(y_test) # Fallback

# Calculate per-class metrics
per_class_df, per_class_tps, per_class_fps, per_class_tns, per_class_fns, class_supports = calculate_per_class_ovr_metrics(y_test, y_pred)

# Calculate averaged metrics
averaged_results_df = calculate_multi_class_averaged_metrics(y_test, y_pred, per_class_df, per_class_tps, per_class_fps, per_class_tns, per_class_fns, class_supports)


# --- Display Per-Class Results ---
print("One-vs-Rest Evaluation Metrics per Class:")

# Define styling function with lighter colors
def highlight_classes(row):
    styles = [''] * len(row)
    # Assuming row.name is in 'Class X' format
    class_id_str = row.name.split(' ')[-1]
    if class_id_str in ['3', '4']:
        styles = ['background-color: #FFCCCC'] * len(row) # Lighter red
    elif class_id_str in ['0', '1']:
        styles = ['background-color: #CCFFCC'] * len(row) # Lighter green
    return styles

# Apply styling and formatting
styled_per_class_df = per_class_df.style \
    .apply(highlight_classes, axis=1) \
    .format(lambda x: f'{x*100:.1f}') # Multiply by 100 and format to 1 decimal place

# Display the styled DataFrame
display(styled_per_class_df)

# --- Display Averaged Results ---
print("\nAveraged Metrics:")

# Apply formatting (multiply by 100 and round to 1 decimal place)
styled_averaged_results_df = averaged_results_df.style.format(lambda x: f'{x*100:.1f}')

# Display the styled DataFrame
display(styled_averaged_results_df)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    1.6s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    3.5s finished

One-vs-Rest Evaluation Metrics per Class:

	Accuracy	Precision	Recall	Specificity	MCC	F1 Score	Balanced Accuracy
Class
Class 0	96.6	96.3	94.1	97.9	92.5	95.2	96.0
Class 1	96.1	94.8	97.3	95.0	92.2	96.0	96.1
Class 2	99.4	93.9	96.0	99.6	94.6	95.0	97.8
Class 3	99.9	91.6	85.6	100.0	88.5	88.5	92.8
Class 4	99.6	95.4	77.0	99.9	85.5	85.2	88.5
Class 5	99.5	93.0	89.0	99.8	90.7	91.0	94.4
Class 6	99.7	97.3	94.6	99.9	95.8	95.9	97.3

Averaged Metrics:

	Accuracy	Precision	Recall	Specificity	MCC	F1 Score	Balanced Accuracy
Micro Average	95.3	95.3	95.3	99.2	92.5	95.3	90.5
Macro Average	98.7	94.6	90.5	98.9	91.4	92.4	94.7
Weighted Average	96.8	95.3	95.3	96.8	92.4	95.3	96.0

In [ ]:

# @title
# Calculate the confusion matrix for the test set
cm_forest = confusion_matrix(y_test, y_pred)

# Create a pandas DataFrame for better visualization
# The class labels for the Forest Cover Type dataset are 1 to 7, but scikit-learn
# might output them as 0 to 6 depending on how the data was loaded/processed.
# We will get the unique class labels from the true test data.
class_labels = np.unique(y_test)
cm_df_forest = pd.DataFrame(cm_forest,
                            index=[f'Actual Class {i}' for i in class_labels],
                            columns=[f'Predicted Class {i}' for i in class_labels])

# Visualize the confusion matrix using seaborn heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(cm_df_forest, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for Forest Cover Type Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Conclusion

One-vs-Rest Table

Based on the One-vs-Rest evaluation table, we can observe the Random Forest classifier's performance on this imbalanced dataset. The model demonstrates strong performance on the majority classes (Classes 0 and 1, highlighted in green), with high accuracy, precision, recall, and F1-scores.

However, the high accuracy reported for the minority classes (Classes 3 and 4, highlighted in red) can be misleading due to their low prevalence. While accuracy appears high (above 99%), this is largely influenced by correctly identifying the majority of instances as not belonging to these rare classes (True Negatives).

Looking at other metrics provides a more realistic view of performance on these minority classes. Recall (Sensitivity) for Class 3 (85.6%) and Class 4 (77.0%) is notably lower than for majority classes, indicating the model misses a significant portion of actual instances from these groups (higher False Negatives). Precision is also generally lower, meaning when the model predicts these classes, there's a higher chance of being incorrect. Metrics like F1-score, Balanced Accuracy, and MCC, which provide a more balanced assessment than accuracy, also show considerably lower values for the minority classes (e.g., F1-scores of 88.5% and 85.2%), confirming that performance on these rarer classes is not as robust as accuracy alone would suggest. This highlights why it's crucial to examine multiple metrics when dealing with imbalanced data in medical applications, where correctly identifying rare conditions is often critical.

NOTE: the high per-class OvR accuracies for minority classes primarily reflect the model's ability to correctly identify instances that do not belong to that specific rare class (TN), rather than its ability to correctly identify instances that do belong to it (TP). The overall accuracy provides a more realistic picture of the model's performance across the entire multi-class problem, taking into account the model's performance on all classes and the prevalence of each class. Recall the accuracy formula.

Averaged Metrics

The averaged metrics table provides single summary scores for the model's performance across all classes. The Micro Average is equivalent to the overall accuracy (95.3%), reflecting the model's performance on the dataset as a whole, where every instance is equally important. This average is heavily influenced by the performance on the larger, majority classes.

The Macro Average provides a different perspective by giving equal weight to each class, including the smaller minority classes. Comparing the Macro average values to the Micro average (e.g., Macro F1 is 92.4% vs. Micro F1/Accuracy 95.3%) highlights that the model performs less well on the minority classes compared to the majority classes, as poor performance on small classes has a larger impact on the Macro average. This makes the Macro average a more informative metric than Micro average (or overall accuracy) when evaluating performance on imbalanced datasets, as it clearly shows issues with minority classes.

Finally, the Weighted Average provides a balance between the Micro and Macro averages. It considers the performance on each class but weights its contribution by the class's frequency in the dataset. This gives a sense of the model's overall performance while still accounting for the class distribution. The Weighted Average F1 (95.3%) in this case is very close to the Micro Average/Overall Accuracy, which again reflects the dominance of the majority classes in the dataset. Comparing these different averages helps provide a more complete understanding of the model's performance beyond just a single number, especially in the context of imbalanced data.

Confusion Matrix

The confusion matrix shows that the model is effective at classifying the majority classes, but that it makes mistakes when classifying minority classes.

Multi-Class ROC and Precision Recall Curves

We've already seen and discussed the binary cases of the ROC (and the AUC) and the Precision/Recall curve. Once again we can choose the One-vs-Rest (OvR) strategy to create the plots and use an averaging strategy to calculate the AUC for both types of curves. It is also possible to use the OvO strategy, but we've chosen not to show this case in the notebook for the sake of brevity.

Multi-Class ROC Curve and AUC

In the multi-class setting, we can generate an ROC curve for each class using the OvR approach. For each class, we treat it as the positive class and all other classes as the negative class. We then calculate the True Positive Rate (TPR) and False Positive Rate (FPR) at various probability thresholds for this binary problem and plot the ROC curve.

The Area Under the ROC Curve (AUC) can then be calculated for each individual class's OvR curve. To get a single overall metric, we can use the averaging strategies discussed earlier.

The multi-class ROC curve and its AUC are useful for understanding the model's overall ability to discriminate between classes across various thresholds.

Multi-Class Precision-Recall Curve and AUPRC

Similar to the ROC curve, we can generate a Precision-Recall curve for each class using the OvR approach. For each class, we plot Precision against Recall at various probability thresholds.

The Area Under the Precision-Recall Curve (AUPRC or AP) is particularly informative in the multi-class setting, especially when dealing with imbalanced datasets. We can calculate the AUPRC for each individual class's OvR curve. Again, these can be averaged using one of the averaging strategies mentioned earlier.

The multi-class Precision-Recall curve and its AUPRC are preferred over ROC when dealing with imbalanced datasets, as they focus on the model's ability to correctly identify positive instances without generating too many false positives, which is crucial for minority classes.

Have a look at the code below that demonstrates how to plot these curves and calculate the averaged AUC and AUPRC values for the Forest Cover Type dataset. You will see that the precision recall curve more clearly illustrates the more difficult to predict minority classes. Whereas the ROC curve has difficulty differentiating them. Though all scores are very high (it's a toy dataset!).

In [ ]:

# @title
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
import numpy as np

# Assuming y_test and y_prob are available from previous cells
# If not, you would need to get predictions and probabilities first
try:
    y_prob = random_forest_model.predict_proba(X_test)
    classes = np.unique(y_test)
    n_classes = len(classes)

    # Binarize the true labels for OvR
    y_test_bin = label_binarize(y_test, classes=classes)

    # --- Plot Multi-Class ROC Curve ---
    plt.figure(figsize=(10, 8))

    # Plot ROC for each class (OvR)
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_prob[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
        plt.plot(fpr[i], tpr[i], lw=2,
                 label=f'ROC curve of class {classes[i]} (area = {roc_auc[i]:.2f})')

    # Plot the diagonal line (random classifier)
    plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random (area = 0.50)')

    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Multi-Class Receiver Operating Characteristic (ROC) Curve (OvR)')
    plt.legend(loc="lower right")
    plt.grid(True)
    plt.show()

    # --- Calculate Averaged ROC AUC ---
    # Micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), y_prob.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

    # Macro-average ROC AUC
    # First aggregate all false positive rates
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
    # Then interpolate all ROC curves at this points
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
    # Finally average it and compute AUC
    mean_tpr /= n_classes
    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

    # Weighted-average ROC AUC (using scikit-learn function which weights by support)
    # Note: scikit-learn's roc_auc_score with average='weighted' is easier
    from sklearn.metrics import roc_auc_score
    roc_auc["weighted"] = roc_auc_score(y_test, y_prob, multi_class='ovr', average='weighted')


    print("\nAveraged ROC AUC Scores:")
    print(f"Micro Average AUC: {roc_auc['micro']:.4f}")
    print(f"Macro Average AUC: {roc_auc['macro']:.4f}")
    print(f"Weighted Average AUC: {roc_auc['weighted']:.4f}")


    # --- Plot Multi-Class Precision-Recall Curve ---
    plt.figure(figsize=(10, 8))

    # Plot PR for each class (OvR)
    precision = dict()
    recall = dict()
    average_precision = dict()
    for i in range(n_classes):
        precision[i], recall[i], _ = precision_recall_curve(y_test_bin[:, i], y_prob[:, i])
        average_precision[i] = average_precision_score(y_test_bin[:, i], y_prob[:, i])
        plt.plot(recall[i], precision[i], lw=2,
                 label=f'PR curve of class {classes[i]} (area = {average_precision[i]:.2f})')

    # Plot the baseline (prevalence of positive class)
    # For OvR, the baseline for class i is the prevalence of class i
    # This is not a single line for all classes, but we can show the micro average baseline
    # Micro-average PR baseline is the overall prevalence of the positive class (which is 1/n_classes if balanced, or the average prevalence if imbalanced)
    # A more standard baseline is the prevalence of the positive class for each individual PR curve.
    # Let's skip a single baseline line for clarity and focus on per-class curves and averages.

    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Multi-Class Precision-Recall Curve (OvR)')
    plt.legend(loc="lower left")
    plt.grid(True)
    plt.show()

    # --- Calculate Averaged AUPRC ---
    # Micro-average AUPRC (using scikit-learn direct function)
    average_precision["micro"] = average_precision_score(y_test_bin, y_prob, average='micro')

    # Macro-average AUPRC
    # Simple average of per-class AUPRCs
    average_precision["macro"] = np.mean(list(average_precision.values())[:n_classes]) # Exclude 'micro' from average

    # Weighted-average AUPRC (using scikit-learn function)
    average_precision["weighted"] = average_precision_score(y_test, y_prob, average='weighted')


    print("\nAveraged Precision-Recall AUC (AUPRC) Scores:")
    print(f"Micro Average AUPRC: {average_precision['micro']:.4f}")
    print(f"Macro Average AUPRC: {average_precision['macro']:.4f}")
    print(f"Weighted Average AUPRC: {average_precision['weighted']:.4f}")


except NameError:
    print("Error: random_forest_model, X_test, or y_test not found. Please ensure model is trained and data is split.")
except Exception as e:
    print(f"An error occurred: {e}")

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    2.7s
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    7.3s finished

Averaged ROC AUC Scores:
Micro Average AUC: 0.9985
Macro Average AUC: 0.9978
Weighted Average AUC: 0.9951

Averaged Precision-Recall AUC (AUPRC) Scores:
Micro Average AUPRC: 0.9910
Macro Average AUPRC: 0.9767
Weighted Average AUPRC: 0.9908

7.5 Multi-Label Classification Problems

We've learned about binary classification, where an attribute is either present or absent, and we've learned about multi-class classification, where one of many attributes is either present or absent. These previous problems have focussed on cases where classe are mutually exclusive. But not all problems are like this. For example, if you have a chest x-ray, multiple organs are present. If you want to indicate what organs are present the previous problem definitions make no sense, because multiple things can be either present or absent. A problem that fits this type of situation is called a multi-label classification problem. As we will see, there is some carry over from the other problems that we have already discussed, but some additional metrics will have to be introduced.

Challenges with multi-label data

Multi-label classification presents several unique challenges compared to binary or multi-class problems, primarily due to the possibility of multiple labels being associated with a single instance and the relationships between these labels.

Label Dependence: A significant challenge is that labels are often not independent of each other. For example, in medical image analysis, inserting a device that is visible on an image may in most cases be associated with the presence of an illness. Think about the presence of a tube on an X-Ray image that is used during intubation of a patient. All patients that are intubated may have an pneumonia, but not all patients that have pneumonia are intubated. Many traditional multi-label models implicitly assume label independence, which can limit their ability to capture these complex relationships and lead to suboptimal performance, as they might miss important patterns or predict improbable combinations of labels.
Data Imbalance: Similar to multi-class problems, multi-label datasets frequently suffer from data imbalance, but it's at the label level. Some labels appear much more frequently across the dataset than others. Training models to perform well on rare labels is difficult because there are limited examples to learn from. This can be particularly problematic in domains like healthcare, where rare diseases or conditions are often the most critical to identify correctly.
Evaluation Complexity: Evaluating the performance of multi-label models is more complex than evaluating binary or multi-class classifiers. A prediction for a single instance can be partially correct (some labels predicted correctly, others missed or falsely included). Traditional metrics and confusion matrices don't easily capture this nuance.
Threshold Selection: Unlike binary or multi-class problems where a single decision boundary or threshold is often sufficient for prediction, multi-label classification typically involves predicting probability scores for each label independently. Converting these scores into binary predictions requires setting a threshold for each label. Choosing the right threshold(s) is crucial as it directly impacts the trade-off between false positives and false negatives for each label and can significantly affect overall performance, especially when dealing with varying label frequencies or asymmetric costs of errors. Finding optimal thresholds, potentially different for each label, adds another layer of complexity to model deployment and evaluation.

Power Set Transformation

One intuitive way to approach multi-label classification is by transforming it into a multi-class classification problem. This can be done using the Power Set Transformation.

The power set of a set $S$ is the set of all subsets of $S$, including the empty set and the set $S$ itself. For example, if we have a set of labels $L = \{L_1, L_2, L_3\}$, the power set $\mathcal{P}(L)$ would be:

$$ \mathcal{P}(L) = \{\emptyset, \{L_1\}, \{L_2\}, \{L_3\}, \{L_1, L_2\}, \{L_1, L_3\}, \{L_2, L_3\}, \{L_1, L_2, L_3\}\} $$

In the context of multi-label classification, the power set transformation treats each unique combination of labels observed in the dataset as a single, distinct class. If the original labels are $\{L_1, L_2, \dots, L_k\}$, the power set transformation creates a new set of classes corresponding to every possible subset of these labels that appears in the training data.

For an instance with true labels $\{L_1, L_3\}$, instead of predicting '1' for $L_1$ and $L_3$ and '0' for $L_2$, the model would be trained to predict the single class corresponding to the label combination $\{L_1, L_3\}$. The problem is converted into a standard multi-class classification task where the classes are these unique label combinations.

Impracticality of Power Set Transformation

While conceptually simple, the power set transformation is frequently impractical for real-world multi-label problems due to a major limitation: the number of possible unique label combinations (classes) can be extremely large.

Exponential Growth: If there are $k$ possible labels, the total number of subsets in the power set is $2^k$. While the number of unique combinations observed in the training data might be less than $2^k$, it can still grow exponentially with the number of labels. For example, with 20 labels, $2^{20}$ is over a million potential combinations.
Sparse Data: Even if the total number of possible combinations is large, many combinations might never appear in the training data. This leads to a multi-class problem with a vast number of classes, most of which have zero training examples. Training a classifier to predict classes it has never seen is impossible.
Increased Model Complexity: Training a multi-class classifier with a huge number of output classes requires a much more complex model and significantly more training data to learn to distinguish between all these combinations.

Due to these scalability and data sparsity issues, the power set transformation is typically only feasible for multi-label problems with a small number of labels. If it is possible for your problem to do this you can simply use multi-class evaluation. However, this section of the tutorial will however focus on the case where you cannot use the power-set transformation.

Instance-Based Metrics

Instance-based metrics evaluate the performance of a multi-label classifier on a per-instance basis. An instance here refers to all the labels associated with a datapoint. For each instance in the dataset, these metrics compare the predicted set of labels directly to the true set of labels. A score is calculated for each individual instance, and the final metric is typically the average of these scores across all instances in the dataset.

An instance-based approach provides insight into how well the model performs for each individual data point, considering the specific combination of labels associated with it. Common instance-based metrics include Exact Match Ratio (Subset Accuracy), Hamming Loss, and the Jaccard Index (averaged over instances). They offer a perspective on how often the model gets the entire set of labels right for a single instance or the average overlap between the predicted and true label sets per instance.

Exact Match Ratio (Subset Accuracy)

The Exact Match Ratio, also known as Subset Accuracy, is one of the strictest multi-label evaluation metrics. It is the proportion of instances for which the set of predicted labels exactly matches the set of true labels. For each instance, if the set of predicted labels is identical to the set of true labels, the instance is counted as a "full match". The Exact Match Ratio is the total count of full matches divided by the total number of instances:

$$ \text{Exact Match Ratio} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(Y_i = \hat{Y}_i) $$

where $N$ is the number of instances, $Y_i$ is the set of true labels for instance $i$, $\hat{Y}_i$ is the set of predicted labels for instance $i$, and $\mathbb{I}(\cdot)$ is the indicator function (which is 1 if the condition is true, and 0 otherwise). This metric provides a simple, intuitive understanding of how often the model gets the entire set of labels right for an instance. It is a very strict metric because even a single missing or incorrectly added label for an instance results in a score of 0 for that instance. A higher Exact Match Ratio indicates better performance (ranging from 0 to 1).

Hamming Loss

The Hamming Loss measures the average number of incorrect labels per instance or, equivalently, the fraction of labels that are incorrectly predicted across all instances and all labels.

Definition: It is the average symmetric difference between the true and predicted label sets for each instance. This means it counts both false positives (predicted label is not a true label) and false negatives (true label is not a predicted label).

Calculation: For each instance, the Hamming Loss is the number of labels incorrectly predicted (either predicted as present but are absent, or predicted as absent but are present), divided by the total number of possible labels. The overall Hamming Loss is the average of these per-instance losses.

$$ \text{Hamming Loss} = \frac{1}{N \times L} \sum_{i=1}^{N} \sum_{j=1}^{L} \mathbb{I}(y_{ij} \neq \hat{y}_{ij}) $$

where $N$ is the number of instances, $L$ is the total number of possible labels, $y_{ij}$ is the true label (1 if present, 0 if absent) for instance $i$ and label $j$, and $\hat{y}_{ij}$ is the predicted label.

Interpretation: A lower Hamming Loss indicates better performance. It ranges from 0 (perfect prediction for all labels on all instances) to 1 (every label is incorrectly predicted for every instance).

Jaccard Index (Jaccard Similarity, IoU)

We've mentioned the jaccard index before under binary classification. However, we can also adapt it to multi-label classification. This is because we would like to see the "overlap" between the label and the prediction.

In binary classification, you would calculate it over the entire dataset. In the case of multi-label classification you calculate it for each instance (datapoint). Then you macro-average it to summarize it over your entire dataset.

$$ \text{Jaccard Index for instance } i = \frac{|Y_i \cap \hat{Y}_i|}{|Y_i \cup \hat{Y}_i|} $$

where $Y_i$ is the set of true labels for instance $i$, and $\hat{Y}_i$ is the set of predicted labels for instance $i$.

Interpretation: A higher Jaccard Index indicates better performance, meaning the predicted set of labels has more overlap to the true set of labels for each instance or globally across the dataset. It ranges from 0 (no overlap) to 1 (perfect match).

Label-Based Metrics

Label-based metrics evaluate the performance of a multi-label classifier on a per-label basis. Instead of looking at the entire set of labels for a single instance, this approach treats the multi-label problem as a collection of independent binary classification problems, one for each label. For each label, we calculate standard binary classification metrics (like Precision, Recall, and F1-score) by considering all instances in the dataset and whether that specific label is present or absent, and whether the model correctly predicted its presence or absence.

We can then use the previously discussed averaging strategies to get a single score for our models performance. Label-based metrics are particularly useful for understanding how well the model performs on individual labels, especially in the presence of label imbalance.

Why is IoU not treated under label-based metrics

Although you could calculate a Jaccard Index for each label across all instances (similar to label-based Precision or Recall), the Jaccard Index (IoU) is conventionally categorized as an instance-based metric in multi-label classification. This is because its most common and intuitive definition is applied at the instance level, measuring the overlap between the set of predicted labels and the set of true labels for each individual data point. The overall IoU is then typically calculated by averaging these per-instance scores. This contrasts with standard label-based metrics like Precision, Recall, and F1, which are primarily calculated by aggregating True Positives, False Positives, and False Negatives per label across all instances and then averaging these per-label scores.

Precision, Recall, and F1-score (Label-Based)

Precision, Recall, and F1-score are adapted for multi-label classification by calculating True Positives (TPs), False Positives (FPs), and False Negatives (FNs) on a per-label basis. For each label, we treat the problem as a binary classification task (presence or absence of that specific label) and calculate the binary metrics.

Let's take Precision as an example. To calculate the precision for a specific label, say Label $j$:

We consider all instances in the dataset.
For each instance, we check if the true label is $j$ and if the model predicted label $j$.
We count the total number of times Label $j$ was truly present and predicted by the model across all instances. This is the True Positives ($TP_j$) for Label $j$.
We count the total number of times Label $j$ was not truly present but was predicted by the model across all instances. This is the False Positives ($FP_j$) for Label $j$.
The Precision for Label $j$ is then calculated using the standard binary formula:
$$ \text{Precision}_j = \frac{TP_j}{TP_j + FP_j} $$
This tells us, out of all the times the model predicted Label $j$, how often it was actually correct.

The rest of the label-based metrics, such as Recall,Specificity or Balanced Accuracy, can be calculated analogously for each label by determining the per-label TP, FP, TN, and FN counts across all instances and applying the corresponding binary metric formula.

Multi-Label Confusion Matrix, ROC, and Precision-Recall Curves

While a single standard multi-class confusion matrix doesn't directly apply to multi-label classification (as instances can have multiple true and predicted labels), we can adapt the concept to gain insight into performance per label. Similarly, ROC and Precision-Recall curves, along with their AUCs, can be generated on a per-label basis.

Confusion Matrix per Label: For each label, we can construct a binary confusion matrix by treating the problem as predicting the presence or absence of that specific label across all instances. This matrix will show the True Positives, False Positives, False Negatives, and True Negatives for that label. Displaying these per-label confusion matrices provides a detailed view of where the model is succeeding or failing for each individual label.
ROC Curve and AUC per Label: Similar to the multi-class One-vs-Rest approach, we can generate a binary ROC curve for each label. This plots the True Positive Rate (Recall) against the False Positive Rate at various probability thresholds for predicting that specific label. The Area Under the ROC Curve (AUC) can be calculated for each label's curve. Averaging these per-label AUCs (micro, macro, or weighted) provides overall performance summaries.
Precision-Recall Curve and AUPRC per Label: Also using a per-label binary approach, we can plot the Precision against the Recall at various probability thresholds for each label. The Area Under the Precision-Recall Curve (AUPRC or AP) for each label is particularly informative for imbalanced multi-label datasets, highlighting the trade-off between false positives and false negatives when identifying instances of that specific label. Averaging these per-label AUPRCs is a common way to summarize performance.

These per-label visualizations and metrics are important for understanding the model's performance on individual labels, especially in datasets with label imbalance, and they complement the instance-based metrics by providing a different perspective on evaluation.

Example: Simulated Multi-Label Classification

In the following example, we give an example of how multi-label classification works. We do the following:

Simulate data
Show why the powerset transformation will not always lead to good outcomes.
Train classifiers for multi-label classification.
Evaluate them on instance-based and example-based metrics. We will forego the ROC, Precision/recall plots and confusion matrices.

In [ ]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from itertools import chain, combinations # Import tools for combinations
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, hamming_loss, jaccard_score

In [ ]:

# Simulate a multi-label classification problem
n_samples = 3000
n_features = 20
n_classes = 5
n_labels = 2 # Average number of labels per instance

X, y = make_multilabel_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes,
                                      n_labels=n_labels, allow_unlabeled=True, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


print("\nExample of multi-label data (features and true labels for first 5 samples):")
print("X (features):\n", X[:5])
print("\ny (true labels - binary indicator format):\n", y[:5])

Example of multi-label data (features and true labels for first 5 samples):
X (features):
 [[3. 0. 2. 2. 2. 8. 3. 2. 5. 2. 0. 1. 0. 1. 0. 0. 6. 5. 0. 2.]
 [3. 5. 2. 3. 3. 3. 1. 1. 4. 1. 1. 4. 2. 8. 4. 3. 2. 6. 2. 0.]
 [3. 2. 3. 1. 0. 5. 3. 2. 2. 7. 2. 2. 2. 4. 1. 0. 6. 6. 3. 3.]
 [1. 0. 1. 6. 6. 1. 4. 1. 2. 7. 8. 2. 5. 5. 3. 2. 0. 0. 3. 0.]
 [3. 6. 2. 1. 1. 2. 5. 1. 4. 5. 2. 1. 5. 1. 5. 7. 3. 5. 4. 1.]]

y (true labels - binary indicator format):
 [[0 0 0 1 0]
 [1 1 1 0 0]
 [0 0 1 1 0]
 [1 0 0 0 0]
 [1 0 1 0 0]]

We mentioned the power set transformation before. This plot below will illustrate the difficulty with this. Run the cell above and below. You will see that some of the combinations have few datapoints.

This makes it hard to train and evaluate a machine learning model on these classes. Ofcourse this is a bit of an artefact of the simulation process, but it proves the point.

In [ ]:

# Assume y_train is available from the previous cell
n_classes = y_train.shape[1] # Get the number of classes from the data

# 1. Generate all 2^n_classes possible label combinations
# Helper function to generate all subsets of labels (0 to n_classes-1)
def all_subsets(labels):
    return chain.from_iterable(combinations(labels, r) for r in range(len(labels) + 1))

all_combinations = list(all_subsets(range(n_classes)))

# 2. Get observed label subset counts from training data
def get_label_subset_representation(labels_row):
    return tuple(np.where(labels_row == 1)[0])

y_train_subsets = [get_label_subset_representation(row) for row in y_train]
observed_subset_counts = pd.Series(y_train_subsets).value_counts()

# 3. Create a dictionary with all possible combinations and their counts (0 for unseen)
full_counts_dict = {combo: observed_subset_counts.get(combo, 0) for combo in all_combinations}

# Create a Series from the dictionary, explicitly managing the index
# Sort the combinations first to ensure consistent plotting order
sorted_combinations = sorted(full_counts_dict.keys())
full_counts = pd.Series([full_counts_dict[combo] for combo in sorted_combinations], index=sorted_combinations)

print(f"Total possible label combinations (2^{n_classes}): {len(all_combinations)}")
print(f"Observed unique label subsets in training data: {len(observed_subset_counts)}")
print("\nCounts for all possible label subsets (0 for unseen):")

# 4. Plot the counts of all possible label subsets
plt.figure(figsize=(15, 6)) # Increase figure size for more bars
full_counts.plot(kind='bar')
plt.title(f'Counts of All {len(all_combinations)} Possible Label Subsets')
plt.xlabel('Label Subset (Tuple of Label Indices)')
plt.ylabel('Frequency')
plt.xticks(rotation=90) # Use 90 rotation to prevent overlap
plt.tight_layout() # Adjust layout
plt.show()

Total possible label combinations (2^5): 32
Observed unique label subsets in training data: 32

Counts for all possible label subsets (0 for unseen):

Next, we are going to train a logistic regressor for each of the labels. With other classifiers there are more sophisticated ways but for our illustration, this will suffice. Then we will show instance-based and label-based evaluation.

In [ ]:

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
# You could use other base estimators as well, e.g., DecisionTreeClassifier, RandomForestClassifier

# Define a base binary classifier
base_classifier = LogisticRegression(max_iter=1000) # Increased max_iter for convergence

# Wrap the base classifier with MultiOutputClassifier to make it multi-label
multi_label_classifier = MultiOutputClassifier(base_classifier, n_jobs=-1) # n_jobs=-1 to use all cores

# Train the multi-label classifier on the training data
# Assumes X_train and y_train are available from a previous cell
print("\nTraining multi-label classifier...")
multi_label_classifier.fit(X_train, y_train)
print("Multi-label classifier trained successfully.")

Multi-label classifier instantiated:
MultiOutputClassifier(estimator=LogisticRegression(max_iter=1000), n_jobs=-1)

Training multi-label classifier...
Multi-label classifier trained successfully.

In [ ]:

def example_based_evaluation(y_true, y_pred):
    """
    Calculates and prints instance-based (example-based) evaluation metrics
    for multi-label classification.

    Args:
        y_true (np.ndarray): True labels (binary indicator format).
        y_pred (np.ndarray): Predicted labels (binary indicator format).
    """
    print("--- Instance-Based (Example-Based) Metrics ---")

    # Exact Match Ratio (Subset Accuracy)
    exact_match_ratio = accuracy_score(y_true, y_pred)
    print(f"Exact Match Ratio: {exact_match_ratio:<.4f}   (Higher is better, Range: 0 to 1)")

    # Hamming Loss
    hamming_loss_score = hamming_loss(y_true, y_pred)
    print(f"Hamming Loss:      {hamming_loss_score:<.4f}   (Lower is better, Range: 0 to 1)")

    # Jaccard Index (averaged over instances)
    jaccard_samples = jaccard_score(y_true, y_pred, average='samples', zero_division=0)
    print(f"Jaccard Index:     {jaccard_samples:<.4f}   (Higher is better, Range: 0 to 1)")

    print("-" * 30)

# Assuming multi_label_classifier, X_test, and y_test are available
# Make predictions on the test set
y_pred_test = multi_label_classifier.predict(X_test)

# Evaluate using the function
example_based_evaluation(y_test, y_pred_test)

--- Instance-Based (Example-Based) Metrics ---
Exact Match Ratio: 0.4256   (Higher is better, Range: 0 to 1)
Hamming Loss:      0.1807   (Lower is better, Range: 0 to 1)
Jaccard Index:     0.6142   (Higher is better, Range: 0 to 1)
------------------------------

In [ ]:

from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np
import pandas as pd # Import pandas to display results nicely

# Assuming y_test and y_pred_test are available from the previous cell (q5SpE0oy_zO7)

print("--- Label-Based Metrics (Per Label) ---")

# Calculate per-label metrics
# Set average=None to get scores for each label
precision_per_label = precision_score(y_test, y_pred_test, average=None, zero_division=0)
recall_per_label = recall_score(y_test, y_pred_test, average=None, zero_division=0)
f1_per_label = f1_score(y_test, y_pred_test, average=None, zero_division=0)

# Get the number of labels
n_labels = y_test.shape[1]
label_names = [f'Label {i}' for i in range(n_labels)]

# Create a DataFrame to display per-label results
per_label_df = pd.DataFrame({
    'Precision': precision_per_label,
    'Recall': recall_per_label,
    'F1 Score': f1_per_label
}, index=label_names)

# Display the per-label DataFrame
display(per_label_df)
print("-" * 30)


print("--- Label-Based Metrics (Averaged) ---")

# Micro-averaged metrics
micro_precision = precision_score(y_test, y_pred_test, average='micro', zero_division=0)
micro_recall = recall_score(y_test, y_pred_test, average='micro', zero_division=0)
micro_f1 = f1_score(y_test, y_pred_test, average='micro', zero_division=0)

print(f"Micro-averaged Precision: {micro_precision:<.4f}")
print(f"Micro-averaged Recall:    {micro_recall:<.4f}")
print(f"Micro-averaged F1 Score:  {micro_f1:<.4f}")
print("-" * 30)


# Macro-averaged metrics
macro_precision = precision_score(y_test, y_pred_test, average='macro', zero_division=0)
macro_recall = recall_score(y_test, y_pred_test, average='macro', zero_division=0)
macro_f1 = f1_score(y_test, y_pred_test, average='macro', zero_division=0)

print(f"Macro-averaged Precision: {macro_precision:<.4f}")
print(f"Macro-averaged Recall:    {macro_recall:<.4f}")
print(f"Macro-averaged F1 Score:  {macro_f1:<.4f}")
print("-" * 30)


# Weighted-averaged metrics
weighted_precision = precision_score(y_test, y_pred_test, average='weighted', zero_division=0)
weighted_recall = recall_score(y_test, y_pred_test, average='weighted', zero_division=0)
weighted_f1 = f1_score(y_test, y_pred_test, average='weighted', zero_division=0)

print(f"Weighted-averaged Precision: {weighted_precision:<.4f} ")
print(f"Weighted-averaged Recall:    {weighted_recall:<.4f} ")
print(f"Weighted-averaged F1 Score:  {weighted_f1:<.4f} ")
print("-" * 30)

--- Label-Based Metrics (Per Label) ---

	Precision	Recall	F1 Score
Label 0	0.772277	0.600000	0.675325
Label 1	0.802198	0.757261	0.779082
Label 2	0.775342	0.735065	0.754667
Label 3	0.790761	0.778075	0.784367
Label 4	0.728395	0.395973	0.513043

------------------------------
--- Label-Based Metrics (Averaged) ---
Micro-averaged Precision: 0.7845
Micro-averaged Recall:    0.6994
Micro-averaged F1 Score:  0.7395
------------------------------
Macro-averaged Precision: 0.7738
Macro-averaged Recall:    0.6533
Macro-averaged F1 Score:  0.7013
------------------------------
Weighted-averaged Precision: 0.7820 
Weighted-averaged Recall:    0.6994 
Weighted-averaged F1 Score:  0.7342 
------------------------------

Conclusion based on Multi-Label Evaluation Metrics

In conclusion, the model demonstrates reasonable performance on this synthetic multi-label problem across various metrics. Instance-based metrics show a moderate level of exact matching and overlap, while label-based metrics provide a more detailed view of performance per label and overall summaries that consider label frequencies. The differences between micro, macro, and weighted averages suggest that while overall performance is decent, there might be room for improvement on specific individual labels, particularly less frequent ones.

7.6 Model Calibration

Model calibration refers to how well the predicted probabilities from a classification model align with the true probabilities of an event occurring. In other words, if a model predicts a probability of 0.8 for a certain class, a well-calibrated model means that for all instances where the model predicted 0.8, the true proportion of positive instances in that group is indeed close to 80%. Calibration is particularly important in applications where the predicted probabilities are used directly for decision-making or risk assessment, not just for ranking instances.

Discrimination vs Calibration

The previous metrics that we have looked at all judge a models discriminative abilities. Discrimination refers to a models ability to differentiate between positive and negative instances. A model can have excellent discrimination (e.g., a high AUC) but still be poorly calibrated, meaning its predicted probabilities are not reliable estimates of the true likelihoods. For example, a model might consistently assign higher probabilities to positive cases than negative cases (good discrimination) but might predict probabilities of 0.9 when the true probability is consistently closer to 0.6 (poor calibration). Conversely, a model could be well-calibrated but have poor discrimination if its predicted probabilities for positive and negative classes are similar. In many applications, both high discrimination and good calibration are desired

Examples of model calibration

One example of model calibration that we've all run into are weather forecasts. Most weather forecasts tend to report not just the expected weather for the next days, but also the probability of the weather occurring. So when they forecast "an 80% chance of rain tomorrow", it means that out of each day that they have predicted and 80% chance of rain in the past, it indeed rained out of 80% of those cases.

Another example from medical diagnosis. A predicted probability of 90% for a disease might lead to immediate treatment, while a 10% probability might suggest further testing. If the model is poorly calibrated, a predicted 90% probability might only correspond to a true probability of 50%, leading to unnecessary or delayed interventions.

Calibration Metrics and Visualizations

Reliability Curve (Calibration Plot)

The Reliability Curve, also known as a Calibration Plot or Reliability Diagrams, is a visualization used to assess how well the predicted probabilities of a classification model match the actual probabilities. It helps determine if the model's predictions are well-calibrated.

How it is created:

Binning: The predicted probabilities from the model are divided into a fixed number of bins (e.g., 10 bins for probabilities ranging from 0.0 to 1.0).
Average Predicted Probability: For each bin, the average of the predicted probabilities of all instances falling into that bin is calculated.
Fraction of Positives: For each bin, the actual fraction of positive instances among all instances in that bin is calculated.
Plotting: A scatter plot is created where the x-axis represents the average predicted probability in each bin, and the y-axis represents the fraction of positives in that bin.

Interpretation:

A perfectly calibrated model would have points that fall exactly on the diagonal line (y = x). This means that if the model predicts a probability of, say, 0.7, the actual proportion of positive cases in that group is also 0.7.
If the points lie above the diagonal line, the model is under-confident; it predicts probabilities that are too low.
If the points lie below the diagonal line, the model is over-confident; it predicts probabilities that are too high.

Reliability curves are useful for visually comparing the calibration of different models or for understanding where a single model is poorly calibrated across the range of predicted probabilities.

References

[1] https://scikit-learn.org/stable/auto_examples/calibration/plot_compare_calibration.html

[2] https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

A reliability curve is plotted below for different (calibrated) classifiers. You can ignore the fact that they are calibrated for now, just know that if you don't do this your model may not behave as nicely as it does in this example. You also don't need to know the different types of classifiers, just know that there are different non-neural network types of classifiers as well!

You may notice in this curve that the points are located at slightly different positions in the plot. This is because the calibration plot function creates different bins and places the point at the average of that bin. Moreover, if there are no datapoints in a bin, the plot drops the bin from the plot.

Note that these examples are made for binary classification, multi-class classification can again be done using a One vs Rest approach.

In [ ]:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
import matplotlib.pyplot as plt
import numpy as np
from sklearn.base import clone # Keep clone import for RandomForest example
from sklearn.ensemble import RandomForestClassifier # Import RandomForestClassifier

# Generate synthetic data
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2,
                           n_redundant=2, n_classes=2, n_clusters_per_class=2,
                           flip_y=0.1, class_sep=0.5, random_state=42)

# Split into training, calibration, and test sets
X_train_cal, X_test, y_train_cal, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_cal, y_train, y_cal = train_test_split(X_train_cal, y_train_cal, test_size=0.5, random_state=42)

# Define base classifiers
lr = LogisticRegression(max_iter=10000, random_state=42) # Added random_state for reproducibility
nb = GaussianNB()
rf = RandomForestClassifier(random_state=42) # Instantiate RandomForestClassifier

# Train base classifiers on the training set
print("Training Logistic Regression (base)...")
lr.fit(X_train, y_train)
print("Training Naive Bayes (base)...")
nb.fit(X_train, y_train)
print("Training Random Forest (base)...")
rf.fit(X_train, y_train)


# Calibrate classifiers using Platt Scaling and Isotonic Regression on the calibration set
print("Calibrating Logistic Regression with Platt Scaling...")
lr_sigmoid = CalibratedClassifierCV(lr, method='sigmoid', cv='prefit')
lr_sigmoid.fit(X_cal, y_cal)

print("Calibrating Naive Bayes with Platt Scaling...")
nb_sigmoid = CalibratedClassifierCV(nb, method='sigmoid', cv='prefit')
nb_sigmoid.fit(X_cal, y_cal)

# Calibrating Random Forest is less common with Platt/Isotonic via CalibratedClassifierCV prefit,
# as RF already provides probabilities and may benefit more from other methods or cross-validation.
# However, for demonstration, we can include it, noting the caveat.
print("Calibrating Random Forest with Platt Scaling...")

# Using clone() here is appropriate if you want to train a fresh RF specifically for calibration CV,
# but with cv='prefit', it should be the *already trained* estimator.
# Let's use the already trained 'rf' estimator for 'prefit' calibration.
rf_sigmoid = CalibratedClassifierCV(rf, method='sigmoid', cv='prefit')
rf_sigmoid.fit(X_cal, y_cal)

print("Calibrating Random Forest with Isotonic Regression...")
# Using the already trained 'rf' estimator for 'prefit' calibration.

# Get predicted probabilities on the test set for CALIBRATED models
y_prob_lr_sigmoid = lr_sigmoid.predict_proba(X_test)[:, 1]

y_prob_nb_sigmoid = nb_sigmoid.predict_proba(X_test)[:, 1]
y_prob_rf_sigmoid = rf_sigmoid.predict_proba(X_test)[:, 1]

# Calculate calibration curves for CALIBRATED models
fraction_of_positives_lr_sigmoid, mean_predicted_value_lr_sigmoid = calibration_curve(y_test, y_prob_lr_sigmoid, n_bins=10)
fraction_of_positives_nb_sigmoid, mean_predicted_value_nb_sigmoid = calibration_curve(y_test, y_prob_nb_sigmoid, n_bins=10)
fraction_of_positives_rf_sigmoid, mean_predicted_value_rf_sigmoid = calibration_curve(y_test, y_prob_rf_sigmoid, n_bins=10)

Training Logistic Regression (base)...
Training Naive Bayes (base)...
Training Random Forest (base)...
Calibrating Logistic Regression with Platt Scaling...
Calibrating Naive Bayes with Platt Scaling...
Calibrating Random Forest with Platt Scaling...
Calibrating Random Forest with Isotonic Regression...

/usr/local/lib/python3.12/dist-packages/sklearn/calibration.py:333: UserWarning: The `cv='prefit'` option is deprecated in 1.6 and will be removed in 1.8. You can use CalibratedClassifierCV(FrozenEstimator(estimator)) instead.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/sklearn/calibration.py:333: UserWarning: The `cv='prefit'` option is deprecated in 1.6 and will be removed in 1.8. You can use CalibratedClassifierCV(FrozenEstimator(estimator)) instead.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/sklearn/calibration.py:333: UserWarning: The `cv='prefit'` option is deprecated in 1.6 and will be removed in 1.8. You can use CalibratedClassifierCV(FrozenEstimator(estimator)) instead.
  warnings.warn(

In [ ]:

# @title
# Plot calibration curves for CALIBRATED models
plt.figure(figsize=(10, 8))
plt.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated") # Diagonal line for perfect calibration

plt.plot(mean_predicted_value_lr_sigmoid, fraction_of_positives_lr_sigmoid, "o-", label="Logistic Regression (Calibrated)")

plt.plot(mean_predicted_value_nb_sigmoid, fraction_of_positives_nb_sigmoid, "o-", label="Naive Bayes (Calibrated)")

plt.plot(mean_predicted_value_rf_sigmoid, fraction_of_positives_rf_sigmoid, "o-", label="Random Forest ( Calibrated)")

plt.xlabel("Mean predicted value")
plt.ylabel("Fraction of positives")
plt.title("Calibration plots (Reliability Curves) - Calibrated Models")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

Brier Score

The Brier Score is a single scalar metric used to evaluate the accuracy of probabilistic predictions. It measures the mean squared difference between the predicted probability and the actual outcome (which is 0 or 1 for binary classification).

The formula for the Brier Score for a set of $N$ predictions is:

$$ \text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2 $$

where:

$p_i$ is the predicted probability of the positive class for instance $i$.
$o_i$ is the actual outcome for instance $i$ (1 if the instance is positive, 0 if negative).

Interpretation:

The Brier Score ranges from 0 to 1.
A lower Brier Score indicates better calibration and prediction accuracy.
A score of 0 represents a perfectly calibrated model that assigns a probability of 1 to all positive outcomes and 0 to all negative outcomes.
A score of 0.25 is achieved by a model that always predicts a probability of 0.5, regardless of the instance. A score higher than 0.25 is worse than random guessing in this specific way.

The Brier Score can be decomposed into three components: reliability (calibration), resolution (the ability to distinguish between instances with different outcomes), and uncertainty (the inherent variability in the outcomes). While the overall Brier Score provides a single measure of performance, its decomposition can offer deeper insights into why a model has a particular score. The decomposition of the Brier score is beyond the scope of this tutorial.

In [ ]:

from sklearn.metrics import brier_score_loss

# Assuming y_test, y_prob_lr_sigmoid, y_prob_nb_sigmoid, and y_prob_rf_sigmoid
# are available from the previous cell

# Calculate Brier Score for Logistic Regression (Platt Calibrated)
brier_lr_sigmoid = brier_score_loss(y_test, y_prob_lr_sigmoid)

# Calculate Brier Score for Naive Bayes (Platt Calibrated)
brier_nb_sigmoid = brier_score_loss(y_test, y_prob_nb_sigmoid)

# Calculate Brier Score for Random Forest (Platt Calibrated)
brier_rf_sigmoid = brier_score_loss(y_test, y_prob_rf_sigmoid)

print(f"Brier Score for Logistic Regression (Platt Calibrated): {brier_lr_sigmoid:.4f}")
print(f"Brier Score for Naive Bayes (Platt Calibrated): {brier_nb_sigmoid:.4f}")
print(f"Brier Score for Random Forest (Platt Calibrated): {brier_rf_sigmoid:.4f}")

Brier Score for Logistic Regression (Platt Calibrated): 0.1938
Brier Score for Naive Bayes (Platt Calibrated): 0.1983
Brier Score for Random Forest (Platt Calibrated): 0.1656

Expected Calibration Error (ECE)

The Expected Calibration Error (ECE) is a metric that quantifies the calibration of a model by taking a weighted average of the difference between the average predicted probability and the fraction of positives within several bins of predicted probabilities. It provides a single numerical summary of the miscalibration observed in a Reliability Curve.

Similar to the Reliability Curve, the ECE calculation involves:

Binning: Dividing predicted probabilities into $M$ bins.
Calculating per-bin differences: For each bin $m$, calculating the absolute difference between the average predicted probability in that bin ($\bar{p}_m$) and the fraction of positives in that bin ($\bar{o}_m$).
Weighted Averaging: Summing these absolute differences, weighted by the proportion of instances that fall into each bin.

The formula for ECE is typically given as:

$$ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N} |\bar{p}_m - \bar{o}_m| $$

where:

$M$ is the number of bins.
$|B_m|$ is the number of instances in bin $m$.
$N$ is the total number of instances.
$\bar{p}_m = \frac{1}{|B_m|} \sum_{i \in B_m} p_i$ is the average predicted probability in bin $m$.
$\bar{o}_m = \frac{1}{|B_m|} \sum_{i \in B_m} o_i$ is the fraction of positives in bin $m$.

Interpretation:

A lower ECE indicates better calibration.
An ECE of 0 represents perfect calibration.

Different weighting schemes can be used for ECE, such as weighting by the square root of the bin size, but the bin size weighting ($\frac{|B_m|}{N}$) is the most common and is used to emphasize deviations in bins with more instances. ECE provides a concise summary of calibration quality and is useful for comparing models, although it can be sensitive to the choice of the number of bins ($M$).

In [ ]:

import numpy as np

def expected_calibration_error(y_true, y_prob, n_bins=10):
    """
    Calculates the Expected Calibration Error (ECE).

    Args:
        y_true (np.ndarray): True labels (0 or 1).
        y_prob (np.ndarray): Predicted probabilities of the positive class.
        n_bins (int): Number of bins to use for calibration.

    Returns:
        float: The Expected Calibration Error.
    """
    # Ensure y_true and y_prob have the same length
    if len(y_true) != len(y_prob):
        raise ValueError("y_true and y_prob must have the same length.")

    # Create bins based on predicted probabilities
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers = bin_boundaries[:-1]
    bin_uppers = bin_boundaries[1:]

    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        # Identify instances within the current bin
        # Ensure the comparison results in a boolean array with the same shape as y_prob
        in_bin = (y_prob > bin_lower) & (y_prob <= bin_upper)
        num_in_bin = np.sum(in_bin)

        if num_in_bin > 0:
            # Calculate average predicted probability and fraction of positives in the bin
            avg_predicted_prob = np.mean(y_prob[in_bin])
            fraction_of_positives = np.mean(y_true[in_bin]) # Index y_true using the boolean mask

            # Calculate the difference and weight by the proportion of instances in the bin
            ece += np.abs(avg_predicted_prob - fraction_of_positives) * (num_in_bin / len(y_true))

    return ece

# Assuming y_test, y_prob_lr_sigmoid, y_prob_nb_sigmoid, and y_prob_rf_sigmoid
# are available from the previous cells.

# Calculate ECE for selected models
try:
    ece_lr_sigmoid = expected_calibration_error(y_test, y_prob_lr_sigmoid)
    ece_nb_sigmoid = expected_calibration_error(y_test, y_prob_nb_sigmoid)
    ece_rf_sigmoid = expected_calibration_error(y_test, y_prob_rf_sigmoid)

    print(f"ECE for Logistic Regression (Platt Calibrated): {ece_lr_sigmoid:.4f}")
    print(f"ECE for Naive Bayes (Platt Calibrated): {ece_nb_sigmoid:.4f}")
    print(f"ECE for Random Forest (Platt Calibrated): {ece_rf_sigmoid:.4f}")

except NameError as e:
    print(f"Error: {e}. Please ensure the previous cell has been run to define the probability variables.")
except ValueError as e:
    print(f"Error: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

ECE for Logistic Regression (Platt Calibrated): 0.0318
ECE for Naive Bayes (Platt Calibrated): 0.0298
ECE for Random Forest (Platt Calibrated): 0.0202

Case Studies

The goal of the case studies is to make you think critically about evaluating classification problems in machine learning.

It's not about getting the questions right or wrong, but coming up with a reasonable approach.

You will come up with a solution that you think is right, then we will discuss these solutions in class.

Cardiac Monitoring Using a Smart Watch

You have been given a model that has been developed on single lead EKG data from smart watches obtained from people between the ages of 20 and 40. The model was developed to detect a rare ($\frac{1}{100000}$) heart disease. In addition, you have been given a test set (taken from the same age cohort) of the same data. The test consists of data from 100 people.

You may assume there is no data leakage, so there is no overlap between patients in the training data and testing data.

1. What would be your concern when evaluating this model with the given test set?

2. Now assume that you have 100000 people in your test set. Would your answer change?

So the company that gave you the algorithm now wants you to evaluate a different EKG algorithm. This one can detect a more common type of problem that about 5% of the people in the age cohort have. This time they have given you a testing set of 100000.

3. Say you find an accuracy of 95%, what would you think about this? Would you add any other metrics and why would you add them? Explain each metric that you include.

Assume no metrics have been reported. In addition, you are told that in case of a report the patient has to undergo a battery of invasive tests to confirm that they have the heart disease. This would result in a total cost of 50000. In addition, if the model misses heart disease it would cost 100000 per lost year. You may assume that patients in this cohort have another 40 years to live had the disease been detected early, instead of 10 if it was not detected.

4. Ignoring the costs. What metrics would you report and why?

5. Construct the cost matrix.

6. How would you set the probability threshold? (assume you have access to a separate dataset for this)

X-Ray Organ Detection

You have been given a model that can classify what organs are in a view of an X-Ray image. It will be developed iteratively, and new classes will be added as development progresses. Right now it can detect three organs: The Heart, The liver, and the spleen.

It is important to note that any combination of organs can be in view at the same time.

7. What type of classification problem is this?

8. Could you convert the labels here into a powerset and classify individual combinations? Hint: $2^k$ is the total number of possible combinations. Where K is the number of labels.

9. If you did convert it to a powerset, how would you evaluate this problem? What metrics would you include?

In the next iteration of the models development, more organs can be detected. Now the model can detect a total of 10 organs.

10. What is your opinion now on converting it to a powerset?

11. You can also convert it to a powerset, using only the existing combinations in the training set. What would be the pro's and con's of this approach?

12. Discuss how you would evalaute this problem.

Author: Riaan Zoetmulder