In the last few lectures we've mostly worked on getting familiar with basic deep learning architectures, such as the feedforward and convolutional neural network. This tutorial we take a step back from technical subjects and look at the basics of managing a deep learning project. We discuss how to manage and prepare the data and how to tune a model.

Important notes

It is fine to consult with colleagues to solve the problems, in fact it is encouraged.
Please turn off AI tools, we want you to memorize concepts and not just quickly breeze through problems. To turn off AI click on the gear in the top right corner. got to AI assistance -> Untick Show AI powered inline completions, Untick consented to use generative AI features, tick Hide Generative AI features

5. Deep Learning Project Basics

5.1 Managing your data

Say you have been given a dataset to develop a deep learning model on. Your first instinct might be to just go ahead and train a model on it. But in practise you often spend much more time cleaning, understanding, and managing your data than developing machine learning models. Doing so will allow you to make more reasonable assumptions about what your deep learning model will and won't generalise to. So what should you look out for, specifically when working with medical (image) data?

Investigate your data

Medical data can have a lot of nuances and complexities. The first thing you want to do is investigate the data and know where it came from. This is to prevent (as much as possible) your deep learning model, that is trained on the data, to suddenly not work when it is applied in the real world. I have categorized topics you should investigate before doing anything else.

What population was this data collected from?

Often times you will find that medical image data is collected in registry and comes from multiple hospitals, in different geographical locations. In an ideal situation you are working with a medical specialist when exploring the data. However, this is not always possible. Here are questions that you could ask about the patient population from which the data was acquired:

What is the age range?
In what location were the hospitals that the patients went to?
How many men/women?
What types of illnesses did the patients have or treatments were the patients receiving?
In case of a single illness: What ancillary outcome metrics are available for each patient?
- For stroke you have: Door-to-needle time, picture-to-puncture time.
- Scores on checklists are often also useful to report, for stroke for example: mTICI, ASPECTS score, Modified Rankin Scale.
What pre-existing conditions or comorbitities did these patients have?
Any differences variable that is used to differentiate within the disease itself?
- Examples are the exact location of the occlusion in a stroke, such as the Middle Cerebral Artery, Basilar artery, Carotid Artery, etc. The type of stroke (Ischaemic or Hemorrhagic).

What are the technical aspects of the data?

By technical aspects I mean any settings of the technology that was used during the acquisition of the image data. Especially, if different people are involved in the acquisition, these may vary from image to image.

Here are sample questions that you could ask to disover sources of variations:

What were the machine settings during acquisition?
- Different technicians may use different parameters.
- Different machines have different hardware options.
What were the hardware configurations of the machine?
- Size of the detector in x-ray, strenght of the magnet in MRI.
What processing has been used on your images?
- Generally the less processed, the better. This is because processing algorithms create more variation and tend to change more often than hardware.
Has any post-processing been used before storing the data?
- An example from research: I was working on a dataset, and all the slices in my CT scans were spaced 5mm apart. I thought this was a default setting, but it turned out that the radiologists before storing the data down-sampled it to save space. We explicitly had to ask them to stop downsampling to get the full data.
In case of a multi-vendor setup: What machine types from what vendor were used to acquire the data?

Manually going through the data

Often times it is necessary for somebody to have gone through the data. You will find that even in a registry, your data will be messy and each patient folder will have multiple images in it that aren't always clearly labelled. In addition, images will often have artefacts that make the image less or not usable for a specific purpose. These images should also be excluded. Finally, some images might be of a completely different type than you would assume. The researcher should always keep track of what images are included and excluded and for what reason.

What should be reported?

Ideally, you should be capable of making tables and figures as shown below. These tables and figures are taken from the papers : Brain segmentation in patients with perinatal arterial ischemic stroke and Automated Final Lesion Segmentation in Posterior Circulation Acute Ischemic Stroke Using Deep Learning. Which is a shameless plug by the author of this tutorial. If these are not known, ask the owner of the data about them. Sometimes it is not possible to obtain the aforementioned information due to privacy laws.

Patient Characteristics Table

Technical parameters of the images

Inclusion/Exclusion Tables

Investigate your annotations

If you are working in a supervised learning setup, your network learns from annotations that you have made on the data. Examples of supervised learning are classification, segmentation, and object detection. In the case of supervised labels inspecting your labels is informative, because it may reveal underlying problems. You want to know whether you have to correct for any problems with the annotations in your data.

Another aspect that you want to get right in medical image analysis is the process by which you obtained your labels. In computer vision it is common to use labels that are obtained via crowd-sourcing via mechanical turk. However, in medical image analysis you usually need much more expertise. Hence reporting who has what expertise is relevant. In this section we discuss several types of problems that you may encounter.

Label Imbalance

Label imbalance is a common and significant challenge encountered in many supervised machine learning tasks. It refers to the unequal distribution of samples across different classes within a dataset. In medical contexts, this often manifests when dealing with rare diseases or conditions, where the number of positive cases is substantially lower than the number of healthy or negative cases. Fortunately, it is usually easy to detect. Simply tabulate the prevalence of the labels in your dataset. This is ofcourse assuming all your labels of interest exist in your dataset.

This imbalance poses a significant problem because standard machine learning algorithms are often designed to maximize overall accuracy. When faced with imbalanced data, a model might achieve high overall accuracy by simply predicting the majority class for most or all samples, while performing poorly on the minority class. This is particularly problematic in medical applications where correctly identifying the minority class (e.g., detecting a rare tumor) is often the primary goal and misclassifications can have serious consequences. Furthermore, misleading evaluation metrics can obscure the true performance of the model on the critical minority class.

Label imbalance impacts various types of medical image analysis tasks. In classification, it directly affects the training process, potentially leading to a biased model that favors the majority class. This necessitates careful consideration of training losses and the use of evaluation metrics beyond simple accuracy. These metrics are discussed in tutorial 7.

A segmentation task involves indicating what pixels belong to a certain object. You can see it as pixel wise classification. In segmentation tasks, label imbalance occurs at the pixel/voxel level. The number of pixels belonging to the target structure (e.g., a lesion or organ) is often a tiny fraction of the total pixels in an image. This imbalance makes it challenging for models to correctly identify and segment the minority class pixels, potentially leading to incomplete or inaccurate segmentations.

Detection involves drawing a bounding box around an object if it exists in an image. Similar to segmentation, in detection tasks, the number of objects of interest (e.g., individual cells, lesions) is typically much smaller than the vast number of background regions in an image. This imbalance can overwhelm detection models, resulting in a high rate of false positives (incorrectly identifying background as objects) or missed detections of the crucial minority class objects.

Addressing label imbalance is crucial for building effective medical AI models. Strategies for handling this issue include resampling techniques like oversampling the minority class or undersampling the majority class, employing algorithmic approaches such as using class weights during training to give more importance to the minority class, and utilizing appropriate loss functions that are less sensitive to imbalance (specific loss functions will be covered in the next tutorial). By understanding and actively mitigating label imbalance, we can develop more robust and reliable models that perform well on all classes, especially those of critical importance in medical diagnosis and treatment.

Size/volume of segmentation or detection labels

Size/Volume of Segmentation or Detection Labels Evaluating the size or volume of segmentation and detection labels is a critical aspect of managing medical image data, particularly because small objects pose unique challenges for deep learning models. Understanding the size distribution of your target labels is crucial for interpreting model performance and identifying potential issues. Small objects have a limited number of pixels or voxels, making it difficult for models to learn robust features from them. This often leads to models easily missing these small objects or confusing them with noise or background.

For segmentation tasks, it's important to calculate the volume (in 3D, scans such as MRI or CT) or area (in 2D) of each segmented region. Analyzing the distribution of these volumes or areas across your dataset can reveal patterns and potential problems. One thing that can occur is that the overall volume in your segmentation annotation (typically referred to as a segmentation mask) seems large, but when you inspect it you notice that the segmentation mask is composed of many smaller objects. In this case a particularly useful technique is a connected components analysis. This will allow you to individually analyse the volume of the objects in your segmentation mask. For example, in Multiple Sclerosis (MS) lesion segmentation, a patient might have many small, scattered lesions. A connected components analysis can help you identify and measure the size of each individual lesion, giving a more detailed understanding than just the total lesion volume.

Similarly, for detection tasks, analyzing the size (e.g., bounding box dimensions) of the detected objects is essential. Understanding the range and distribution of object sizes in your dataset helps in interpreting why a model might struggle with certain objects. Knowing the size characteristics of your labels not only explains model behavior but also informs the selection of appropriate model architectures and training strategies. For instance, techniques like using specific loss functions that emphasize small objects, data augmentation tailored for small structures, multi-scale processing, or optimizing anchor box sizes in object detection are all guided by the size distribution of your target labels.

Ultimately, the size and volume of segmentation and detection labels significantly inform how we interpret our model results. It is essential for evaluating the clinical relevance of your model's performance; misclassifying a large, aggressive tumor has different clinical implications than misclassifying a tiny, benign finding. It's also important to remember that clinical importance might not always directly correlate with size. In the aforementioned example of MS lesion detection, the count of smaller lesions is often also very relevant. Furthermore, understanding size helps in selecting appropriate evaluation metrics and potentially employing size-specific evaluation if necessary. Visualizing the segmented or detected objects along with their size information can provide valuable insights into both the data and the model's performance.

Agreement

Ideally, multiple raters should have annotated your data. This is because there is significant inter-rater variation, even in medical problems. Knowing what the inter-rater variation is informs us whether a problem is worth pursuing with a deep learning model and sets reasonable expectations for the eventual model's performance. If human experts disagree frequently, expecting a machine learning model to achieve perfect agreement is unrealistic.

When working with different raters it is important to mention how they were trained. This may be done by mentioning their years of experience as a radiologist or training they specifically received for the annotation task. Moreover, describe the process by which the annotators annotated the data. Did they do so with or without working together? Did an inexperienced annotator first annotate the images and were they subsequently corrected by another, more experienced, annotator?

To understand the level of agreement between different human annotators (raters), you need to establish inter-rater reliability. The methods for doing this vary depending on the type of task:

For classification problems, where raters assign categories or scores (such as assessing qualitative aspects of image quality or grading the severity of a condition), you can use metrics that quantify the level of agreement beyond what would be expected by chance. Common metrics include Cohen's kappa (for two raters) or Fleiss' k (for more than two raters), which account for chance agreement. A concordance score (like Kendall's W) can also be used, particularly for ranked data. The higher the score on these metrics, the more raters agree with each other. Additionally, a linear model can sometimes be employed to identify and potentially account for systemic biases between raters. It's also worth considering if agreement needs to be exact or acceptable within a certain margin, depending on the clinical context.

For segmentation problems, where raters delineate structures or abnormalities within an image, agreement is assessed based on the overlap and distance between the annotated regions. While specific metrics like the Dice coefficient, Jaccard index (measures of overlap), and Hausdorff distance (a measure of distance) are commonly used, a detailed explanation of these will come later in tutorials 7 and 9. Conceptually, you are evaluating how well the boundaries and areas of the segmented regions align between different raters. To assess agreement between multiple raters, a common approach is to create a consensus segmentation mask from all but one rater (e.g., using a majority vote for each pixel, or an algorithmic approach called STAPLE). You then calculate the agreement between the left-out rater's annotation and this consensus mask. This process is repeated for each rater. Visualizing the areas where raters agree and disagree on the images themselves can provide valuable qualitative insights. Volumetric agreement, perhaps visualized with Bland-Altmann plots comparing the volumes of segmented regions by different raters, can also be informative especially if volume is used to measure a clinical endpoint.

For detection tasks, where raters draw bounding boxes around objects of interest, assessing agreement involves considering both the location and size of these boxes. A key concept here is Intersection over Union (IoU), which measures the overlap (in a single number) between two bounding boxes relative to their combined area. While the calculation of IoU will be covered later, it's used to determine if two bounding boxes from different raters correspond to the same object and how well they align. Based on matching bounding boxes using IoU (typically above a threshold), you can then quantify agreement by counting the ones both agreed on.

Finally, it is important to remember that agreement measures only the degree to which annotators agree on lables. Not whether the lables themselves are informative.

5.2 Data Splitting and Validation

Train, Validation, and Test Sets

The fundamental first step in preparing your data for a machine learning project is splitting it into distinct subsets: the training set, the validation set, and the test set. The primary purpose of this split is to ensure that you can train your model effectively, tune its hyperparameters without bias, and finally evaluate its performance on data it has never seen before, mimicking how it would perform in the real world.

Let's define each set:

The Training Set is the largest portion of your data and is used to train the machine learning model. During training, the model learns patterns and relationships within this data to make predictions or decisions.
The Validation Set is used for hyperparameter tuning and model selection. As you develop your model, you'll try different architectures, regularization techniques, learning rates, and other hyperparameters. The validation set allows you to evaluate the performance of different model configurations during the training process and select the one that performs best without touching the final test set.
The Test Set is held out from the entire model development process until the very end. It is used for a final, unbiased evaluation of the selected and tuned model. The test set provides an estimate of how well your model is likely to generalize to truly unseen data. It is crucial that the test set is used only once for this final evaluation. Repeatedly evaluating on the test set can lead to overfitting to the test data.

Typical Split Ratios for these sets are common practice, although they can vary depending on the size and nature of your dataset. Common ratios include 70% for training, 15% for validation, and 15% for testing (70/15/15), or sometimes 80/10/10 for larger datasets. The choice of ratio is influenced by factors such as the total size of your dataset (smaller datasets might require relatively larger validation/test sets to ensure they are representative, this is quite common in medical image analysis) and the complexity of the task.

Data Leakage

Data leakage occurs when information from your validation or test set inadvertently "leaks" into the training process. This leads to overly optimistic performance estimates during development because your model is essentially getting a sneak peek at the data it's supposed to be evaluated on independently. This "cheating" results in a model that performs well on your specific validation/test sets but generalizes poorly to truly unseen data in the real world.

Data leakage is particularly critical and often more challenging to avoid in medical AI due to the nature of medical data. A major source of leakage in this domain is patient-specific data. Medical datasets frequently contain multiple images, scans, or related clinical information belonging to the same patient.

To effectively prevent data leakage in medical AI, it is paramount to ensure that data from the same patient is never present in more than one data split (training, validation, or test set). This applies to all types of data associated with a patient, including different scans, images acquired at different time points, or any related clinical data. The correct approach is to split your data based on patient ID rather than just individual images or data points. This guarantees that all data points belonging to a particular patient reside exclusively within one of the three sets. If you don't have patient ID's, an imperfect solution is that you can try to select data from different hospitals in different geographical regions for different sets.

While patient-level leakage is a primary concern in medical imaging, other sources of leakage can also exist. These might include applying pre-processing steps (like normalization or feature extraction) to the entire dataset before splitting it, or using information derived from the whole dataset (like overall data statistics) during the training of individual models on subsets. Another common way that data leakage can occur is if you want to subsample scans (by taking slices or volumes of interest) and then split up the subsampled data without taking into account the patient from which it was sampled.

The consequences of data leakage are significant and can be detrimental in medical applications. Models trained on leaked data will likely show inflated performance metrics during your internal evaluation. However, when these models are deployed in a real-world clinical setting on genuinely unseen patients, their performance will often be much worse than expected. This can lead to unreliable diagnostic or prognostic tools, potentially impacting patient care negatively. Therefore, rigorously preventing data leakage is a cornerstone of developing trustworthy medical AI models.

Data leakage is also one of the reasons why many AI papers from about a decade ago will show inflated evaluation metrics. So make sure to always look out for a section on how the data was split up.

Cross-Validation

Cross-Validation is a widely used technique in machine learning to get a more robust and reliable estimate of your model's performance compared to using a single, fixed train/validation split. It's particularly valuable when you have a limited amount of data, as it allows you to make better use of your entire dataset for both training and evaluation.

The primary motivation for using cross-validation is to reduce the dependence of your evaluation results on a single, potentially unrepresentative, random split of your data. With a single split, the performance metric you get can be highly sensitive to which specific data points ended up in the training set and which in the validation set. Cross-validation mitigates this by performing the evaluation multiple times on different subsets of the data and averaging the results. This provides a more stable and reliable estimate of how well your model is likely to generalize to unseen data, giving you greater confidence in your model selection and performance claims. It also allows the entire dataset to be used for training and validation across the different iterations, making more efficient use of your data.

K-Fold Cross-Validation

K-Fold Cross-Validation is a widely used technique to get a more robust estimate of model performance compared to a single train/validation split and to make better use of limited data. It is usually not used to tune hyperparameters of a model. Here is how it works:

The dataset is divided into K equally sized "folds" or subsets.
The cross-validation process is then repeated K times (K "folds").
In each fold, one of the K subsets is held out as the validation set, and the remaining K-1 subsets are used as the training set.
A model is trained on the training set and evaluated on the held-out validation set.
This process is repeated K times, with a different fold used as the validation set in each iteration.
The final performance metrics are calculated by averaging the performance across all K folds.

K-fold cross-validation utilizes the entire dataset for both training and testing. This reduces the dependence of the results on single random train/validation split. Hence, it provides a more reliable estimate of how the model will generalize to unseen data.

In [ ]:

# @title
import matplotlib.pyplot as plt
import numpy as np

# Define the number of folds
n_folds = 5
n_samples = 20

# Generate dummy data for illustration (indices)
X = np.arange(n_samples)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 4))

# Loop through each fold
for i in range(n_folds):
    # Determine the indices for the validation set
    fold_size = n_samples // n_folds
    val_start = i * fold_size
    val_end = (i + 1) * fold_size
    val_indices = np.arange(val_start, val_end)

    # Determine the indices for the training set
    train_indices = np.setdiff1d(np.arange(n_samples), val_indices)

    # Plot the training data for this fold
    # Use train_indices for x-positions, a constant height, and 'bottom' for the vertical position
    ax.bar(train_indices, 0.8, bottom=i, color='skyblue', label='Training set' if i == 0 else "")
    # Plot the validation data for this fold
    ax.bar(val_indices, 0.8, bottom=i, color='salmon', label='Validation set' if i == 0 else "")


# Set labels and title
ax.set_yticks(np.arange(n_folds) + 0.4) # Position ticks in the middle of bars
ax.set_yticklabels([f'Fold {i + 1}' for i in range(n_folds)])
ax.set_xlabel("Data Samples")
ax.set_title("K-Fold Cross-Validation (K=5)")
# Move the legend outside the plot
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
ax.set_ylim(n_folds, -0.1) # Invert y-axis to have Fold 1 at the top
ax.set_xlim(-0.5, n_samples - 0.5) # Adjust x-axis limits
plt.tight_layout()
plt.show()

Stratified K-Fold Cross-Validation

Stratified K-Fold Cross-Validation is a variation of K-Fold that is particularly useful when dealing with imbalanced datasets, where the distribution of classes is not uniform.

It follows the same principle as standard K-Fold Cross-Validation, but with an important modification; When dividing the dataset into K folds, Stratified K-Fold ensures that each fold has approximately the same proportion of samples from each target class as the original dataset.

Group K-Fold Cross-Validation

Group K-Fold Cross-Validation is a variation of K-Fold that is essential when your data has inherent groupings, and you need to ensure that all data points from the same group are kept together in the same fold. This is particularly critical in medical AI, when you need to group data on the basis of Patient ID. Here is how this works:

Instead of splitting individual data points randomly, the splitting is done based on predefined groups.
All data points belonging to the same group (e.g., the same patient) are assigned to the same fold.
The cross-validation proceeds as in standard K-Fold, but the folds now consist of entire groups.

This prevents data leakage, which as was mentioned before, is a potentialp problem in medical AI. By keeping all data from a single patient in one fold, you prevent information from that patient from leaking between the training, validation/test sets, ensuring a realistic evaluation of the model's ability to generalize to new patients.

Nested Cross-Validation

Nested Cross-Validation is a more sophisticated approach used for robust hyperparameter tuning and model selection, providing a more reliable estimate of the model's generalization performance. It involves two layers of cross-validation. Here is how it works:

Outer Loop: The dataset is split into K outer folds. This outer loop is used to estimate the model's generalization performance.
Inner Loop: For each split of the outer loop (using K-1 folds for training and 1 for the outer validation/test), an inner cross-validation loop is performed only on the training data of the outer loop. This inner loop is used for hyperparameter tuning and model selection (e.g., using Grid Search or Random Search which are discussed later).
Once the best hyperparameters are found using the inner loop, the model is trained on the entire training data of the outer loop (with the best hyperparameters) and evaluated on the held-out outer validation/test fold.
This process is repeated for each outer fold, and the final performance is the average across the outer folds.

Nested Cross-Validation Provides a more reliable estimate of the model's performance on unseen data by separating the hyperparameter tuning process from the final performance evaluation. It helps against overfitting and often sresults in more realistsic, slightly more conservative estimate of performance.

Nested cross-validation is recommended when you are performing significant hyperparameter tuning or model selection and want to get an unbiased estimate of the performance of your entire modeling pipeline (including the tuning process). The downside is that it is computationally expensive.

For a visual representation, see below!

In [ ]:

# @title
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import KFold

# Define the number of outer and inner folds
n_outer_folds = 3
n_inner_folds = 3
n_samples = 30 # Adjusted samples to be divisible by outer and inner folds

# Generate dummy data for illustration (indices)
X = np.arange(n_samples)

# Create a figure and axis, adjusting height for 9 rows
fig, ax = plt.subplots(figsize=(12, 9))

outer_cv = KFold(n_splits=n_outer_folds)

# Keep track of the current row for plotting
current_row = 0

# Loop through the outer folds
for i, (outer_train_indices, outer_test_indices) in enumerate(outer_cv.split(X)):
    # Plot the outer test set on its own row
    ax.bar(outer_test_indices, 0.8, bottom=current_row, color='limegreen', label='Outer Test set' if current_row == 0 else "")
    # Add label for the outer fold row, positioned to cover the outer fold block
    # ax.text(-3, current_row + (n_inner_folds + 0.5) / 2, f'Outer Fold {i + 1}', va='center', ha='right', fontsize=12, fontweight='bold')
    ax.text(-1, current_row + 0.4, 'Outer Fold {}\n (Test Set)'.format(i+ 1), va='center', ha='right', fontsize=10, fontweight = 'bold')

    current_row += 1 # Move to the next row for inner splits

    # Define inner cross-validation on the outer training data
    inner_cv = KFold(n_splits=n_inner_folds)

    # Loop through the inner folds (on the outer training data)
    for j, (inner_train_indices, inner_val_indices) in enumerate(inner_cv.split(X[outer_train_indices])):
        # Map inner indices back to original indices
        original_inner_train_indices = outer_train_indices[inner_train_indices]
        original_inner_val_indices = outer_train_indices[inner_val_indices]

        # Plot the inner training set on a new row
        ax.bar(original_inner_train_indices, 0.8, bottom=current_row, color='skyblue', label='Inner Training set' if current_row == 1 else "") # Use current_row == 1 for the first inner train label
        # Plot the inner validation set on the same row, next to the training set
        ax.bar(original_inner_val_indices, 0.8, bottom=current_row, color='salmon', label='Inner Validation set' if current_row == 1 else "") # Use current_row == 1 for the first inner val label
        # Add label for the inner fold row
        ax.text(-1, current_row + 0.4, f'Inner Split {j + 1}', va='center', ha='right', fontsize=10)
        current_row += 1 # Move to the next row


# Set y-axis limits and ticks to match the rows
ax.set_ylim(current_row, -0.1) # Invert y-axis
ax.set_yticks([]) # Remove y-ticks as we are using text labels
ax.set_xlabel("Data Samples")
ax.set_title("Nested Cross-Validation (Outer K=3, Inner K=3)")
# Move the legend outside the plot
ax.legend(loc='upper left', bbox_to_anchor=(1.05, 1), borderaxespad=0.) # Adjust bbox_to_anchor and borderaxespad
ax.set_xlim(-0.5, n_samples - 0.5) # Adjust x-axis limits
plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust tight_layout to make space for legend and labels
plt.show()

Other methods

You have been introduced to a variety of important cross validation techniques. There are more methods available. You can find a lot of them explained on this page. Moreover, you can combine various cross validation technique. For example, you can combine Nested, stratified, and group cross validation, such that you can take advantage of each of their strenghts.

5.3 Hyperparameter Tuning

Hyperparameter tuning is a process by which optimal settings (the hyperparameters) for the model are found. This is different from the models parameters, which are the weights and biases that we learn during training. Hyperparameters are settings that are external to the models. These settings significantly influence the training process and the model's performance on unseen testing data. Examples of hyperparameters include the learning rate, the number of layers in a neural network, the number of neurons in each layer, the type of optimizer used, regularization strength, and batch size.

We will be discussing various hyperparameter tuning strategies next, going from simple to more complex. Simply, understanding that different methods exist is enough for this section. We will not be going into technical detail.

Manual Search

Manual search is a hyperparameter tuning approach where a human expert, typically a researcher or engineer with significant experience, intuitively selects and adjusts hyperparameter values. The process involves training the model with initial hyperparameter settings, evaluating its performance on a validation set, and then manually modifying the hyperparameters based on the observed results and their understanding of the model's behavior and the data's characteristics.

This method is primarily used by experienced practitioners who have developed an intuition for how different hyperparameters affect model training and performance in their specific domain.They use it as an initial step in the hyperparameter tuning of MVP's. Moreover, it is used as a sanity check to determine whether a model is implemented correctly, if the loss doesn't decrease in spite of you having tried various hyperparameters you may have a bug in your code or a mistake in your data/annotations!

One of the main advantages of manual search is that it can leverage valuable human expertise and domain-specific knowledge. An experienced individual can often make intelligent adjustments based on qualitative observations during training, potentially leading to faster progress in some cases, especially when dealing with novel architectures or complex datasets where automated methods might struggle initially. It's also flexible and adaptive, allowing for real-time adjustments.

Despite the advantages, manual search is highly dependent on the individual's expertise and can be time-consuming and inefficient, particularly when the number of hyperparameters is large or the search space is vast. It's also less systematic than automated methods, making it difficult to reproduce the exact tuning process. Furthermore, it can be prone to human bias and may not always find the globally optimal set of hyperparameters.

Grid Search

Grid Search is a straightforward and systematic approach to hyperparameter tuning that involves exhaustively searching through a predefined set of hyperparameter values to find the best combination.

To perform a Grid Search, you first define a discrete set of possible values for each hyperparameter you want to tune. For example, you might specify a list of learning rates like [0.1, 0.01, 0.001], a set of batch sizes like [32, 64, 128], and a few options for the number of layers. Grid Search then evaluates the model's performance for every possible combination of these specified values. If you have 3 learning rates, 3 batch sizes, and 2 options for the number of layers, Grid Search will train and evaluate the model for 3 * 3 * 2 = 18 different combinations. This evaluation is done using a validation set or through cross-validation to get a more reliable estimate of performance. The combination of hyperparameters that yields the best performance (e.g., highest accuracy, lowest loss) is then selected as the optimal set.

Practitioners often use Grid Search when the number of hyperparameters to tune and their possible values are relatively small (typically 2-4) and they have sufficient computational resources to explore the entire grid. It's also used as a systematic way to explore a limited search space, especially when starting with a new model or dataset.

The main advantages are its simplicity to understand and implement. It's also guaranteed to find the best combination within the defined search space and value ranges, assuming the model training and evaluation are deterministic. Furthermore, it is a systematic and reproducible method.

The primary disadvantage is that it can be computationally very expensive, especially as the number of hyperparameters or the number of values per hyperparameter increases. The size of the search space grows exponentially with the number of hyperparameters, making it impractical for tuning many hyperparameters simultaneously. It also does not account for the interaction between hyperparameters effectively and can be inefficient if the optimal values lie between the defined grid points.

In [ ]:

learning_rates = [0.1, 0.01, 0.001]
batch_sizes = [32, 64, 128]
num_layers = [2, 3, 4]

# exceedingly inefficient implementation
hyperparameters = []
for lr in learning_rates:
  for bs in batch_sizes:
    for nl in num_layers:
      hyperparameters.append((lr, bs, nl))

print(hyperparameters)

[(0.1, 32, 2), (0.1, 32, 3), (0.1, 32, 4), (0.1, 64, 2), (0.1, 64, 3), (0.1, 64, 4), (0.1, 128, 2), (0.1, 128, 3), (0.1, 128, 4), (0.01, 32, 2), (0.01, 32, 3), (0.01, 32, 4), (0.01, 64, 2), (0.01, 64, 3), (0.01, 64, 4), (0.01, 128, 2), (0.01, 128, 3), (0.01, 128, 4), (0.001, 32, 2), (0.001, 32, 3), (0.001, 32, 4), (0.001, 64, 2), (0.001, 64, 3), (0.001, 64, 4), (0.001, 128, 2), (0.001, 128, 3), (0.001, 128, 4)]

Random Search

Random Search is a hyperparameter tuning approach that addresses some of the limitations of Grid Search. It works by randomly sampling hyperparameter combinations from a predefined search space.

Instead of defining a discrete set of values for each hyperparameter and evaluating every combination, Random Search defines a range or distribution (e.g., uniform or logarithmic) for each hyperparameter. It then randomly or samples selects a fixed number of combinations from these spaces or distributions. For each randomly selected combination, the model is trained and evaluated on a validation set. The combination that yields the best performance is chosen as the optimal set found by the random search.

The main advantage of Random Search is its efficiency in high-dimensional hyperparameter spaces. As you progress further into your deep learning projects, you will find that you need to tune more hyperparameters (and they influence each other). As the number of hyperparameters increases, the search space grows exponentially, making Grid Search impractical. Random Search, by randomly sampling, is more likely to find good hyperparameter values in fewer iterations compared to Grid Search, especially if only a few hyperparameters significantly impact performance. It can also explore a wider range of values for each hyperparameter compared to the fixed points in a Grid Search. Finally, it is relatively simple to implement.

A disadvantage is that there is no guarantee of finding the globally optimal combination. Unlike Grid Search which is exhaustive within its defined grid. The performance of Random Search depends on the number of random samples chosen; a smaller number of samples might miss the optimal region, while a larger number can still be computationally expensive.

Random Search is often a good strategy to start with, as it can quickly give you an idea of promising regions in the hyperparameter space before potentially focusing on those areas with other methods.

In [ ]:

import random

num_samples = 10
hyperparameters = []
for sample in range(num_samples):

  # sample for each hyperparameter within range
  lr = random.uniform(0.001, 0.1)
  bs = random.randint(32, 128)
  nl = random.randint(2, 5)
  hyperparameters.append((lr, bs, nl))

print(hyperparameters)

[(0.06284570122711361, 103, 4), (0.08844301291509257, 77, 2), (0.00919691841743489, 34, 2), (0.03875831054790029, 36, 4), (0.0886067089611427, 40, 2), (0.032533903088875694, 61, 2), (0.06734711051620564, 81, 5), (0.04260955168951026, 76, 5), (0.0189416303570268, 119, 5), (0.031578173119699404, 32, 4)]

Bayesian Optimization

Bayesian Optimization is a more advanced and efficient technique for hyperparameter tuning compared to Grid Search or Random Search. Instead of blindly searching the hyperparameter space, it uses a probabilistic model to guide the search and find the best hyperparameters more quickly. Think of it as a smart search that learns (in contrast to grid and random search, which do not learn) from past experiments to decide where to look next, minimizing the number of costly model training runs.

Bayesian Optimization builds a probabilistic model of the relationship between different hyperparameter combinations and the resulting model performance (e.g., accuracy on a validation set). A common choice for this model is a Gaussian Process, which not only predicts the expected performance for a given set of hyperparameters but also provides a measure of uncertainty around that prediction.

Based on this probabilistic model, an "acquisition function" is used to determine the next set of hyperparameters to evaluate. This function strikes a balance between two strategies:

Exploration: Trying hyperparameter combinations in areas of the search space where the model is uncertain (high uncertainty).
Exploitation: Trying hyperparameter combinations that the model predicts will result in high performance (high expected value).

By balancing exploration and exploitation, the acquisition function proposes the next most promising hyperparameters to try. The model is then trained with these new hyperparameters, and the results are used to update the probabilistic model. This iterative process continues, with the probabilistic model becoming more accurate and the acquisition function becoming better at guiding the search towards the optimal hyperparameters, until a stopping criterion is met.

Bayesian Optimization is typically used by data scientists and machine learning engineers who are working on complex models or large datasets where training is computationally expensive and efficient hyperparameter tuning is critical. It's a technique employed in research and industry for more sophisticated optimization tasks.

Going into Bayesian optimization is a large subject and beyond the scope of this tutorial. If you ever need to use one, there are plenty of libraries available for it, such as: Hyperopt and Optuna.

5.4 Case Study Questions

To practise the materials in this notebook, you will be analysing a case study. Give te case study a serious attempt, debate it with your classmates, and write down what you think is the best course of action. It's not about getting it perfect, but about actively using the information in this tutorial. The case studies are entirely fictional and though they contain medical sounding information it is not fact-checked. We will discuss your questions to the case study in class.

Chest Ultrasound Study

You are reading a paper about deep learning based disease classification on chest ultrasound. The paper is about chest ultrasounds taken in an emergency room and is supposed to asess the status of the lungs and heart. . For instance, is there fluid in the lungs, is the heart inflamed, are the lungs collapsed etc. At each examination, a total of 6 different ultrasound videos are made at different locations on the torso of the same patient.

The following paragraph is written in the study about the data:

In this retrospective study, we included image data from 1320 patients that were admitted to the emergency room of the Academic Hospital Harderwijk between 2020 and 2022. Patient characteristics are shown in table 1. The image data was acquired on a MedCorp BAT-54A portable ultrasound machine. The data was divided into train, validation, and test sets.

Table 1: Characteristics of patients admitted to the Academic Hospital Harderwijk

Characteristic	Category	Count	Male	Female
Age Cohort
	0-18	61	36	25
	19-45	197	96	101
	46-65	419	194	225
	66+	643	329	314
Reason for Admission
	Covid-19	991	490	501
	Heart Failure	66	31	35
	Pericarditis	67	30	37
	Pneumothorax	131	70	61
	Pulmonary embolism	65	34	31
Comorbidities
	Diabetes	428	218	210
	Hypertension	417	214	203
	Congestive Heart Failure	385	192	193
	Asthma	403	200	203
	Coronary artery disease	392	195	197

Question 1: Closely examine the table and the paragraph written. Think about the patient population. If a deep learning model were trained on data from this patient population. What do you think it's limitations would be?

Question 2: On the basis of the information so far. Do you think data leakage could be a problem? Why?

You continue reading the paper and read the following paragraph about annotation.

Image labels were created for 6 mutually exclusive classes: B-Lines, Pleural Effusion, Pneumothorax, Pericardial Effusion, Low Ventricular Function, and No Finding. A total of 7920 ultrasound videos were assessed by one of 7 raters. The prevalence of each label in the dataset is shown in table 2.

Table 2: Label Prevalence | Label | Total Count | Male | Female | |---|---|---|---| | B-Lines | 2397 | 1152 | 1245 | | Pleural Effusion | 1214 | 589 | 625 | | Pneumothorax | 139 | 67 | 72 | | Pericardial Effusion | 880 | 442 | 438 | | Low Ventricular Function | 939 | 464 | 475 | | No Finding | 2351 | 1181 | 1170 |

Question 3: Go through the above paragraph. Do you agree with the way in which they describe the raters and the annotation process? What would you differently?

Question 4: If I were to tell you that all raters were first year bachelor students in medicine with no ultrasound experience and the inter-rater agreement was high. How would you look at the viability of the project?

Question 5: If I were to tell you that all raters were emergency department specialists, cardiologists, and pulmonologists with 10+ years and they had low agreement. How would you look at the viability of this project?

Question 6: Have a look at table 2. Can you spot a problem?

Question 7: Given all the information above, try to design your own deep learning project. You don't have to worry about specifics of the algorithm. Make sure to not only design your project well, also indicate if information is missing to make a good design choice. Here are some things to think about:

Does the population need more analysis?
Could we improve the labels?
How do I split up the data? Be specific.
Tuning the algorithm.

5.5 Conclusion

This tutorial has, for a short while, shifted our focus from the intricacies of deep learning architectures to the foundational elements of managing a deep learning project, particularly within the context of medical image analysis. We've emphasized that building a successful model goes far beyond just writing code and training networks. Understanding your data – its origin, population characteristics, technical nuances, and annotation complexities – is paramount. We explored crucial aspects like identifying potential data leakage, the importance of proper data splitting using techniques like Train/Validation/Test sets, and the benefits of cross-validation methods for robust evaluation (and tuning), especially in data-scarce medical domains. Finally, we touched upon the essential process of hyperparameter tuning, introducing various strategies. By mastering these project management fundamentals, you lay a solid groundwork for developing reliable, generalizable, and clinically relevant deep learning models. Moreover, if you have to evaluate an algorithm that is given to you, some of the most important questions that you will have to ask are based on the information in this tutorial.

Next tutorial, we will re-introduce the problem of classification. Then we will dive into different architectures and techniques developed on image classification models that are also used to tune other types of algorithms. Finally, we will introduce the concept of uncertainty estimation using neural networks.