Image Segmentation of Microscope Slides

The HuBMAP challenge on Kaggle ( Citation: , , , , & (). HuBMAP - hacking the human vasculature. https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature; Kaggle. Retrieved from https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature ) aimed to segment regions of microvascular (blood vessels) from other tissue in a microscope slide of a healthy human kidney. Each pixel from a stained slide of the kidneys should be labeled as part of a vein or not. This page explores how to apply a convolutional neural network to the HuBMAP challenge. Otsu’s segmentation is used as an example of a simpler, but ineffective approach, which justifies the machine learning approach.

Image from PAS stained slide of the kidneys

Slide and Label Data

Whole Slide Images (WSIs) from Periodic acid-Schiff stained tissues slides were obtained from five healthy adults. These slides were split into 7034 tiles each of which is 512 \times 512 pixels, 8 bit RGB TIFF files. A CSV file contains the source WSI and location of each tile within the WSI. Of the 7034 tiles, 1633 were labeled by expert pathologists. Three different structures were labelled, blood_vessel, glomerulus, and unsure. The labels are given per tile as a list polygons for each label. The model should predict a blood_vessel label for an unlabelled tile.

Background

Semantic segmentation is the

task of clustering parts of images together which belong to the same object class. ( Citation: , (). A survey of semantic segmentation. CoRR, abs/1602.06541. Retrieved from http://arxiv.org/abs/1602.06541 )

There exist several classical segmentation techniques, including thresh-holding, Otsu’s method, active contour and mean shift. In recent years, deep convolutional neural networks (CNNs) have performed well in image segmentation ( Citation: , & (). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) including segmentation of microscope slides ( Citation: (). Sample image segmentation of microscope slides. Uppsala University, Division of Visual Information; Interaction; Uppsala University, Division of Visual Information; Interaction. ) .

There exist many CNNs designed for image segmentation, such as U-Net ( Citation: , & (). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) , DeepLab ( Citation: , , , & (). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Retrieved from https://arxiv.org/abs/1606.00915 ) , and ErfNet ( Citation: , , & (). ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1). 263–272. https://doi.org/10.1109/TITS.2017.2750080 ) . In some cases, an already trained network may be used and adapted by retraining through transfer learning. Transfer learning has been used successfully in biological image analysis ( Citation: , , , , , & (). Deep model based transfer and multi-task learning for biological image analysis. IEEE Transactions on Big Data, 6(2). 322–333. https://doi.org/10.1109/TBDATA.2016.2573280 ) .

Implementation

All source code can be found here, metrics were recorded with WandB.

Processing

In order to apply and evaluate our algorithms we need to convert the label data from polygons into mask files to match the training images. The polygons of each blood_vessel label are converted to a mask image. Any tiles which had no labels, including glomerulus and unsure, are ignored in training. If a tile has a label, but not a blood_vessel label, a blank mask is generated.

Fig 1. Mask generated for a training example.

Metrics

We need useful metrics which can be used to evaluate our algorithm’s success. Müller ( Citation: , & (). Towards a guideline for evaluation metrics in medical image segmentation. Retrieved from https://arxiv.org/abs/2202.05273 ) suggested a guideline for evaluation metrics for medical image segmentation tasks. These metrics are formally defined below, feel free to skip this section if you are already familiar with DSC, IoU, and Sensitivity and Specificity.

Let X,Y \in \set{0,1}^{w \times h} be matrices representing the predicted and ground-truth labels respectively. A 1 in the matrix means the corresponding pixel is part of a vein. The true-positives, false-positives, false-negatives and true-negative counts are given by

where \circ is the element-wise product.

Müller ( Citation: , & (). Towards a guideline for evaluation metrics in medical image segmentation. Retrieved from https://arxiv.org/abs/2202.05273 ) suggests the dice similarity coefficient (DSC) and intersection over union (IoU or Jaccard index) because they are unbiased metrics (pixel accuracy is an example of a biased metric). Biased metrics can cause models which classify unbalanced classes to look more accurate than if balanced classes were classified.

Additionally, sensitivity and specificity are used to demonstrate functionality but not performance.

Experiments

Otsu’s Thresholding

Before jumping to more complex deep learning techniques, a simpler model should be considered. Otsu’s segmentation is one of the simplest forms of image segmentation. The algorithm finds the optimal binary threshold to separate an image into two classes which have least variance. The algorithm minimizes

where w_0 and w_1 is the probability of a pixel belonging to the classes 0 and 1, which are seperated by the threshold t, and \sigma_0^2 and \sigma_1^2 are the variances of these respective classes ( Citation: (). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1). 62–66. https://doi.org/10.1109/TSMC.1979.4310076 ) .

Otsu’s segmentation is a reasonable baseline for testing if machine learning is necessary to solve this problem. We can see by looking at the training data that a thresholding algorithm won’t work in all cases as some contextual information is required and table 1 shows Otsu’s is not much better than random noise. We need to use machine learning and table 1 will be a useful baseline for our neural network.

Table 1. Comparison of different algorithms. The metrics are calculated by averaging across the entire training set. The full and no segmentation scores demonstrate the class proportions.

U-Net

Fig 2. The U-Net architecture, a down-sampling encoder extracts information into many feature channels which are then up-sampled to return to the original resolution. Arrows represent operations, blue boxes are multi-channel feature maps and white boxes are copied feature maps.

U-Net is a fully convolutional neural network which was introduced by Ronneberger et al. ( Citation: , & (). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) where it was successful in segmenting neurons from electron microscope data. U-Net is widely used in medical image segmentation. U-Net’s architecture is shown in figure 2, it contains no fully connected layers so can accept any resolution image as input.

The model and training harness are modified from Pytorch-UNet. For the following experiments a standard 90-10 random train-test split is used. U-Net’s logits are passed through a sigmoid to be convert to a probability.

Table 2. Settings and hyper-parameters initially used to train the U-Net the optimizer and batch size match the original U-Net paper.

Normalization and Image Transformation

The model is initially trained without normalizing the tiles. Figure 3 compares DSC during training for normalized and raw tiles. The tiles are normalized with Albumentations with a mean and standard deviation pre-calculated from the entire training set.1

Applying normalization allowed the epoch loss to decrease more quickly, this could be due to overfitting, but the validation shows an \approx 50 \\% greater peak DSC over no normalization. Applying random morphological and color transforms didn’t seem to have any benefit and slowed training.

Fig 3. Result from training with different image pre-processing techniques. Epoch loss is the average loss for each training epoch, all other metrics are recalculated on the validation split at the end of an epoch. transform applies random morphological and color transforms using Albumentations, morphtransform only applies the morphological transforms.

Batch Size

The U-Net paper ( Citation: , & (). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) suggests a batch size of 1 to maximise GPU usage and a correspondingly high momentum. Kandel et al. ( Citation: & (). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010 ) investigated different batch sizes when applied to CNN image classifiers and states that low batch sizes can lead the network to “bounce back and forth without achieving acceptable performance.” The DSC in figure 3 shows this characteristic ping-ponging suggesting the batch size is too low. On the other hand, higher batch sizes can mean training takes too long to converge, the optimum batch size depends on many factors including the data-set ( Citation: & (). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010 ) , so experimentation is required.

Fig 4. Validation metrics for different batch sizes. The horizontal scale is relative time since training started, since we are concerned with performance over equivalent training time.

Figure 4 shows that as batch size increases, IoU and DSC takes longer to improve but show less variance, as predicted by Kandel et al. Higher batch sizes seem to reach a greater peak DSC, but IoU shows a greater peak for low batch sizes, this may be because more epochs are required ( Citation: & (). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010 ) to converge. A batch size of 16 was chosen as optimal.

Loss Function

The loss function measures how far a predicted label is from the ground truth. Binary Cross Entropy (BCEwithLogitLoss) is a widely used loss function in semantic segmentation ( Citation: (). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638 ) because of its stability. There exist other common semantic segmentation loss functions, which include focal loss, and combo loss ( Citation: (). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638 ) . With the help of bigironsphere’s Kaggle notebook the loss functions which are not built-in can be implemented in PyTorch. See Jadon et al. ( Citation: (). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638 ) for the mathematical definition of these loss functions, my focal implementation sets \alpha_t = 0.95 = 1 - 0.05 to the inverse class frequency and \gamma = 2, my combo loss weighs BCE and dice equally where \alpha=0.5. Before applying focal and combo loss, U-Net’s logit output is passed through a sigmoid.

Fig 5. Validation metrics for different loss functions. All runs are performed with batch sizes of 16 over 40 epochs, training over more epochs could show more useful results.

Figure 5 compares these metrics over equivalent training setups. Combo consistently out performed BCE and focal loss, showing less variability and an \approx 10\\% greater peak metric. Focal loss performed very poorly, this may due improper tuning of the \alpha and \gamma hyper-parameters, as when \gamma = 1, focal loss should be equivalent to BCE ( Citation: (). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638 ) .

Results

After tuning these hyper-parameters a final model was trained over 160 epochs with a batch size of 16 using a combo loss function. The DSC shown in figure 6 shows the model converged towards a DSC of \approx 0.6. Results and an example of the model’s predictions are shown in table 3.

Fig 6. Validation metrics calculated while training the final model. Loss is shown per batch, rather than per epoch.

Table 3. Metrics over validation set from the models at 40th, 80th and 160th epoch with an example. Refer to table 1 for examples of these metrics on simpler segmentations.

Conclusion

This page examined different techniques and parameters when applying semantic segmentation techniques to the HuBMAP - Hacking the Human Vasculature competition. We found that normalization of input images allowed the loss to decrease more quickly. Applying random morphological or color transformation did not seem effective on this data-set. For the standard U-Net architecture, Combo loss outperformed the simpler binary cross entropy loss and focal loss. Greater batch sizes decreased loss variance at the cost of training time and should be tuned for each data-set.

References

Jadon (2020)
(). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638
Persson (2022)
(). Sample image segmentation of microscope slides. Uppsala University, Division of Visual Information; Interaction; Uppsala University, Division of Visual Information; Interaction.
Kandel & Castelli (2020)
& (). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010
Chen, Papandreou, Kokkinos, Murphy & Yuille (2017)
, , , & (). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Retrieved from https://arxiv.org/abs/1606.00915
Romera, Álvarez, Bergasa & Arroyo (2018)
, , & (). ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1). 263–272. https://doi.org/10.1109/TITS.2017.2750080
Howard, HCL-Jevster, Gustilo, Borner, Holbrook & Jain (2023)
, , , , & (). HuBMAP - hacking the human vasculature. https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature; Kaggle. Retrieved from https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature
Müller, Soto-Rey & Kramer (2022)
, & (). Towards a guideline for evaluation metrics in medical image segmentation. Retrieved from https://arxiv.org/abs/2202.05273
Otsu (1979)
(). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1). 62–66. https://doi.org/10.1109/TSMC.1979.4310076
Thoma (2016)
(). A survey of semantic segmentation. CoRR, abs/1602.06541. Retrieved from http://arxiv.org/abs/1602.06541
Ronneberger, Fischer & Brox (2015)
, & (). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597
Zhang, Li, Zeng, Sun, Kumar, Ye & Ji (2020)
, , , , , & (). Deep model based transfer and multi-task learning for biological image analysis. IEEE Transactions on Big Data, 6(2). 322–333. https://doi.org/10.1109/TBDATA.2016.2573280

  1. This technically causes data leaking from the training set to validation set. ↩︎

Last modified 02/09/2023