The HuBMAP challenge on Kaggle ( Citation: 2023 Howard, A., HCL-Jevster, Gustilo, K., Borner, K., Holbrook, R. & Jain, Y. (2023). HuBMAP - hacking the human vasculature. https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature; Kaggle. Retrieved from https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature ) aimed to segment regions of microvascular (blood vessels) from other tissue in a microscope slide of a healthy human kidney. Each pixel from a stained slide of the kidneys should be labeled as part of a vein or not. This page explores how to apply a convolutional neural network to the HuBMAP challenge. Otsu’s segmentation is used as an example of a simpler, but ineffective approach, which justifies the machine learning approach.
Image from PAS stained slide of the kidneys
Slide and Label Data
Whole Slide Images (WSIs) from Periodic acid-Schiff stained tissues slides were obtained from five healthy adults. These slides were split into tiles each of which is pixels, 8 bit RGB TIFF files. A CSV file contains the source WSI and location of each tile within the WSI. Of the tiles, were labeled by expert pathologists. Three different structures were labelled, blood_vessel
, glomerulus
, and unsure
. The labels are given per tile as a list polygons for each label. The model should predict a blood_vessel
label for an unlabelled tile.
Background
Semantic segmentation is the
task of clustering parts of images together which belong to the same object class. ( Citation: Thoma, 2016 Thoma, M. (2016). A survey of semantic segmentation. CoRR, abs/1602.06541. Retrieved from http://arxiv.org/abs/1602.06541 )
There exist several classical segmentation techniques, including thresh-holding, Otsu’s method, active contour and mean shift. In recent years, deep convolutional neural networks (CNNs) have performed well in image segmentation ( Citation: 2015 Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) including segmentation of microscope slides ( Citation: 2022 Persson, M. (2022). Sample image segmentation of microscope slides. Uppsala University, Division of Visual Information; Interaction; Uppsala University, Division of Visual Information; Interaction. ) .
There exist many CNNs designed for image segmentation, such as U-Net ( Citation: 2015 Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) , DeepLab ( Citation: 2017 Chen, L., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Retrieved from https://arxiv.org/abs/1606.00915 ) , and ErfNet ( Citation: 2018 Romera, E., Álvarez, J., Bergasa, L. & Arroyo, R. (2018). ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1). 263–272. https://doi.org/10.1109/TITS.2017.2750080 ) . In some cases, an already trained network may be used and adapted by retraining through transfer learning. Transfer learning has been used successfully in biological image analysis ( Citation: 2020 Zhang, W., Li, R., Zeng, T., Sun, Q., Kumar, S., Ye, J. & Ji, S. (2020). Deep model based transfer and multi-task learning for biological image analysis. IEEE Transactions on Big Data, 6(2). 322–333. https://doi.org/10.1109/TBDATA.2016.2573280 ) .
Implementation
All source code can be found here, metrics were recorded with WandB.
Processing
In order to apply and evaluate our algorithms we need to convert the label data from polygons into mask files to match the training images. The polygons of each blood_vessel
label are converted to a mask image. Any tiles which had no labels, including glomerulus
and unsure
, are ignored in training. If a tile has a label, but not a blood_vessel
label, a blank mask is generated.
Fig 1. Mask generated for a training example.
Metrics
We need useful metrics which can be used to evaluate our algorithm’s success. Müller ( Citation: 2022 Müller, D., Soto-Rey, I. & Kramer, F. (2022). Towards a guideline for evaluation metrics in medical image segmentation. Retrieved from https://arxiv.org/abs/2202.05273 ) suggested a guideline for evaluation metrics for medical image segmentation tasks. These metrics are formally defined below, feel free to skip this section if you are already familiar with DSC, IoU, and Sensitivity and Specificity.
Let be matrices representing the predicted and ground-truth labels respectively. A in the matrix means the corresponding pixel is part of a vein. The true-positives, false-positives, false-negatives and true-negative counts are given by
where is the element-wise product.
Müller ( Citation: 2022 Müller, D., Soto-Rey, I. & Kramer, F. (2022). Towards a guideline for evaluation metrics in medical image segmentation. Retrieved from https://arxiv.org/abs/2202.05273 ) suggests the dice similarity coefficient (DSC) and intersection over union (IoU or Jaccard index) because they are unbiased metrics (pixel accuracy is an example of a biased metric). Biased metrics can cause models which classify unbalanced classes to look more accurate than if balanced classes were classified.
Additionally, sensitivity and specificity are used to demonstrate functionality but not performance.
Experiments
Otsu’s Thresholding
Before jumping to more complex deep learning techniques, a simpler model should be considered. Otsu’s segmentation is one of the simplest forms of image segmentation. The algorithm finds the optimal binary threshold to separate an image into two classes which have least variance. The algorithm minimizes
where and is the probability of a pixel belonging to the classes and , which are seperated by the threshold , and and are the variances of these respective classes ( Citation: 1979 Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1). 62–66. https://doi.org/10.1109/TSMC.1979.4310076 ) .
Otsu’s segmentation is a reasonable baseline for testing if machine learning is necessary to solve this problem. We can see by looking at the training data that a thresholding algorithm won’t work in all cases as some contextual information is required and table 1 shows Otsu’s is not much better than random noise. We need to use machine learning and table 1 will be a useful baseline for our neural network.
Table 1. Comparison of different algorithms. The metrics are calculated by averaging across the entire training set. The full and no segmentation scores demonstrate the class proportions.
U-Net
Fig 2. The U-Net architecture, a down-sampling encoder extracts information into many feature channels which are then up-sampled to return to the original resolution. Arrows represent operations, blue boxes are multi-channel feature maps and white boxes are copied feature maps.
U-Net is a fully convolutional neural network which was introduced by Ronneberger et al. ( Citation: 2015 Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) where it was successful in segmenting neurons from electron microscope data. U-Net is widely used in medical image segmentation. U-Net’s architecture is shown in figure 2, it contains no fully connected layers so can accept any resolution image as input.
The model and training harness are modified from Pytorch-UNet. For the following experiments a standard 90-10 random train-test split is used. U-Net’s logits are passed through a sigmoid to be convert to a probability.
Table 2. Settings and hyper-parameters initially used to train the U-Net the optimizer and batch size match the original U-Net paper.
Normalization and Image Transformation
The model is initially trained without normalizing the tiles. Figure 3 compares DSC during training for normalized and raw tiles. The tiles are normalized with Albumentations with a mean and standard deviation pre-calculated from the entire training set.1
Applying normalization allowed the epoch loss to decrease more quickly, this could be due to overfitting, but the validation shows an greater peak DSC over no normalization. Applying random morphological and color transforms didn’t seem to have any benefit and slowed training.
Fig 3. Result from training with different image pre-processing techniques. Epoch loss is the average loss for each training epoch, all other metrics are recalculated on the validation split at the end of an epoch. transform
applies random morphological and color transforms using Albumentations
, morphtransform
only applies the morphological transforms.
Batch Size
The U-Net paper ( Citation: 2015 Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597 ) suggests a batch size of to maximise GPU usage and a correspondingly high momentum. Kandel et al. ( Citation: 2020 Kandel, I. & Castelli, M. (2020). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010 ) investigated different batch sizes when applied to CNN image classifiers and states that low batch sizes can lead the network to “bounce back and forth without achieving acceptable performance.” The DSC in figure 3 shows this characteristic ping-ponging suggesting the batch size is too low. On the other hand, higher batch sizes can mean training takes too long to converge, the optimum batch size depends on many factors including the data-set ( Citation: 2020 Kandel, I. & Castelli, M. (2020). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010 ) , so experimentation is required.
Fig 4. Validation metrics for different batch sizes. The horizontal scale is relative time since training started, since we are concerned with performance over equivalent training time.
Figure 4 shows that as batch size increases, IoU and DSC takes longer to improve but show less variance, as predicted by Kandel et al. Higher batch sizes seem to reach a greater peak DSC, but IoU shows a greater peak for low batch sizes, this may be because more epochs are required ( Citation: 2020 Kandel, I. & Castelli, M. (2020). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010 ) to converge. A batch size of was chosen as optimal.
Loss Function
The loss function measures how far a predicted label is from the ground truth. Binary Cross Entropy (BCEwithLogitLoss
) is a widely used loss function in semantic segmentation
(
Citation: 2020
Jadon, S.
(2020).
A survey of loss functions for semantic segmentation.
IEEE.
https://doi.org/10.1109/cibcb48159.2020.9277638
)
because of its stability. There exist other common semantic segmentation loss functions, which include focal loss, and combo loss
(
Citation: 2020
Jadon, S.
(2020).
A survey of loss functions for semantic segmentation.
IEEE.
https://doi.org/10.1109/cibcb48159.2020.9277638
)
. With the help of bigironsphere’s Kaggle notebook the loss functions which are not built-in can be implemented in PyTorch
. See Jadon et al.
(
Citation: 2020
Jadon, S.
(2020).
A survey of loss functions for semantic segmentation.
IEEE.
https://doi.org/10.1109/cibcb48159.2020.9277638
)
for the mathematical definition of these loss functions, my focal implementation sets to the inverse class frequency and , my combo loss weighs BCE and dice equally where . Before applying focal and combo loss, U-Net’s logit output is passed through a sigmoid.
Fig 5. Validation metrics for different loss functions. All runs are performed with batch sizes of over epochs, training over more epochs could show more useful results.
Figure 5 compares these metrics over equivalent training setups. Combo consistently out performed BCE and focal loss, showing less variability and an greater peak metric. Focal loss performed very poorly, this may due improper tuning of the and hyper-parameters, as when , focal loss should be equivalent to BCE ( Citation: 2020 Jadon, S. (2020). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638 ) .
Results
After tuning these hyper-parameters a final model was trained over epochs with a batch size of using a combo loss function. The DSC shown in figure 6 shows the model converged towards a DSC of . Results and an example of the model’s predictions are shown in table 3.
Fig 6. Validation metrics calculated while training the final model. Loss is shown per batch, rather than per epoch.
Table 3. Metrics over validation set from the models at 40th, 80th and 160th epoch with an example. Refer to table 1 for examples of these metrics on simpler segmentations.
Conclusion
This page examined different techniques and parameters when applying semantic segmentation techniques to the HuBMAP - Hacking the Human Vasculature competition. We found that normalization of input images allowed the loss to decrease more quickly. Applying random morphological or color transformation did not seem effective on this data-set. For the standard U-Net architecture, Combo loss outperformed the simpler binary cross entropy loss and focal loss. Greater batch sizes decreased loss variance at the cost of training time and should be tuned for each data-set.
References
- Jadon (2020)
- Jadon, S. (2020). A survey of loss functions for semantic segmentation. IEEE. https://doi.org/10.1109/cibcb48159.2020.9277638
- Persson (2022)
- Persson, M. (2022). Sample image segmentation of microscope slides. Uppsala University, Division of Visual Information; Interaction; Uppsala University, Division of Visual Information; Interaction.
- Kandel & Castelli (2020)
- Kandel, I. & Castelli, M. (2020). The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express, 6(4). 312–315. https://doi.org/https://doi.org/10.1016/j.icte.2020.04.010
- Chen, Papandreou, Kokkinos, Murphy & Yuille (2017)
- Chen, L., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. Retrieved from https://arxiv.org/abs/1606.00915
- Romera, Álvarez, Bergasa & Arroyo (2018)
- Romera, E., Álvarez, J., Bergasa, L. & Arroyo, R. (2018). ERFNet: Efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1). 263–272. https://doi.org/10.1109/TITS.2017.2750080
- Howard, HCL-Jevster, Gustilo, Borner, Holbrook & Jain (2023)
- Howard, A., HCL-Jevster, Gustilo, K., Borner, K., Holbrook, R. & Jain, Y. (2023). HuBMAP - hacking the human vasculature. https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature; Kaggle. Retrieved from https://kaggle.com/competitions/hubmap-hacking-the-human-vasculature
- Müller, Soto-Rey & Kramer (2022)
- Müller, D., Soto-Rey, I. & Kramer, F. (2022). Towards a guideline for evaluation metrics in medical image segmentation. Retrieved from https://arxiv.org/abs/2202.05273
- Otsu (1979)
- Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1). 62–66. https://doi.org/10.1109/TSMC.1979.4310076
- Thoma (2016)
- Thoma, M. (2016). A survey of semantic segmentation. CoRR, abs/1602.06541. Retrieved from http://arxiv.org/abs/1602.06541
- Ronneberger, Fischer & Brox (2015)
- Ronneberger, O., Fischer, P. & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597. Retrieved from http://arxiv.org/abs/1505.04597
- Zhang, Li, Zeng, Sun, Kumar, Ye & Ji (2020)
- Zhang, W., Li, R., Zeng, T., Sun, Q., Kumar, S., Ye, J. & Ji, S. (2020). Deep model based transfer and multi-task learning for biological image analysis. IEEE Transactions on Big Data, 6(2). 322–333. https://doi.org/10.1109/TBDATA.2016.2573280
This technically causes data leaking from the training set to validation set. ↩︎