Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2. RELATED WORK
This pipeline was based on MIScnn [37], which is an in- In order to simplify the pattern finding and fitting process
house developed open-source framework to setup complete for the model, we applied several preprocessing methods on
medical image segmentation pipelines with convolutional the dataset.
neural networks and deep learning models on top of We exploited the Hounsfield units (HU) scale by
Tensorflow/Keras [38]. MIScnn supports extensive clipping the pixel intensity values of the images to -1,250 as
preprocessing, data augmentation, state-of-the-art deep minimum and +250 as maximum, because we were
learning models and diverse evaluation techniques. interested in infected regions (+50 to +100 HU) and lung
regions (-1,000 to -700 HU). It was only possible to apply
3.1 Dataset of COVID-19 Chest CTs the clipping approach on the Coronacases Initiative CTs,
because the Radiopaedia volumes were already normalized
In this study, we used the public dataset from Ma et al. to a grayscale range between 0 and 255.
which consists of 20 annotated COVID-19 chest CT Varying signal intensity ranges of images can
volumes [16,36]. At the time of this paper, this dataset is the drastically influence the fitting process and the resulting
only publicly available 3D volume set with annotated performance of segmentation models [39]. For achieving
COVID-19 infection segmentation [16]. The CT scans were dynamic signal intensity range consistency, it is
collected from the Coronacases Initiative and Radiopaedia recommended to scale and standardize imaging data.
and were licensed under CC BY-NC-SA. Each CT volume Therefore, we normalized the Coronacases Initiative CT
was first labeled by junior annotators, then refined by two volumes likewise to grayscale range. Afterwards, all
radiologists with 5 years of experience and afterwards the samples were standardized via z-score.
annotations verified by senior radiologists with more than Medical imaging volumes have commonly
10 years of experience [16]. Despite the fact that the sample inhomogeneous voxel spacings. The interpretation of
size is rather small, the annotation process led to an diverse voxel spacings is a challenging task for deep neural
excellent high-quality dataset. The volumes had a resolution networks. Therefore, it is possible to drastically reduce
of 512x512 (Coronacases Initiative) or 630x630 complexity by resampling volumes in an imaging dataset to
(Radiopaedia) with a number of slices of about 176 by homogeneous voxel spacing, which is also called target
mean (200 by median). The CT images were labeled into spacing. Resampling voxel spacings also directly resizes the
four classes: Background, lung left, lung right and COVID- volume shape and determines the contextual information,
19 infection. which the neural network model is able to capture. As a
In our pipeline, we performed a 5-fold cross- result, the target spacing has a huge impact on the final
validation on the dataset. This resulted in five fitting and model performance. We decided to resample all CT
inference runs with each time 16 samples as training dataset volumes to a target spacing of 1.58x1.58x2.70, resulting in
a median volume shape of 267x254x104.
Preprint: Automated Chest CT Image Segmentation of COVID-19 Lung Infection based on 3D U-Net 4
Figure 3: The architecture of the standard 3D U-Net. The network takes a 3D patch (cuboid) and outputs the segmentation of lungs and
infected regions by COVID-19. Skip connections were implemented with concatenation layers. Conv: Convultional layer; ReLU:
Rectified linear unit layer; BN: Batch normalization.
[43] and the categorical cross-entropy as loss function for metrics are based on the confusion matrix, where TP, FP,
model fitting (1). TN and FN represent the true positive, false positive, true
negative and false negative rate, respectively.
Ltotal = LTversky + LCCE (1)
2⋅TP
N
TP c DSC = (4)
LTversky = N− ∑ (2) 2⋅TP + FP + FN
c=1 TP c + α⋅FNc + β⋅FP c
TP
N Sensitivity = (5)
TP + FN
LCCE = −∑ y o , c log( po , c) (3)
c=1 TN
Specificity = (6)
We implemented a multi-class adaptation for the Tversky TN + FP
index (2), which is an asymmetric similarity index to We calculated the evaluation metrics for each fold
measure the overlap of the segmented region with the in the cross-validation, and thus, for all samples in our
ground truth. It allows for flexibility in balancing the false dataset. The two lung classes (lung left and lung right) were
positive rate (FP) and false negative (FN) rate. The cross- averaged by mean into a single class (lungs) during the
entropy (3) is a commonly used loss function in machine evaluation.
learning and calculates the total entropy between the
predicted and true distribution. The multi-class adaptation 3.7 Code Reproducibility
for multiple categories (categorical cross-entropy) is
represented through the sum of the binary cross-entropy for In order to ensure full reproducibility and to create a base
each class c, whereas yo,c is the binary indicator whether the for further research, the complete code of this project,
class label c is the correct classification for observation o. including extensive documentation, is available in the
The variable po,c is the predicted probability that observation following Git repository:
o is of class c. https://github.com/frankkramer-lab/covid19.MIScnn
For model fitting, an Adam optimization was used
with the initial weight decay of 1e-3 [44]. We utilized a 4. RESULTS & DISCUSSION
dynamic learning rate which reduced the learning rate by a
factor of 0.1 in case the training loss did not decrease for 15 The sequential training of the complete cross-validation on
epochs. The minimal learning rate was set to 1e-5. In order a single GPU took around 130 hours. All folds did not
to further reduce the risk of overfitting, we exploited the require the entire 1000 epochs for training and instead were
early stopping technique for training, in which the training early stopped after an average of 312 epochs.
process stopped without a fitting loss decrease after 100 Through validation monitoring during the training,
epochs. The neural network model was trained for a no overfitting was observed. The training and validation
maximum of 1000 epochs. Instead of the common epoch loss function revealed no significant distinction from each
definition as a single iteration over the dataset, we defined other, which can be seen in figure 4. During the fitting, the
an epoch as the iteration over 150 training batches. This performance settled down at a loss of around 0.383 which is
allowed for an improved fitting process for randomly a generalized DSC (average of all class-wise DSCs) of
generated batches in which the dataset acts as a variation around 0.919. Because of this robust training process
database. According to our available GPU VRAM, we without any signs of overfitting, we concluded that fitting
selected a batch size of 2. on randomly generated patches via extensive data
augmentation and random cropping from a variant database,
3.6 Evaluation Metrics is highly efficient for limited imaging data.
After the training, the inference revealed a strong
During the fitting process, we computed the segmentation segmentation performance for lungs and COVID-19
performance for each epoch on randomly cropped and data infected regions, which is illustrated in figure 6. Overall, the
augmented patches from the validation dataset. This cross-validation models achieved a DSC of around 0.956
allowed for an evaluation of the overfitting on the training for lung and 0.761 for COVID-19 infection segmentation.
data. Furthermore, the models achieved a sensitivity and
After the training, we used three widely popular specificity of 0.956 and 0.998 for lungs, as well as 0.730
evaluation metrics in the community for medical image and 0.999 for infection, respectively. More details on the
analysis to do the inference performance measurement in inference performance is listed in table 1. From a medical
order to measure the segmentation overlap between perspective, detection of COVID-19 infection is a
prediction and ground truth. The Dice similarity coefficient, challenging task and one of the reasons for the weaker
defined in (4), is the most widespread metric in computer segmentation accuracy in contrast to the lung segmentation.
vision. In contrast, the sensitivity (5) and specificity (6) are The reason for this is the variety of GGO and pulmonary
one of the most popular metrics in medical fields. All consolidation morphology. Nevertheless, our medical image
Preprint: Automated Chest CT Image Segmentation of COVID-19 Lung Infection based on 3D U-Net 6
Figure 4: Loss course during the training process for training (red) and
validation (cyan) data. The lines were computed via Gaussian Process
Regression and represent the average loss across all folds. The gray areas
around the lines represents the confidence intervals.
Figure 5: Box plots showing the result distribution from the 5-fold cross-
validation on Lung and COVID-19 infection segmentation.
segmentation pipeline allowed fitting a model which is able
to segment COVID-19 infection with state-of-the-art DSC of 0.987 and 0.726 for lungs and infection,
accuracy that is comparable to models trained on large respectively. Hence, COVID-SegNet as well as our
datasets. approach achieved similar results. This raises the question,
For further evaluation, we compared our pipeline if it is possible to further increase our performance by
to other available COVID-19 segmentation approaches switching from the standard U-Net of our pipeline to an
based on CT scans. The authors (Ma et al.), who also architecture specifically designed for COVID-19 infection
provided the dataset we used for our analysis, implemented segmentation like COVID-SegNet. Further approaches,
a 3D U-Net approach as a baseline for benchmarking [16]. with the aim to utilize specifically designed architectures,
They were able to achieve a DSC of 0.70355 and 0.6078 for were Inf-Net (Fan et al.) and MiniSeg (Qiu et al.) [34,35].
lungs and COVID-19 infection, respectively. With our Both were trained on 2D CT scans and achieved for
model, we were able to outperform this baseline. It is COVID-19 infection segmentation DSCs of 0.764 and
important to mention that we trained with a cross-validation 0.773, respectively. Although diverse datasets were used for
distribution of 80% training and 20% testing, whereas they training, which leads to incomparability of the results, it is
used the inverted distribution (20% training and 80% highly impressive that they achieved similar performance as
testing). Another approach from Yan et al. developed a approaches based on 3D imaging data. The 3D
novel neural network architecture (COVID-SegNet) transformation of these architectures and the integration into
specifically designed for COVID-19 infection segmentation our pipeline would be an interesting experiment to evaluate
with limited data [29]. The authors tested their architecture improvement possibilities.
on a limited dataset consisting of ten COVID-19 cases from However, it is important to note that the majority
Brainlab Co. Ltd (Germany) and were able to achieve a of current segmentation approaches in research are not
suited for clinical usage. The bias of current models is that
Table 1: Achieved results showing the median Dice similarity coefficient they are only trained with COVID-19 related images.
(DSC), the sensitivity (Sens) and specificity (Spec) on Lung and COVID- Therefore, it is not certain how good the models can
19 infection segmentation for each CV fold and the global average (AVG). differentiate between COVID-19 lesions and other
Lungs COVID-19
pneumonia, or entirely unrelated medical conditions like
cancer. Furthermore, identical to COVID-19 classification,
Fold DSC Sens. Spec. DSC Sens. Spec. the models reveal huge differences depending on which
1 0.907 0.913 0.995 0.556 0.447 0.999 dataset they were trained on. Segmentation models purely
2 0.977 0.979 0.998 0.801 0.875 0.999 based on COVID-19 scans are often not able to segment
accurately in the presence of other medical conditions [16].
3 0.952 0.945 0.999 0.829 0.796 0.999
Additionally, there is a high potential for false positive
4 0.979 0.975 0.999 0.853 0.836 0.999 segmentation of pneumonia lesions that are not caused by
5 0.967 0.967 0.999 0.765 0.697 0.999 COVID-19. This demonstrates that these models could be
AVG 0.956 0.956 0.998 0.761 0.730 0.999
biased and are not suitable for COVID-19 screening.
Nevertheless, current infection segmentation models are
Preprint: Automated Chest CT Image Segmentation of COVID-19 Lung Infection based on 3D U-Net 7
Figure 6: Visual comparison of the segmentation between ground truth and our model on four slices from different CT scans.
already highly accurate for confirmed COVID-19 imaging. limited dataset sizes which act as variant database. Instead
This offers the opportunity for quantitative assessment and of novel and complex neural network architectures, we
disease monitoring as applications in clinical studies. utilized the standard 3D U-Net. We proved that our medical
Despite that our model and those of others, which image segmentation pipeline is able to successfully train
are based on limited data, are capable for accurate accurate and robust models without overfitting on limited
segmentation, it is essential to discuss their robustness. data. Furthermore, we were able to outperform current
Currently, there are no large as well as annotated imaging state-of-the-art semantic segmentation approaches for lungs
datasets available for COVID-19 segmentation [16]. and COVID-19 infected regions. Our work has great
Existing small datasets may have incomplete and inaccurate potential to be applied as a clinical decision support system
labels which results in challenging handicaps for models. for COVID-19 quantitative assessment and disease
More imaging data with more variance (different COVID- monitoring in a clinical environment. Nevertheless, further
19 states, other pneumonia, etc.) need to be collected, research is needed on COVID-19 semantic segmentation in
annotated and published for researchers. Similar to Ma et clinical studies for evaluating clinical performance and
al., community accepted benchmark datasets have to be robustness.
established in order to fully ensure robustness as well as
comparability of models [16,36]. ACKNOWLEDGMENTS
[26] Hou H, Lv W, Tao Q, Hospital T, Company JT, Ai T, et al. Artif Intell Lect Notes Bioinformatics) 2016;9901 LNCS:424–
Artificial Intelligence Distinguishes COVID-19 from 32. doi:10.1007/978-3-319-46723-8_49.
Community Acquired Pneumonia on Chest CT 2020;2020.
[42] Zhang Z, Liu Q, Wang Y. Road Extraction by Deep Residual U-
[27] Shi F, Xia L, Shan F, Wu D, Wei Y, Yuan H, et al. Large-Scale Net. IEEE Geosci Remote Sens Lett 2018.
Screening of COVID-19 from Community Acquired Pneumonia doi:10.1109/LGRS.2018.2802944.
using Infection Size-Aware Classification 2020.
[43] Seyed SSM, Erdogmus D, Gholipour A. Tversky loss function
[28] Gozes O, Frid-Adar M, Greenspan H, Browning PD, Bernheim for image segmentation using 3D fully convolutional deep
A, Siegel E. Rapid AI Development Cycle for the Coronavirus networks 2017.
(COVID-19) Pandemic: Initial Results for Automated Detection
[44] Kingma DP, Lei Ba J. Adam: A Method for Stochastic
& Patient Monitoring using Deep Learning CT Image Analysis
Optimization. 2014.
2020.