Sei sulla pagina 1di 60

Deep Neural Networks for Improving Computer-Aided

Diagnosis, Segmentation and Text/Image Parsing in Radiology

Le Lu, Ph.D.

Joint work with Holger R. Roth, Hoo-chang Shin, Ari Seff, Xiaosong Wang,
Mingchen Gao, Isabella Nogues, Ronald M. Summers
Radiology and Imaging Sciences, National Institutes of Health Clinical Center

le.lu@nih.gov
Application Focus: Cancer Imaging

Cancer Lung Colorectal Pancreatic Breast Prostate


Type (Bronchus) (F-M)

Estimated 224,390 134,490 53,070 180,890


246,660
New Cases
2,600

Estimated 158,080 49,190 41,780 40,450 26,120


Deaths 440

American Cancer Society: Cancer Facts and Figures 2016. Atlanta, Ga: American Cancer
Society, 2016. Last accessed February 1, 2016.
http://www.cancer.gov/types/common-cancers
Overview: Three Key Problems (I)

Computer-aided Detection (CADe) and Diagnosis (CADx)


Lung, Colon pre-cancer detection; bone and vessel imaging (13 conference papers
in CVPR/ECCV/ICCV/MICCAI/WACV/CIKM, 12 patents, 6 years of industrial R&D)

Lymph node, colon polyp, bone lesion detection using Deep CNN + Random View
Aggregation (http://arxiv.org/abs/1505.03046, TMI 2016a; MICCAI 2014a)

Empirical analysis on Lymph node detection and interstitial lung disease (ILD)
classification using CNN (http://arxiv.org/abs/1602.03409, TMI 2016b)

Non-deep models for CADe using compositional representation (MICCAI 2014b)


and +mid-level cues (MICCAI 2015b); deep regression based multi-label ILD
prediction (MICCAI 2016 in submission); missing label issue in ILD (ISBI 2016)

Clinical Impact: producing various high performance second or first


reader CAD use cases and applications effective imaging based
prescreening tools on a cloud based platform for large population
Overview: Three Key Problems (II)

Semantic Segmentation in Medical Image Analysis


DeepOrgan for pancreas segmentation (MICCAI 2015a) via scanning superpixels
using multi-scale deep features (Zoom-out) and probability map embedding
http://arxiv.org/abs/1506.06448

Deep segmentation on pancreas and lymph node clusters with HED (Holistically-
nested neural networks, Xie & Tu, 2015) as building blocks to learn unary
(segmentation mask) and pairwise (labeling segmentation boundary) CRF terms +
spatial aggregation or + structured optimization (The focus of MICCAI 2016
submissions since this is a much needed task Small datasets; (de-)compositional
representation is still the key.)
CRF: conditional random fields

Clinical Impact: semantic segmentation can help compute clinically


more accurate and desirable imaging bio-markers!
Overview: Three Key Problems (III)

Interleaved or Joint Text/Image Deep Mining on a Large-Scale Radiology


Image Database large datasets; no labels (~216K 2D key images/slices extracted from
>60K unique patients)
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Image Database (CVPR
2015, a proof of concept study)
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Image Database for
Automated Image Interpretation (its extension, JMLR 2016, to appear)
http://arxiv.org/abs/1505.00670
Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image
Annotation, (CVPR 2016) http://arxiv.org/abs/1603.08486
Unsupervised Category Discovery via Looped Deep Pseudo-Task Optimization Using a
Large Scale Radiology Image Database, (ECCV 2016 in submission)
http://arxiv.org/abs/1603.07965

Clinical Impact: eventually to build an automated programmable


mechanism to parse and learn from hospital scale PACS-RIS databases to
derive semantics and knowledge
has to be deep learning based since effective image features are very hard to be hand-
crafted cross different diseases, imaging protocols and modalities.
(I) Automated Lymph Node Detection
Difficult due to large variations in appearance, location and pose.
Plus low contrast against surrounding tissues.

Mediastinal lymph node in CT Abdominal lymph node in CT


Previous Work

(+ parts of Abd.)

Previous work mostly use direct 3D image feature information from CT volume.
The state-of-the-art approaches [4,5] employ a large set of boosted 3D Haar
features to build a holistic detector, in a scanning window manner.
Curse of dimensionality leads to relatively poor performance [Lu, Barbu, et al.,
2008].

*Can we represent the challenging object detection task(s) as


2D or 2.5D problems, to achieve better FROC performance?
Heterogeneous Cascade CADe

*Ingredients* (MICCAI 2014~2015, TMI 2016):

CG: Avoid exhaustive scanning window search, but use systems or


modules which can generate object hypotheses with extremely high
recall, at the expense of high false positive rates (e.g., heuristic
importance sampling) as candidate proposals.

Hundreds of Thousands potential object windows reduced to ~[40-


50] windows or 3D VOIs. Heterogeneous Cascade for Object
Detection via classification! unbalanced (hard) negative sampling
issue)

Propose, implement and evaluate 2.5D approaches using local


composites of 2D views of classification, versus one-shot 3D yes-no
classification. (Compositional or De-compositional Model)
Lymph Node Candidate Generation
Mediastinum [J. Liu et al. 2014] Abdomen [K. Cherry et al. 2014]
388 lymph nodes in 90 patients 595 lymph nodes in 86 patients
3208 false-positives 3484 false-positives
36 FPs per patient 41 FPs per patient

Deep Detection Proposal Generation as future work


Shallow Models: 2D View Aggregation Using a Two-
Level Hierarchy of Linear Classifiers [Seff et al. MICCAI 2014]

VOI candidates generated via a random forest classifier using voxel-


level features (not the primary focus of this work), for high sensitivity
but also high false positive rates.
2.5D: 3 sequences of orthogonal 2D slices then extracted from each
candidate VOI (9 x 3 = 27 views).

Axial

Coronal

Sagittal

2D slice gallery for a LN candidate VOI (45 x 45 45 voxels).


HOG: Histogram of Oriented Gradients + LibLinear on
processing 2D Views

HOG feature extraction

Abdominal LN axial slice.


SVM training

Resulting feature weights after training.

Note that a unified, compact HOG model is trained, regardless of axial, coronal, or
sagittal views, or unifying view orientations.
Lymph Node Detection FROC Performance
Lymph Node Detection FROC Performance

Enriching HOG descriptor with other image feature channels, e.g., mid-level semantic
contours/gradients, can further lift the sensitivity for 8~10%!
About 1/3 FPs are found to be smaller lymph nodes (short axis < 10 mm).
Make Shallow to Go Deeper via Mid-level Cues?
[Seff et al. MICCAI 2015]

We explore a learned transformation scheme for producing enhanced


semantic input for HOG, based on LN-selective visual responses.
Mid-level semantic boundary cues learned from segmentation.
All LNs in both target regions are manually segmented by radiologists.

Target region # Patients # LNs


Mediastinal 90 389
Abdominal 86 595
Sketch Tokens (CVPR13)
Extract all patches (radius = 7 voxels) centered on a boundary pixel
Cluster into sketch token classes using k-means with k = 150
A random forest is trained for sketch token classification for input CT
patches

Mediastinal LN Abdominal LN Colon Polyps


Feature Map Construction

An enhanced, 3-channel feature map:


Single Template Results
Top performing feature sets (Sum_Max_I and Sum_Max) exhibit 15%-23%
greater recall than the baseline HOG at low FP rates (e.g. 3/FP scan).
Our system outperforms the state-of-the-art deep CNN system (Roth et
al., 2014) in the mediastinum, e.g. 78% vs. 70% at 3 FP/scan.

Six-fold cross-valdiation FROC curves are shown for the two target regions
Classification

A linear SVM is trained using the new feature set; A HOG cell size of 9x9
pixels gives optimal performance.
Separate models are trained for specific LN size ranges to form a mixture-of-
templates-approach (see later slide)

Visualization of linear SVM weights for the abdominal LN detection models


Mixture Model Results
Wide distribution of LN sizes invites the application of size-specific
models trained separately.
LNs > 20 mm are especially clinically relevant

Single template and mixture model performance for abdominal models


Deep models: Random Sets of Convolutional Neural
Network Predictions [Roth et al. MICCAI 2014, TMI 2016]
CIFAR-10 [H. Roth et al. MICCAI 2014]
Not-so-deep Convolutional Neural
Network:
Trained Filters

CUDA-ConvNet: Open-source GPU


accelerated code by [A. Krizhevsky et al.
2012]

plus DropConnect modification by


[L. Wan et al. 2013]
Deep models: Random Sets of Convolutional Neural
Network Predictions [Roth et al., MICCAI 2014]
Application to appearance modeling and
detecting lymph node

Random translations, rotations and


scale
Convolutional Neural Network Architecture
Results (~100% sensitivity but ~40 FPs/patient at candidate
generation step; then 3-fold Cross-Validation with data augmentation)

Pseudo-probability by simple averaging of N [0,1] classifications


Mediastinum Abdomen
71% @ 3 FPs (was 55%) 83% @ 3 FPs (was 30%)
Results (~100% sensitivity but ~40 FPs/patient at candidate
generation step)

Training mediastinum and abdomen Jointly!


Mediastinum Abdomen
82% @ 3 FPs (was 55%) 80% @ 3 FPs (was 30%)
Previous Work (CAD 1.0 or 2.0)
The previous state-of-the-art work is (Feulner et al., MedIA, 2013) which shows 52.9% sensitivity at 3.1 FP/Vol on 54 Chest CT
scans or 60.9% recall at 6.1 FP/Vol.
In (Feulner et al., MedIA, 2013), In order to compare the automatic detection results with the performance of a human, we did
an experiment on the intra-human observer variability. Ten of the CT volumes were annotated a second time by the same person
a few months later. The first segmentations served as ground truth, and the second ones were considered as detections.
TPR and FP were measured in the same way as for the automatic detection. The TPR was 54.8% with 0.8 false positives per volume
on average. While 0.8 FP is very low, a TPR of 54.8% shows that finding lymph nodes in CT is quite challenging also for humans.

Method Body Region Number Size TP Criterion TPR (%) FP/Vol.


CT Vol. (mm)
Kitasaka et al. Abdomen 5 >5.0 Overlap 57.0% 58
(2007)
Feuerstein et al. Mediastinum 5 >1.5 Overlap 82.1% 113
(2009)
Dornheim (2008) Neck 1 >8.0 Unknown 100% 9

Barbu et al. (2010) Axillary 101 >10.0 In box 82.3% 1.0


Feulner et al. (2013) Mediastinum 54 >10.0 In box 52.9% 3.1
Intra-obs. Var. Mediastinum 10 >10.0 In box 54.8% 0.8

Table reproduced from Table 3, Feulner et al., Lymph node detection and segmentation in chest CT data
using discriminative learning and a spatial prior, Medical image analysis, 17(2): 254-270 (2013). Note that
Barbu et al. (2010) is not directly comparable to other papers since Axillary lymph nodes are easier to detect.
Generalizable? Colon CADe Results using a deeper CNN on
1186 patients (or 2372 CTC volumes) [Roth et al., TMI 2016]

[SVM baseline] Summers, et a., Computed tomographic virtual colonoscopy computer-aided


polyp detection in a screening population, Gastroenterology, vol. 129, no. 6, pp.18321844, 2005.
1,186 patients with prone and supine CTC images (394/792 patients; 79/173 polyps tr/ts split)
Deep Convolutional Neural Networks for Computer-Aided Detection:
CNN Architectures, Dataset Characteristics and Transfer Learning
[Shin et al., TMI 2016, in press; http://arxiv.org/abs/1602.03409]

For a more comprehensive evaluation, we exploit three important,


but previously under-studied factors of employing deep
convolutional neural networks to CADe problems. provide some
insights and implementation tips for MICCAI community.

Particularly, we present
Evaluation of different CNN architectures ranging from 5 thousand to
160 million parameters with various of depths of CNN layers;
Impacts on performance given datasets of different scales and spatial
image contexts;
When transfer learning from pre-trained ImageNet CNN models via
fine-tuning can be helpful and why?
Problem 1: Lymph node detection in CT using three-
orthogonal views + random sampling + multi-scale
Problem 2.b: Slice based ILD Classification in CT, thick
sliceness, no Lung segmentation
Problem 2.b: Patch 32x32 based ILD Classification in CT,
all previous work using this protocol, manual ROI reqed
Observations & Directions

We summarize our findings as follows.


1. Deep CNN architectures in 8, even 22 layers [3], [18] can be useful even
for CADe problems where the available training datasets are limited.
Previously, CNN models used in medical image analysis applications are
often 2~ 5 orders of magnitude smaller.

2. The tradeoff of better learning models versus more training datasets [29]
should be thought carefully for finding an optimal solution of any CADe
problem (e.g., mediastinal and abdominal LN detection).

3. The Datasets can be the bottleneck to further advance the field of CADe.
Building progressively growing (in scales) well annotated datasets is at
least with the same importance of developing new algorithms.
As an analogy in computer vision, Scene Recognition problem has made
tremendous progress, thanks to the steady and continuous development of
Scene-15, MIT Indoor-67, SUN-397 and Place datasets [36], .
4. Transfer learning from the large scale annotated natural image datasets
(ImageNet) to CADe problems is validated to be consistently beneficial in
our experiments. This sheds some light on cross-datasets CNN learning in
medical image domain, e.g., the union of ILD [20] and LTRC datasets [38]
as suggested in this paper.

5. Last, applying out-of-shelf deep CNN image features on CADe problems


can be improved by either exploring/coupling the performance-
complementary properties of hand-crafted features [9], [8], [11]; or CNNs
trained from scratch (Roth et al., MICCAI 2014, TMI 2016) and more
desirably CNNs fine-tuned on the target medical image dataset
(evaluated in this paper).
Visualization on Transfer Learning (Learned from
Thoracoabdominal LNs)
Better Localization after Fine-tuning?
Failure Cases
(II) Semantic (Free-form) Organ Segmentation

[Farag et al., arXiv-1407.8497, 2014; Roth et al., arXiv-1504.03967; Roth et al., MICCAI 2015]
(II) Candidate Region Generation (Hand-crafted
Image Features + RF) [Farag et al., arXiv-1407.8497]

[A. Farag et al., 2014]

e.g., threshold at 97% avg. sensitivity/recall


p > 0.5 27% avg. Dice score
(over-segmentation)

Refinement: Multi-Level Regional and Patch ConvNets Fusion


Convolutional Neural Networks (AlexNet)

Trained first level filter kernels

CUDA-ConvNet: Open-source GPU accelerated code by


[Krizhevsky et al., NIPS 2012]
Multi-Scale Zoom-out R-ConvNet

Zoom-out
P-ConvNet: Deep Patch Classification

holger.roth@nih.gov
Ground truth Random Forest 2.5D Patch ConvNet prob.
R2-ConvNet: Regional ConvNet

~68%
Dice
score

~27% ~57%
Dice Dice
score score
Training & Testing Performance (4-fold Cross-
Validation)

3/24/2015
holger.roth@nih.gov
Probability maps thresholded at p0=0.2, p1=0.5, and p2=0.6, calibrated in training and applied
on testing. 43
Dice coefficients: 84.2% (+/- 3.6%) in Training and 75.8% (+/-5.4%) in Testing (more stable by
std values)
4-fold CV Performance

Minimum surface distances: 0.94+/-0.6mm (p<0.01) with R2-ConvNet from 1.46+/-1.5mm


if just P-ConvNet is applied.

Previous state-of-the-art: [46.6% to 69.1%] DSC, all under LOO (Leave-one-patient-out).


An Above-Average Example

a) The manual ground truth annotation (in red outline)


b) The G(P2(x)) probability map
c) The final segmentation (in green outline) at p2=0.6

DSC=82.7%.
mm mm mm mm

mm mm mm mm

mm mm mm mm

mm mm mm mm

mm mm mm mm

Mean 0.936 mm
Std 0.586 mm
Min 0.297 mm
Max 2.204 mm
(III) Interleaved Text/Image Deep Mining on a Large-Scale Radiology
Database (780K/60K patients) for Automated Image Interpretation

Hoo-Chang Shin, Le Lu, Lauren Kim, Ari Seff, Jianhua Yao, Ronald M. Summers, IEEE
Conf. CVPR 2015, to appear; JMLR on large scale health informatics issue (in submission)
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database

Example words embedded in the vector space using Open Source RNN based Google Word-
to-Vector modeling (visualized on 2D), trained from 1B words in 780K radiology reports and
0.2B from OpenI:an open access biomedical image search engine; http://openi.nlm.nih.gov .
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database

Disease Ontology (OD) is analogical to WordNet to ImageNet


http://arxiv.org/abs/1603.08486

Shin et al., IEEE CVPR 2015, JMLR 2016 (http://arxiv.org/abs/1505.00670)


Unsupervised Category Discovery via Looped Deep Pseudo-Task
Optimization Using a Large Scale Radiology Image Database [Wang et al.
2016] http://arxiv.org/abs/1603.07965

Obtaining semantic labels on a large scale radiology image database


(215,786 key images from 61,845 unique patients) is a prerequisite
yet bottleneck to train highly effective deep convolutional neural
network (CNN) models for image recognition.
Nevertheless, conventional methods for collecting image labels (e.g.,
Google search followed by crowd-sourcing) are not applicable due to
the formidable difficulties of medical annotation tasks for those who
are not clinically trained.
This type of image labeling task remains non-trivial even for
radiologists due to uncertainty and possible drastic inter-observer
variation or inconsistency.

In this paper, we present a looped deep pseudo-task optimization


(LDPO) procedure for automatic category discovery of visually
coherent and clinically semantic (concept) clusters.
Unsupervised Category Discovery via Looped Deep Pseudo-Task
Optimization Using a Large Scale Radiology Image Database [Wang et al.
2016] http://arxiv.org/abs/1603.07965
Our system can be initialized by domain-specific (CNN trained on
radiology images and text report derived labels) or generic
(ImageNet based) CNN models.
Afterwards, a sequence of pseudo-tasks are exploited by the looped
deep image feature clustering (to refine image labels) and deep
CNN training/classification using new labels (to obtain more task
representative deep features).
Our method is conceptually simple and based on the hypothesized
"convergence" of better labels leading to better trained CNN models
which in turn feed more effective deep image features to facilitate
more meaningful clustering/labels.
We have empirically validated the convergence and demonstrated
promising quantitative and qualitative results.

Category labels of significantly higher quality than those in previous


work are discovered. This allows for further investigation of the
hierarchical semantic nature of the given large-scale radiology
image database.
Framework of LDPO

Deep CNN features Clustering CNN


extraction and feature
encoding (k-means or RIM) If
Fine-tuned CNN
converged No
model (with topic Fine-tuning the
by
labels) or generic CNN (Using renewed
evaluating
Imagenet CNN cluster labels)
the
model
clusters

Randomly
Shuffled Images
Image Clusters NLP on text Yes for Each Iteration
with semantic text reports for each
labels Cluster Train 70% Val 10%
Test 20%
CNN Models and Feature Encoding

LDPO is applicable to a variety of CNN models, by analyzing the CNN


activations from layers of different depths in AlexNet and GoogLeNet
Caffe CNN implementation to perform fine-tuning on pre-trained CNN
Cluster Labeling Samples
Five-level Hierarchical Categorization

Form a hierarchical category tree (ontology semantics?) of (270,


64, 15, 4, 1) different class labels from bottom (leaf) to top (root).
The random color coded category tree is shown.
A Sample Branch of Category Hierarchy

The high majority of images in the clusters of this branch are


verified as CT Chest scans by radiologists.
1

4 7

1 5 1 2 5
6
4 0 5 6 5

22 25 60 64 141 174 40 129 195 26 72 200 205 230 253 23 75 233 41 104 166 246 81 84 179 224 259
With Radiologist-in-the-loop
Protocol to build an annotated
Large-scale Radiology Image
Database Flickr 30K, MS
COCO ?
Take Home Messages
1. High performance CAD systems can be build using Stratified, Heterogeneous
Cascade or Stacking; progressively pruning from large dimensional model state
spaces approaches to handle the unbalanced negative learning challenge (negatives
need to be approximately sampled).

2. Full 3D approaches may capture more holistic patterns but can be very challenging to
be effectively/compactly trained, even by modern learning systems not always
optimal by default The issue of Complexity & Composability curse-of-
dimensionality of trainability and generality proper balance of representation
granularity/scale & size.

3. Proper image representations (e.g., random 2D/2.5D view sampling and aggregation,
mid-level cues, 20-questions hypothesis testing, ) can be critical alternatives.

4. Multi-staged algorithmic flow is not end-to-end trainable; but offer great flexibility
of leveraging heterogeneous components: shallow or deep, as long as the performance
goal of each step/stage is clearly defined and can compensate each other.

5. Generally speaking, it seems that Deeper is better if carefully handled!


Thank you!
Imaging Biomarkers and Computer-Aided Diagnosis Laboratory
Clinical Image Processing & Services
Radiology and Imaging Sciences
National Institutes of Health Clinical Center

le.lu@nih.gov; rms@nih.gov

Thanks NIH Intramural Research Program for support and NVIDIA for
donating Tesla K40 GPUs! All code and data (except full radiology
reports) discussed are in the process to make publicly available, or
already shared at NCI cancer image archive or Github (upon approval).

CVPR 2015, 2016 Workshop on Medical Computer Vision: How Big Data is Possible for Medical
Image Analysis, invited talks only, Boston, MA, June 11th, 2015; Las Vegas, NV, July 1st, 2016

Potrebbero piacerti anche