MDai Algorithm

Summary

An ensemble model based on multiple ResNets that create probability maps to detect nodules, which are further segmented by bounding boxes. The final prediction of whether the patient has lung cancer uses additional features including the sex of the patient and based on the detected nodules. The ensemble consists on a huge number of neural networks and will be quite expensive to run.

Source

**Author:** MD.ai
**Repository:** https://github.com/mdai/kaggle-lung-cancer
6th place at the Kaggle Data Science Bowl 2017

License

Apache 2.0

Prerequisites

Dependency Name Version
Language Python 3.5
ML engine Keras 1.2.2
ML backend Tensorflow 1.0.0
OS Ubuntu 16.04
Processor CPU (yes/no)
  GPU yes
GPU driver CUDA 8
  cuDNN 5.1

Dependency packages:

numpy==1.12.1
scipy==0.19.0
pandas==0.19.2
scikit-image==0.13.0
scikit-learn==0.18.1
joblib==0.9.4
pillow==4.0.0
xgboost==0.6a2
keras==1.2.2
tensorflow-gpu==1.0.0
hyperopt==0.1
h5py==2.7.0
redis-py==2.10.5

Additional installation instructions:

Install pydicom 1.0.0a1 from source.

git clone https://github.com/pydicom/pydicom.git pydicom-src
cd pydicom-src
git reset --hard bbaa74e9d02596afc03b924fe8ffbe7b95b6ff55
python setup.py install

Algorithm design

Preprocessing

The DICOM images were preprocessed applying the following steps:

1) DICOM images are loaded into a 3D-numpy ndarray. 
2) The pixel values are converted into Hounsfield Units (HU).
3) The volume is resampled to isotropic spacing.

Sex detection

As a further feature for model training the sex is determined using a deep residual net (ResNet) with residual units based on the architecture proposed by He et al. that was trained on hand-labeled DICOM images. The network architecture is as follows:

mdai_cnn

Nodule detection

Two variants of ResNets are used to create two sets of probability maps out of the isotropic volume created from the DICOM image, that provide the further algorithm with regions of interest. This is done by extracting patches out of the full isotropic volume and using the model to predict the presence of a nodule in that patch. The model architectures are designed as follows:

mdai_cnn

750 random hyperparameter sets are used in the further prediction process. These parameters include:

  • Probability map of model m05a or m09a
  • ResNet to fit boundary box
  • ResNet to predict cancer probability
  • Cancer Probability threshold
  • Size threshold
  • Number of nodules used in cancer prediction
  • Inner or outer boundary box
  • Aggregation function for p(cancer)

A threshold is applied to the density map. Then for each configuration bounding boxes are fit around connected objects inside the probability map. A second ResNet is then used to refine the bounding boxes. These ResNets are designed as follows:

mdai_cnn

Too small bounding boxes are discarded. The remaining bounding boxes are used to create 3D image subsets out of the isometric volume. These ROIs are again fed to the yet another ResNet to predict whether a nodule is present in it. This ResNet has the following architecture:

mdai_cnn

ROIs with a too low probability to contain a nodule are discarded.

Prediction of cancer probability

The ROIs remaining at the end of the nodule detection are fed into a third ResNet to predict whether they are cancerous or not. These ResNets is designed as follows:

mdai_cnn

The resulting probabilities are then aggregated. Further some additional features are created out of the geometry of the fitted bounding boxes, the number of ROIs and the number of predicted nodules.

All features (sex and nodule features) as well as the single probabilities that the patient has cancer resulting of all hyperparameter configurations are used in a XGBoost ensemble model to get a final prediction of how probable the patient has cancer.

Trained model

A trained model is not publicly available, but was requested.

Source: -

Usage instructions: -

Model Performance

Training- / prediction time

Test system: AWS p2.16xlarge instance

Component Spec Count
vCPU Broadwell 2.7 GHz 64
GPU NVIDIA GK210 16
RAM 732 GB  

Training time: Not specified.
Prediction time: Several days for the whole DSB dataset.

Model Evaluation

Dataset:

Metric Score
LogLoss 0.41629

Use cases

When to use this algorithm

  • If you want to take into account the patient’s sex (only for final prediction, whether the patient has cancer.)
  • If you want bounding boxes for probable nodules.

When to avoid this algorithm

  • A lot of models are run in parallel for one prediction. Thus this model will probably be computational very expensive.

Adaptation into Concept To Clinic

Porting to Python 3.5+

Already written in Python 3.5+.

Porting to run on CPU and GPU

Running it on a CPU has not been tested yet. It probably is possible, but the instructions given by the authors for running the pipeline on multiple GPUs will have to be adjusted.

Improvements on the code base

  • There is basically no documentation (doc-strings, comments) to explain what the purpose of each function/module is.
  • Some functions are extremely long and do a lot of stuff. Breaking them up in multiple functions will improve readability and maintainability.
  • There are no unit tests so far.

Adapting the model

Since the goal of this project is to detect nodules and classify each nodule and not to predict whether the patient has cancer overall, the feature generation used for the ensemble model will not be needed, including the prediction of the patient’s sex. The patient’s sex would also probably be known by the doctor and does not have to be predicted in production. The other parts of the prediction process up until the ensemble is happening can be used mostly unchanged. The final aggregation of the probabilities of cancer for each nodule has to be removed. Also the ensemble model will have to be adjusted to aggregate the predictions for each nodule of each hyperparameter combination instead of an overall probability that the patient has cancer. If possible the number of models in the ensemble should be reduced. The authors also suggest that some parts of the algorithm can be optimized.

Comments

The model allows to identify and mark single nodules with bounding boxes, which would be great to label them for the user. It also calculates a cancer-probability for each nodule, which is useful for the goal of this project. But this model might be one of the most computational expensive ones and might not run on a PC. The authoring team seems to recently have founded a startup (MD.ai) that wants to create solutions based on image recognition with deep learning on medical data. Thus they will probably hesitate to share the trained model.

References

[Identity Mappings in Deep Residual Networks](https://arxiv.org/pdf/1603.05027.pdf)