Aidence¶
Summary¶
Author: Tim Salimans, Mark-Jan Harte, Gerben van Veenendaal Repository: https://bitbucket.org/aidence/kaggle-data-science-bowl-2017/src/38c4f2f67294?at=master The 3rd place at the Data Science Bowl 2017 on the private leaderboard.
Prerequisites¶
Dependency | Name | Version |
---|---|---|
Language | Python | 3.4 |
ML engine | ||
ML backend | Tensorflow | 1.1 |
OS | ||
Processor | CPU | yes |
GPU | Nvidia K80 | |
GPU driver | CUDA | 8.0 |
cuDNN | 6.0 |
Dependency packages:
tensorflow==1.1
opencv>=3.1
scipy==0.17.0
numpy==1.13
scikit-learn==0.19.0
pydicom==0.9.9
SimpleITK==1.0.1
pandas==0.20.3
pycuda==2017.1.1
Algorithm design¶
Preprocessing¶
Resampling to the isotropic resolutions of and
for the final model.
Nodule detection¶
Fully convolutional Resnet has been employed in order to detect for each pixel whether it is contained in the center of a nodule. It was trained it over the LIDC/IDRI dataset. Two of those models has been trained: one for normal sized nodules and one for masses. The masses on the train data of Kaggle have been annotated and the mass network has been trained on both masses from LIDC/IDRI as well as masses from Kaggle. Takes the logit output of that network for the whole volume and thresholds it to determine candidates. It also masks out nodules outside the lung.
Prediction of cancer probability¶
Takes the candidates and trains some attributes of the LIDC dataset (malignancy, etc.) and trains the cancer label for the Kaggle scans in a multi-task model.
Trained model¶
Source: From the issue description followed that the trained model is already requested.
There are two .pkl
models in localization and localization-large
Usage instructions: README
Model Performance¶
Training- / prediction time¶
Test system:
Component | Spec | Count |
---|---|---|
CPU | ||
GPU | Nvidia K80 | 4 for everything but the final model <br/> 8 for the final model |
RAM |
Training time:
It takes about 3-5 days to run everything (infer+train) on a decent machine with 8 GPUs. Prediction time:unknown, but must be less than 14 min per CT, since it processes the 506 CTs for the 5 days
Use cases¶
When to use this algorithm¶
- The annotation for the mass and nodules over the Kaggle dataset, provided by the aidence team, can be used in futher fine-tunings / retrainings.
When to avoid this algorithm¶
- even with GPU support the approach of per voxel examination may consume a huge amount of time. The authors have used 8 GPUs Nvidia K80 which is
Adaptation into Concept To Clinic¶
Porting to Python 3.5+¶
The solution is already compatible with Python 3.5+
Porting to run on CPU and GPU¶
The approach consists of two deep 3D residual networks for classification (which runs through each voxel
from a CT scan). It’ll require a huge amount of time to even predict with this pipeline using CPU only.
Improvements on the code base¶
The code itself looks good to me.
Adapting the model¶
Comments¶
The major benefits for the concept-to-clinic from the aidence approach will be to include provided mass- ans nodule- annotations over the Kaggle dataset into the overall dataset for further retraining other models on it.
References¶
Aidence algorithmREADME Mass-annotations over Kaggle data. Nodule-annotations over Kaggle data.