Alex |Andre |Gilberto |Shize algorithm


The approach consists of 3D CNN data model which slide through the z coordinate of a CT volume, followed xgboost and extraTree models trained on different subsets of extracted features. by was custom built to reflect how radiologists review lung CT scans to diagnose cancer risk.

A sliding 3D data model was custom built to reflect how radiologists review lung CT scans to diagnose cancer risk. As part of this data model - which allows for any nodule to be analyzed multiple times - a neural network nodule identifier has been implemented and trained using the Luna CT dataset. Non-traditional, unsegmented (i.e. full CT scans) were used for training, in order to ensure no nodules, in particular, those on the lung perimeter are missed.


Author: Alexander Ryzhkov, Gilberto Titericz Junior, Andre, Shize Su
The approach scored the 8th place at the Data Science Bowl 2017.


MIT License


Andre Shize
Dependency Name Version Dependency Name Version
Language Python 3.5 Language Python 2.7
ML engine Keras 1.2.2 ML engine Keras 1.2.2
ML backend Theano 0.8+ ML backend Theano
OS PC Linux
AWS Linux

OS AWS Linux C3.8
Processor CPU PC i7
Processor CPU Intel Xeon
GPU no
GPU driver CUDA

Some of the cells’ values were restored from the AWSs’ setups and CUDA compatibility.

Dependency packages: Neither the repository nor the authors specified exact versions of the Python packages:

Andre Shize
Keras 1.2.2 numpy
Theano pandas
spyder xgboost
opencv scikit-learn

Algorithm design


  1. Resampling all patient CT scans to a relatively rough resolution of 2x2x2mm.
  2. CT voxels’ values standardisation to Hounsfield scale.
  3. Lungs segmentation.

Nodule detection

Train a nodule identifier on a slicing architecture using Luna dataset and intermediate files created (3 options provided).

The slicing architecture itself is made of UNets. One of the aforementioned options is also a good data augmentation method:

[..] Special mosaic-type aggregation of training of the nodule identifier has been deployed, as illustrated below.

Prediction of cancer probability

The most important feature is the existence of nodule(s), followed by their size, location and their other characteristics. For instance, a very significant number of patients for which no nodule has been found, proved to be no cancer cases. [..] Key features include existence/size of the largest nodule, and its vertical location, existence of emphysema, volume of all nodules, and their diversity.

The authors also have mentioned that the code location of nodules versus the segmented lungs centre of gravity as a feature provide higher significance in comparison with convenient upper/lower parts of lungs feature.

As outlined, our combined approach uses the neural network as a feature generator and then applying xgboost and extraTree models on the extracted features to generate predictions and submissions. To make the model performance more stable, we also run some of the models with multiple random seeds (e.g., for xgb, use 50 random runs; for extraTree, use 10 random runs) and take the average. Our final winning submission (private LB0.430) is a linear combination of a couple of xgb models and extraTree models.

Trained model

Source: nodule_identifiers

Usage instructions: Shize algorithm, Andre algorithm

Model Performance

Training- / prediction time

Test system:

Component Spec Count
CPU C3.8 Intel Xeon  

Training time: days on AWS

Training some of the nodule models took days using high end 12GB GPUs.

Prediction time: unknown, but must be less than 14 min per CT, since it processes the 506 CTs for the 5 days

Model Evaluation

Dataset: Data Science Bowl evaluation dataset

Metric Score
Log Loss 0.43019

Use cases

When to use this algorithm

  • The nodules detection system seems to be a good contribute to a concept-to-clinic’s ensemble, by the reason listed in comments.

When to avoid this algorithm

  • The nodules detection method provided by the authors requires inconvenient rough CT’s spacing (2x2x2mm) which may conflict with other pipelines, if the high order interpolation polynomials will be employed then the additional spacing transaction may considerably affect on a computation time.
  • The training from scratch, as it was mentioned by the authors, for only one of the sliding architectures may take days even over AWS P2 equipped by NVIDIA K80 12GB GPUs.

Adaptation into Concept To Clinic

Porting to Python 3.5+

The Andre part had been already written in python 3.5. However Shize used the python 2. The main difficulties seems to be the lack of specified versions for the packages employed by Shize. Nonetheless, Shize’s part consists merely of ensembling already extracted features via xgboost and extraTree models, and GPUs are not required.

Porting to run on CPU and GPU

The noodles detector written on Keras with Theano as the backend, thus it shall run on CPU out of the box.

Improvements on the code base

Adapting the model

Worth noting that simpler model consisted only of a single xgb has performed similarly (0.434 on private LB). Thus it will be better to drop away the cumbersome combination of different xgb and extraTree models , and some of them were using averaged prediction from 50 or 10 random runs (i.e., using 50 (or 10) different random seeds)


The whole pipeline relies on the nodules detector, and at the same time the approach has reached 8th place on the DSB17 private LB, it’s worth to admire that method then and consider it into account. Moreover, the authors stated that they didn’t use the information relative nodule malignancy as they’ve incorrectly assumed it’s unavailable, therefore training the model from scratch or fine tune it over the data within malignancy status seems to be beneficial.


Repository: Report: dataset: