The Results I hate it when the final result is used to maintain suspense throughout an article. So here are the results up front: The fact that there were payouts for the top 10 finishers, and the competition was for a good cause beating lung cancer were also quite motivating.

Preprocessing The beginning of the competition was focused on data prep. This turned out to be fairly straightforward, and the preprocessing code that I wrote on the second day of the competition I continued using until the very end.

Basically CT scans are a collection of 2D greyscale slices regular images. A tricky detail that I found reading the LUNA competition is that different CT machines will produce scans with different sampling rates in the 3rd dimension.

So in order to apply the same model to scans of different thickness and to make a model generalize to new scans you need to resize the images so that they have the same resolution.

It contains about additional CT scans. This is from a small 3D chunk of a full scan. To put this nodule in context, look at the first big. See, finding nodules in a CT scan is hard for a computer.

An average CT scan is 30 x 30 x 40 centimeters cubed while an average nodule is 1cm cubed. For an automated system with zero knowledge of human anatomy and actually zero prior knowledge at allfiguring out which one or two areas in a scan really matter is a very hard task.

So this LUNA data was very important. However it is extremely relevant to the task of predicting cancer diagnosis. Ultimately this means that I had nodules and radiologist estimations of their properties. The properties that I chose to use were: Figuring out that the LIDC dataset had malignancy labels turned out to be one of the biggest separators between teams in the top 5 and the top A month into the competition, someone made a submission to the stage 1 leaderboard that was insanely good.

I assumed they had discovered some great additional source of data so I dug around more and found the LIDC malignancy labels! Julian proceeded in much the same way, and independently discovered and used the LUNA and malignancy annotations. For at least a week I tried a few things without success, namely: At first I built a model using 64mm cube chunks where the model was trained to predict the probability of a given chunk containing a nodule.

To make sure not to miss any parts, the model needs to be scored a few hundred times. I then aggregated these with some simple stats like max, stdev, and the location of the max probability prediction.

This model is trained and validated on the Kaggle DSB dataset. After doing some initial tests on the training set cross validationI was expecting my leaderboard score to be around 0.

