What to do when you don’t have all the data?

Missing data

A physicist may say that the world is an approximation of his equations. In the same way, many machine learning models require that the data is “ideal”. With the real world data that is available today, this is not the case.

A lot of data today is recorded by medical institutions as time series. This information can be very important and should be taken advantage of if possible.

Medical institutions have been collecting data for decades, but it is not necessarily data that is in the same “format”; The data can be measured at different time intervals, and different properties could be measured. This results in “missing data”. That is for example doctor visits – how often do you go to the doctor? I bet it is not every single Monday throughout the year. Also I don’t think the doctor measures the same things every visit. I believe you would find it somewhat inconvenient if you caught a cold and the doctor insist on doing a colonoscopy for collecting data.

While extensive measuring on patients could be an option, it wouldn’t allow us to apply machine learning to the vast amount of already existing data. We need to deal with it somehow.


Recurrent Neural Networks

In deep learning, recurrent neural networks (RNNs) are cells that remember a state over time.

RNNs are a special class of Neural Networks characterized by internal self-connections. The output of a RNN is recursively computed from the current input and the past internal states, which stores a decaying memory of the past inputs. In short, it takes a value at a certain point in time predicts a new value based on values at an earlier point in time. This makes it very suitable for working with time series. Model parameters are commonly trained with gradient descent, but training can be performed by many other variations of deep learning training techniques.


Image from wildml.com

Gated Recurrent Unit Decay Cell

One approach is explored in depth by Zhengping et al with their Paper on Recurrent Neural Networks for Multivariate Time Series with Missing Values.  Here they talk about modelling this kind of data as time series, and then make a prediction on the missing intermediate values, by training on how theses values changes with respect to time. In the paper they propose this specific approach, and they call it “Gated Recurrent Unit Decay” (or GRU-D for short). This is a cell much like the GRU (introduced by Cho et al in his paper), but where the missing data is estimated by using decay weights, time intervals, last observed value and the all over mean of the value. The machine learning inclined reader is encouraged to read the paper.

To see how GRU-D compares to the normal GRU, the paper uses this image:


Image is from the GRU-D paper by Zheping et al

The mask is a way to determine whether or not the data is missing and the gammas are the decay weights.

This is tested out on various data and show real promising results!

Analysed time series data

We applied the GRU-D technique to mainly two datasets; the Physionet 2010 Challenge and data from the University Hospital of North Norway.

Colonrectal cancer surgery prediction by blood samples

Applying GRU-D on trying to predict wound infections by looking at blood sample data from patients that has undergone surgery for colonrectal cancer, proved very successful. The GRU-D method yielded substantially better results than the original GRU.


The Physionet 2010 challenge data contains information about patients that have been in the ICU at a hospital for 48 hours. We attempted to predict the mortality likelihood based on this time series data, and got some promising results. The GRU-D again performed better than the GRU. However, the data set here is very unbalanced and thus presents challenges. It would be very interesting to experiment with class balancing methods and possibly achieve an even better result.