Stratified k-fold with Keras

I was working for laughs on a (somewhat) private Kaggle competition held for a class a friend was enrolled in. I decided to try my hand at it, given I could pick up some new experience using the Keras Model API instead of the Sequential one.

Frankly, I did really poorly. My initial submission was a painful case of overfitting and landed me somewhere at position 40/60. Who would’ve thought, Deep Learning can’t just waltz in everywhere and take the top spot!

At any rate, I found that the validation split and its results did not accurately reflect performance on the test set (bad case of overfitting). Given it was a relatively small, simple dataset (4700 training examples with 500-ish features), I decided it might be interesting to implement cross validation, and see where that would get me, despite using a deep learning setup.

Intermezzo: k-fold cross-validation

I’m going to assume you’re at least vaguely familiar with cross-validation as a principle, and I’ll just briefly explain what k-fold (and its stratified cousin) entail.

Brief disclaimer: I dabble in data statistics for laughs, and don’t actually fully grasp most of its concepts. I’ll try to explain to the best of my abilities, but there’s a very real chance I get things wrong. Make sure to check Google. You’ve been warned.

k-fold cross-validation basically means the following:

Take all of your labeled data, and divide it in K batches
Train your model on K-1 batches
Validate on the last, remaining batch
Do this for all permutations

Stratified k-fold means that you also look at the relative distribution of your classes: if one class/label appears more often than another, stratified k-fold will make sure to represent that imbalance when it creates batches.
How do you choose a k? Honest answer, I don’t know. Default appears to be 3 in sklearn and most of the industry seems to use 10. I stuck to 10. Let me know if you find a better strategy.

Implementing (stratified) k-fold in Keras

Of course, I wanted to implement this in Keras. A good starting point was this issue from the Keras GitHub repo. However, looks like the sklearn package for stratified k-fold was just updated, so we’ll have to work with its new API (not very hard though).

Let’s discuss what the sklearn.model_selection.StratifiedKFold class does:

It takes as input the number of folds (‘splits’) you want. For our case, 10. It can also use shuffle and a relevant RNG can be set or left to default.
Once instantiated, its split method takes our data in numpy arrays, X, its labels, y, and returns the indices of the batches in the original training data. This may seem like a mindfuck at first, but look at how elegantly numpy lets us use those indices to generate the batches (right after the first print statement)!

That means that all we have to do is instantiate the validator, have it spit out the indices, generate batches from the indices, and feed those batches to our model. I’ve slightly adapted the code from the GitHub issue linked above. Code is in Python 2, shouldn’t be hard to adapt though.

from sklearn.model_selection import StratifiedKFold
# Instantiate the cross validator
skf = StratifiedKFold(n_splits=kfold_splits, shuffle=True)
# Loop through the indices the split() method returns
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
    print "Training on fold " + str(index+1) + "/10..."
    # Generate batches from indices
    xtrain, xval = X[train_indices], X[val_indices]
    ytrain, yval = y[train_indices], y[val_indices]
    # Clear model, and create it
    model = None
    model = create_model()
    
    # Debug message I guess
    # print "Training new iteration on " + str(xtrain.shape[0]) + " training samples, " + str(xval.shape[0]) + " validation samples, this may be a while..."
    
    history = train_model(model, xtrain, ytrain, xval, yval)
    accuracy_history = history.history['acc']
    val_accuracy_history = history.history['val_acc']
    print "Last training accuracy: " + str(accuracy_history[-1]) + ", last validation accuracy: " + str(val_accuracy_history[-1])

Some additional explanations:

The create_model()function is one I wrote myself. All it does is create a model in Keras, compile it, and return the compiled model.
I also wrote train_model() myself. The only thing that happens in there is that I call model.fit() multiple times, with different learning rates and number of epochs. It returns the history object of its last model.fit() operation. More information on the history object returned by the fit() method can be found in the Keras documentation. I use this object to see my last accuracy, because I usually set verbose=0 when calling fit(), you may want to do that differently.

That’s all there is to it! I might upload the final Jupyter notebook onto my GitHub later on, if so, I’ll update the post here. Additionally, if you’re not into stratified k-fold, I’m relatively confident you can do this with vanilla k-fold without changing much.