I was working for laughs on a (somewhat) private Kaggle competition held for a class a friend was enrolled in. I decided to try my hand at it, given I could pick up some new experience using the Keras
Model API instead of the
Frankly, I did really poorly. My initial submission was a painful case of overfitting and landed me somewhere at position 40/60. Who would’ve thought, Deep Learning can’t just waltz in everywhere and take the top spot!
At any rate, I found that the validation split and its results did not accurately reflect performance on the test set (bad case of overfitting). Given it was a relatively small, simple dataset (4700 training examples with 500-ish features), I decided it might be interesting to implement cross validation, and see where that would get me, despite using a deep learning setup.
I’m going to assume you’re at least vaguely familiar with cross-validation as a principle, and I’ll just briefly explain what k-fold (and its stratified cousin) entail.
Brief disclaimer: I dabble in data statistics for laughs, and don’t actually fully grasp most of its concepts. I’ll try to explain to the best of my abilities, but there’s a very real chance I get things wrong. Make sure to check Google. You’ve been warned.
k-fold cross-validation basically means the following:
Stratified k-fold means that you also look at the relative distribution of your classes: if one class/label appears more often than another, stratified k-fold will make sure to represent that imbalance when it creates batches.
How do you choose a k? Honest answer, I don’t know. Default appears to be 3 in sklearn and most of the industry seems to use 10. I stuck to 10. Let me know if you find a better strategy.
Of course, I wanted to implement this in Keras. A good starting point was this issue from the Keras GitHub repo. However, looks like the
sklearn package for stratified k-fold was just updated, so we’ll have to work with its new API (not very hard though).
Let’s discuss what the
sklearn.model_selection.StratifiedKFold class does:
splitmethod takes our data in
numpyarrays, X, its labels, y, and returns the indices of the batches in the original training data. This may seem like a mindfuck at first, but look at how elegantly
numpylets us use those indices to generate the batches (right after the first print statement)!
That means that all we have to do is instantiate the validator, have it spit out the indices, generate batches from the indices, and feed those batches to our model. I’ve slightly adapted the code from the GitHub issue linked above. Code is in Python 2, shouldn’t be hard to adapt though.
from sklearn.model_selection import StratifiedKFold # Instantiate the cross validator skf = StratifiedKFold(n_splits=kfold_splits, shuffle=True) # Loop through the indices the split() method returns for index, (train_indices, val_indices) in enumerate(skf.split(X, y)): print "Training on fold " + str(index+1) + "/10..." # Generate batches from indices xtrain, xval = X[train_indices], X[val_indices] ytrain, yval = y[train_indices], y[val_indices] # Clear model, and create it model = None model = create_model() # Debug message I guess # print "Training new iteration on " + str(xtrain.shape) + " training samples, " + str(xval.shape) + " validation samples, this may be a while..." history = train_model(model, xtrain, ytrain, xval, yval) accuracy_history = history.history['acc'] val_accuracy_history = history.history['val_acc'] print "Last training accuracy: " + str(accuracy_history[-1]) + ", last validation accuracy: " + str(val_accuracy_history[-1])
Some additional explanations:
create_model()function is one I wrote myself. All it does is create a model in Keras, compile it, and return the compiled model.
train_model()myself. The only thing that happens in there is that I call
model.fit()multiple times, with different learning rates and number of epochs. It returns the
historyobject of its last
model.fit()operation. More information on the
historyobject returned by the
fit()method can be found in the Keras documentation. I use this object to see my last accuracy, because I usually set
fit(), you may want to do that differently.
That’s all there is to it! I might upload the final Jupyter notebook onto my GitHub later on, if so, I’ll update the post here. Additionally, if you’re not into stratified k-fold, I’m relatively confident you can do this with vanilla k-fold without changing much.