MixMatch is a form of data augmentation and pseudo-labeling proposed by Google’s Berthelot et al. in their paper MixMatch: A Holistic Approach to Semi-Supervised Learning (arXiv:1905.02249). All it does is essentially combine some already-known techniques for pseudo-labeling, but the paper shows that the act of combining these various techniques leads to significant performance improvements on datasets like CIFAR-10 and CIFAR-100.
The basic premise of pseudo-labeling is that there is knowledge in unlabeled data that we’re currently not leveraging by only using the training set. MixMatch blends labeled and unlabeled data in a way that essentially leads to a more regularised model. Too vague? Let’s hop in. I’ll try to provide a more intuitive explanation at the end of this post.
Note: If you’re not familiar with MixUp, an augmentation technique that MixMatch is based on, it’s best you quickly familiarise yourself with it, perhaps by reading the paper?
It’s important to understand that pseudo-labeling (and by extension MixMatch) doesn’t touch the architecture of the network that’s being used. It only touches two parts of the processing pipeline: batch generation and of course the loss function. Most of the magic happens during batch generation though, the loss function is just plain old Mean Squared Error (MSE).
Essentially, you feed MixMatch two batches, X
and U
, and it modifies those batches by blending them. X
is just your standard batch of training data. U
is a batch of your unlabeled data. Let’s forget about X
for a second, and focus on what MixMatch does with U
.
For each element in U
, we’ll call it u
, MixMatch does the following:
u
twice, creating u1
and u2
(discarding u
). The augments are your standard augments like flips and crops for images.u1
and u2
through the model, and averages the resulting predictions (just a standard average). Let’s call this average distribution q
.q = Sharpen(q)
u1
, with label q
) and (u2
, also with label q
).
We’ll throw all of these pseudo-labeled input pairs into one big tensor, pseudoU
.Let’s now blend pseudoU
with our X
(which has also been augmented with your usual augments, crop, flip etc):
X
and pseudoU
into one big tensor, and shuffle its order. We’ll call it SuperBatch
.X[i]
, and mix it with SuperBatch[i]
. We discard the individual elements from SuperBatch
after we’ve blended them.pseudoU
, and mix it with what’s left in SuperBatch
.That’s it for MixMatch. We should now have two batches, X_mixedup
and U_mixedup
. We’ll now feed both of these through the network.
This is relatively straightforward:
X_mixedup
, it’s just your usual CrossEntropyLossU_mixedup
, we apply MSE. The reason for using MSE is that MSE is bounded, and that it doesn’t care about getting the class right (which is important, as it’s only a pseudo-label!).loss(X_mixedup) + lambda * loss(U_mixedup)
There’s quite a few hyper-parameters for MixMatch, thankfully the researchers propose a couple of good starting points:
u
) only twice (K
= 2)Sharpen()
function is actually just a “temperature adjustment” where they propose to keep the temperature parameter T
at 0.5.lambda
in the combined loss function, they propose to start out with a value of 100, but it may have to be adjusted depending on the datasetalpha
, which is a parameter that’s relevant to the MixUp operation, they propose to start with 0.75 and adjust depending on the datasetMy understanding of pseudo-labeling in general is shaky to say the least, but I’ll make an attempt at verbalising my understanding so far. In essence, the properties of any good model aren’t limited to “can it reliably classify an input correctly?”. If we take a step back, we can think of a few more properties we want our model to have:
Pseudo-labeling (and thus MixMatch) may not be able to help the model learn to correctly classify, but it can help to teach the model to have the above label-independent meta-properties!
U
makes sure that the model spits out robust predictions across augmentations (note that it doesn’t matter what the model predicts, just that it predicts the same for all augmentations). This is called consistency regularisation.q
. This is what the paper calls entropy minimization.U
could be confusing the legitimate inputs from X
through blending. However, MixUp works in such a way that it only adds a “little” of the second mix-in label, it’s still mostly the original X
(say 75-80%).