MixMatch is a form of data augmentation and pseudo-labeling proposed by Google’s Berthelot et al. in their paper MixMatch: A Holistic Approach to Semi-Supervised Learning (arXiv:1905.02249). All it does is essentially combine some already-known techniques for pseudo-labeling, but the paper shows that the act of combining these various techniques leads to significant performance improvements on datasets like CIFAR-10 and CIFAR-100.
The basic premise of pseudo-labeling is that there is knowledge in unlabeled data that we’re currently not leveraging by only using the training set. MixMatch blends labeled and unlabeled data in a way that essentially leads to a more regularised model. Too vague? Let’s hop in. I’ll try to provide a more intuitive explanation at the end of this post.
Note: If you’re not familiar with MixUp, an augmentation technique that MixMatch is based on, it’s best you quickly familiarise yourself with it, perhaps by reading the paper?
It’s important to understand that pseudo-labeling (and by extension MixMatch) doesn’t touch the architecture of the network that’s being used. It only touches two parts of the processing pipeline: batch generation and of course the loss function. Most of the magic happens during batch generation though, the loss function is just plain old Mean Squared Error (MSE).
Essentially, you feed MixMatch two batches,
U, and it modifies those batches by blending them.
X is just your standard batch of training data.
U is a batch of your unlabeled data. Let’s forget about
X for a second, and focus on what MixMatch does with
For each element in
U, we’ll call it
u, MixMatch does the following:
u). The augments are your standard augments like flips and crops for images.
u2through the model, and averages the resulting predictions (just a standard average). Let’s call this average distribution
q = Sharpen(q)
u1, with label
q) and (
u2, also with label
q). We’ll throw all of these pseudo-labeled input pairs into one big tensor,
Let’s now blend
pseudoU with our
X (which has also been augmented with your usual augments, crop, flip etc):
pseudoUinto one big tensor, and shuffle its order. We’ll call it
X[i], and mix it with
SuperBatch[i]. We discard the individual elements from
SuperBatchafter we’ve blended them.
pseudoU, and mix it with what’s left in
That’s it for MixMatch. We should now have two batches,
U_mixedup. We’ll now feed both of these through the network.
This is relatively straightforward:
X_mixedup, it’s just your usual CrossEntropyLoss
U_mixedup, we apply MSE. The reason for using MSE is that MSE is bounded, and that it doesn’t care about getting the class right (which is important, as it’s only a pseudo-label!).
loss(X_mixedup) + lambda * loss(U_mixedup)
There’s quite a few hyper-parameters for MixMatch, thankfully the researchers propose a couple of good starting points:
u) only twice (
Sharpen()function is actually just a “temperature adjustment” where they propose to keep the temperature parameter
lambdain the combined loss function, they propose to start out with a value of 100, but it may have to be adjusted depending on the dataset
alpha, which is a parameter that’s relevant to the MixUp operation, they propose to start with 0.75 and adjust depending on the dataset
My understanding of pseudo-labeling in general is shaky to say the least, but I’ll make an attempt at verbalising my understanding so far. In essence, the properties of any good model aren’t limited to “can it reliably classify an input correctly?”. If we take a step back, we can think of a few more properties we want our model to have:
Pseudo-labeling (and thus MixMatch) may not be able to help the model learn to correctly classify, but it can help to teach the model to have the above label-independent meta-properties!
Umakes sure that the model spits out robust predictions across augmentations (note that it doesn’t matter what the model predicts, just that it predicts the same for all augmentations). This is called consistency regularisation.
q. This is what the paper calls entropy minimization.
Ucould be confusing the legitimate inputs from
Xthrough blending. However, MixUp works in such a way that it only adds a “little” of the second mix-in label, it’s still mostly the original