MixMatch is a form of data augmentation and pseudo-labeling proposed by Google’s Berthelot et al. in their paper *MixMatch: A Holistic Approach to Semi-Supervised Learning* (arXiv:1905.02249). All it does is essentially combine some already-known techniques for pseudo-labeling, but the paper shows that the act of *combining* these various techniques leads to significant performance improvements on datasets like CIFAR-10 and CIFAR-100.

The basic premise of pseudo-labeling is that there is knowledge in unlabeled data that we’re currently not leveraging by only using the training set. MixMatch blends labeled and unlabeled data in a way that essentially leads to a more regularised model. Too vague? Let’s hop in. I’ll try to provide a more intuitive explanation at the end of this post.

**Note:** If you’re not familiar with MixUp, an augmentation technique that MixMatch is based on, it’s best you quickly familiarise yourself with it, perhaps by reading the paper?

It’s important to understand that pseudo-labeling (and by extension MixMatch) doesn’t touch the architecture of the network that’s being used. It only touches two parts of the processing pipeline: batch generation and of course the loss function. Most of the magic happens during batch generation though, the loss function is just plain old Mean Squared Error (MSE).

Essentially, you feed MixMatch two batches, `X`

and `U`

, and it modifies those batches by blending them. `X`

is just your standard batch of training data. `U`

is a batch of your unlabeled data. Let’s forget about `X`

for a second, and focus on what MixMatch does with `U`

.

For each element in `U`

, we’ll call it `u`

, MixMatch does the following:

- It augments
`u`

twice, creating`u1`

and`u2`

(discarding`u`

). The augments are your standard augments like flips and crops for images. - It puts both
`u1`

and`u2`

through the model, and averages the resulting predictions (just a standard average). Let’s call this average distribution`q`

. - Because it wants to lower entropy, it’s going to sharpen up the distribution:
`q = Sharpen(q)`

- The result is that we now have two pseudo-labeled inputs: (
`u1`

, with label`q`

) and (`u2`

, also with label`q`

). We’ll throw all of these pseudo-labeled input pairs into one big tensor,`pseudoU`

.

Let’s now blend `pseudoU`

with our `X`

(which has also been augmented with your usual augments, crop, flip etc):

- We’ll throw
`X`

and`pseudoU`

into one big tensor, and shuffle its order. We’ll call it`SuperBatch`

. - Now we’ll apply MixUp:
- We loop through each input in
`X[i]`

, and mix it with`SuperBatch[i]`

. We discard the individual elements from`SuperBatch`

after we’ve blended them. - We will also loop through
`pseudoU`

, and mix it with what’s left in`SuperBatch`

. - The paper claims that one of the things that leads to better performance is actually the blending of training data with pseudo labels!

- We loop through each input in

That’s it for MixMatch. We should now have two batches, `X_mixedup`

and `U_mixedup`

. We’ll now feed both of these through the network.

This is relatively straightforward:

- For
`X_mixedup`

, it’s just your usual CrossEntropyLoss - For
`U_mixedup`

, we apply MSE. The reason for using MSE is that MSE is bounded, and that it doesn’t care about getting the class right (which is important, as it’s only a pseudo-label!). - The final loss is just a weighted combination:
`loss(X_mixedup) + lambda * loss(U_mixedup)`

There’s quite a few hyper-parameters for MixMatch, thankfully the researchers propose a couple of good starting points:

- The researchers propose you augment each unlabeled input (
`u`

) only twice (`K`

= 2) - Their proposed augments are just standard crops and flips
- The
`Sharpen()`

function is actually just a “temperature adjustment” where they propose to keep the temperature parameter`T`

at 0.5. - For
`lambda`

in the combined loss function, they propose to start out with a value of 100, but it may have to be adjusted depending on the dataset - For
`alpha`

, which is a parameter that’s relevant to the MixUp operation, they propose to start with 0.75 and adjust depending on the dataset

My understanding of pseudo-labeling in general is shaky to say the least, but I’ll make an attempt at verbalising my understanding so far. In essence, the properties of any good model aren’t limited to “can it reliably classify an input correctly?”. If we take a step back, we can think of a few more properties we want our model to have:

- Predictions of the model should be stable across augmented versions of the same input
- It should be confident in its predictions: one class should have a high predicted probability, while the rest should be negligible in comparison
- The model shouldn’t just memorise its training data, but actually understand what it is being taught

Pseudo-labeling (and thus MixMatch) may not be able to help the model learn to correctly classify, but it can help to teach the model to have the above label-independent meta-properties!

- The data augmentation we apply to
`U`

makes sure that the model spits out robust predictions across augmentations (note that it doesn’t matter**what**the model predicts, just that it predicts the same for all augmentations). This is called*consistency regularisation*. - We teach the model to be confident by sharpening the pseudo-distribution
`q`

. This is what the paper calls*entropy minimization*. - The random blending performed by MixUp means that the inputs change a bit every time, making it impossible for it to simply memorise the training data. This is just
*traditional regularisation*.

At this stage, you might be worried that the guessed pseudo-labels from`U`

could be confusing the legitimate inputs from`X`

through blending. However, MixUp works in such a way that it only adds a “little” of the second mix-in label, it’s still mostly the original`X`

(say 75-80%).