I was recently challenged to apply image matting to the CelebA dataset. If that doesn’t ring a bell, matting is the art of separating a subject from its background. Your intuition is probably telling you that’s something you’d do with Photoshop, and you’d be right. It then shouldn’t be a surprise that Adobe released a paper in 2017 on training a network that does exactly this. I thought it’d be interesting to quickly cover what they propose, how I implemented it, and spend some time on the results. If you really want a painfully detailed write-up, check out my Jupyter notebook on it (warning: it’s really damn long).
Let’s build up an intuitive understanding of what we’re trying to do here. You’re probably familiar with image segmentation: the network detects various things in an image, and colors pixels based on that what type of thing it thinks it is. We could apply this as a first step to the challenge: can a network reliably segment the human foreground from the base class (background)?
You’ll notice it’s not a terribly convincing performance though: the cut-out looks super blocky. That’s because the segmentation is essentially binary: either it’s the human/foreground (1), or it’s the background (0). This is especially too coarse when you’re trying to segment thin parts like hair or fur.
Okay, so what if we make it a bit more subtle: we’ll continue to do segmentation, but now there’s three classes: background
, foreground
and uncertain
. The first two classes get the easy work, while the uncertain
class contains the edge of our subject. At a later stage, we could for example just set the “uncertain” area to half transparency, which would make the edge less brutal. We call this a trimap, where the “tri” part refers to the three classes. They’ll be fundamental to the approach the DIM paper takes.
The standard images you and I know, like .jpg
s, have three channels: Red, Green and Blue (RGB). You’ll probably also have come across 4-channeled images, like .png
for example. These have an additional channel, Alpha. Alpha is different, because it doesn’t talk about the color of the pixel. Instead, it essentially states how “transparent” a pixel should be. If it’s 1, it is fully visible, while at 0, it is fully transparent and shows the background (whatever that may be). Any value in between essentially lets the background “shine through”. Often a company’s logo will have this.
Alright, so what does the paper propose? The paper proposes a network in two parts:
My approach was trained using the SegmentAI dataset. This is a dataset that contains 40k images of humans with an alpha layer that separates them from their background, ideal for my challenge.
Armed with this dataset, I proceeded to do the following:
Let’s run the model on some CelebA samples and see how it does.
That’s mildly disappointing. Some thoughts: