.. title: Autoregressive Autoencoders .. slug: autoregressive-autoencoders .. date: 2017-10-14 10:02:15 UTC-04:00 .. tags: autoencoders, autoregressive, generative models, MADE, MNIST, mathjax .. category: .. link: .. description: A write up on Masked Autoencoder for Distribution Estimation (MADE). .. type: text .. |br| raw:: html
.. |H2| raw:: html

### .. |H2e| raw:: html

.. |H3| raw:: html

#### .. |H3e| raw:: html

.. |center| raw:: html
.. |centere| raw:: html
You might think that I'd be bored with autoencoders by now but I still find them extremely interesting! In this post, I'm going to be explaining a cute little idea that I came across in the paper MADE: Masked Autoencoder for Distribution Estimation _. Traditional autoencoders are great because they can perform unsupervised learning by mapping an input to a latent representation. However, one drawback is that they don't have a solid probabilistic basis (of course there are other variants of autoencoders that do, see previous posts here __, here __, and here __). By using what the authors define as the *autoregressive property*, we can transform the traditional autoencoder approach into a fully probabilistic model with very little modification! As usual, I'll provide some intuition, math and an implementation. .. TEASER_END |h2| Vanilla Autoencoders |h2e| The basic autoencoder _ is a pretty simple idea. Our primary goal is take an input sample :math:x and transform it to some latent dimension :math:z (*encoder*), which hopefully is a good representation of the original data. As usual, we need to ask ourselves: what makes a good representation? An autoencoder's answer: "*A good representation is one where you can reconstruct the original input!*". The process of transforming the latent dimension :math:z back to a reconstructed version of the input :math:\hat{x} is called the *decoder*. It's an "autoencoder" because it's using the same value :math:x value on the input and output. Figure 1 shows a picture of what this looks like. .. figure:: /images/autoencoder_structure.png :width: 400px :alt: Vanilla Autoencoder :align: center Figure 1: Vanilla Autoencoder (source: Wikipedia _) From Figure 1, we typically will use a neural network as the encoder and a different (usually similar) neural network as the decoder. Additionally, we'll typically put a sensible loss function on the output to ensure :math:x and :math:\hat{x} are as close as possible: .. math:: \mathcal{L_{\text{binary}}}({\bf x}) &= \sum_{i=1}^D -x_i\log \hat{x}_i - (1-x_i)\log(1-\hat{x_i}) \tag{1} \\ \mathcal{L_{\text{real}}}({\bf x}) &= \sum_{i=1}^D (x_i - \hat{x}_i)^2 \tag{2} Here we assume that our data point :math:{\bf x} has :math:D dimensions. The loss function we use will depend on the form of the data. For binary data, we'll use cross entropy and for real-valued data we'll use the mean squared error. These correspond to modelling :math:x as a Bernoulli and Gaussian respectively (see the box). .. admonition:: Negative Log-Likelihoods (NLL) and Loss Functions The loss functions we typically use in training machine learning models are usually derived by an assumption on the probability distribution of each data point (typically assuming identically, independently distributed (IID) data). It just doesn't look that way because we typically use the negative log-likelihood as the loss function. We can do this because we're usually just looking for a point estimate (i.e. optimizing) so we don't need to worry about the entire distribution, just a single point that gives us the highest probability. For example, if our data is binary, then we can model it as a Bernoulli __ with parameter :math:p on the interval :math:(0,1). The probability of seeing a given 0/1 :math:x value is then: .. math:: P(x) = p^x(1-p)^{(1-x)} \tag{3} If we take the logarithm and negate it, we get the binary cross entropy loss function: .. math:: \mathcal{L_{\text{binary}}}(x) = -x\log p - (1-x)\log(1-p) \tag{4} This is precisely the expression from Equation 1, except we replace :math:x=x_i and :math:p=\hat{x_i}, where the former is the observed data and latter is the estimate of the parameters that our model gives. Similarly, we can do the same trick with a normal distribution __. Given an observed real-valued data point :math:x, the probability density for parameters :math:\mu, \sigma^2 is given by: .. math:: p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \tag{5} Taking the negative logarithm of this function, we get: .. math:: -\log p(x) = \frac{1}{2}\log(2\pi \sigma^2) + \frac{1}{2\sigma^2} (x-\mu)^2 \tag{6} Now if we assume that the variance is the same fixed value for all our data points, then the only parameter we're optimizing for is :math:\mu. So adding and multiplying a bunch of constants to our main expression doesn't change the optimal (highest probability) point so we can just simplify it (when optimizing) and still get the same point solution: .. math:: \underset{\mu}{\operatorname{argmax}} -\log p(x) = \underset{\mu}{\operatorname{argmax}} \mathcal{L_{\text{real}}}(x) = \underset{\mu}{\operatorname{argmax}} (x-\mu)^2 \\ \tag{7} Here our observation is :math:x and our model would produce an estimate of the parameter :math:\mu i.e. :math:\hat{x} in this case. I have some more details on this in one of my previous posts on regularization __. |h3| Losing Your Identity |h3e| Now this is all well and good but an astute observer will notice that unless we put some additional constraints, our autoencoder can just set :math:z=x (i.e. the identity function) and generate a perfect reconstruction. What better representation for a reconstruction than *exactly* the original data? This is not desirable because we originally wanted to find a good latent representation for :math:z, not just regurgitate :math:x! We can easily solve this though by making it difficult to learn just the identity function. The easiest method is to just make the dimensions of :math:z smaller than :math:x. For example, if your image has 900 pixels (30 x 30) then make the dimensions of :math:z, say 100. In this way, you're "forcing" the autoencoder to learn a more compact representation. Another method used in *denoising autoencoders* is to artificially introduce noise on the input :math:x' = \text{noise}(x) (e.g. Gaussian noise) but still compare the output of the decoder with the clean value of :math:x. The intuition here is that a good representation is robust to any noise that you might give it. Again, this prevents the autoencoder from just learning the identify mapping (because your input is not the same as your output anymore). In both cases, you will eventually end up with a pretty good latent representation of :math:x that can be used in all sorts of applications such as semi-supervised learning __. |h3| Proper Probability Distributions |h3e| Although vanilla autoencoders do pretty well in learning a latent representation of data in an unsupervised manner, they don't have a proper probabilistic interpretation. We put a loss function on the outputs of the autoencoder in Equation 1 and 2 but that doesn't automatically mean our autoencoder will generate a proper distribution of the data! Let me explain. Ideally, we would like the unsupervised autoencoder to learn the distribution of the data. That is, for each one of our :math:\bf x values, we would like to be able to evaluate the probability :math:P({\bf x}) to see how often we would expect to see this data point. Implicitly this means that if we sum over all *possible* :math:\bf x values, we should get :math:1, i.e. :math:\sum_{\bf x} P({\bf x}) = 1. For traditional autoencoders, we can show that this property is not guaranteed. Consider two samples :math:\bf x_1, and :math:\bf x_2. Let's say (regardless of what type of autoencoder we use) our neural network "memorizes" these two samples and is able to reconstruct them perfectly. That is, pass :math:\bf x_1 into the autoencoder and get *exactly* :math:\bf x_1 back; pass :math:\bf x_2 into the autoencoder and get *exactly* :math:\bf x_2 back. If this happened, it would be a good thing (as long as we had a bottleneck or a denoising autoencoder) because we have a learned a really powerful latent representation that can reconstruct the data perfectly! However, this implies the loss from Equation 1 (or 2 in the continuous case) is :math:0. If we negate and take the exponential to translate it to a probability this means both :math:P({\bf x_1})=1 and :math:P({\bf x_2})=1, which of course is not a valid probability distribution. In contrast, if our model did model the data distribution properly, then we would end up with a fully generative model __, where we could do nice things like sample from it (e.g. generate new images). For vanilla autoencoders, we started with some neural network and then tried to apply some sort of probabilistic interpretation that didn't quite work out. I like it the other way around: start with a probabilistic model and then figure out how to use neural networks to help you add more capacity and scale it. |h2| Autoregressive Autoencoders |h2e| So vanilla autoencoders don't quite get us to a proper probability distribution but is there a way to modify them to get us there? Let's review the product rule __: .. math:: p({\bf x}) = \prod_{i=1}^{D} p(x_i | {\bf x}_{