A Probabilistic Interpretation of Regularization

Brian Keng

2016-08-29 08:52

This post is going to look at a probabilistic (Bayesian) interpretation of regularization. We'll take a look at both L1 and L2 regularization in the context of ordinary linear regression. The discussion will start off with a quick introduction to regularization, followed by a back-to-basics explanation starting with the maximum likelihood estimate (MLE), then on to the maximum a posteriori estimate (MAP), and finally playing around with priors to end up with L1 and L2 regularization.

Regularization

Regularization is the process of introducing additional information in order to solve ill-posed problems or prevent overfitting. A trivial example is when trying to fit a simple linear regression but you only have one point. In this case, you can't estimate both the slope and intercept (you need at least two points) so any MLE estimate (which only uses the data) will be ill-formed. Instead, if you provide some "additional information" (i.e. prior information 1), you can get a much more reasonable estimate.

To make things a bit more concrete, let's talk about things in the context of a ordinary linear regression. Recall from my previous post on linear regression (Equation 11 in that post) that the maximum likelihood estimate for ordinary linear regression is given by:

\begin{align*} {\bf \hat{\beta}_{\text{MLE}}} &= \arg\min_{\bf \beta} \sum_{i=1}^{n} (y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 \\ &= \arg\min_{\bf \beta} \sum_{i=1}^{n} (y_i-\hat{y_i})^2 \tag{1} \end{align*}

The estimate is quite intuitive: pick the coefficients (\(\beta_j\)) that minimizes the squared error between the observed values (\(y_i\)) and those generated by our linear model (\(\hat{y_i}\)).

In a similar vein as above, consider what happens when we only have one data point \((y_0, {\bf x_0})\) but more than one coefficient. There are any number of possible "lines" or equivalently coefficients that we could draw to minimize Equation 1. Thinking back to high school math, this is analogous to estimating the slope and intercept for a line but with only one point. Definitely a problem they didn't teach you in high school. I'm using examples where we don't have enough data but there could other types of issues such as colinearity that may not outright prevent fitting of the model but will probably produce an unreasonable estimate.

Two common schemes for regularization add a simple modification to Equation 1:

\begin{align*} {\bf \hat{\beta}_{\text{L1}}} = \arg\min_{\bf \beta} \big( \sum_{i=1}^{n} (y_i- (\beta_0 + \beta_1 x_{i, 1} + ... + \beta_p x_{i,p}))^2 + \lambda \sum_{j=0}^{p} | \beta_j | \big) \\ \tag{2} \end{align*}

\begin{equation*} {\bf \hat{\beta}_{\text{L2}}} = \arg\min_{\bf \beta} \big( \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \lambda \sum_{j=0}^{p} | \beta_j | ^2 \big) \tag{3} \end{equation*}

L1 regularization (also known as LASSO in the context of linear regression) promotes sparsity of coefficients. Sparsity translates to some coefficients having values, while others are zero (or closer to zero). This can be seen as a form of feature selection. L2 regularization (also known as ridge regression in the context of linear regression and generally as Tikhonov regularization) promotes smaller coefficients (i.e. no one coefficient should be too large). This type of regularization is pretty common and typically will help in producing reasonable estimates. It also has a simple probabilistic interpretation (at least in the context of linear regression) which we will see below. (You can skip the next two sections if you're already familiar with the basics of MLE and MAP estimates.)

The Likelihood Function

Recall Equation 1 can be derived from the likelihood function (without \(\log(\cdot)\)) for ordinary linear regression:

\begin{align*} \mathcal{L({\bf \beta}|{\bf y})} &:= P({\bf y} | {\bf \beta}) \\ &= \prod_{i=1}^{n} P_Y(y_i|{\bf \beta}, \sigma^2) \\ &= \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}} \tag{4} \end{align*}

where \({\bf y}\) are our observed data points (\(y_1, \ldots, y_N\)) and \(P_Y(y_i|\mu, \sigma^2)\) is the probability of observing the data point \(y_i\). The implicit assumption in linear regression is that the data points are normally distributed about the regression line (see my past post on linear regression for more on this).

Classical statistics focuses on maximizing this likelihood function, which usually provides a pretty good estimate -- except when it doesn't like in the case of "small data". However, looking at the problem from a more probabilistic point of view (i.e. Bayesian), we don't just want to maximize the likelihood function but rather the posterior probability. Let's see how that works.

The Posterior

We'll start by reviewing Bayes Theorem where we usually denote the parameters we're trying to estimate by \(\theta\) and the data as \(y\):

\begin{align*} P(\theta | y) &= \frac{P(y | \theta) P(\theta)}{P(y)} \\ \text{posterior} &= \frac{\text{likelihood} \cdot \text{prior}}{\text{evidence}} \tag{5} \end{align*}

In Bayesian inference, we're primarily concerned with the posterior: "the probability of the parameters given the data". Put in another way, we're looking to estimate the probability distribution of the parameters (\(\theta\)) given the data we have observed (\(y\)). Contrast this with classical methods which instead try to find the best parameters to maximize likelihood: the probability of observing data (\(y\)) given a different values of the parameters. Definitely a subtle difference but I think most would agree the Bayesian interpretation is much more natural 2.

Looking at Equation 5 in more detail, we already know how to compute the likelihood but the two new parts are the prior and the evidence. This is where proponents of frequentist statistics usually have a philosophical dilemma. The prior is actually something we (the modeler) explicitly choose that is not based on the data 3 (\(y\)) *gasp*!

Without getting into a whole philosophical spiel, adding some additional prior information is exactly what we want in certain situations! For example when we don't have enough data, we probably have some idea about what is reasonable given our knowledge of the problem. This prior allows us to encode this knowledge. Even in cases where we don't explicitly have this problem, we can choose a "weak" prior which will only bias the result slightly from the MLE estimate. In cases where we have lots of the data, the likelihood dominates Equation 5 anyways so the result will be similar in these cases.

From Equation 5, a full Bayesian analysis would look at the distribution of the parameters (\(\theta\)). However, most people will settle for something a bit less involved: finding the maximum of the posterior (which turns out to be an easier problem in most cases). This is known as the maximum a posteriori probability estimate, usually abbreviated by MAP. This simplifies the analysis and in particular allows us to ignore the evidence (\(P(y)\)), which is constant relative to the parameters we're trying to estimate (\(\theta\)).

Formalizing the MAP estimate, we can write it as:

\begin{align*} {\bf \hat{\theta}_{\text{MAP}}} &= \arg\max_{\bf \theta} P(\theta | y) \\ &= \arg\max_{\bf \theta} \frac{P(y | \theta) P(\theta)}{P(y)} \\ &= \arg\max_{\bf \theta} P(y | \theta) P(\theta) \\ &= \arg\max_{\bf \theta} \log(P(y | \theta) P(\theta)) \\ &= \arg\max_{\bf \theta} \log P(y | \theta) + \log P(\theta) \tag{6} \end{align*}

Notice that we can get rid of the evidence term (\(P(y)\)) because it's constant with respect to the maximization and also take the \(\log\) of the inner function because it's monotonically increasing. Contrast this with the MLE estimate:

\begin{equation*} {\bf \hat{\theta}_{\text{MLE}}} = \arg\max_{\bf \theta} \log P(y | \theta) \tag{7} \end{equation*}

Selecting Priors for Linear Regression

The main idea is to select Bayesian priors on the coefficients of linear regression that get us to L1 and L2 regularization (Equation 2 and 3). Let's see how this works.

Normally Distributed Priors

We'll start with our good old friend the normal distribution and place a zero-mean normally distributed prior on each \(\beta_i\) value, all with identical variance \(\tau^2\). From Equation 6 and filling in the likelihood function from Equation 4 and our prior:

\begin{align*} &\arg\max_{\bf \beta} \Big[ \log \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}} + \log \prod_{j=0}^{p} \frac{1}{\tau\sqrt{2\pi}}e^{-\frac{\beta_j^2}{2\tau^2}} \Big] \\ &= \arg\max_{\bf \beta} \Big[- \sum_{i=1}^{n} {\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}} - \sum_{j=0}^{p} {\frac{\beta_j^2}{2\tau^2}} \Big]\\ &= \arg\min_{\bf \beta} \frac{1}{2\sigma^2} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \frac{\sigma^2}{\tau^2} \sum_{j=0}^{p} \beta_j^2 \big] \\ &= \arg\min_{\bf \beta} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \lambda \sum_{j=0}^{p} \beta_j^2 \big] \tag{8} \end{align*}

Notice that we dropped many of the constants (with respect to \(\beta\)) and factored a bit to simplify the expression. You can see this is the same expression as Equation 3 (L2 Regularization) with \(\lambda = \sigma^2/\tau^2\) (remember \(\sigma\) is assumed to be constant in ordinary linear regression, and we get to pick \(\tau\) for our prior). We can adjust the amount of regularization we want by changing \(\lambda\). Equivalently, we can adjust how much we want to weight the priors carry on the coefficients (\(\beta\)). If we have a very small variance (large \(\lambda\)) then the coefficients will be very close to 0; if we have a large variance (small \(\lambda\)) then the coefficients will not be affected much (similar to as if we didn't have any regularization).

Laplacean Priors

Let's first review the density of the Laplace distribution (something that's usually not introduced in beginner probability classes):

\begin{equation*} Laplace(\mu, b) = \frac{1}{2b} e^{-\frac{|x-\mu|}{b}} \end{equation*}

This is sometimes called the "double exponential" distribution because it looks like two exponential distributions placed back to back (appropriately scaled with a location parameter). It's also quite similar to our Gaussian in form, perhaps you can see how we can get to L1 regularization already?

Starting with a zero-mean Laplacean prior on all the coefficients like we did in the previous subsection:

\begin{align*} &\arg\max_{\bf \beta} \Big[ \log \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}} + \log \prod_{j=0}^{p} \frac{1}{2b}e^{-\frac{|\beta_j|}{b}} \Big] \\ &= \arg\max_{\bf \beta} \Big[- \sum_{i=1}^{n} {\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}} - \sum_{j=0}^{p} {\frac{|\beta_j|}{b}} \Big]\\ &= \arg\min_{\bf \beta} \frac{1}{2\sigma^2} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \frac{2\sigma^2}{b} \sum_{j=0}^{p} |\beta_j| \big] \\ &= \arg\min_{\bf \beta} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \lambda \sum_{j=0}^{p} |\beta_j| \big] \tag{9} \end{align*}

Again we can see that Equation 9 contains the same expression as L1 Regularization in Equation 2.

The Laplacean prior has a slightly different effect compared to L2 regularization. Instead of preventing any of the coefficients from being too large (due to the squaring), L1 promotes sparsity. That is, zeroing out some of the coefficients. This makes some sense if you look at the density of a Laplacean prior where there is a sharp increase in the density at its mean.

Another way to intuitively see this is to compare two solutions 4. Let's imagine we are estimating two coefficients in a regression. In L2 regularization, the solution \({\bf \beta} = (1, 0)\) has the same weight as \({\bf \beta} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})\) so they are both treated equally. In L1 regularization, the same two solutions favor the sparse one:

\begin{equation*} ||(1, 0)||_1 = 1 < ||(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})||_1 = \sqrt{2} \tag{10} \end{equation*}

So L2 regularization doesn't have any specific built in mechanisms to favor zeroed out coefficients, while L1 regularization actually favors these sparser solutions.

Conclusion

L1 and L2 regularization are such intuitive techniques when viewed shallowly as just extra terms in the objective function (i.e. "shrink the coefficients"). However, I was delighted to find out that it also has a Bayesian interpretation, it just seems so much more elegant that way. As I have mentioned before, I'm a big fan probabilistic interpretations of machine learning and you can expect many more posts on this subject!