# Variational Bayes and The Mean-Field Approximation

This post is going to cover Variational Bayesian methods and, in particular, the most common one, the mean-field approximation. This is a topic that I've been trying to understand for a while now but didn't quite have all the background that I needed. After picking up the main ideas from variational calculus and getting more fluent in manipulating probability statements like in my EM post, this variational Bayes stuff seems a lot easier.

Variational Bayesian methods are a set of techniques to approximate posterior distributions in Bayesian Inference. If this sounds a bit terse, keep reading! I hope to provide some intuition so that the big ideas are easy to understand (which they are), but of course we can't do that well unless we have a healthy dose of mathematics. For some of the background concepts, I'll try to refer you to good sources (including my own), which I find is the main blocker to understanding this subject (admittedly, the math can sometimes be a bit cryptic too). Enjoy!

# The Calculus of Variations

This post is going to describe a specialized type of calculus called variational calculus. Analogous to the usual methods of calculus that we learn in university, this one deals with functions of functions and how to minimize or maximize them. It's used extensively in physics problems such as finding the minimum energy path a particle takes under certain conditions. As you can also imagine, it's also used in machine learning/statistics where you want to find a density that optimizes an objective [1]. The explanation I'm going to use (at least for the first part) is heavily based upon Svetitsky's Notes on Functionals, which so far is the most intuitive explanation I've read. I'll try to follow Svetitsky's notes to give some intuition on how we arrive at variational calculus from regular calculus with a bunch of examples along the way. Eventually we'll get to an application that relates back to probability. I think with the right intuition and explanation, it's actually not too difficult, enjoy!

# Maximum Entropy Distributions

This post will talk about a method to find the probability distribution that best fits your given state of knowledge. Using the principle of maximum entropy and some testable information (e.g. the mean), you can find the distribution that makes the fewest assumptions about your data (the one with maximal information entropy). As you may have guessed, this is used often in Bayesian inference to determine prior distributions and also (at least implicitly) in natural language processing applications with maximum entropy (MaxEnt) classifiers (i.e. a multinomial logistic regression). As usual, I'll go through some intuition, some math, and some examples. Hope you find this topic as interesting as I do!

# Lagrange Multipliers

This post is going to be about finding the maxima or minima of a function subject to some constraints. This is usually introduced in a multivariate calculus course, unfortunately (or fortunately?) I never got the chance to take a multivariate calculus course that covered this topic. In my undergraduate class, computer engineers only took three half year engineering calculus courses, and the fourth one (for electrical engineers) seems to have covered other basic multivariate calculus topics such as all the various theorems such as Green's, Gauss', Stokes' (I could be wrong though, I never did take that course!). You know what I always imagined Newton saying, "It's never too late to learn multivariate calculus!".

In that vein, this post will discuss one widely used method for finding optima subject to constraints: Lagrange multipliers. The concepts behind it are actually quite intuitive once we come up with the right analogue in physical reality, so as usual we'll start there. We'll work through some problems and hopefully by the end of this post, this topic won't seem as mysterious anymore [1].

# The Expectation-Maximization Algorithm

This post is going to talk about a widely used method to find the maximum likelihood (MLE) or maximum a posteriori (MAP) estimate of parameters in latent variable models called the Expectation-Maximization algorithm. You have probably heard about the most famous variant of this algorithm called the k-means algorithm for clustering. Even though it's so ubiquitous, whenever I've tried to understand why this algorithm works, I never quite got the intuition right. Now that I've taken the time to work through the math, I'm going to attempt to explain the algorithm hopefully with a bit more clarity. We'll start by going back to the basics with latent variable models and the likelihood functions, then moving on to showing the math with a simple Gaussian mixture model [1].