Maximum Entropy Distributions

This post will talk about a method to find the probability distribution that best fits your given state of knowledge. Using the principle of maximum entropy and some testable information (e.g. the mean), you can find the distribution that makes the fewest assumptions about your data (the one with maximal information entropy). As you may have guessed, this is used often in Bayesian inference to determine prior distributions and also (at least implicitly) in natural language processing applications with maximum entropy (MaxEnt) classifiers (i.e. a multinomial logistic regression). As usual, I'll go through some intuition, some math, and some examples. Hope you find this topic as interesting as I do!

Read more…

Lagrange Multipliers

This post is going to be about finding the maxima or minima of a function subject to some constraints. This is usually introduced in a multivariate calculus course, unfortunately (or fortunately?) I never got the chance to take a multivariate calculus course that covered this topic. In my undergraduate class, computer engineers only took three half year engineering calculus courses, and the fourth one (for electrical engineers) seems to have covered other basic multivariate calculus topics such as all the various theorems such as Green's, Gauss', Stokes' (I could be wrong though, I never did take that course!). You know what I always imagined Newton saying, "It's never too late to learn multivariate calculus!".

In that vein, this post will discuss one widely used method for finding optima subject to constraints: Lagrange multipliers. The concepts behind it are actually quite intuitive once we come up with the right analogue in physical reality, so as usual we'll start there. We'll work through some problems and hopefully by the end of this post, this topic won't seem as mysterious anymore 1.

Read more…

The Expectation-Maximization Algorithm

This post is going to talk about a widely used method to find the maximum likelihood (MLE) or maximum a posteriori (MAP) estimate of parameters in latent variable models called the Expectation-Maximization algorithm. You have probably heard about the most famous variant of this algorithm called the k-means algorithm for clustering. Even though it's so ubiquitous, whenever I've tried to understand why this algorithm works, I never quite got the intuition right. Now that I've taken the time to work through the math, I'm going to attempt to explain the algorithm hopefully with a bit more clarity. We'll start by going back to the basics with latent variable models and the likelihood functions, then moving on to showing the math with a simple Gaussian mixture model 1.

Read more…

A Probabilistic Interpretation of Regularization

This post is going to look at a probabilistic (Bayesian) interpretation of regularization. We'll take a look at both L1 and L2 regularization in the context of ordinary linear regression. The discussion will start off with a quick introduction to regularization, followed by a back-to-basics explanation starting with the maximum likelihood estimate (MLE), then on to the maximum a posteriori estimate (MAP), and finally playing around with priors to end up with L1 and L2 regularization.

Read more…

Beyond Collaborative Filtering

I wrote a couple of posts about some of the work on recommendation systems and collaborative filtering that we're doing at my job as a Data Scientist at Rubikloud:

Here's a blurb:

Here at Rubikloud, a big focus of our data science team is empowering retailers in delivering personalized one-to-one communications with their customers. A big aspect of personalization is recommending products and services that are tailored to a customer’s wants and needs. Naturally, recommendation systems are an active research area in machine learning with practical large scale deployments from companies such as Netflix and Spotify. In Part 1 of this series, I’ll describe the unique challenges that we have faced in building a retail specific product recommendation system and outline one of the main components of our recommendation system: a collaborative filtering algorithm. In Part 2, I’ll follow up with several useful applications of collaborative filtering and end by highlighting some of its limitations.

Hope you like it!

I'm Brian Keng, a former academic, current data scientist and engineer. This is the place where I write about all things technical.

Twitter: @bjlkeng

Signup for Email Blog Posts