Hamiltonian Monte Carlo

Here's a topic I thought that I would never get around to learning because it was "too hard". When I first started learning about Bayesian methods, I knew enough that I should learn a thing or two about MCMC since that's the backbone of most Bayesian analysis; so I learned something about it (see my previous post). But I didn't dare attempt to learn about the infamous Hamiltonian Monte Carlo (HMC). Even though it is among the standard algorithms used in Bayesian inference, it always seemed too daunting because it required "advanced physics" to understand. As usual, things only seem hard because you don't know them yet. After having some time to digest MCMC methods, getting comfortable learning more maths (see here, here, and here), all of a sudden learning "advanced physics" didn't seem so tough (but there sure was a lot of background needed)!

This post is the culmination of many different rabbit holes (many much deeper than I needed to go) where I'm going to attempt to explain HMC in simple and intuitive terms to a satisfactory degree (that's the tag line of this blog after all). I'm going to begin by briefly motivating the topic by reviewing MCMC and the Metropolis-Hastings algorithm then move on to explaining Hamiltonian dynamics (i.e., the "advanced physics"), and finally discuss the HMC algorithm along with some toy experiments I put together. Most of the material is based on [1] and [2], which I've found to be great sources for their respective areas.

Read more…

Lossless Compression with Latent Variable Models using Bits-Back Coding

A lot of modern machine learning is related to this idea of "compression", or maybe to use a fancier term "representations". Taking a huge dimensional space (e.g. images of 256 x 256 x 3 pixels = 196608 dimensions) and somehow compressing it into a 1000 or so dimensional representation seems like pretty good compression to me! Unfortunately, it's not a lossless compression (or representation). Somehow though, it seems intuitive that there must be a way to use what is learned in these powerful lossy representations to help us better perform lossless compression, right? Of course there is! (It would be too anti-climatic of a setup otherwise.)

This post is going to introduce a method to perform lossless compression that leverages the learned "compression" of a machine learning latent variable model using the Bits-Back coding algorithm. Depending on how you first think about it, this seems like it should either be (a) really easy or (b) not possible at all. The reality is kind of in between with an elegant theoretical algorithm that is brought down by the realities of discretization and imperfect learning by the model. In today's post, I'll skim over some preliminaries (mostly referring you to previous posts), go over the main Bits-Back coding algorithm in detail, and discuss some of the implementation details and experiments that I did while trying to write a toy version of the algorithm.

Read more…

Lossless Compression with Asymmetric Numeral Systems

During my undergraduate days, one of the most interesting courses I took was on coding and compression. Here was a course that combined algorithms, probability and secret messages, what's not to like? 1 I ended up not going down that career path, at least partially because communications systems had its heyday around the 2000s with companies like Nortel and Blackberry and its predecessors (some like to joke that all the major theoretical breakthroughs were done by Shannon and his discovery of information theory around 1950). Fortunately, I eventually wound up studying industrial applications of classical AI techniques and then machine learning, which has really grown like crazy in the last 10 years or so. Which is exactly why I was so surprised that a new and better method of lossless compression was developed in 2009 after I finished my undergraduate degree when I was well into my PhD. It's a bit mind boggling that something as well-studied as entropy-based lossless compression still had (have?) totally new methods to discover, but I digress.

In this post, I'm going to write about a relatively new entropy based encoding method called Asymmetrical Numeral Systems (ANS) developed by Jaroslaw (Jarek) Duda [2]. If you've ever heard of Arithmetic Coding (probably best known for its use in JPEG compression), ANS runs in a very similar vein. It can generate codes that are close to the theoretical compression limit (similar to Arithmetic coding) but is much more efficient. It's been used in modern compression algorithms since 2014 including compressors developed by Facebook, Apple and Google [3]. As usual, I'm going to go over some background, some math, some examples to help with intuition, and finally some experiments with a toy ANS implementation I wrote. I hope you're as excited as I am, let's begin!

Read more…

Model Explainability with SHapley Additive exPlanations (SHAP)

One of the big criticisms of modern machine learning is that it's essentially a blackbox -- data in, prediction out, that's it. And in some sense, how could it be any other way? When you have a highly non-linear model with high degrees of interactions, how can you possibly hope to have a simple understanding of what the model is doing? Well, turns out there is an interesting (and practical) line of research along these lines.

This post will dive into the ideas of a popular technique published in the last few years call SHapely Additive exPlanations (or SHAP). It builds upon previous work in this area by providing a unified framework to think about explanation models as well as a new technique with this framework that uses Shapely values. I'll go over the math, the intuition, and how it works. No need for an implementation because there is already a nice little Python package! Confused yet? Keep reading and I'll explain.

Read more…

A Note on Using Log-Likelihood for Generative Models

One of the things that I find is usually missing from many ML papers is how they relate to the fundamentals. There's always a throwaway line where it assumes something that is not at all obvious (see my post on Importance Sampling). I'm the kind of person who likes to understand things to a satisfactory degree (it's literally in the subtitle of the blog) so I couldn't help myself investigating a minor idea I read about in a paper.

This post investigates how to use continuous density outputs (e.g. a logistic or normal distribution) to model discrete image data (e.g. 8-bit RGB values). It seems like it might be something obvious such as setting the loss as the average log-likelihood of the continuous density and that's almost the whole story. But leaving it at that skips over so many (interesting) and non-obvious things that you would never know if you didn't bother to look. I'm a curious fellow so come with me and let's take a look!

Read more…

Hi, I'm Brian Keng. This is the place where I write about all things technical.

Twitter: @bjlkeng



Signup for Email Blog Posts