Bounded Rationalityhttp://bjlkeng.github.io/Understanding math, machine learning, and data to a satisfactory degree.enFri, 24 Dec 2021 01:41:16 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rssHamiltonian Monte Carlohttp://bjlkeng.github.io/posts/hamiltonian-monte-carlo/Brian Keng<div><p>Here's a topic I thought that I would never get around to learning because it was "too hard".
When I first started learning about Bayesian methods, I knew enough that I
should learn a thing or two about MCMC since that's the backbone
of most Bayesian analysis; so I learned something about it
(see my <a class="reference external" href="http://bjlkeng.github.io/posts/markov-chain-monte-carlo-mcmc-and-the-metropolis-hastings-algorithm/">previous post</a>).
But I didn't dare attempt to learn about the infamous Hamiltonian Monte Carlo (HMC).
Even though it is among the standard algorithms used in Bayesian inference, it
always seemed too daunting because it required "advanced physics" to
understand. As usual, things only seem hard because you don't know them yet.
After having some time to digest MCMC methods, getting comfortable learning
more maths (see
<a class="reference external" href="http://bjlkeng.github.io/posts/tensors-tensors-tensors/">here</a>,
<a class="reference external" href="http://bjlkeng.github.io/posts/manifolds/">here</a>, and
<a class="reference external" href="http://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/">here</a>),
all of a sudden learning "advanced physics" didn't seem so tough (but there
sure was a lot of background needed)!</p>
<p>This post is the culmination of many different rabbit holes (many much deeper
than I needed to go) where I'm going to attempt to explain HMC in simple and
intuitive terms to a satisfactory degree (that's the tag line of this blog
after all). I'm going to begin by briefly motivating the topic by reviewing
MCMC and the Metropolis-Hastings algorithm then move on to explaining
Hamiltonian dynamics (i.e., the "advanced physics"), and finally discuss the HMC
algorithm along with some toy experiments I put together. Most of the material
is based on [1] and [2], which I've found to be great sources for their
respective areas.</p>
<p><a href="http://bjlkeng.github.io/posts/hamiltonian-monte-carlo/">Read more…</a> (52 min remaining to read)</p></div>BayesianHamiltonianmathjaxMCMCMonte Carlohttp://bjlkeng.github.io/posts/hamiltonian-monte-carlo/Fri, 24 Dec 2021 00:07:05 GMTLossless Compression with Latent Variable Models using Bits-Back Codinghttp://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/Brian Keng<div><p>A lot of modern machine learning is related to this idea of "compression", or
maybe to use a fancier term "representations". Taking a huge dimensional space
(e.g. images of 256 x 256 x 3 pixels = 196608 dimensions) and somehow compressing it into
a 1000 or so dimensional representation seems like pretty good compression to
me! Unfortunately, it's not a lossless compression (or representation).
Somehow though, it seems intuitive that there must be a way to use what is learned in
these powerful lossy representations to help us better perform <em>lossless</em>
compression, right? Of course there is! (It would be too anti-climatic of a
setup otherwise.)</p>
<p>This post is going to introduce a method to perform lossless compression that
leverages the learned "compression" of a machine learning latent variable
model using the Bits-Back coding algorithm. Depending on how you first think
about it, this <em>seems</em> like it should either be (a) really easy or (b) not possible at
all. The reality is kind of in between with an elegant theoretical algorithm
that is brought down by the realities of discretization and imperfect learning
by the model. In today's post, I'll skim over some preliminaries (mostly
referring you to previous posts), go over the main Bits-Back coding algorithm
in detail, and discuss some of the implementation details and experiments that
I did while trying to write a toy version of the algorithm.</p>
<p><a href="http://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/">Read more…</a> (23 min remaining to read)</p></div>asymmetric numeral systemsBits-BackcompressionlosslessmathjaxMNISTvariational autoencoderhttp://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/Tue, 06 Jul 2021 16:00:00 GMTLossless Compression with Asymmetric Numeral Systemshttp://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/Brian Keng<div><p>During my undergraduate days, one of the most interesting courses I took was on
coding and compression. Here was a course that combined algorithms,
probability and secret messages, what's not to like? <a class="footnote-reference brackets" href="http://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/#id2" id="id1">1</a> I ended up not going
down that career path, at least partially because communications systems had
its heyday around the 2000s with companies like Nortel and Blackberry and its
predecessors (some like to joke that all the major theoretical breakthroughs
were done by Shannon and his discovery of information theory around 1950). Fortunately, I
eventually wound up studying industrial applications of classical AI techniques
and then machine learning, which has really grown like crazy in the last 10
years or so. Which is exactly why I was so surprised that a <em>new</em> and <em>better</em>
method of lossless compression was developed in 2009 <em>after</em> I finished my
undergraduate degree when I was well into my PhD. It's a bit mind boggling that
something as well-studied as entropy-based lossless compression still had
(have?) totally new methods to discover, but I digress.</p>
<p>In this post, I'm going to write about a relatively new entropy based encoding
method called Asymmetrical Numeral Systems (ANS) developed by Jaroslaw (Jarek)
Duda [2]. If you've ever heard of Arithmetic Coding (probably best known for
its use in JPEG compression), ANS runs in a very similar vein. It can
generate codes that are close to the theoretical compression limit
(similar to Arithmetic coding) but is <em>much</em> more efficient. It's been used in
modern compression algorithms since 2014 including compressors developed
by Facebook, Apple and Google [3]. As usual, I'm going to go over some
background, some math, some examples to help with intuition, and finally some
experiments with a toy ANS implementation I wrote. I hope you're as
excited as I am, let's begin!</p>
<p><a href="http://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/">Read more…</a> (32 min remaining to read)</p></div>Arithmetic Codingasymmetric numeral systemscompressionentropyHuffman codingmathjaxhttp://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/Sat, 26 Sep 2020 14:37:43 GMTModel Explainability with SHapley Additive exPlanations (SHAP)http://bjlkeng.github.io/posts/model-explanability-with-shapley-additive-explanations-shap/Brian Keng<div><p>One of the big criticisms of modern machine learning is that it's essentially
a blackbox -- data in, prediction out, that's it. And in some sense, how could
it be any other way? When you have a highly non-linear model with high degrees
of interactions, how can you possibly hope to have a simple understanding of
what the model is doing? Well, turns out there is an interesting (and
practical) line of research along these lines.</p>
<p>This post will dive into the ideas of a popular technique published in the last
few years call <em>SHapely Additive exPlanations</em> (or SHAP). It builds upon
previous work in this area by providing a unified framework to think
about explanation models as well as a new technique with this framework that
uses Shapely values. I'll go over the math, the intuition, and how it works.
No need for an implementation because there is already a nice little Python
package! Confused yet? Keep reading and I'll <em>explain</em>.</p>
<p><a href="http://bjlkeng.github.io/posts/model-explanability-with-shapley-additive-explanations-shap/">Read more…</a> (26 min remaining to read)</p></div>explainabilitygame theorymathjaxSHAPhttp://bjlkeng.github.io/posts/model-explanability-with-shapley-additive-explanations-shap/Wed, 12 Feb 2020 11:24:22 GMTA Note on Using Log-Likelihood for Generative Modelshttp://bjlkeng.github.io/posts/a-note-on-using-log-likelihood-for-generative-models/Brian Keng<div><p>One of the things that I find is usually missing from many ML papers is how
they relate to the fundamentals. There's always a throwaway line where it
assumes something that is not at all obvious (see my post on
<a class="reference external" href="http://bjlkeng.github.io/posts/importance-sampling-and-estimating-marginal-likelihood-in-variational-autoencoders/">Importance Sampling</a>). I'm the kind of person who likes to
understand things to a satisfactory degree (it's literally in the subtitle of
the blog) so I couldn't help myself investigating a minor idea I read about in
a paper.</p>
<p>This post investigates how to use continuous density outputs (e.g. a logistic
or normal distribution) to model discrete image data (e.g. 8-bit RGB values).
It seems like it might be something obvious such as setting the loss as the
average log-likelihood of the continuous density and that's <em>almost</em> the
whole story. But leaving it at that skips over so many (interesting) and
non-obvious things that you would never know if you didn't bother to look. I'm
a curious fellow so come with me and let's take a look!</p>
<p><a href="http://bjlkeng.github.io/posts/a-note-on-using-log-likelihood-for-generative-models/">Read more…</a> (15 min remaining to read)</p></div>generative modelslog-likelihoodmathjaxhttp://bjlkeng.github.io/posts/a-note-on-using-log-likelihood-for-generative-models/Tue, 27 Aug 2019 11:50:09 GMTPixelCNNhttp://bjlkeng.github.io/posts/pixelcnn/Brian Keng<div><p>It's been a long time coming but I'm finally getting this post out! I read
this paper a couple of years ago and wanted to really understand it because it
was state of the art at the time (still pretty close even now). As usual
though, once I started down the variational autoencoder line of posts, there
was always <em>yet</em> another VAE paper to look into so I never got around to
looking at this one.</p>
<p>This post is all about a proper probabilistic generative model called Pixel
Convolutional Neural Networks or PixelCNN. It was originally proposed
as a side contribution of Pixel Recurrent Neural Networks in [1] and later
expanded upon in [2,3] (and I'm sure many other papers). The real cool thing
about it is that it's (a) probabilistic, and (b) autoregressive. It's still
counter-intuitive to me that you can generate images one pixel at at time, but
I'm jumping ahead of myself here. We'll go over some background material, the
method, and my painstaking attempts at an implementation (and what I learned
from it). Let's get started!</p>
<p><a href="http://bjlkeng.github.io/posts/pixelcnn/">Read more…</a> (23 min remaining to read)</p></div>autoregressiveCIFAR10generative modelsmathjaxhttp://bjlkeng.github.io/posts/pixelcnn/Mon, 22 Jul 2019 11:11:09 GMTImportance Sampling and Estimating Marginal Likelihood in Variational Autoencodershttp://bjlkeng.github.io/posts/importance-sampling-and-estimating-marginal-likelihood-in-variational-autoencoders/Brian Keng<div><p>It took a while but I'm back! This post is kind of a digression (which seems
to happen a lot) along my journey of learning more about probabilistic
generative models. There's so much in ML that you can't help learning a lot
of random things along the way. That's why it's interesting, right?</p>
<p>Today's topic is <em>importance sampling</em>. It's a really old idea that you may
have learned in a statistics class (I didn't) but somehow is useful in deep learning,
what's old is new right? How this is relevant to the discussion is that when
we have a large latent variable model (e.g. a variational
autoencoder), we want to be able to efficiently estimate the marginal likelihood
given data. The marginal likelihood is kind of taken for granted in the
experiments of some VAE papers when comparing different models. I was curious
how it was actually computed and it took me down this rabbit hole. Turns out
it's actually pretty interesting! As usual, I'll have a mix of background
material, examples, math and code to build some intuition around this topic.
Enjoy!</p>
<p><a href="http://bjlkeng.github.io/posts/importance-sampling-and-estimating-marginal-likelihood-in-variational-autoencoders/">Read more…</a> (22 min remaining to read)</p></div>autoencodersautoregressiveCIFAR10generative modelsimportance samplingmathjaxMNISTMonte Carlovariational calculushttp://bjlkeng.github.io/posts/importance-sampling-and-estimating-marginal-likelihood-in-variational-autoencoders/Wed, 06 Feb 2019 12:20:11 GMTLabel Refinery: A Softer Approachhttp://bjlkeng.github.io/posts/label-refinery/Brian Keng<div><p>This post is going to be about a really simple idea that is surprisingly effective
from a paper by Bagherinezhad et al. called <a class="reference external" href="https://arxiv.org/abs/1805.02641">Label Refinery: Improving ImageNet
Classification through Label Progression</a>.
The title pretty much says it all but I'll also discuss some intuition and show
some experiments on the CIFAR10 and SVHN datasets. The idea is both simple and
surprising, my favourite kind of idea! Let's take a look.</p>
<p><a href="http://bjlkeng.github.io/posts/label-refinery/">Read more…</a> (10 min remaining to read)</p></div>CIFAR10label refinerymathjaxresidual networkssvhnhttp://bjlkeng.github.io/posts/label-refinery/Tue, 04 Sep 2018 11:26:02 GMTUniversal ResNet: The One-Neuron Approximatorhttp://bjlkeng.github.io/posts/universal-resnet-the-one-neuron-approximator/Brian Keng<div><p><em>"In theory, theory and practice are the same. In practice, they are not."</em></p>
<p>I read a very interesting paper titled <em>ResNet with one-neuron hidden layers is
a Universal Approximator</em> by Lin and Jegelka [1].
The paper describes a simplified Residual Network as a universal approximator,
giving some theoretical backing to the wildly successful ResNet architecture.
In this post, I'm going to talk about this paper and a few of the related
universal approximation theorems for neural networks.
Instead of going through all the theoretical stuff, I'm simply going introduce
some theorems and play around with some toy datasets to see if we can get close
to the theoretical limits.</p>
<p>(You might also want to checkout my previous post where I played around with
ResNets: <a class="reference external" href="http://bjlkeng.github.io/posts/residual-networks/">Residual Networks</a>)</p>
<p><a href="http://bjlkeng.github.io/posts/universal-resnet-the-one-neuron-approximator/">Read more…</a> (11 min remaining to read)</p></div>hidden layersmathjaxneural networksresidual networksResNetuniversal approximatorhttp://bjlkeng.github.io/posts/universal-resnet-the-one-neuron-approximator/Fri, 03 Aug 2018 12:03:28 GMTHyperbolic Geometry and Poincaré Embeddingshttp://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/Brian Keng<div><p>This post is finally going to get back to some ML related topics.
In fact, the original reason I took that whole math-y detour in the previous
posts was to more deeply understand this topic. It turns out trying to
under tensor calculus and differential geometry (even to a basic level) takes a
while! Who knew? In any case, we're getting back to our regularly scheduled program.</p>
<p>In this post, I'm going to explain one of the applications of an abstract
area of mathematics called hyperbolic geometry. The reason why this area is of
interest is because there has been a surge of research showing its
application in various fields, chief among them is a paper by Facebook
researchers [1] in which they discuss how to utilize a model of hyperbolic
geometry to represent hierarchical relationships. I'll cover some of
the math weighting more towards intuition, show some of their results, and also
show some sample code from Gensim. Don't worry, this time I'll try much harder
not going to go down the rabbit hole of trying to explain all the math (no
promises though).</p>
<p>(Note: If you're unfamiliar with tensors or manifolds, I suggest getting a quick
overview with my previous two posts:
<a class="reference external" href="http://bjlkeng.github.io/posts/tensors-tensors-tensors/">Tensors, Tensors, Tensors</a> and
<a class="reference external" href="http://bjlkeng.github.io/posts/manifolds/">Manifolds: A Gentle Introduction</a>)</p>
<p><a href="http://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/">Read more…</a> (34 min remaining to read)</p></div>embeddingsgeometryhyperbolicmanifoldsmathjaxPoincaréhttp://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/Sun, 17 Jun 2018 12:20:18 GMT