In this post, I'm going to share some notes on implementing a variational autoencoder (VAE) on the Street View House Numbers (SVHN) dataset. My last post on variational autoencoders showed a simple example on the MNIST dataset but because it was so simple I thought I might have missed some of the subtler points of VAEs -- boy was I right! The fact that I'm not really a computer vision guy nor a deep learning guy didn't help either. Through this exercise, I picked up some of the basics in the "craft" of computer vision/deep learning area; there are a lot of subtle points that are easy to gloss over if you're just reading someone else's tutorial. I'll share with you some of the details in the math (that I initially got wrong) and also some of the implementation notes along with a notebook that I used to train the VAE. Please check out my previous post on variational autoencoders to get some background.
Update 2017-08-09: I actually found a bug in my original code where I was only using a small subset of the data! I fixed it up in the notebooks and I've added some inline comments below to say what I've changed. For the most part, things have stayed the same but the generated images are a bit blurry because the dataset isn't so easy anymore.