Bayesian Machine Learning and Julia are a match made in heaven

As I argued in an earlier article, Bayesian Machine Learning can be quite powerful. Building actual Bayesian models in Python, however, is sometimes a bit of a hassle. Most solutions that you will find online are either relatively complex or require learning yet another domain specific language. The latter could easily constrain your expressiveness when you need a highly customized solution.

Doing Bayesian Machine Learning in Julia, on the other hand, allows you to mitigate both these issues. In fact, you just need a few lines of raw Julia code to build, for example, a Bayesian Neural Network for regression. Julia’s Flux and Turing packages will then handle the heavy workload under the hood.

Hence today, I want to show you how to implement and train a Bayesian Neural Network in less than 30 lines of Julia. Before showing you the code, let us briefly recall the main theoretical aspects:

Bayesian Machine Learning in three steps

As always, we want to find a posterior distribution via Bayes’ law:

p(\theta|D)=\frac{p(D|\theta)p(\theta)}{p(D)}

As the data term in the denominator is a constant, we can simplify the above:

p(\theta|D)\propto p(D|\theta)p(\theta) \\

To avoid confusion, let us use the following standard wording:

\begin{gathered}
p(\theta)\text{ := 'prior distribution'} \\
p(D|\theta)\text{ := 'likelihood function'}
\end{gathered}

For Bayesian Neural Network regression, we further specify the likelihood function:

p(D|\theta)=\prod_{i=1}^N\mathcal{N}(y_i|f_W(X_i),\sigma^2)

This denotes a product of independent normal distributions with means defined by the outputs of a Neural Network. The variance of the Normal distribution is chosen to be a constant.

The corresponding prior distribution could look as follows:

p(\theta)=p(W,\sigma)=\prod_{k=1}^K\mathcal{N}(W_k|0,1)\cdot\Gamma(1,1)

The priors for K networks weights are independent standard normal distributions. For the square root of the variance (a.k.a. standard deviation), we use a standard Gamma distribution. So, from a theory perspective, we are all set up and ready to go.

Ideally, we now want to implement the Bayesian Neural Network in the following steps:

  1. Define the likelihood function
  2. Define the prior distribution
  3. Train the model

Having these three steps separate from each other in the code, will help us to

  • Maintain readability – Besides the corresponding functions being smaller, a potential reader can also easier discern the likelihood from the prior.
  • Keep the code testable at a granular level – Likelihood and prior distribution are clearly separate concerns. Thus, we should also be able to test them individually.

With this in mind, let us start building the model in Julia.

Defining the likelihood function

The Flux library provides everything we need to build and work with Neural Networks. It has Dense to build feedforward layers and Chain to combine the layers into a network. In fact, they work just like the Layer and Network structs that we defined in this article.

Our Likelihood struct therefore consists of the Neural Network, network, and the standard deviation, sigma. In the feedforward pass, we use the network’s output and sigma to define conditional mean and standard deviation of the Gaussian likelihood:

The dot in Normal.(...) lets us define one Normal distribution per network output, each with standard deviation sigma. We could combine this with logpdf(...) from the Distributions library in order to train the model with maximum likelihood gradient descent. To perform Bayesian Machine Learning, however, we need to add a few more elements.

This leads us to the central function of this article, namely Flux.destructure(). From the documentation:

destructure(m)
Flatten a model's parameters into a single weight vector.

julia> m = Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
Chain(Dense(10, 5, σ), Dense(5, 2), softmax)

julia> θ, re = destructure(m);

julia> θ
67-element Vector{Float32}:
-0.1407104
...
The second return value re allows you to reconstruct the original network after making modifications to the weight vector (for example, with a hypernetwork).

julia> re(θ .* 2)
Chain(Dense(10, 5, σ), Dense(5, 2), softmax)

In summary, destructure(...) takes an instantiated model struct and returns a tuple with two elements:

  1. The model parameters, concatenated into a single vector
  2. A reconstructor function that takes a parameter vector as in 1. as input and returns the model with those parameters

The latter is important as we can feed an arbitrary parameter vector to the reconstructor. As long as its length is valid, it returns the corresponding model with the given parameter configuration. In code:

The last function will allow us to provide weights and standard deviation parameters separately to the reconstructor. This is a necessary step in order for Turing to handle the Bayesian inference part.

From here, we are ready to move to the prior distribution.

Defining the prior distribution

This part is very short – we only need to define the prior distributions for the weight vector and the standard deviation scalar:

Having defined both likelihood and prior, we can take samples from the prior predictive distribution,

p(y|X)=\prod_{i=1}^N\int \mathcal{N}(y_i|f_W(X_i),\sigma^2)p(W)p(\sigma) dW d\sigma.

While this might look complicated as a formula, we are basically just drawing Monte Carlo samples:

The prior predictive distribution itself includes the noise from sigma. Prior predictive draws from the network alone, i.e. the prior predictive mean, yield nice and smooth samples:

Prior predictive draws from the mean function (the output of the Bayesian Neural Network)

Now, we can actually train the model.

Training the Bayesian Neural Network

For this example, we’ll be using synthetic data, sampled from

p(y,X)=\prod_{i=1}^{50}\mathcal{N}(y_i|sin(X_i),0.25^2)\cdot\mathcal{U}(X_i|-2,2).

The latter factor denotes a uniform density over (-2,2).

Synthetic dataset – 50 draws from a Normal-Uniform simulation

In order to use Turing, we need to define a model as explained in their documentation. Applied on our example, we get the following:

Finally, we need to choose an algorithm for Bayesian posterior inference. As our model is comparatively small, Hamiltonian Monte Carlo (HMC) is a suitable choice. In fact, HMC is generally considered the gold standard algorithm for Bayesian Machine Learning. Unfortunately, it becomes quite inefficient in high dimensions.

Nevertheless, we now use HMC via Turing and collect the resulting draws from the MCMC posterior:

From here, we can visualize the full posterior predictive distribution,

p(y^*|X^*,X,y)=\prod_{i=1}^{N^*}\mathcal{N}(y^*_i|f_W(X^*_i),\sigma^2)\cdot p(W,\sigma|X,y) dW d\sigma.

This is done in a similar fashion as for the prior predictive distribution (star-variables denote new inputs outside the training set). The only difference is that we now use the samples from the MCMC posterior distribution.

Posterior predictive distribution with 90% confidence intervals. As typical for Bayesian Machine Learning, epistemic uncertainty increases outside the range of observed data

Using the above example, it is easy to try out other prior distributions.

Plug-and-play with different prior distributions

As another big advantage, Turing can use almost all distributions from the Distributions library as a prior. This also allows us to try out some exotic weight priors, say a Semicircle distribution with radius 0.5. All we have to do is replace the Gaussian prior:

With the same setup as before, we get the following posterior predictive distribution:

Posterior predictive distribution for independent Semicircle(0.5) prior weight distributions

The possibilities obviously don’t stop here. Another fruitful adjustment could be the introduction of hyperprior distributions, e.g. over the standard deviation of the weight priors.

Using a model different from a Neural Network is also quite simple. You only need to adjust the Likelihood struct and corresponding functions.

Summary

This article was a brief introduction to Bayesian Machine Learning with Julia. As matter of fact, Julia is not just fast but can also make coding much easier and more efficient.

While Julia is still a young language with some caveats to deploying Julia programs to production, it is definitely an awesome language for research and prototyping. Especially the seamless interoperability between different libraries can considerably shorten iteration cycles both inside and outside of academia.

At this point, I am also quite hopeful that we will, at some point in the near future, see more Julia being used in industry-grade production code.

Leave a Reply

Your email address will not be published.

WordPress Theme by RichWP