using Flux, Distributions
struct Likelihood
network
sigmaend
@functor Likelihood #tell Flux to look for trainable parameters in Likelihood
Flux.
::Likelihood)(x) = Normal.(p.network(x)[:], p.sigma[1]); #Flux only recognizes Matrix parameters but Normal() needs a scalar for sigma (p
Introduction
As I argued in an earlier article, Bayesian Machine Learning can be quite powerful. Building actual Bayesian models in Python, however, is sometimes a bit of a hassle. Most solutions that you will find online are either relatively complex or require learning yet another domain specific language. The latter could easily constrain your expressiveness when you need a highly customized solution.
Doing Bayesian Machine Learning in Julia, on the other hand, allows you to mitigate both these issues. In fact, you just need a few lines of raw Julia code to build, for example, a Bayesian Neural Network for regression. Julia’s Flux and Turing packages will then handle the heavy workload under the hood.
Hence today, I want to show you how to implement and train a Bayesian Neural Network in less than 30 lines of Julia. Before showing you the code, let us briefly recall the main theoretical aspects:
Bayesian Machine Learning in three steps
As always, we want to find a posterior distribution via Bayes’ law: \[ p(\theta \mid D)=\frac{p(D \mid \theta) p(\theta)}{p(D)} \] As the data term in the denominator is a constant, we can simplify the above: \[ p(\theta \mid D) \propto p(D \mid \theta) p(\theta) \] To avoid confusion, let us use the following standard wording: \[ \begin{gathered} p(\theta):=\text { 'prior distribution' } \\ p(D \mid \theta):=\text { 'likelihood function' } \end{gathered} \] For Bayesian Neural Network regression, we further specify the likelihood function: \[ p(D \mid \theta)=\prod_{i=1}^N \mathcal{N}\left(y_i \mid f_W\left(X_i\right), \sigma^2\right) \] This denotes a product of independent normal distributions with means defined by the outputs of a Neural Network. The variance of the Normal distribution is chosen to be a constant.
The corresponding prior distribution could look as follows: \[ p(\theta)=p(W, \sigma)=\prod_{k=1}^K \mathcal{N}\left(W_k \mid 0,1\right) \cdot \Gamma(1,1) \] The priors for \(K\) network weights are independent standard normal distributions. For the square root of the variance (a.k.a. standard deviation), we use a standard Gamma distribution. So, from a theory perspective, we are all set up and ready to go.
Ideally, we now want to implement the Bayesian Neural Network in the following steps:
- Define the likelihood function
- Define the prior distribution
- Train the model
Having these three steps separate from each other in the code, will help us to
- Maintain readability - Besides the corresponding functions being smaller, a potential reader can also easier discern the likelihood from the prior.
- Keep the code testable at a granular level - Likelihood and prior distribution are clearly separate concerns. Thus, we should also be able to test them individually.
With this in mind, let us start building the model in Julia.
Defining the likelihood function
The Flux
library provides everything we need to build and work with Neural Networks. It has Dense
to build feedforward layers and Chain
to combine the layers into a network.
Our Likelihood
struct therefore consists of the Neural Network, network
, and the standard deviation, sigma
. In the feedforward pass, we use the network’s output and sigma
to define conditional mean and standard deviation of the Gaussian likelihood:
The dot in Normal.(...)
lets us define one Normal distribution per network output, each with standard deviation sigma
. We could combine this with logpdf(...)
from the Distributions library in order to train the model with maximum likelihood gradient descent. To perform Bayesian Machine Learning, however, we need to add a few more elements.
This leads us to the central function of this article, namely Flux.destructure()
. From the documentation:
@doc Flux.destructure
destructure(model) -> vector, reconstructor
Copies all trainable
, isnumeric
parameters in the model to a vector, and returns also a function which reverses this transformation. Differentiable.
Example
julia> v, re = destructure((x=[1.0, 2.0], y=(sin, [3.0 + 4.0im])))
(ComplexF64[1.0 + 0.0im, 2.0 + 0.0im, 3.0 + 4.0im], Restructure(NamedTuple, ..., 3))
julia> re([3, 5, 7+11im])
(x = [3.0, 5.0], y = (sin, ComplexF64[7.0 + 11.0im]))
If model
contains various number types, they are promoted to make vector
, and are usually restored by Restructure
. Such restoration follows the rules of ChainRulesCore.ProjectTo
, and thus will restore floating point precision, but will permit more exotic numbers like ForwardDiff.Dual
.
If model
contains only GPU arrays, then vector
will also live on the GPU. At present, a mixture of GPU and ordinary CPU arrays is undefined behaviour.
In summary, destructure(...)
takes an instantiated model struct and returns a tuple with two elements:
- The model parameters, concatenated into a single vector
- A reconstructor function that takes a parameter vector as in 1. as input and returns the model with those parameters
The latter is important as we can feed an arbitrary parameter vector to the reconstructor. As long as its length is valid, it returns the corresponding model with the given parameter configuration. In code:
= Likelihood(Chain(Dense(1,5,tanh),Dense(5,1)), ones(1,1))
likelihood
= Flux.destructure(likelihood)
params, likelihood_reconstructor = length(params) - 1
n_weights
likelihood_conditional(weights, sigma) = likelihood_reconstructor(vcat(weights...,sigma));
The last function will allow us to provide weights and standard deviation parameters separately to the reconstructor. This is a necessary step in order for Turing
to handle the Bayesian inference part.
From here, we are ready to move to the prior distribution.
Defining the prior distribution
This part is very short - we only need to define the prior distributions for the weight vector and the standard deviation scalar:
= MvNormal(zeros(n_weights), ones(n_weights))
weight_prior = Gamma(1.,1.); sigma_prior
Having defined both likelihood and prior, we can take samples from the prior predictive distribution,
\[ p(y \mid X)=\prod_{i=1}^N \int \mathcal{N}\left(y_i \mid f_W\left(X_i\right), \sigma^2\right) p(W) p(\sigma) d W d \sigma \]
While this might look complicated as a formula, we are basically just drawing Monte Carlo samples. The prior predictive distribution itself includes the noise from sigma. Prior predictive draws from the network alone, i.e. the prior predictive mean, yield nice and smooth samples:
= Matrix(transpose(collect(-3:0.1:3)[:,:]))
Xline likelihood_conditional(rand(weight_prior), rand(sigma_prior))(Xline)
using Random, Plots
Random.seed!(54321)
plot(Xline[:],mean.(likelihood_conditional(rand(weight_prior), rand(sigma_prior))(Xline)),color=:red, legend=:none, fmt=:png)
plot!(Xline[:],mean.(likelihood_conditional(rand(weight_prior), rand(sigma_prior))(Xline)),color=:red)
plot!(Xline[:],mean.(likelihood_conditional(rand(weight_prior), rand(sigma_prior))(Xline)),color=:red)
plot!(Xline[:],mean.(likelihood_conditional(rand(weight_prior), rand(sigma_prior))(Xline)),color=:red)
plot!(Xline[:],mean.(likelihood_conditional(rand(weight_prior), rand(sigma_prior))(Xline)),color=:red)
Now, we can actually train the model.
Training the Bayesian Neural Network
For this example, we’ll be using synthetic data, sampled from \[ p(y, X)=\prod_{i=1}^{50} \mathcal{N}\left(y_i \mid \sin \left(X_i\right), 0.25^2\right) \cdot \mathcal{U}\left(X_i \mid-2,2\right) \] The latter factor denotes a uniform density over \((-2,2)\).
Random.seed!(54321)
= rand(1,50) .* 4 .- 2
X = sin.(X) .+ randn(1,50).*0.25
y
scatter(X[:], y[:],color=:green,legend=:none, fmt=:png)
In order to use Turing
, we need to define a model as explained in their documentation. Applied on our example, we get the following:
using Turing
@model function TuringModel(likelihood_conditional, weight_prior, sigma_prior, X, y)
~ weight_prior
weights ~ sigma_prior
sigma
= likelihood_conditional(weights,sigma)(X)
predictions
:] ~ Product(predictions)
y[end;
Finally, we need to choose an algorithm for Bayesian posterior inference. As our model is comparatively small, Hamiltonian Monte Carlo (HMC) is a suitable choice. In fact, HMC is generally considered the gold standard algorithm for Bayesian Machine Learning. Unfortunately, it becomes quite inefficient in high dimensions.
Nevertheless, we now use HMC via Turing and collect the resulting draws from the MCMC posterior:
using Random
Random.seed!(54321)
= 5000
N = sample(TuringModel(likelihood_conditional, weight_prior, sigma_prior, X , y), HMC(0.025, 4), N);
ch
= Array(MCMCChains.group(ch, :weights).value) #get posterior MCMC samples for network weights
weights = Array(MCMCChains.group(ch, :sigma).value); #get posterior MCMC samples for standard deviation sigmas
Sampling: 100%|█████████████████████████████████████████| Time: 0:00:05
From here, we can visualize the full posterior predictive distribution, \[ p\left(y^* \mid X^*, X, y\right)=\prod_{i=1}^{N^*} \mathcal{N}\left(y_i^* \mid f_W\left(X_i^*\right), \sigma^2\right) \cdot p(W, \sigma \mid X, y) d W d \sigma \] This is done in a similar fashion as for the prior predictive distribution (star-variables denote new inputs outside the training set). The only difference is that we now use the samples from the MCMC posterior distribution.
Random.seed!(54321)
= []
posterior_predictive_mean_samples = []
posterior_predictive_full_samples
for _ in 1:10000
= rand(1:5000,1)
samp = weights[samp,:,1]
W = sigmas[samp,:,1]
sigma = likelihood_reconstructor(vcat(W[:],sigma[:]))
posterior_predictive_model
= posterior_predictive_model(Xline)
predictive_distribution = rand(Product(predictive_distribution))
postpred_full_sample push!(posterior_predictive_mean_samples,mean.(predictive_distribution))
push!(posterior_predictive_full_samples, postpred_full_sample)
end
= hcat(posterior_predictive_mean_samples...)
posterior_predictive_mean_samples
= mean(posterior_predictive_mean_samples, dims=2)[:]
pp_mean = mapslices(x -> quantile(x,0.05),posterior_predictive_mean_samples, dims=2)[:]
pp_mean_lower = mapslices(x -> quantile(x,0.95),posterior_predictive_mean_samples, dims=2)[:]
pp_mean_upper
= hcat(posterior_predictive_full_samples...)
posterior_predictive_full_samples = mapslices(x -> quantile(x,0.05),posterior_predictive_full_samples, dims=2)[:]
pp_full_lower = mapslices(x -> quantile(x,0.95),posterior_predictive_full_samples, dims=2)[:]
pp_full_upper
plot(Xline[:],pp_mean, ribbon = (pp_mean.-pp_full_lower, pp_full_upper.-pp_mean),legend=:bottomright, label="Full posterior predictive distribution", fmt=:png)
plot!(Xline[:], pp_mean, ribbon = (pp_mean.-pp_mean_lower, pp_mean_upper.-pp_mean), label="Posterior predictive mean distribution (a.k.a. epistemic uncertainty)")
scatter!(X[:],y[:],color=:green, label = "Training data")
Using the above example, it is easy to try out other prior distributions.
Plug-and-play with different prior distributions
As another big advantage, Turing
can use almost all distributions from the Distributions
library as a prior. This also allows us to try out some exotic weight priors, say a Semicircle distribution with radius 0.5. All we have to do is replace the Gaussian prior:
= Product([Semicircle(0.5) for _ in 1:n_weights]); weight_prior
With the same setup as before, we get the following posterior predictive distribution:
Random.seed!(54321)
= 5000
N = sample(TuringModel(likelihood_conditional, weight_prior, sigma_prior, X , y), HMC(0.025, 4), N);
ch
= MCMCChains.group(ch, :weights).value #get posterior MCMC samples for network weights
weights = MCMCChains.group(ch, :sigma).value #get posterior MCMC samples for standard deviation
sigmas
= []
posterior_predictive_mean_samples = []
posterior_predictive_full_samples
for _ in 1:10000
= rand(1:5000,1)
samp = weights[samp,:,1]
W = sigmas[samp,:,1]
sigma = likelihood_reconstructor(vcat(W[:],sigma[:]))
posterior_predictive_model
= posterior_predictive_model(Xline)
predictive_distribution = rand(Product(predictive_distribution))
postpred_full_sample push!(posterior_predictive_mean_samples,mean.(predictive_distribution))
push!(posterior_predictive_full_samples, postpred_full_sample)
end
= hcat(posterior_predictive_mean_samples...)
posterior_predictive_mean_samples = mean(posterior_predictive_mean_samples, dims=2)[:]
pp_mean = mapslices(x -> quantile(x,0.05),posterior_predictive_mean_samples, dims=2)[:]
pp_mean_lower = mapslices(x -> quantile(x,0.95),posterior_predictive_mean_samples, dims=2)[:]
pp_mean_upper
= hcat(posterior_predictive_full_samples...)
posterior_predictive_full_samples = mapslices(x -> quantile(x,0.05),posterior_predictive_full_samples, dims=2)[:]
pp_full_lower = mapslices(x -> quantile(x,0.95),posterior_predictive_full_samples, dims=2)[:]
pp_full_upper
plot(Xline[:],pp_mean, ribbon = (pp_mean.-pp_full_lower, pp_full_upper.-pp_mean),legend=:bottomright, label="Full posterior predictive distribution", fmt=:png)
plot!(Xline[:], pp_mean, ribbon = (pp_mean.-pp_mean_lower, pp_mean_upper.-pp_mean), label="Posterior predictive mean distribution (a.k.a. epistemic uncertainty)")
scatter!(X[:],y[:],color=:green, label = "Training data")
Sampling: 100%|█████████████████████████████████████████| Time: 0:00:05
The possibilities obviously don’t stop here. Another fruitful adjustment could be the introduction of hyperprior distributions, e.g. over the standard deviation of the weight priors.
Using a model different from a Neural Network is also quite simple. You only need to adjust the Likelihood struct and corresponding functions.
Conclusion
This article was a brief introduction to Bayesian Machine Learning with Julia. As matter of fact, Julia is not just fast but can also make coding much easier and more efficient.
While Julia is still a young language with some caveats to deploying Julia programs to production, it is definitely an awesome language for research and prototyping. Especially the seamless interoperability between different libraries can considerably shorten iteration cycles both inside and outside of academia.
At this point, I am also quite hopeful that we will, at some point in the near future, see more Julia being used in industry-grade production code.
References
[1] Bezanson, Jeff, et al. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145, 2012.
[2] Ge, Hong; Xu, Kai; Ghahramani, Zoubin. Turing: a language for flexible probabilistic inference. In: International conference on artificial intelligence and statistics. PMLR, 2018. p. 1682-1690.