This gives a brief walkthrough of the intuition behind simulation-based inference
(also known as likelihood-free inference, SBI, or LFI) aimed at scientists with a bit of a stats background,
but without machine learning experience.
We go through a simple inference example with the package
sbi, and make
comparisons to an analytic likelihood. We also touch the basics of PyTorch as a generic linear algebra package.
Parts of notebook are taken from SBI tutorials and Peter Melchior’s LFI introduction.
An interactive version of this notebook
in Colab is available here.
For a literature and a review of current LFI methods, see “The frontier of simulation-based inference”. The methods used in
sbi are Automatic Posterior Transformation for likelihood-free inference by Greenberg D, Nonnenmacher M, and Macke J;
Sequential Neural Likelihood by Papamakarios G, Sterrat DC and Murray I;
Likelihood-free Inference with Amortized Approximate Likelihood Ratios by Hermans J, Begy V, and Louppe G.;
and On Contrastive Learning for Likelihood-free Inference by Durkan C, Murray I, and Papamakarios G.
(don’t worry about the error message “…has requirement nflows==0.11…”, it will work)
I will assume you have PyTorch, and the main scientific computing packages.
Simulation-Based/Likelihood-Free Inference Quick Start
We would like to perform inference on a model’s parameters given observations :
We normally do this with Bayes rule:
which relies on the likelihood function .
However, what if we don’t know the likelihood? What if all we can do is simulate outputs given ?
Well, say we take this simulator, and can simulate enough examples such that we can measure the frequency of examples reproducing the data (or getting close to it). This is called Approximate Bayesian Computation, or ABC:
But, what if the simulation is very expensive, and the data is high dimensional? Then we would need to run far too many simulations. E.g., images of a galaxy - we can’t simply simulate enough galaxies until the pixels match up!
So we turn to Likelihood-Free Inference. Instead of needing to call the simulation over and over, we instead run it only a few times, and fit a function:
for several examples of , such that is maximized over the data.
But what is and how can we ensure it’s normalized?
This is where density estimation-based LFI (DELFI) comes into play. We would like to fit a distribution over , rather than a point-to-point function. One example is we would like to fit a function that predicts for a 1D Gaussian to model , given a datapoint . I.e.,
This gives us log-probabilities of , and we know that this function is normalized. And we maximize the likelihood of that Gaussian over the dataset by tweaking the parameters of and . This can be done with Neural Networks with some imposed smoothness regularization on so they don’t overfit.
Furthermore, if one does not expect the posterior to look Gaussian (or a Gaussian mixture model), one can turn to “Normalizing Flows”.
A normalizing flow is a flexible transformation between probability distributions:
A normalizing flow learns an invertible dynamical model for samples of the distribution. Sample a Gaussian, and go forward through the dynamical model to sample the learned distribution. Invert the model and go backward from a sample to the compute the likelihood.
Each step is autoregressive^ (=change in y depends on x, followed by change in x depends on y), allowing it to be invertible. The invertibility implies normalization, since we aren’t creating additional Monte Carlo samples.
If you learn the dynamics with a flexible model, like a neural network, you can actually model an arbitrary distribution. This is great for non-Gaussian distributions.
For an astronomy example, check out this paper where we learn an accurate data-driven HR diagram directly from Gaia data with a normalizing flow. No theory and no assumption of Gaussianity.
Modern LFI techniques model as normalizing flows, to learn very flexible likelihoods. Here is the procedure:
- Optionally learn a compression of to a smaller number of features.
- Learn a normalizing flow that maximizes over examples. We can then both sample from and evaluate the posterior of given samples.
PyTorch quick start
Here, I introduce PyTorch, which is used as the backend for
sbi. It’s more of
an introduction for scientists who don’t do machine learning typically, as I
want to encourage them to use these DL frameworks for scientific computing.
I think it’s wise to write your array-based simulations by piggy-backing
on a package which is relied on and optimized by a
many-billion dollar industry.
Feel free to skip this section if you are comfortable with PyTorch.
PyTorch is like numpy in syntax, but adds: vectorization, GPU-acceleration, and autodifferentiation. It also has deep learning kits. Let’s do some linear algebra:
np.array, we use
(array([5., 3.]), tensor([5., 3.]))
But we can pass data back and forth between numpy and torch like so:
(array([10., 5., 1.]), tensor([10., 5., 1.], dtype=torch.float64), array([10., 5., 1.]))
Create a vector of 10 zeros:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Add 1 to each element:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Generate random numbers:
tensor([ 0.8344, 0.1738, -0.7568, 0.1369, 0.1513, 0.0052, 0.2035, -1.2633, -0.9020, 0.8539])
Do some basic operations, all with numpy syntax:
(tensor([0.6716, 0.9849, 0.7271, 0.9906, 0.9886, 1.0000, 0.9794, 0.3027, 0.6201, 0.6571]), tensor([0.6067, 0.1602, 0.5635, 0.1283, 0.1409, 0.0052, 0.1852, 0.8168, 0.6429, 0.6173]), tensor(-0.5630), tensor(0.7062))
Now, let’s do stuff on the GPU:
tensor([[-1.3129, -0.6998, -0.1639, ..., -0.2922, 0.9799, 0.2962], [ 0.3176, -1.4630, 0.5830, ..., 0.2930, 0.2625, 0.1684], [-2.1551, 2.2514, -0.8173, ..., -2.6452, 0.9850, -0.2836], ..., [ 1.1963, 0.5753, 1.4584, ..., -0.1940, -0.6713, 0.2051], [ 0.6018, -0.7904, -0.4918, ..., 0.4556, 0.2463, -1.3733], [ 0.9120, 0.4902, -0.9096, ..., 0.2018, 0.1721, -0.2624]], device='cuda:0')
So the vector is on the GPU. We can do vector operations all on the GPU now:
tensor([[ -2.0060, 0.4243, 0.9842, ..., 0.9347, 1.5000, 0.9844], [ 0.9840, -3.0218, 1.0350, ..., 0.9845, 0.9858, 0.9926], [-10.5596, 10.7842, 0.1401, ..., -19.3868, 1.5106, 0.9393], ..., [ 2.0799, 1.0315, 3.2161, ..., 0.9759, 0.4825, 0.9897], [ 1.0443, 0.2117, 0.7645, ..., 0.9946, 0.9868, -2.3916], [ 1.3727, 1.0020, -0.1365, ..., 0.9899, 0.9923, 0.9497]], device='cuda:0')
And move back to the CPU like so:
tensor([[ -2.0060, 0.4243, 0.9842, ..., 0.9347, 1.5000, 0.9844], [ 0.9840, -3.0218, 1.0350, ..., 0.9845, 0.9858, 0.9926], [-10.5596, 10.7842, 0.1401, ..., -19.3868, 1.5106, 0.9393], ..., [ 2.0799, 1.0315, 3.2161, ..., 0.9759, 0.4825, 0.9897], [ 1.0443, 0.2117, 0.7645, ..., 0.9946, 0.9868, -2.3916], [ 1.3727, 1.0020, -0.1365, ..., 0.9899, 0.9923, 0.9497]])
PyTorch also lets you do autodifferentiation, which is a big part of deep learning. Let’s look at the gradient of :
First, we make our variable know that it should record gradients as it is
operated upon with the
Now, we take the analytic gradient at each of , which is :
We can also do this all on the GPU! We can also do higher-order derivatives by repeatedly calling grad.
(tensor([-1.0000, -0.7071], device='cuda:0'),)
With gradient information, one can do gradient descent operation on a model
One can write a class that inherits from
torch.nn.Parameters(...) around its parameters, and then
optimize those. This is how all deep learning works: gradient-based optimization
of some highly-flexible differentiable model.
PyTorch is used as the backend for SBI. Let’s now move to likelihood-free inference.
Say that there is the following true model—a 2D Gaussian with known width:
The true vector for the Gaussian is , which we are unaware of. We make 5 observations and attempt to reconstruct a posterior over .
The likelihood of this model is given by
which gives us a way to recover the parameter given data and a prior: .
But now, say that we don’t know this likelihood. Pretend we are not given the likelihood of a Gaussian. We only know how to draw samples from a Gaussian: this is our “simulation.”
How can we compute a distribution over , without a likelihood?
For LFI, you need to provide two ingredients:
- a prior distribution that allows to sample parameter sets (your guess at )
- a simulator that takes parameter sets and produces simulation outputs (samples of a Gaussian with a given )
For example, let’s pretend we have a reasonable idea that each is within with uniform probability. We also have our simulator that generates samples of a Gaussian.
sbi provides a simple interface to run simulation-based inference.
Let’s write out our uniform prior:
This will be used to generate proposals for .
Now, let’s write out our simulator for a given :
This simulator will take an estimate for , and output a “simulation result” as a random sample of a Gaussian. If we pick the right , the output of the simulator will be close to our observed data.
Step 1: Let’s learn a likelihood from the simulator:
sbi to record 200 simulations from the above function, and fit
the joint surface of with a neural network using
one of the three LFI algorithms.
This will be a bit slow because we are training the neural network on a CPU:
Neural network successfully converged after 57 epochs.
Now we have a neural network that acts as our likelihood!
Step 2, let’s record our 5 “observations” of the true distribution:
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
Here’s the data:
We can now use our learned likelihood (a neural network) to do inference on !
Step 3, Inference
We can sample the posterior for a single datapoint like so:
Tuning bracket width...: 100%|██████████| 50/50 [00:00<00:00, 58.36it/s] Generating samples: 100%|██████████| 20/20 [00:00<00:00, 20.97it/s] Generating samples: 100%|██████████| 200/200 [00:09<00:00, 20.73it/s]
And then plot this (it should look approximately Gaussian):
For multiple observations, one should compute the total log-probability (see below), and MCMC.
Let’s create a grid of close to the expected value of , and calculate the total log likelihood for each value:
Let’s calculate the log probability across the grid. One can usually use
posterior.log_prob, but one can also vectorize it as follows:
Text(0, 0.5, '$$\\mu_2$$')
Now, this was a simple example and we could have written out an explicit likelihood. But the real power of Likelihood-Free Inference comes for high dimensional and complicated distributions, when there is no explicit analytic likelihood available. The exact same algorithm as above can simply be scaled up to a higher dimensional space.
Let’s compare to the true likelihood:
Text(0, 0.5, '$$\\mu_2$$')
Not bad considering we only used examples to learn the likelihood, which is a complex function ! The more simulations you use (much faster to train with GPUs), the more accurate the likelihood will be.
LFI is a very general approach that will eventually be used everywhere in astrophysics where we don’t have clear likelihoods. To help give examples of what it could do:
- Photo-z: parameter is z, you then generate some SEDs - that is your
“simulator”. Then use
sbito compare to data. You get a posterior over z.
- Galaxy images: parameters are galaxy properties, orientation, distance, you
then simulate some images of galaxies. You then use
sbito infer the galaxy properties.
- Here, one would probably want to use an autoencoder to first compress the images to vectors describing the properties.
- The likelihood is then learned over that vector.
Answers to questions during my tutorial:
- Your simulator does not need to be differentiable. All you need is the input and output.
- LFI was invented for very expensive & high-dimensional simulations; the toy example we did was a 2D Gaussian just to help with building intuition.
- The range of parameters which you simulate is linked to the accuracy of the likelihood you learn (the closer the parameters you simulate from are to the true parameters, the less the model needs to interpolate). We chose a wide prior of , but a tighter prior would result in a more accurate likelihood. (i.e., generally you want your simulations to have realistic parameters).
- This may sound strange for those who have not done deep learning, but the time difference between our 2-D case and a 100-D case is small, as the internal model (“hidden layers”) is the same for both.
SNPE = sequential neural posterior estimation = https://papers.nips.cc/paper/6728-flexible-statistical-inference-for- mechanistic-models-of-neural-dynamics.pdf.
Outputs as a Gaussian mixture model. Difference: generates parameter samples from a learned proposal that make generated data closer to the observed data.
SNLE = sequential neural likelihood estimation = http://proceedings.mlr.press/v89/papamakarios19a/papamakarios19a.pdf.
Train a likelihood as a conditional normalizing flow. Will learn summary statistics internally.
SNRE = Sequential Neural Ratio Estimation = https://arxiv.org/abs/1903.04057.
Targets the likelihood ratio between two choices of parameters.