Bayesian scoring rule calibration for approximate models

Joshua J Bon

CEREMADE, Université Paris-Dauphine

Talk overview

Motivation
Method
Examples
Theory

arXiv: https://doi.org/10.48550/arXiv.2211.05357

Slides: https://bonstats.github.io/score-cal-24s/

Bayesian Inference

Approximations in Bayesian inference

Bayesian inference requires likelihoods, but…

Intractable likelihoods
- Use approximations
Computationally expensive likelihoods
- Use approximations
Posterior evaluation
- Use approximations

Approximations everywhere…

Likelihood approximations

Limiting distributions
Whittle likelihood
Model simplification (e.g. SDE \(\rightarrow\) ODE)
Linear noise approximation
Kalman filter
Particle filters

Posterior approximations

Laplace
INLA
Variational inference
Psuedo-marginal MCMC
Approximate Bayesian Computation

Sampling approximations

Importance sampling
Sequential Monte Carlo
Markov chain Monte Carlo
- Metropolis-Hastings
- (Un)adjusted Langevin Algorithms
- Hamiltonian Monte Carlo

A good approximation?

Can we check posteriors based on approximation(s)?

Can we correct posteriors based on approximation(s)?

Correcting approximate posteriors

\[\begin{aligned}&\text{True posterior}:\quad &\Pi(\cdot\vert y) \\ &\text{Approximate posterior}:\quad &\check{\Pi}(\cdot\vert y) \\ &\text{Finite-sample approximate posterior}:\quad &\hat{\Pi}(\cdot\vert y)\end{aligned}\]

Large-sample approximation to likelihood

Good approximation, \(\Pi(\cdot\vert y_n) \approx \check{\Pi}(\cdot\vert y_n)\) as \(n\rightarrow \infty\)
Good approximation, \(\check{\Pi}(\cdot\vert y) \approx \hat{\Pi}(\cdot\vert y)\) as \(N \rightarrow \infty\)

What about \(\Pi(\cdot\vert y) \approx \hat{\Pi}(\cdot\vert y)\) for finite \(n\) and \(N\)?

Transforming approximate posteriors

Find a function to correct approximate posteriors

\[\Pi(\cdot\vert y) \approx {\color{green}f}_\sharp\hat{\Pi}(\cdot\vert y)\] using samples from \(\hat{\Pi}(\cdot\vert y)\)

No need to rerun MC/MCMC/SMC
- unlike correcting the likelihood directly
Correction acts on all levels of approximation
But we still can’t evaluate \(\Pi(\cdot\vert y)\)!

Use a calibration-based approach…

Illustration

Correcting a posterior

Figure 1: Ideal case (approximation in blue)

Correcting a posterior

Figure 2: Real case (approximation in blue)

Correcting many posteriors

Figure 3: Correct on average? (approximation in blue)

How to score an approximation when we know the truth?

Let \({\color{blue}K}\) be a conditional probability (i.e. Markov kernel)

We want \({\color{blue}K}(\cdot\vert\tilde{y}) \approx \Pi(\cdot\vert\tilde{y})\)
- e.g. \({\color{blue}K}(\cdot\vert\tilde{y}) = {\color{green}f}_\sharp\hat{\Pi}(\cdot\vert \tilde{y})\)
We have prior predictive draws \(({\color{red}\theta}, \tilde{y})\)
Measure the similarity to the truth with \(S:\mathcal{P} \times \Theta \rightarrow \mathbb{R}\)

\[S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta})\]

\[\tilde{y} \sim P(\cdot \vert {\color{red}\theta}), \quad {\color{red}\theta} \sim {\color{purple}\Pi}\]

Optimise the average score

Consider the average over prior-predictive pairs \(({\color{red}\theta}, \tilde{y})\)

\[\max_{{\color{blue}K}\in\mathcal{K}}\mathbb{E}[S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta})]\]

If we choose \(\mathcal{K} = \{{\color{green}f}_\sharp\hat\Pi: {\color{green}f} \in \mathcal{F}\}\)

\[\max_{{\color{green}f} \in \mathcal{F}}\mathbb{E}[S({\color{green}f}_\sharp\hat\Pi(\cdot\vert \tilde{y}),{\color{red}\theta})]\]

Correcting many posteriors

Figure 4: Before correction

Correcting many posteriors

Figure 5: After correction

Model calibration by simulation

Correct coverage? Frequentist
- \(S({\color{blue}K}(\cdot\vert y), {\color{red}\theta}) = \vert 1({\color{red}\theta} \in {\color{blue}\text{CR}_\alpha}) - \alpha \vert\)
Can we correct the entire distribution? Bayesian
- Proper scoring rules

BSRC method

Scoring rules

\(S(U,x)\) compares probabilistic forecast \(U\) to ground truth \(x\).

\[S(U,V) = \mathbb{E}_{x\sim V} S(U,x)\]

where \(V\) is a probability measure.

Proper scoring rules

(Gneiting and Raftery 2007)

\(S(U, \cdot)\) is strictly proper if

\(S(V,V) \geq S(U, V)\) for all \(V\) in some family, and
equality holds iff \(U = V\).

\[V = \arg\max_{U \in \mathcal{U}} S(U, V)\]

What we can’t do

\[\Pi(\cdot ~\vert~y ) = \underset{ {\color{blue}K}\in\mathcal{K} }{\arg\max}~ S[{\color{blue}K}(\cdot\vert y),\Pi(\cdot ~\vert~y )]\]

\[S[{\color{blue}K}(\cdot\vert y),\Pi(\cdot ~\vert~y )] = \mathbb{E}_{\theta \sim \Pi(\cdot ~\vert~y )}\left[S( {\color{blue}K}(\cdot\vert y),\theta) \right]\]

What we can do

Consider the joint data-parameter space instead

\[\Pi(\text{d}\theta ~\vert~\tilde{y} )P(\text{d}\tilde{y}) = P(\text{d}\tilde{y}~\vert~\theta)\Pi(\text{d}\theta )\]

See Pacchiardi and Dutta (2022) and Bon et al. (2022)

What we can do

\[{\color{purple}\Pi}(\cdot ~\vert~ \cdot ) = \underset{ {\color{blue}K}\in\mathcal{K} }{\arg\max}~ \mathbb{E}_{\tilde{y} \sim P} \mathbb{E}_{{\color{red}\theta} \sim {\color{purple}\Pi}(\cdot ~\vert~\tilde{y} )}\left[S( {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

What we can do

\[{\color{purple}\Pi}(\cdot ~\vert~ \cdot ) = \underset{ {\color{blue}K}\in\mathcal{K} }{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim \Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[S( {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

Two steps further

Replace \(P(\text{d}\tilde{y})\) with \(Q(\text{d}\tilde{y}) \propto P(\text{d}\tilde{y})v(\tilde{y})\)
Replace \(\Pi\) with \(\bar{\Pi}\)
- correct with importance weight

It still works!

BSRC optimisation problem

\[\underset{ {\color{blue}K}\in\mathcal{K} }{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[w({\color{red}\theta}, \tilde{y})S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

\[w({\color{red}\theta}, \tilde{y}) = \frac{\pi({\color{red}\theta})}{\bar\pi({\color{red}\theta})} v(\tilde{y})\]

If we choose \(\mathcal{K} = \{{\color{green}f}_\sharp\hat\Pi: {\color{green}f} \in \mathcal{F}\}\)

\[\underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[w({\color{red}\theta}, \tilde{y})S({\color{green}f}_\sharp \hat\Pi(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

Moment-correcting transformation

Need a family of functions, \(f \in \mathcal{F}\).

Start simple:

\[f(x) = A[x - \hat{\mu}(y)] + \hat{\mu}(y) + b\]

Mean \(\hat{\mu}(y) = \mathbb{E}(\hat{\theta}),\quad \hat{\theta} \sim \hat\Pi(\cdot~\vert~y)\) for \(y \in \mathsf{Y}\).
\(A\) is a square matrix with positive elements on diagonal such that \(AA^\top\) is positive definite.

Bayesian scoring rule calibration

Focus on Energy Score: \(S(\Pi, \theta) = \frac{1}{2}\mathbb{E}_{X,X^\prime \sim \Pi}\Vert X - X^\prime\Vert_2^\beta - \mathbb{E}_{X \sim \Pi}\Vert X - \theta\Vert_2^\beta\)

Examples

OU Process

\[\text{d}X_t = \gamma (\mu - X_t) \text{d}t + \sigma\text{d}W_t\]

Observe final observation at time \(T\) ( \(n= 100\)):

\[X_T \sim \mathcal{N}\left( \mu + (x_0 - \mu)e^{-\gamma T}, \frac{D}{\gamma}(1- e^{-2\gamma T}) \right)\]

where \(D = \frac{\sigma^2}{2}\). Fix \(\gamma = 2\), \(T=1\), \(x_0 = 10\)

OU Process (limiting approximation)

Infer \(\mu\) and \(D\) with approximate likelihood based on

\[X_\infty \sim \mathcal{N}\left(\mu, \frac{D}{\gamma}\right)\]

OU Process (limiting approximation)

\(M = 100\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2

OU Process (limiting approximation)

Comparison for \(\mu\)

Posterior	MSE	Bias	St. Dev	Coverage (90%)
Approx	1.54	1.21	0.22	0
Adjust (α=0)	0.12	0.15	0.20	64
Adjust (α=1)	0.12	0.15	0.23	82
True	0.12	-0.01	0.26	94

Estimated from independent replications of the method.

OU Process (limiting approximation)

Comparison for \(D\)

Posterior	MSE	Bias	St. Dev	Coverage (90%)
Approx	4.73	0.18	1.46	85
Adjust (α=0)	4.83	0.28	1.24	72
Adjust (α=1)	5.13	0.42	1.45	83
True	5.00	0.37	1.48	85

Estimated from independent replications of the method.

Lotka-Volterra SDE

\[\text{d} X_{t} = (\beta_1 X_{t} - \beta_2 X_{t} Y_{t} ) \text{d} t+ \sigma_1 \text{d} B_{t}^{1}\] \[\text{d} Y_{t} = (\beta_4 X_{t} Y_{t} - \beta_3 Y_{t} ) \text{d} t+ \sigma_2 \text{d} B_{t}^{2}\]

Figure 6: Realisation of Lotka-Volterra SDE

Lotka-Volterra SDE

Use extended Kalman Filter as approximate likelihood

Discretise SDE at time points \(t_0,t_1,\ldots,t_n\)
Assume Gaussian at each \(t_i\)
Approximate transition dynamics by linear assumption

BSRC settings

\(M = 200\)
\(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2

Lotka-Volterra SDE

MAPK: Biological network

Two-step Mitogen Activated Protein Kinase enzymatic cascade (Dhananjaneyulu et al. 2012; Warne et al. 2022)

\[ \begin{aligned} X + E &\overset{k_1}{\rightarrow} [XE],\\ X^{a} + P_1 &\overset{k_4}{\rightarrow} [X^{a}P_1], \\ X^{a} + Y &\overset{k_7}{\rightarrow} [X^{a}Y], \\ Y^{a} + P_2 &\overset{k_{10}}{\rightarrow} [Y^{a}P_2] \\ \end{aligned}\]

\[ \begin{aligned} ~[XE] &\overset{k_2}{\rightarrow} X + E, \\ [X^{a}P_1] &\overset{k_5}{\rightarrow} X^{a} + P_1, \\ [X^{a}Y] &\overset{k_8}{\rightarrow} X^{a} + Y, \\ [Y^{a}P_2] &\overset{k_{11}}{\rightarrow} Y^{a} + P_2, \\ \end{aligned}\]

\[ \begin{aligned} ~[XE] &\overset{k_3}{\rightarrow} X^{a}+ E, \\ [X^{a}P_1] &\overset{k_6}{\rightarrow} X + P_1, \\ [X^{a}Y] &\overset{k_9}{\rightarrow} X^{a} + Y^{a}, \\ [Y^{a}P_2] &\overset{k_{12}}{\rightarrow} Y + P_2. \\ \end{aligned}\]

\(k_1,\ldots,k_{12}\) are kinetic rate parameters
Cascade reactions essential for cell signalling processes
Translates to Markov jump process

MAPK: Biological network

Figure 7: Realisation of MAPK MJP

MAPK: Biological network

Figure 8: Realisation of MAPK MJP

MAPK: Biological network

Figure 9: Observed molecules from MAPK MJP

MAPK: Biological network

Assume observation process

\[[X^\text{obs}_t~Y^\text{obs}_t] \sim \mathcal{N}([X^a_t, Y^a_t], \sigma^2)\] at times \(t \in \{0,4,8,\ldots,200\}\) with \(\sigma = 1\).

Approximations:

Large count limit (MJP \(\rightarrow\) SDE)
EKF for likelihood approximation (\(dt = 1\))
\(N=1000\) samples from HMC

BSRC settings: \(M = 200\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\), scaled by 1.5

MAPK: Biological network

Supporting theory

Strictly proper scoring rule \(S\) w.r.t. \(\mathcal{P}\)
Importance distribution \(\bar\Pi\) on \((\Theta,\vartheta)\)
- \(\Pi \ll \bar\Pi\) with density \(\bar\pi\)
Stability function \(v:\mathsf{Y} \rightarrow [0,\infty)\)
- measurable under \(P\) on \((\mathsf{Y}, \mathcal{Y})\)
\(Q(\text{d} \tilde{y}) \propto P(\text{d} \tilde{y})v(\tilde{y})\)
Family of kernels \(\mathcal{K}\)

Supporting theory: Theorem 1

If \(\mathcal{K}\) is sufficiently rich then the Markov kernel,

\[\bbox[5pt,border: 1px solid blue]{\color{black}K^{\star} \equiv \underset{K \in \mathcal{K}}{\arg\max}~\mathbb{E}_{\theta \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~ \theta)}\left[w(\theta, \tilde{y}) S(K(\cdot ~\vert~ \tilde{y}),\theta) \right]}\]

where \(w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\) then

\[\bbox[5pt,border: 1px solid blue]{K^{\star}(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})}\] almost surely.

Sufficiently rich kernel family

\(\mathcal{K}\) be a family of Markov kernels
\(\mathcal{P}\) be a class of probability measures
\(Q\) be a probability measure on \((\mathsf{Y},\mathcal{Y})\), and \(\tilde{y}\sim Q\)
\(\Pi(\cdot ~\vert~ \tilde{y})\) be the true posterior at \(\tilde{y}\).

We say \(\mathcal{K}\) is sufficiently rich with respect to \((Q,\mathcal{P})\) if for all \(U \in \mathcal{K}\), \(U(\cdot ~\vert~ \tilde{y}) \in \mathcal{P}\) almost surely and there exists \(U \in \mathcal{K}\) such that \(U(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})\) almost surely.

What about the weights?

\[w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\]

Unstable, high variance?

Truncate the weights (Ionides 2008)
PSIS (Vehtari et al. 2024)
Be brave…

What about the weights?

Unit weights

\[\hat{w}(\theta, \tilde{y}) = 1\]

Justified asymptotically using the flexibility of \(v(\tilde{y})\)

Unit weights: Theorem 2

Let \(g(x) = \bar \pi(x) / \pi(x)\) positive and continuous for \(x \in \Theta\).

If an estimator \(\theta^{\ast}_n \equiv \theta^{\ast}(\tilde{y}_{1:n})\) exists such that \(\theta^{\ast}_n \rightarrow z\) a.s. for \(n \rightarrow \infty\) when \(\tilde{y}_i \sim P(\cdot ~\vert~ z)\) for \(z \in \Theta\) then the error when using \(\hat{w} = 1\) satisfies \[\hat{w} - w(\theta,\tilde{y}_{1:n}) \rightarrow 0\] a.s. for \(n \rightarrow \infty\)

Justification of unit weights

Theorem 2 is possible because of the stability function justified by Theorem 1.

\[w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\]

Trick: set \(v(\tilde{y}) = \frac{\bar\pi(\theta^{\ast}_n )}{\pi(\theta^{\ast}_n )}\) and look at large sample properties.

Don’t need to explicitly know the estimator \(\theta^{\ast}_n\)!

Approximate conclusion

Ongoing work: happy to talk

Test more approximate likelihoods
Test more flexible transformation families

Thanks to

Collaborators: Chris, David\(^2\)
Helpful commentators: Ming Xu, Aad van der Vaart

Contact

References

Bon, Joshua J, David J Warne, David J Nott, and Christopher Drovandi. 2022. “Bayesian Score Calibration for Approximate Models.” arXiv Preprint arXiv:2211.05357.

Dhananjaneyulu, Venkata, Vidya Nanda Sagar P, Gopalakrishnan Kumar, and Ganesh A Viswanathan. 2012. “Noise Propagation in Two-Step Series MAPK Cascade.” PloS One 7 (5): e35958.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.

Ionides, Edward L. 2008. “Truncated Importance Sampling.” Journal of Computational and Graphical Statistics 17 (2): 295–311.

Pacchiardi, Lorenzo, and Ritabrata Dutta. 2022. “Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization.” arXiv Preprint arXiv:2205.15784.

Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2024. “Pareto Smoothed Importance Sampling.” Journal of Machine Learning Research 25 (72): 1–58. http://jmlr.org/papers/v25/19-556.html.

Warne, David J, Thomas P Prescott, Ruth E Baker, and Matthew J Simpson. 2022. “Multifidelity Multilevel Monte Carlo to Accelerate Approximate Bayesian Parameter Inference for Partially Observed Stochastic Processes.” Journal of Computational Physics 469: 111543.

Appendix: Theory vs practice

Sufficiently rich kernel family

\(\mathcal{K}\) be a family of Markov kernels
\(\mathcal{P}\) be a class of probability measures
\(Q\) be a probability measure on \((\mathsf{Y},\mathcal{Y})\), and \(\tilde{y}\sim Q\)
\(\Pi(\cdot ~\vert~ \tilde{y})\) be the true posterior at \(\tilde{y}\).

Sufficiently rich kernel family

The moment-correcting transformation is typically not sufficiently rich
Unit weights give the stabilising function the following property

\(v(\tilde{y}_{1:n}) \rightarrow \delta_{\hat{\theta}_0}(\tilde{\theta}^\ast_n)\) for \(n\rightarrow\infty\) where

\(\hat{\theta}_0\) is the MLE for the true data \(y\)
\(\tilde{\theta}^\ast_n\) is the MLE for simulated data \(\tilde{y}\)

Sufficiently rich kernel family

\(v(\tilde{y}_{1:n}) \rightarrow \delta_{\hat{\theta}_0}(\tilde{\theta}^\ast_n)\)

Recall the distribution is defined as

\[ Q(\text{d}\tilde{y}) \propto P(\text{d}\tilde{y}) v(\tilde{y}) \]

concentrating on simulated datasets consistent with \(\hat{\theta}_0\)
simulated datasets with the same sufficient statistics
approximate posteriors \(\hat\Pi(\cdot~\vert~\tilde{y})\) will be equal
global bias and variance correction terms are sufficiently rich asymptotically

Sufficiently rich kernel family

Results are targeted towards \(\bar\Pi\)

finite sample approximation using importance sampling
manipulation of the weights (unit weights)
Trade-off between:
- insufficiency of moment-correcting transformation + approximate posterior
- targeting high-probability regions of \(\bar\Pi\)

Appendix: Another example

Bivariate OU Process (VI)

\[\text{d}X_t = \gamma (\mu - X_t) \text{d}t + \sigma\text{d}W_t\] \[\text{d}Y_t = \gamma (\mu - Y_t) \text{d}t + \sigma\text{d}W_t\] \[Z_t = \rho X_t + (1-\rho)Y_t\]

Model \((X_t,Z_t)\) with setup as in the univariate case, \((x_0,z_0)=(5,5)\) and use a mean-field variational approximation.

Bivariate OU Process (VI)

Correlation summaries

Posterior	Mean	St. Dev.
Approx	0.00	0.02
Adjust (α=0)	0.18	0.41
Adjust (α=0.5)	0.37	0.16
Adjust (α=1)	0.37	0.15
True	0.42	0.06

\(M = 100\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2