Bayesian scoring rule calibration for approximate models

Joshua J Bon

CEREMADE, Université Paris-Dauphine

Talk overview

  • Motivation
  • Method
  • Examples
  • Theory

Joint work with

  • Christopher Drovandi, QUT
  • David Warne, QUT
  • David Nott, NUS

Bayesian Inference

Approximations in Bayesian inference

Bayesian inference requires likelihoods, but…

  • Intractable likelihoods
    • Use approximations
  • Computationally expensive likelihoods
    • Use approximations
  • Posterior evaluation
    • Use approximations

Approximations everywhere…

Likelihood approximations

  • Limiting distributions
  • Whittle likelihood
  • Model simplification (e.g. SDE \(\rightarrow\) ODE)
  • Linear noise approximation
  • Kalman filter
  • Particle filters

Posterior approximations

  • Laplace
  • INLA
  • Variational inference
  • Psuedo-marginal MCMC
  • Approximate Bayesian Computation

Sampling approximations

  • Importance sampling
  • Sequential Monte Carlo
  • Markov chain Monte Carlo (MH, ULA, MALA, HMC)

A good approximation?

Can we check posteriors based on approximation(s)?

Can we correct posteriors based on approximation(s)?

Correcting approximate posteriors

\[\begin{aligned}&\text{True posterior}:\quad &\Pi(\cdot\vert y) \\ &\text{Approximate posterior}:\quad &\check{\Pi}(\cdot\vert y) \\ &\text{Finite-sample approximate posterior}:\quad &\hat{\Pi}(\cdot\vert y)\end{aligned}\]

Correcting approximate posteriors

  • Generally theory motivating \(\Pi(\cdot\vert y) \approx \check{\Pi}(\cdot\vert y)\)
    • But not for finite and/or given realisation of the data
  • Theory for \(\check{\Pi}(\cdot\vert y) \approx \hat{\Pi}(\cdot\vert y)\) as \(N \rightarrow \infty\)
    • and empirical tests for finite \(N\)

What about \(\Pi(\cdot\vert y) \approx \hat{\Pi}(\cdot\vert y)\)?

Transforming approximate posteriors

Find a function to correct approximate posteriors

\[\Pi(\cdot\vert y) \approx {\color{green}f}_\sharp\hat{\Pi}(\cdot\vert y)\] using samples from \(\hat{\Pi}(\cdot\vert y)\)

  • No need to rerun MC/MCMC/SMC
    • unlike correcting the likelihood directly
  • Correction acts on all levels of approximation
  • But we still can’t evaluate \(\Pi(\cdot\vert y)\)!

Use a calibration-based approach…

Illustration

Correcting a posterior

Figure 1: Ideal case (approximation in blue)

Correcting a posterior

Figure 2: Real case (approximation in blue)

Correcting many posteriors

Figure 3: Correct on average? (approximation in blue)

Measure the similarity

Let \({\color{blue}K}\) be a Markov kernel, e.g. \({\color{blue}K}(\cdot\vert\tilde{y}) = \hat{\Pi}(\cdot\vert\tilde{y})\).

Measure the similarity to the truth with \(S:\mathcal{P} \times \Theta \rightarrow \mathbb{R}\)

\[S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta})\]

\[\tilde{y} \sim P(\cdot \vert {\color{red}\theta}), \quad {\color{red}\theta} \sim {\color{purple}\Pi}\]

  • \({\color{red}\theta}\) is the true data generating parameter
  • \(({\color{red}\theta}, \tilde{y})\) prior-predictive pairs

Optimise the similarity

Consider the average over prior-predictive pairs \(({\color{red}\theta}, \tilde{y})\)

\[\max_{{\color{blue}K}\in\mathcal{K}}\mathbb{E}[S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta})]\]

Optimise the similarity

Consider the average over prior-predictive pairs \(({\color{red}\theta}, \tilde{y})\)

Choose \(\mathcal{K} = \{{\color{green}f}_\sharp\hat\Pi: {\color{green}f} \in \mathcal{F}\}\)

\[\max_{{\color{green}f} \in \mathcal{F}}\mathbb{E}[S({\color{green}f}_\sharp\hat\Pi(\cdot\vert \tilde{y}),{\color{red}\theta})]\]

Correcting many posteriors

Figure 4: Before correction

Correcting many posteriors

Figure 5: After correction

Model calibration by simulation

  1. Correct coverage? Frequentist
    • \(S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert y), {\color{red}\theta}) = \vert 1({\color{red}\theta} \in {\color{purple}\text{CR}_\alpha}) - \alpha \vert\)
  2. Can we correct the entire distribution? Bayesian
    • Proper scoring rules

BSRC method

Scoring rules


\(S(U,x)\) compares probabilistic forecast \(U\) to ground truth \(x\).

\[S(U,V) = \mathbb{E}_{x\sim V} S(U,x)\]

where \(V\) is a probability measure.

Proper scoring rules

(Gneiting and Raftery 2007)


\(S(U, \cdot)\) is strictly proper if

  • \(S(V,V) \geq S(U, V)\) for all \(V\) in some family, and
  • equality holds iff \(U = V\).

\[V = \arg\max_{U \in \mathcal{U}} S(U, V)\]

What we can’t do


\[\Pi(\cdot ~\vert~y ) = \underset{ {\color{green}f}\in \mathcal{F}}{\arg\max}~ S[{\color{green}f}_\sharp {\color{blue}K}(\cdot\vert y),\Pi(\cdot ~\vert~y )]\]

What we can’t do


\[\Pi(\cdot ~\vert~y ) = \underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{\theta \sim \Pi(\cdot ~\vert~y )}\left[S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert y),\theta) \right]\]

What we can do


Consider the joint data-parameter space instead

\[\Pi(\text{d}\theta ~\vert~\tilde{y} )P(\text{d}\tilde{y}) = P(\tilde{y}~\vert~\theta)\Pi(\text{d}\theta )\]

See Pacchiardi and Dutta (2022) and Bon et al. (2022)

What we can do


\[{\color{purple}\Pi}(\cdot ~\vert~y ) = \underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{\tilde{y} \sim P} \mathbb{E}_{{\color{red}\theta} \sim {\color{purple}\Pi}(\cdot ~\vert~\tilde{y} )}\left[S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

What we can do


\[{\color{purple}\Pi}(\cdot ~\vert~y ) = \underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim {\color{purple}\Pi}} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

Two steps further


  1. Replace \(P(\text{d}\tilde{y})\) with \(Q(\text{d}\tilde{y}) \propto P(\text{d}\tilde{y})v(\tilde{y})\)

  2. Replace \(\Pi\) with \(\bar{\Pi}\)

    • correct with importance weight

It still works!

BSRC optimisation problem

\[\underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[w({\color{red}\theta}, \tilde{y})S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]

\[w({\color{red}\theta}, \tilde{y}) = \frac{\pi({\color{red}\theta})}{\bar\pi({\color{red}\theta})} v(\tilde{y})\]

Moment-correcting transformation

Need a family of functions, \(f \in \mathcal{F}\).

Start simple:

\[f(x) = A[x - \hat{\mu}(y)] + \hat{\mu}(y) + b\]

  • Mean \(\hat{\mu}(y) = \mathbb{E}(\hat{\theta}),\quad \hat{\theta} \sim \hat\Pi(\cdot~\vert~y)\) for \(y \in \mathsf{Y}\).

  • \(A\) is a square matrix with positive elements on diagonal such that \(AA^\top\) is positive definite.

Bayesian scoring rule calibration

  • Focus on Energy Score: \(S(\Pi, \theta) = \frac{1}{2}\mathbb{E}_{X,X^\prime \sim \Pi}\Vert X - X^\prime\Vert_2^\beta - \mathbb{E}_{X \sim \Pi}\Vert X - \theta\Vert_2^\beta\)

Examples

OU Process

\[\text{d}X_t = \gamma (\mu - X_t) \text{d}t + \sigma\text{d}W_t\]

Observe final observation at time \(T\) ( \(n= 100\)):

\[X_T \sim \mathcal{N}\left( \mu + (x_0 - \mu)e^{-\gamma T}, \frac{D}{\gamma}(1- e^{-2\gamma T}) \right)\]

where \(D = \frac{\sigma^2}{2}\). Fix \(\gamma = 2\), \(T=1\), \(x_0 = 10\)

OU Process (limiting approximation)

Infer \(\mu\) and \(D\) with approximate likelihood based on

\[X_\infty \sim \mathcal{N}\left(\mu, \frac{D}{\gamma}\right)\]

OU Process (limiting approximation)

Unable to display PDF file. Download instead.

\(M = 100\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2

OU Process (limiting approximation)

Comparison for \(\mu\)

Posterior MSE Bias St. Dev Coverage (90%)
Approx 1.54 1.21 0.22 0
Adjust (α=0) 0.12 0.15 0.20 64
Adjust (α=1) 0.12 0.15 0.23 82
True 0.12 -0.01 0.26 94

Estimated from independent replications of the method.

OU Process (limiting approximation)

Comparison for \(D\)

Posterior MSE Bias St. Dev Coverage (90%)
Approx 4.73 0.18 1.46 85
Adjust (α=0) 4.83 0.28 1.24 72
Adjust (α=1) 5.13 0.42 1.45 83
True 5.00 0.37 1.48 85

Estimated from independent replications of the method.

Lotka-Volterra SDE

\[\text{d} X_{t} = (\beta_1 X_{t} - \beta_2 X_{t} Y_{t} ) \text{d} t+ \sigma_1 \text{d} B_{t}^{1}\] \[\text{d} Y_{t} = (\beta_4 X_{t} Y_{t} - \beta_3 Y_{t} ) \text{d} t+ \sigma_2 \text{d} B_{t}^{2}\]

Figure 6: Realisation of Lotka-Volterra SDE

Lotka-Volterra SDE

Use extended Kalman Filter as approximate likelihood

  • Discretise SDE at time points \(t_0,t_1,\ldots,t_n\)
  • Assume Gaussian at each \(t_i\)
  • Approximate transition dynamics by linear assumption

BSRC settings

  • \(M = 200\)
  • \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2

Lotka-Volterra SDE

Unable to display PDF file. Download instead.

Lotka-Volterra SDE

Unable to display PDF file. Download instead.

MAPK: Biological network

Two-step Mitogen Activated Protein Kinase enzymatic cascade (Dhananjaneyulu et al. 2012; Warne et al. 2022)

\[ \begin{aligned} X + E &\overset{k_1}{\rightarrow} [XE],\\ X^{a} + P_1 &\overset{k_4}{\rightarrow} [X^{a}P_1], \\ X^{a} + Y &\overset{k_7}{\rightarrow} [X^{a}Y], \\ Y^{a} + P_2 &\overset{k_{10}}{\rightarrow} [Y^{a}P_2] \\ \end{aligned}\]

\[ \begin{aligned} ~[XE] &\overset{k_2}{\rightarrow} X + E, \\ [X^{a}P_1] &\overset{k_5}{\rightarrow} X^{a} + P_1, \\ [X^{a}Y] &\overset{k_8}{\rightarrow} X^{a} + Y, \\ [Y^{a}P_2] &\overset{k_{11}}{\rightarrow} Y^{a} + P_2, \\ \end{aligned}\]

\[ \begin{aligned} ~[XE] &\overset{k_3}{\rightarrow} X^{a}+ E, \\ [X^{a}P_1] &\overset{k_6}{\rightarrow} X + P_1, \\ [X^{a}Y] &\overset{k_9}{\rightarrow} X^{a} + Y^{a}, \\ [Y^{a}P_2] &\overset{k_{12}}{\rightarrow} Y + P_2. \\ \end{aligned}\]

  • \(k_1,\ldots,k_{12}\) are kinetic rate parameters
  • Cascade reactions essential for cell signalling processes
  • Translates to Markov jump process

MAPK: Biological network

Figure 7: Realisation of MAPK MJP

MAPK: Biological network

Figure 8: Realisation of MAPK MJP

MAPK: Biological network

Figure 9: Observed molecules from MAPK MJP

MAPK: Biological network

Assume observation process

\[[X^\text{obs}_t~Y^\text{obs}_t] \sim \mathcal{N}([X^a_t, Y^a_t], \sigma^2)\] at times \(t \in \{0,4,8,\ldots,200\}\) with \(\sigma = 1\).

Approximations:

  1. Large count limit (MJP \(\rightarrow\) SDE)
  2. EKF for likelihood approximation (\(dt = 1\))
  3. \(N=1000\) samples from HMC

BSRC settings: \(M = 200\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\), scaled by 1.5

MAPK: Biological network

Unable to display PDF file. Download instead.

MAPK: Biological network

Unable to display PDF file. Download instead.

Theory

Supporting theory

  • Strictly proper scoring rule \(S\) w.r.t. \(\mathcal{P}\)

  • Importance distribution \(\bar\Pi\) on \((\Theta,\vartheta)\)

    • \(\Pi \ll \bar\Pi\) with density \(\bar\pi\)
  • Stability function \(v:\mathsf{Y} \rightarrow [0,\infty)\)

    • measurable under \(P\) on \((\mathsf{Y}, \mathcal{Y})\)
  • \(Q(\text{d} \tilde{y}) \propto P(\text{d} \tilde{y})v(\tilde{y})\)

  • Family of kernels \(\mathcal{K}\)

Supporting theory: Theorem 1

If \(\mathcal{K}\) is sufficiently rich then the Markov kernel,

\[\bbox[5pt,border: 1px solid blue]{\color{black}K^{\star} \equiv \underset{K \in \mathcal{K}}{\arg\max}~\mathbb{E}_{\theta \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~ \theta)}\left[w(\theta, \tilde{y}) S(K(\cdot ~\vert~ \tilde{y}),\theta) \right]}\]

where \(w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\) then

\[\bbox[5pt,border: 1px solid blue]{K^{\star}(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})}\] almost surely.

Sufficiently rich kernel family

  • \(\mathcal{K}\) be a family of Markov kernels
  • \(\mathcal{P}\) be a class of probability measures
  • \(Q\) be a probability measure on \((\mathsf{Y},\mathcal{Y})\), and \(\tilde{y}\sim Q\)
  • \(\Pi(\cdot ~\vert~ \tilde{y})\) be the true posterior at \(\tilde{y}\).

We say \(\mathcal{K}\) is sufficiently rich with respect to \((Q,\mathcal{P})\) if for all \(U \in \mathcal{K}\), \(U(\cdot ~\vert~ \tilde{y}) \in \mathcal{P}\) almost surely and there exists \(U \in \mathcal{K}\) such that \(U(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})\) almost surely.

What about the weights?

\[w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\]

Unstable, high variance?

What about the weights?


Unit weights

\[\hat{w}(\theta, \tilde{y}) = 1\]

Justified asymptotically using the flexibility of \(v(\tilde{y})\)

Unit weights: Theorem 2

Let \(g(x) = \bar \pi(x) / \pi(x)\) positive and continuous for \(x \in \Theta\).

If an estimator \(\theta^{\ast}_n \equiv \theta^{\ast}(\tilde{y}_{1:n})\) exists such that \(\theta^{\ast}_n \rightarrow z\) a.s. for \(n \rightarrow \infty\) when \(\tilde{y}_i \sim P(\cdot ~\vert~ z)\) for \(z \in \Theta\) then the error when using \(\hat{w} = 1\) satisfies \[\hat{w} - w(\theta,\tilde{y}_{1:n}) \rightarrow 0\] a.s. for \(n \rightarrow \infty\)

Justification of unit weights

Theorem 2 is possible because of the stability function justified by Theorem 1.

\[w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\]

Trick: set \(v(\tilde{y}) = \frac{\bar\pi(\theta^{\ast}_n )}{\pi(\theta^{\ast}_n )}\) and look at large sample properties.

Don’t need to explicitly know the estimator \(\theta^{\ast}_n\)!

Approximate conclusion

Ongoing work: happy to talk

  • Test more approximate likelihoods
  • Test more flexible transformation families

Thanks to

  • Collaborators: Chris, David\(^2\)
  • Helpful commentators: Ming Xu, Aad van der Vaart

Contact

References

Bon, Joshua J, David J Warne, David J Nott, and Christopher Drovandi. 2022. “Bayesian Score Calibration for Approximate Models.” arXiv Preprint arXiv:2211.05357.
Dhananjaneyulu, Venkata, Vidya Nanda Sagar P, Gopalakrishnan Kumar, and Ganesh A Viswanathan. 2012. “Noise Propagation in Two-Step Series MAPK Cascade.” PloS One 7 (5): e35958.
Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.
Ionides, Edward L. 2008. “Truncated Importance Sampling.” Journal of Computational and Graphical Statistics 17 (2): 295–311.
Pacchiardi, Lorenzo, and Ritabrata Dutta. 2022. “Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization.” arXiv Preprint arXiv:2205.15784.
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2024. “Pareto Smoothed Importance Sampling.” Journal of Machine Learning Research 25 (72): 1–58. http://jmlr.org/papers/v25/19-556.html.
Warne, David J, Thomas P Prescott, Ruth E Baker, and Matthew J Simpson. 2022. “Multifidelity Multilevel Monte Carlo to Accelerate Approximate Bayesian Parameter Inference for Partially Observed Stochastic Processes.” Journal of Computational Physics 469: 111543.

Appendix: Theory vs practice

Sufficiently rich kernel family

  • \(\mathcal{K}\) be a family of Markov kernels
  • \(\mathcal{P}\) be a class of probability measures
  • \(Q\) be a probability measure on \((\mathsf{Y},\mathcal{Y})\), and \(\tilde{y}\sim Q\)
  • \(\Pi(\cdot ~\vert~ \tilde{y})\) be the true posterior at \(\tilde{y}\).

We say \(\mathcal{K}\) is sufficiently rich with respect to \((Q,\mathcal{P})\) if for all \(U \in \mathcal{K}\), \(U(\cdot ~\vert~ \tilde{y}) \in \mathcal{P}\) almost surely and there exists \(U \in \mathcal{K}\) such that \(U(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})\) almost surely.

Sufficiently rich kernel family

  • The moment-correcting transformation is typically not sufficiently rich
  • Unit weights give the stabilising function the following property

\(v(\tilde{y}_{1:n}) \rightarrow \delta_{\hat{\theta}_0}(\tilde{\theta}^\ast_n)\) for \(n\rightarrow\infty\) where

  • \(\hat{\theta}_0\) is the MLE for the true data \(y\)
  • \(\tilde{\theta}^\ast_n\) is the MLE for simulated data \(\tilde{y}\)

Sufficiently rich kernel family

\(v(\tilde{y}_{1:n}) \rightarrow \delta_{\hat{\theta}_0}(\tilde{\theta}^\ast_n)\)

Recall the distribution is defined as

\[ Q(\text{d}\tilde{y}) \propto P(\text{d}\tilde{y}) v(\tilde{y}) \]

  • concentrating on simulated datasets consistent with \(\hat{\theta}_0\)
  • simulated datasets with the same sufficient statistics
  • approximate posteriors \(\hat\Pi(\cdot~\vert~\tilde{y})\) will be equal
  • global bias and variance correction terms are sufficiently rich asymptotically

Sufficiently rich kernel family

Results are targeted towards \(\bar\Pi\)

  • finite sample approximation using importance sampling

  • manipulation of the weights (unit weights)

  • Trade-off between:

    • insufficiency of moment-correcting transformation + approximate posterior
    • targeting high-probability regions of \(\bar\Pi\)

Appendix: Another example

Bivariate OU Process (VI)

\[\text{d}X_t = \gamma (\mu - X_t) \text{d}t + \sigma\text{d}W_t\] \[\text{d}Y_t = \gamma (\mu - Y_t) \text{d}t + \sigma\text{d}W_t\] \[Z_t = \rho X_t + (1-\rho)Y_t\]

Model \((X_t,Z_t)\) with setup as in the univariate case, \((x_0,z_0)=(5,5)\) and use a mean-field variational approximation.

Bivariate OU Process (VI)

Unable to display PDF file. Download instead.

Bivariate OU Process (VI)

Correlation summaries

Posterior Mean St. Dev.
Approx 0.00 0.02
Adjust (α=0) 0.18 0.41
Adjust (α=0.5) 0.37 0.16
Adjust (α=1) 0.37 0.15
True 0.42 0.06

\(M = 100\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2

Bivariate OU Process (VI)

Unable to display PDF file. Download instead.