Posterior | MSE | Bias | St. Dev | Coverage (90%) |
---|---|---|---|---|
Approx | 1.54 | 1.21 | 0.22 | 0 |
Adjust (α=0) | 0.12 | 0.15 | 0.20 | 64 |
Adjust (α=1) | 0.12 | 0.15 | 0.23 | 82 |
True | 0.12 | -0.01 | 0.26 | 94 |
CEREMADE, Université Paris-Dauphine
Joint work with
Bayesian inference requires likelihoods, but…
\[\begin{aligned}&\text{True posterior}:\quad &\Pi(\cdot\vert y) \\ &\text{Approximate posterior}:\quad &\check{\Pi}(\cdot\vert y) \\ &\text{Finite-sample approximate posterior}:\quad &\hat{\Pi}(\cdot\vert y)\end{aligned}\]
What about \(\Pi(\cdot\vert y) \approx \hat{\Pi}(\cdot\vert y)\)?
Find a function to correct approximate posteriors
\[\Pi(\cdot\vert y) \approx {\color{green}f}_\sharp\hat{\Pi}(\cdot\vert y)\] using samples from \(\hat{\Pi}(\cdot\vert y)\)
Use a calibration-based approach…
Let \({\color{blue}K}\) be a Markov kernel, e.g. \({\color{blue}K}(\cdot\vert\tilde{y}) = \hat{\Pi}(\cdot\vert\tilde{y})\).
Measure the similarity to the truth with \(S:\mathcal{P} \times \Theta \rightarrow \mathbb{R}\)
\[S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta})\]
\[\tilde{y} \sim P(\cdot \vert {\color{red}\theta}), \quad {\color{red}\theta} \sim {\color{purple}\Pi}\]
Consider the average over prior-predictive pairs \(({\color{red}\theta}, \tilde{y})\)
\[\max_{{\color{blue}K}\in\mathcal{K}}\mathbb{E}[S({\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta})]\]
Consider the average over prior-predictive pairs \(({\color{red}\theta}, \tilde{y})\)
Choose \(\mathcal{K} = \{{\color{green}f}_\sharp\hat\Pi: {\color{green}f} \in \mathcal{F}\}\)
\[\max_{{\color{green}f} \in \mathcal{F}}\mathbb{E}[S({\color{green}f}_\sharp\hat\Pi(\cdot\vert \tilde{y}),{\color{red}\theta})]\]
\(S(U,x)\) compares probabilistic forecast \(U\) to ground truth \(x\).
\[S(U,V) = \mathbb{E}_{x\sim V} S(U,x)\]
where \(V\) is a probability measure.
\(S(U, \cdot)\) is strictly proper if
\[V = \arg\max_{U \in \mathcal{U}} S(U, V)\]
\[\Pi(\cdot ~\vert~y ) = \underset{ {\color{green}f}\in \mathcal{F}}{\arg\max}~ S[{\color{green}f}_\sharp {\color{blue}K}(\cdot\vert y),\Pi(\cdot ~\vert~y )]\]
\[\Pi(\cdot ~\vert~y ) = \underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{\theta \sim \Pi(\cdot ~\vert~y )}\left[S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert y),\theta) \right]\]
Consider the joint data-parameter space instead
\[\Pi(\text{d}\theta ~\vert~\tilde{y} )P(\text{d}\tilde{y}) = P(\tilde{y}~\vert~\theta)\Pi(\text{d}\theta )\]
\[{\color{purple}\Pi}(\cdot ~\vert~y ) = \underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{\tilde{y} \sim P} \mathbb{E}_{{\color{red}\theta} \sim {\color{purple}\Pi}(\cdot ~\vert~\tilde{y} )}\left[S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]
\[{\color{purple}\Pi}(\cdot ~\vert~y ) = \underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim {\color{purple}\Pi}} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]
Replace \(P(\text{d}\tilde{y})\) with \(Q(\text{d}\tilde{y}) \propto P(\text{d}\tilde{y})v(\tilde{y})\)
Replace \(\Pi\) with \(\bar{\Pi}\)
It still works!
\[\underset{ {\color{green}f} \in \mathcal{F}}{\arg\max}~ \mathbb{E}_{ {\color{red}\theta} \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~{\color{red}\theta})}\left[w({\color{red}\theta}, \tilde{y})S({\color{green}f}_\sharp {\color{blue}K}(\cdot\vert \tilde{y}),{\color{red}\theta}) \right]\]
\[w({\color{red}\theta}, \tilde{y}) = \frac{\pi({\color{red}\theta})}{\bar\pi({\color{red}\theta})} v(\tilde{y})\]
Need a family of functions, \(f \in \mathcal{F}\).
Start simple:
\[f(x) = A[x - \hat{\mu}(y)] + \hat{\mu}(y) + b\]
Mean \(\hat{\mu}(y) = \mathbb{E}(\hat{\theta}),\quad \hat{\theta} \sim \hat\Pi(\cdot~\vert~y)\) for \(y \in \mathsf{Y}\).
\(A\) is a square matrix with positive elements on diagonal such that \(AA^\top\) is positive definite.
\[\text{d}X_t = \gamma (\mu - X_t) \text{d}t + \sigma\text{d}W_t\]
Observe final observation at time \(T\) ( \(n= 100\)):
\[X_T \sim \mathcal{N}\left( \mu + (x_0 - \mu)e^{-\gamma T}, \frac{D}{\gamma}(1- e^{-2\gamma T}) \right)\]
where \(D = \frac{\sigma^2}{2}\). Fix \(\gamma = 2\), \(T=1\), \(x_0 = 10\)
Infer \(\mu\) and \(D\) with approximate likelihood based on
\[X_\infty \sim \mathcal{N}\left(\mu, \frac{D}{\gamma}\right)\]
\(M = 100\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2
Comparison for \(\mu\)
Posterior | MSE | Bias | St. Dev | Coverage (90%) |
---|---|---|---|---|
Approx | 1.54 | 1.21 | 0.22 | 0 |
Adjust (α=0) | 0.12 | 0.15 | 0.20 | 64 |
Adjust (α=1) | 0.12 | 0.15 | 0.23 | 82 |
True | 0.12 | -0.01 | 0.26 | 94 |
Estimated from independent replications of the method.
Comparison for \(D\)
Posterior | MSE | Bias | St. Dev | Coverage (90%) |
---|---|---|---|---|
Approx | 4.73 | 0.18 | 1.46 | 85 |
Adjust (α=0) | 4.83 | 0.28 | 1.24 | 72 |
Adjust (α=1) | 5.13 | 0.42 | 1.45 | 83 |
True | 5.00 | 0.37 | 1.48 | 85 |
Estimated from independent replications of the method.
\[\text{d} X_{t} = (\beta_1 X_{t} - \beta_2 X_{t} Y_{t} ) \text{d} t+ \sigma_1 \text{d} B_{t}^{1}\] \[\text{d} Y_{t} = (\beta_4 X_{t} Y_{t} - \beta_3 Y_{t} ) \text{d} t+ \sigma_2 \text{d} B_{t}^{2}\]
Use extended Kalman Filter as approximate likelihood
BSRC settings
Two-step Mitogen Activated Protein Kinase enzymatic cascade (Dhananjaneyulu et al. 2012; Warne et al. 2022)
\[ \begin{aligned} X + E &\overset{k_1}{\rightarrow} [XE],\\ X^{a} + P_1 &\overset{k_4}{\rightarrow} [X^{a}P_1], \\ X^{a} + Y &\overset{k_7}{\rightarrow} [X^{a}Y], \\ Y^{a} + P_2 &\overset{k_{10}}{\rightarrow} [Y^{a}P_2] \\ \end{aligned}\]
\[ \begin{aligned} ~[XE] &\overset{k_2}{\rightarrow} X + E, \\ [X^{a}P_1] &\overset{k_5}{\rightarrow} X^{a} + P_1, \\ [X^{a}Y] &\overset{k_8}{\rightarrow} X^{a} + Y, \\ [Y^{a}P_2] &\overset{k_{11}}{\rightarrow} Y^{a} + P_2, \\ \end{aligned}\]
\[ \begin{aligned} ~[XE] &\overset{k_3}{\rightarrow} X^{a}+ E, \\ [X^{a}P_1] &\overset{k_6}{\rightarrow} X + P_1, \\ [X^{a}Y] &\overset{k_9}{\rightarrow} X^{a} + Y^{a}, \\ [Y^{a}P_2] &\overset{k_{12}}{\rightarrow} Y + P_2. \\ \end{aligned}\]
Assume observation process
\[[X^\text{obs}_t~Y^\text{obs}_t] \sim \mathcal{N}([X^a_t, Y^a_t], \sigma^2)\] at times \(t \in \{0,4,8,\ldots,200\}\) with \(\sigma = 1\).
Approximations:
BSRC settings: \(M = 200\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\), scaled by 1.5
Strictly proper scoring rule \(S\) w.r.t. \(\mathcal{P}\)
Importance distribution \(\bar\Pi\) on \((\Theta,\vartheta)\)
Stability function \(v:\mathsf{Y} \rightarrow [0,\infty)\)
\(Q(\text{d} \tilde{y}) \propto P(\text{d} \tilde{y})v(\tilde{y})\)
Family of kernels \(\mathcal{K}\)
If \(\mathcal{K}\) is sufficiently rich then the Markov kernel,
\[\bbox[5pt,border: 1px solid blue]{\color{black}K^{\star} \equiv \underset{K \in \mathcal{K}}{\arg\max}~\mathbb{E}_{\theta \sim \bar\Pi} \mathbb{E}_{\tilde{y} \sim P(\cdot ~\vert~ \theta)}\left[w(\theta, \tilde{y}) S(K(\cdot ~\vert~ \tilde{y}),\theta) \right]}\]
where \(w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\) then
\[\bbox[5pt,border: 1px solid blue]{K^{\star}(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})}\] almost surely.
We say \(\mathcal{K}\) is sufficiently rich with respect to \((Q,\mathcal{P})\) if for all \(U \in \mathcal{K}\), \(U(\cdot ~\vert~ \tilde{y}) \in \mathcal{P}\) almost surely and there exists \(U \in \mathcal{K}\) such that \(U(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})\) almost surely.
\[w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\]
Unstable, high variance?
Unit weights
\[\hat{w}(\theta, \tilde{y}) = 1\]
Justified asymptotically using the flexibility of \(v(\tilde{y})\)
Let \(g(x) = \bar \pi(x) / \pi(x)\) positive and continuous for \(x \in \Theta\).
If an estimator \(\theta^{\ast}_n \equiv \theta^{\ast}(\tilde{y}_{1:n})\) exists such that \(\theta^{\ast}_n \rightarrow z\) a.s. for \(n \rightarrow \infty\) when \(\tilde{y}_i \sim P(\cdot ~\vert~ z)\) for \(z \in \Theta\) then the error when using \(\hat{w} = 1\) satisfies \[\hat{w} - w(\theta,\tilde{y}_{1:n}) \rightarrow 0\] a.s. for \(n \rightarrow \infty\)
Theorem 2 is possible because of the stability function justified by Theorem 1.
\[w(\theta, \tilde{y}) = \frac{\pi(\theta)}{\bar\pi(\theta)} v(\tilde{y})\]
Trick: set \(v(\tilde{y}) = \frac{\bar\pi(\theta^{\ast}_n )}{\pi(\theta^{\ast}_n )}\) and look at large sample properties.
Don’t need to explicitly know the estimator \(\theta^{\ast}_n\)!
Ongoing work: happy to talk
Thanks to
We say \(\mathcal{K}\) is sufficiently rich with respect to \((Q,\mathcal{P})\) if for all \(U \in \mathcal{K}\), \(U(\cdot ~\vert~ \tilde{y}) \in \mathcal{P}\) almost surely and there exists \(U \in \mathcal{K}\) such that \(U(\cdot ~\vert~ \tilde{y}) = \Pi(\cdot ~\vert~ \tilde{y})\) almost surely.
\(v(\tilde{y}_{1:n}) \rightarrow \delta_{\hat{\theta}_0}(\tilde{\theta}^\ast_n)\) for \(n\rightarrow\infty\) where
\(v(\tilde{y}_{1:n}) \rightarrow \delta_{\hat{\theta}_0}(\tilde{\theta}^\ast_n)\)
Recall the distribution is defined as
\[ Q(\text{d}\tilde{y}) \propto P(\text{d}\tilde{y}) v(\tilde{y}) \]
Results are targeted towards \(\bar\Pi\)
finite sample approximation using importance sampling
manipulation of the weights (unit weights)
Trade-off between:
\[\text{d}X_t = \gamma (\mu - X_t) \text{d}t + \sigma\text{d}W_t\] \[\text{d}Y_t = \gamma (\mu - Y_t) \text{d}t + \sigma\text{d}W_t\] \[Z_t = \rho X_t + (1-\rho)Y_t\]
Model \((X_t,Z_t)\) with setup as in the univariate case, \((x_0,z_0)=(5,5)\) and use a mean-field variational approximation.
Correlation summaries
Posterior | Mean | St. Dev. |
---|---|---|
Approx | 0.00 | 0.02 |
Adjust (α=0) | 0.18 | 0.41 |
Adjust (α=0.5) | 0.37 | 0.16 |
Adjust (α=1) | 0.37 | 0.15 |
True | 0.42 | 0.06 |
\(M = 100\) and \(\bar\Pi = \hat\Pi(\cdot~\vert~y)\) scaled by 2