Scaling samplers for high-dimensional models with stochastic nets

class: center, middle, inverse, title-slide

# Scaling samplers for high-dimensional models with stochastic nets
## 
### Joshua J Bon Queensland University of Technology MCM2019 – July 11th

---

## Talk outline
 
1. **Bayesian Lasso**: An interesting reinterpretation (at least to me). 
 
2. **Stochastic constraints variables**: A good name?
 
3. **Stochastic nets**: Continuous shrinkage priors as stochastic constraints.
 
4. **Geometry**: I'll show some pictures and wave my hands.
 
5. **A new Gibbs sampler**: A bit speedy, more stable. *Or the MC part*.

---

## The Lasso

---

## The standard Lasso

### Lagrangian form

`$$\max_{\boldsymbol{\theta} \in \mathbb{R}^{p}} \log \pi(\boldsymbol{x}|\boldsymbol{\theta}) + \lambda \Vert \boldsymbol{\theta} \Vert_{1}$$`

### Constrained form

`$$			\max_{\tilde{\boldsymbol{\theta}} \in \mathbb{R}^{p}} \log \pi(\boldsymbol{x}|\tilde{\boldsymbol{\theta}}) \quad\text{ s.t. } \quad \Vert \tilde{\boldsymbol{\theta}} \Vert_{1} \leq \tilde{\omega}$$`

---

## The Bayesian Lasso

### Standard probabilistic model:

`$$(\boldsymbol{x}|\boldsymbol{\theta}) \sim \pi(\boldsymbol{x}|\boldsymbol{\theta})$$`
`$$\boldsymbol{\theta} \overset{\text{iid}}{\sim} \text{DExp}(\lambda)$$`

### Constrained probabilistic model:

$$(\boldsymbol{x} \vert \boldsymbol{\theta}) \sim \pi(\boldsymbol{x} \vert \boldsymbol{\theta}) $$

$$(\boldsymbol{\theta},\omega) \overset{\text{d}}{=} (\tilde{\boldsymbol{\theta}},\tilde{\omega}\;\; \text{s.t.}\;\; \Vert \tilde{\boldsymbol{\theta}} \Vert_{1} \leq \tilde{\omega}) $$

`$$\tilde{\boldsymbol{\theta}} \sim f_{\tilde{\boldsymbol{\theta}}} \propto 1 \qquad \tilde{\omega} \sim \text{Exp}(\lambda)$$`

---

## The constrained Bayesian Lasso

This looks suspiciously like changing the constrained Lasso
 
`$$\max_{\tilde{\boldsymbol{\theta}} \in \mathbb{R}^{p}} \log \pi(\boldsymbol{x} \vert \tilde{\boldsymbol{\theta}}) \quad\text{ s.t. } \quad \Vert \tilde{\boldsymbol{\theta}} \Vert_{1} \leq \tilde{\omega}$$`
 
to a Bayesian problem by setting priors
 
`$$\tilde{\boldsymbol{\theta}} \sim f_{\tilde{\boldsymbol{\theta}}} \propto 1 \qquad
\tilde{\omega} \sim \text{Exp}(\lambda)\qquad
\text{s.t.}\qquad \Vert \tilde{\boldsymbol{\theta}} \Vert_{1} \leq \tilde{\omega}$$`
 
just as we had with stochastic constraints.

---

## Duality and analogy to regularisation

| *Inference*  | *Regularisation*                                          |
|--------------|-----------------------------------------------------------|
| Optimisation | Penalty `$\Longleftrightarrow$` constraint                  |
| Bayesian     | Prior   `$\Longleftrightarrow$` stochastic constraint prior |

There is a duality in the optimisation setting, *and for the priors in a Bayesian analysis*.

---

## General stochastic constraint framework

For continuous random variables:

`$$p(\boldsymbol{\theta},\boldsymbol{\omega}) \propto f_{\tilde{\boldsymbol{\theta}}}(\boldsymbol{\theta})~f_{\tilde{\boldsymbol{\omega}}}(\boldsymbol{\omega})~1(\boldsymbol{r}(\boldsymbol{\theta}) \preceq \boldsymbol{\omega})$$`

|||
|-------------------------|--------------|
| Base prior             | `$\tilde{\boldsymbol{\theta}} \sim f_{\tilde{\boldsymbol{\theta}}}, \quad \tilde{\boldsymbol{\theta}} \in \Theta \subseteq \mathbb{R}^{p}$` |
| Constraint variable      | `$\tilde{\boldsymbol{\omega}} \sim f_{\tilde{\boldsymbol{\omega}}}, \quad \tilde{\boldsymbol{\omega}} \in \Omega \subseteq \mathbb{R}^{q}$` |
| Penalty function  | `$\boldsymbol{r}(\boldsymbol{\theta}): \qquad \Theta \rightarrow \Omega$` |

**Key ideas**:

- Truncation is applied *jointly* to base and constraint variables.

- Decomposition exposes marginal and multivariate components of the prior.

**Connections**: Scale mixture of normal/uniform, slice sampling, skew random variables, exponentially tilted random variables,...

<!---#### Base prior:

`$\tilde{\boldsymbol{\theta}} \sim f_{\tilde{\boldsymbol{\theta}}}, \quad \tilde{\boldsymbol{\theta}} \in \Theta \subseteq \mathbb{R}^{p}$`.
    
#### Constraint variable:

`$\tilde{\boldsymbol{\omega}} \sim f_{\tilde{\boldsymbol{\omega}}}, \quad \tilde{\boldsymbol{\omega}} \in \Omega \subseteq \mathbb{R}^{q}$`
    
#### Penalty function:

`$\boldsymbol{r}(\boldsymbol{\theta}): \Theta \rightarrow \Omega$`

## Stochastic constraint framework

### Structure

`$$(\boldsymbol{\theta}, \boldsymbol{\omega} \vert \boldsymbol{\lambda}) \overset{\text{d}}{=} (\tilde{\boldsymbol{\theta}}, \tilde{\boldsymbol{\omega}} \;\vert\;  \boldsymbol{r}(\tilde{\boldsymbol{\theta}}) \preceq \tilde{\boldsymbol{\omega}}, \boldsymbol{\lambda})$$`

`$$\tilde{\boldsymbol{\theta}} \sim f_{\tilde{\boldsymbol{\theta}}}$$` 
`$$(\tilde{\boldsymbol{\omega}} \vert \boldsymbol{\lambda}) \sim f_{\tilde{\boldsymbol{\omega}}  \vert \boldsymbol{\lambda}}$$`

The joint support of the joint distribution `$(\tilde{\boldsymbol{\theta}}, \tilde{\boldsymbol{\omega}})$` is constrained element-wise by the inequality `$\boldsymbol{r}(\tilde{\boldsymbol{\theta}} ) \preceq \tilde{\boldsymbol{\omega}}$`.-->

---

## Examples

---

## 1D example

| Base prior | Penalty | Constraint |
|------------|---------|------------|
| `$\text{N}(0,1)$` | `$-$` | `$-$` |

---

## 1D example

| Base prior | Penalty | Constraint |
|------------|---------|------------|
| `$\text{N}(0,1)$` | `$\vert\theta\vert$` | `$\text{Exp}(1)$` |

---

## 1D example

| Base prior | Penalty | Constraint |
|------------|---------|------------|
| `$\text{N}(0,1)$` | `$\vert\theta\vert$` | `$\text{Gamma}(0.5,1)$` |

---

## Horseshoe prior

`$$(\boldsymbol{\theta}, \boldsymbol{\omega} \;\vert\; \boldsymbol{\lambda}) \overset{\text{d}}{=} \left( \tilde{\boldsymbol{\theta}}, \tilde{\boldsymbol{\omega}} \;\vert\; \tilde{\theta}_{i}^{2} \leq \tilde{\omega}_{i} \;\forall i \in P, \boldsymbol{\lambda} \right)$$`
`$$\tilde{\boldsymbol{\theta}} \sim f_{\tilde{\boldsymbol{\theta}}} \propto 1$$`
`$$(\tilde{\omega}_{i}\vert\lambda_{i},\tau) \sim \text{Exp}(2^{-1} [\lambda_{i}\tau]^{-2})$$`
`$$\lambda_{i} \sim \text{Cauchy}_{+}(0,1)$$`
`$$\tau  \sim \text{Cauchy}_{+}(0,1)$$`
`$$\text{for } i \in P = \{1,2,3,\ldots,p\}$$`
--

Similar representations exist for
- Scale-normal mixtures
- Regularised-horseshoe
- Dirichlet-Laplace
- R2-D2 prior

---

## Stochastic nets

- A restricted family of stochastic constraint priors contain all(?) continuous shrinkage priors (horseshoe etc.) in the Bayesian literature.

- I refer to these as *stochastic nets*.

**... why?**

---

## Visualisation of generative process

`$$(x,y) \sim \text{N}(0,1)$$`
`$$\boldsymbol{\lambda} \sim \text{Dir}(1,1)$$`

`$$\tau \sim \text{Exp}(2)$$`

`$$\text{s.t. } \lambda_{1}\vert x \vert + \lambda_{2}\vert y \vert \leq \tau$$`

---

## Visualisation of generative process

`$$(x,y) \sim \text{N}(0,1)$$`
`$$\boldsymbol{\lambda} \sim \text{Dir}(1,1)$$`

`$$\tau \sim \text{Exp}(2)$$`

`$$\text{s.t. } \lambda_{1}\vert x \vert + \lambda_{2}\vert y \vert \leq \tau$$`

---

## A stochastic net: constraining variables

---

## Gibbs samplers for continuous-shrinkage priors

**Normal model**: `$(\boldsymbol{y} ~\vert~ \boldsymbol{X}, \boldsymbol{\beta},\sigma^2) \sim \text{N}(\boldsymbol{X} \boldsymbol{\beta}, \sigma^2\boldsymbol{I})$`.

### Ingredients of standard algorithm

(a) Sample `$(\boldsymbol{\beta}~\vert~ \boldsymbol{\lambda}, \sigma^2, \tau) \sim \text{N}(\boldsymbol{\mu},\boldsymbol{V})$`.
   - `$\boldsymbol{\mu} = \boldsymbol{V}\boldsymbol{X}^{\top}\boldsymbol{y}$`
   - `$\boldsymbol{V} = (\boldsymbol{X}^{\top}\boldsymbol{X} + \boldsymbol{S}^{-1})^{-1}$`
   - `$\boldsymbol{S}$` = diagonal matrix from `$\boldsymbol{\lambda}$` and `$\tau$`.

(b) Sample `$(\sigma^2~\vert~\boldsymbol{\beta}, \boldsymbol{\lambda}, \tau) \sim \text{IG}(a_{1},a_{2})$`

(d) Sample `$(\tau~\vert~ \boldsymbol{\beta}, \boldsymbol{\lambda}, \sigma^2)$`

**Q:** What is the computational bottleneck of this algorithm?

**A:** Inverting (or decomposing) `$\boldsymbol{V}$` is `$\mathcal{O}(p^3)$`.

---

### Truncated-Gibbs sampler

**Aim**: Exploit the geometry by decomposing the prior to create a better Gibbs sampler.

(a1) Sample `$(\boldsymbol{\omega}~\vert~\boldsymbol{\beta}, \boldsymbol{\lambda}, \sigma^2, \tau) \sim \text{sExp}(b_1)$`

- Use `$\boldsymbol{\omega}$` (current magnitude) to choose appropriate sampling step, namely

--
   
   - Order `$\boldsymbol{\omega}$` by size and split into two groups, `$I$` and `$J$`. Where `$\omega_{i} \leq \epsilon$` for `$i \in I$`.

(a2) Sample `$(\beta_{i}~\vert~ \boldsymbol{\beta}_{(i)},\boldsymbol{\omega}, \boldsymbol{\lambda}, \sigma^2, \tau) \sim \text{U}(-\omega_{i}^{1/\nu},\omega_{i}^{1/\nu})$` with MH correction for `$i \in I$`.

- The dependence on `$\boldsymbol{\beta}_{(i)}$` exists but is very small. You can argue that `$(\beta_{i}~\vert~ \boldsymbol{\beta}_{(i)},\boldsymbol{\omega}, \boldsymbol{\lambda}, \sigma^2, \tau) \overset{d}{\approx} (\beta_{i}~\vert~\boldsymbol{\omega}, \boldsymbol{\lambda}, \sigma^2, \tau)$` if `$\epsilon$` is small enough.

(a3) Sample `$(\boldsymbol{\beta}_{J}~\vert~ \boldsymbol{\beta}_{I}, \boldsymbol{\lambda}, \sigma^2, \tau) \sim \text{N}(\boldsymbol{\mu}_{J|I},\boldsymbol{V}_{J|I})$`.

- Note: `$\boldsymbol{\omega}$` marginalised out. Not necessary, but practical.

---

### Truncated-Gibbs sampler

**Aim**: Exploit the geometry by decomposing the prior to create a better Gibbs sampler.

(a1) Sample `$(\boldsymbol{\omega}~\vert~\boldsymbol{\beta}, \boldsymbol{\lambda}, \sigma^2, \tau) \sim \text{sExp}(b_1)$`

(a3) Sample `$(\boldsymbol{\beta}_{J}~\vert~ \boldsymbol{\beta}_{I}, \boldsymbol{\lambda}, \sigma^2, \tau) \sim \text{N}(\boldsymbol{\mu}_{J|I},\boldsymbol{V}_{J|I})$`.

**Complexity**:

- `$\mathcal{O}(\vert J \vert^{3} + p \vert I\vert)$` in each iteration.

- We can fix `$\vert J \vert$` by selecting `$\epsilon$` adaptively. Then the complexity is `$\mathcal{O}(p^2)$`.

---

- In high dimensions with shrinkage priors, the majority of coefficients are forced to be very close to zero.
 
--
 
- Understanding the structure and geometry of the problem is important!

Be like Aquaman!
 
<img src="imgs/aquaman-jetski.gif" style="display: block; margin: auto;" />

---

## Results on some toy problems

Model: Linear regression, iid Gaussian errors, using R2-D2 prior

Simulation settings:

`$n = 100$`

`$p \in \{500,1000,2500,5000\}$`

`$\rho \in \{0.1,0.5\}$`, controls correlation in `$\boldsymbol{X}$`

`$\epsilon \in \{0.1,0.2\}$`

`$\boldsymbol{\beta} = [\boldsymbol{0}_{10}^{\top}~\boldsymbol{b}_{1}^{\top}~\boldsymbol{0}_{30}^{\top}~~\boldsymbol{b}_{2}^{\top}~\boldsymbol{0}_{p-50}^{\top}]^{\top}$`

`$\boldsymbol{b}_{1} = [2,~2,−5,−5,−5]$`

`$\boldsymbol{b}_{2} = [−2,−2,−2,~5,~5]$`

`$2000$` samples repeated independently 100 times.

---

### Mean (SD) computation time

---

### Posterior mean (SD) of coefficient SSE

---

### Grouping and uniform proposal

---

<!---### Mean (SD) computation time

## Conclusions

- One technique for scaling with dimension where sparsity is desired.

- Can be combined with other techniques

- Other areas under investigation with stochastic constraints:

- Theoretical results
   
   - Discrete version

- Prior construction
   
   - Variable selection from continuous-shrinkage prior
   
   - You tell me!
   
- Acknowledgements:

- Berwin Turlach, Kevin Murray, Chris Drovandi
   - Pawsey HPC

---

## (A random sample of) Key references

<cite>Carvalho, C. M, N. G. Polson, and J. G. Scott
(2010).
&ldquo;The horseshoe estimator for sparse signals&rdquo;.
In: Biometrika 97.2, pp. 465&ndash;480.
DOI: <a href="https://doi.org/10.1093/biomet/asq017">10.1093/biomet/asq017</a>.</cite>

<cite>Gelman, A.
(2004).
&ldquo;Parameterization and Bayesian modeling&rdquo;.
In: Journal of the American Statistical Association 99.466, pp. 537&ndash;545.</cite>

<cite>Johndrow, J. E, P. Orenstein, and A. Bhattacharya
(2018).
&ldquo;Bayes Shrinkage at GWAS scale: Convergence and Approximation Theory of a Scalable MCMC Algorithm for the Horseshoe Prior&rdquo;.
In: arXiv preprint arXiv:1705.00841.</cite>

<cite>Park, T. and G. Casella
(2008).
&ldquo;The Bayesian Lasso&rdquo;.
In: Journal of the American Statistical Association 103.482, pp. 681&ndash;686.</cite>

<cite>Piironen, J. and A. Vehtari
(2017).
&ldquo;Sparsity information and regularization in the horseshoe and other shrinkage priors&rdquo;.
In: Electronic Journal of Statistics 11.2, pp. 5018&ndash;5051.
DOI: <a href="https://doi.org/10.1214/17-ejs1337si">10.1214/17-ejs1337si</a>.</cite>

<cite>Walker, S, P. Damien, and M. Meyer
(1997).
&ldquo;On scale mixtures of uniform distributions and the latent weighted least squares method&rdquo;.
In: Working papers series, University of Michigan Ross School of Business.</cite>