26 Lecture 17 - 2019

26.1 Varying slopes

Slopes are another feature of the response

Making any parameter into a varying effect

  1. Split into vector of parameters by cluster
  2. Define population clusters

Any batch of parameters with exchangeable index values can (“and probably should”) be pooled. Exchangeable = unordered labels.

You could treat slopes as a distinct varying effect, but even better than that - relate intercepts to the slopes directly. Since intercepts and slopes are related in the population/math/geometry, features of these units have a correlation structure.

26.1.1 Example - cafes

Cafe visits in morning and afternoon, intercepts: average morning wait, slopes: avg difference between afternoon and morning.

Are the slopes and intercepts related? Yes. There is pooling across parameters.

The prior is a 2 dimensional Gaussian. There is a vector of means (average intercept, average slope) and a variance-covariance matrix.

26.2 Variance covariance matrix

[var covar]

26.3 Varying slopes model

\(W_{i} \sim \text{Normal}(\mu_{i}, \sigma)\)

\(\mu_{i} = \alpha_{\text{cafe}[i]} + \beta_{\text{cafe}[i}*A_{i}\)

[alpha cafe] ~ MVNormal([alpha / beta, S])

Mu i represents the varying intercepts + varying slopes. A i = afternoon/not

Multivariate prior: for each cafe, there’s a pair of parameters alpha and beta, distributed with a 2 dimensional normal with averages alpha and beta, and S the covariance matrix.

R ~ LKJcorr(2)

You can’t assign priors independently. 1 dimensional correlations vary between -1 and 1, and with increasing n dimensions, the correlation remains restricted within these limits. Therefore, if 1 is really big, the other is necessarily smaller.

The LKJcorr has one variable eta. eta defines how concentrated from the identity matrix. The density is between -1 and 1. eta = 1 represents a pretty much uniform density. eta > 1 has more concentration around 0, more skeptical of extremes.

26.4 Multidimensional shrinkage

Joint distribution of varying effects pools information across slopes and intercepts. Correlation induces shrinkages across dimensions, increasing accuracy.

26.4.1 Example - prosocial chimps, many clusters

4 treatments: partner present/absent, side of table L/R. Can vary by actor and by block.

\(L_{i} \sim \text{Binomial}(1, p_{i})\)

\(\text{logit}(p_{i}) = \gamma_{\text{treatment}} + \alpha_{\text{actor, treatment}} + \beta_{\text{block, treatment}}\)

Mean effect of treatments, each actor in each treatment, each block in each treatment.

Alpha actor, treatment is a matrix of alpha deviation from mean for each actor by treatment

Beta block, treatment is a matrix of beta deviation from mean for block by treatment

How many parameters is this? 7 individuals * 4 treatments + 6 blocks * 4 treatments + 6 correlations + 4 sigmas = 76 parameters

With shrinkage, number of effective parameters will be much lower.

26.4.2 Divergences

Because of divergences (which are more common in these models), we need to use the non-centered versions.

Simpler to do for uni variate models, since we need to factor all parameters out of the prior and into the linear model. How do we factor out a correlation matrix?

Cholesky factor