3 Building your first mixed model

3.1 Building from the ground up

Let’s start with the simplest linear model: only an intercept. This looks like: \[\begin{equation} \vec{y} = \begin{bmatrix} 1\\...\\1 \end{bmatrix} * \beta_0 + \epsilon \tag{3.1} \end{equation}\]

wherу \(\vec{y}\) is a vector with the measurements we want to model (in our example, Effect_Size), \(\beta_0\) is the intercept, and \(\epsilon\) is residual error. The length of the ones vector is the same as the number of observations in \(\vec{y}\). Essentially, this model is predicting the same number for every single measurement.

In R syntax, we can say that \(\vec{y}\) is proportional to some value (and estimate the coefficients that make this proportion work out). In this case, we assume that each value in \(\vec{y}\) is a constant and therefore proportional to 1:

\[\begin{equation} y \propto 1 \tag{3.2} \end{equation}\]

Here’s what it looks like in the code:

## 
## Call:
## lm(formula = Effect_Size ~ 1, data = data.md.red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2640 -0.4187 -0.0466  0.3720  3.6304 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.41824    0.01044   40.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.717 on 4719 degrees of freedom

The most interesting section in this summary is the “coefficients” section, which shows us the estimates for the fixed effects. Here, it’s just the intercept, which has the estimated value of 0.418 and is significantly greater than 0 (p < .001).

What does this number mean in practice? This is just the average response level across all our datapoints. That’s what happens when you don’t add any structure to your model - the average is the best estimate.

3.2 A model with 1 fixed effect

Let’s add a single fixed predictor. Now our effect size \(\vec{y}\) depends not only on the mean response, but also on the experimental condition. The general formula for such a design looks like this:

\[\begin{equation} y = X * \vec{\beta} + \epsilon \tag{3.3} \end{equation}\]

What’s X? X is the design matrix; it specifies our fixed effects for this model (fixed effects are like regular regression terms - in fact, if you’ve seen matrix-form regression equations before, this should all be familiar).

The size of X will depend on the number of conditions. How many conditions do we have here?

##  Factor w/ 2 levels "W","S": 2 1

We have 2 conditions, words (W) and sentences (S). So our design matrix will look something like the following:

\[\begin{equation} X = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ ...\\ 1 & 1 \end{bmatrix} \tag{3.4} \end{equation}\]

where the first column is our intercept and the second column has 1 whenever that row comes from the Sentence condition and 0 otherwise.

Where is the Word condition?

Turns out that, by default, the first level of a factor variable gets treated as the reference level (\(\beta_{0}\)). So here, the response to words is our intercept, and the response to sentences is the stuff you would get “on top” of the word-induced activation - in other words, the magnitude of the S>N contrast. The intercept, in turn, would give you the average magnitude of the W>0 contrast.

This might sound a bit counterintuitive, but imagine if we tried to specify the intercept, word and sentence as 3 separate effects to estimate. Then we could still compare sentences to words, but would have trouble distinguishing the words from the intercept. If our observed response to words is 1.2, does the intercept contribute 0 and words 1.2? intercept 1.2 and words 0? intercept 0.6 and words 0.6? There is an infinite number of combinations.

(Mathematically, we can state that the matrix (3.4) would be rank-deficient if we added the predictor variable for S, since it’s a direct complement of W. Thus, there would be an infinite number of coefficients \(\beta\) for solving equation (3.3))

Bottom line: if you have n levels, you can estimate the maximum of n effects.

Ok, with all those preliminaries behind us, let’s finally run the model!

## 
## Call:
## lm(formula = Effect_Size ~ Condition, data = data.md.red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2080 -0.4233 -0.0426  0.3806  3.5744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.47423    0.01471  32.228  < 2e-16 ***
## ConditionS  -0.11198    0.02081  -5.381 7.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7148 on 4718 degrees of freedom
## Multiple R-squared:  0.0061, Adjusted R-squared:  0.005889 
## F-statistic: 28.95 on 1 and 4718 DF,  p-value: 7.77e-08

We do not need to explicitly specify an intercept: a model with +1 is the same as a model without it

## 
## Call:
## lm(formula = Effect_Size ~ 1 + Condition, data = data.md.red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2080 -0.4233 -0.0426  0.3806  3.5744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.47423    0.01471  32.228  < 2e-16 ***
## ConditionS  -0.11198    0.02081  -5.381 7.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7148 on 4718 degrees of freedom
## Multiple R-squared:  0.0061, Adjusted R-squared:  0.005889 
## F-statistic: 28.95 on 1 and 4718 DF,  p-value: 7.77e-08

We can explicitly supress the intercept. The resulting model is the same in terms of its predicted y values, but the coefficients and their interpretation are different.

THINK: Which two contrasts does this no-intercept model evaluate? (Hint: you can compare these numbers with the numbers produced by the previous model)

## 
## Call:
## lm(formula = Effect_Size ~ 0 + Condition, data = data.md.red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2080 -0.4233 -0.0426  0.3806  3.5744 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## ConditionW  0.47423    0.01471   32.23   <2e-16 ***
## ConditionS  0.36225    0.01471   24.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7148 on 4718 degrees of freedom
## Multiple R-squared:  0.2585, Adjusted R-squared:  0.2582 
## F-statistic: 822.4 on 2 and 4718 DF,  p-value: < 2.2e-16

3.3 Was it worth it? Model comparison

Does the model with Condition as a fixed effect explain more variance than the intercept-only model? To find out, we can compare the two models using the anova function. This function implements the likelihood ratio test.

NOTE: the likelihood ratio (LR) test only works for nested models, i.e. models that have the same structure except that one model has some additional terms. The LR test would then tell us whether these terms are worth adding - in formal words, whether they significantly improve model fit. If you want to compare non-nested models, you can use other methods, such as AIC.

## Data: data.md.red
## Models:
## m.lin.noCond: Effect_Size ~ 1
## m.lin: Effect_Size ~ Condition
##              Df   AIC   BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m.lin.noCond  2 10257 10270 -5126.4    10253                             
## m.lin         3 10230 10249 -5111.9    10224 28.878      1  7.709e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the models are significantly different from each other (\(X^2\)(1)=28.9, p<.001) and that the second model explains more variance: under m.lin, the data have higher (“less negative”) log likelihood. Thus, we can conclude that the second model provides a better fit to the data.

3.4 Adding random intercepts

3.4.1 Random intercept #1: participant

Now we’re ready to actually build a mixed effect model.

Our previous model assumed that all our participants had the same response to the S and W conditions - any variation was put in the error term. However, we know that some participants tend to have generally high responses and vice versa; thus, we might expect someone with a really high response to words to also have a high response to sentences. Can we incorporate that knowledge into the model?

We can! Let’s tell the model that the intercept varies across participants by adding an additional term to it:

\[\begin{equation} y = X * \vec{\beta} + \begin{bmatrix} 1\\...\\1 \end{bmatrix} * \vec{b} + \epsilon \tag{3.5} \end{equation}\]

Here, \(\vec{b}\) specifies a participant-specific offset in the response strength - some will be above average, and some below average. This offset is not condition-specific (yet).

Why do we call it a random effect? Unlike Condition, each \(\vec{b}\) value is participant-specific. Still, we could just add a bunch of columns to X for each participant and put 1’s against all trials completed by any particular participant. Then our participant-specific estimates would just be added to \(\vec{\beta}\).

In some cases, modeling participant-specific variation as a fixed effect might be a valid approach - for instance, if you have 3 participants and a ton of data for each. However, in our case, we have many participants but not much participant-specific data, so we would just get a lot of noisy effects. Moreover, all the-participant specific terms have something in common - they are not fully independent. For example, if your participant intercepts are 1, 1.1, 0.9, 1.2, you have a pretty strong hunch that the next one is not going to be 15. Mathematically, we can state that participant-specific intercepts come from a normal distribution where the mean is the group intercept (estimated as the first term in \(\vec{\beta}\)), and the variance is a free parameter we need to estimate.

Thus, our \(\vec{b}\) term is never evaluated directly! It is a random variable of the form

\[\begin{equation} \vec{b} \sim \mathcal{N}(0,\sigma^{2}) \end{equation}\]

(the mean is 0 because this term is only estimating the participant-specific offset from the overall intercept)

With this ‘random effect’ trick, we achieve three goals:

  1. We estimate the intercept effects across all participants by specifying just one parameter - \(\sigma^{2}\).

  2. We leverage the power of all the dataset, not just the participant-specific data. Thus if any one participant’s values are unusually high, it would not have a strong result on the model - the normal distribution will “regularize” the estimates for that participant.

  3. We make generalizable inferences: instead of estimating the effects for each specific participant, we produce an estimate for the entire population.

LET’S RUN OUR FIRST MIXED EFFECT MODEL!

The previous models did not have any random terms, and so we evaluated it using the ‘lm’ function. For mixed models, we use ‘lmer’.

## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: Effect_Size ~ Condition + (1 | SubjectID)
##    Data: data.md.red
## 
##      AIC      BIC   logLik deviance df.resid 
##   7735.1   7761.0  -3863.6   7727.1     4716 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.4796 -0.5633 -0.0710  0.4803  5.7666 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  SubjectID (Intercept) 0.2365   0.4864  
##  Residual              0.2758   0.5252  
## Number of obs: 4720, groups:  SubjectID, 115
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)    0.47069    0.04663  121.41170  10.094  < 2e-16 ***
## ConditionS    -0.11198    0.01529 4604.97757  -7.324 2.82e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## ConditionS -0.164

Note that the fixed effects structure is the same as before, but now we also have the “random effects” section, which tells us how much variance was explained by the intercept varying across participants and how much is left in the residuals.

The fixed effect estimates themselves have changed - adding the random effect changes the way the model is fitted to the data.

Does this random effect add to explained variance compared to the fixed-effect-only model?

## Data: data.md.red
## Models:
## m.noR: Effect_Size ~ 0 + Condition
## m.ri1: Effect_Size ~ Condition + (1 | SubjectID)
##       Df     AIC   BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m.noR  3 10229.9 10249 -5111.9  10223.9                             
## m.ri1  4  7735.1  7761 -3863.6   7727.1 2496.7      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4.2 Random intercept #2: experiment

Another way to think about random effects is that they specify additional clusters in the data. For instance, most datapoints from participant A will be shifted above the mean, while most datapoints from participant B will be below the mean. Points A and points B will form two different clusters.

The dataset we’re using has another obvious source of clusters - it was collected across multiple experiments. Although we would hope that the data we record would be unaffected by the specific experimental conditions, in practice this is rarely the case. The data might be collected on different MRI scanners, during different seasons, the Sentence & Word blocks might have had different lengths, etc. If you have all this information and you think that each variable might affect your results, then you can go ahead and enter these terms directly. However, we here will just use Experiment.

We assume that the mean responses to both conditions will vary across experiments (and, as before, across participants).

To save space, from now on we will only print out fixed and random effect estimates.

##               Estimate Std. Error          df   t value     Pr(>|t|)
## (Intercept)  0.4563056 0.08056281    7.579646  5.663973 5.752963e-04
## ConditionS  -0.1119761 0.01526275 4603.939404 -7.336561 2.576588e-13
##  Groups     Name        Std.Dev.
##  SubjectID  (Intercept) 0.44329 
##  Experiment (Intercept) 0.16633 
##  Residual               0.52429

Previous model (m.ri1) reminder: ConditionW 0.47069 ConditionS -0.11198

We see again that changing the random effect structure affects the fixed effects too.

Let’s compare our models to see if adding the experiment intercept actually improved model fit.

## Data: data.md.red
## Models:
## m.ri1: Effect_Size ~ Condition + (1 | SubjectID)
## m.ri2: Effect_Size ~ Condition + (1 | SubjectID) + (1 | Experiment)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m.ri1  4 7735.1 7761.0 -3863.6   7727.1                             
## m.ri2  5 7709.9 7742.2 -3850.0   7699.9 27.236      1  1.801e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It did! We have a significant difference between the models, with model 2 having a higher log likelihood of the data.

3.4.3 Random intercept #3: region

Finally, we know that different brain regions have different levels of BOLD activity: for example, regions close to large vessels will have strong responses to everything.

##               Estimate Std. Error         df   t value     Pr(>|t|)
## (Intercept)  0.4566623 0.09492157   12.84781  4.810944 3.514693e-04
## ConditionS  -0.1119761 0.01390210 4584.90113 -8.054616 1.007830e-15
##  Groups     Name        Std.Dev.
##  SubjectID  (Intercept) 0.44491 
##  Region     (Intercept) 0.21771 
##  Experiment (Intercept) 0.16917 
##  Residual               0.47755

Previous model (m.ri2) reminder: ConditionW 0.4563056 ConditionS -0.1119761

This time, the fixed effect estimates remained almost the same.

But did adding a random intercept by region improve model fit further? Yes, it did.

## Data: data.md.red
## Models:
## m.ri2: Effect_Size ~ Condition + (1 | SubjectID) + (1 | Experiment)
## m.ri3: Effect_Size ~ Condition + (1 | SubjectID) + (1 | Experiment) + 
## m.ri3:     (1 | Region)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m.ri2  5 7709.9 7742.2 -3850.0   7699.9                             
## m.ri3  6 6927.0 6965.8 -3457.5   6915.0 784.88      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5 It gets better - random slopes

Ok, our model is getting pretty complicated. But as you were reading about all those random intercepts, you might have been wondering - what if it’s not just the mean activity that varies across participants/regions/experiments? What if participant A not only has stronger activity overall but also has stronger responses to Sentences specifically?

This kind of structure can be captured too. We call it a random slope because if we plot Condition on the x axis and Response on the y axis, the effect of interest will be the slope of the line going from the W value to the S value. High slope = steeper increase in activation as we go from W to S.

Remember that so far we have been specifying our random effects as (1 | RandomVar). What does this mean? It means, take the random effect on the right and estimate how it varies according to the fixed effects on the left. So far we’ve just had 1 on the left, which is our intercept. But now we also want to say that our random variables can vary by condition, so we add it to the left side too.

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0294832
## (tol = 0.002, component 1)

Uh oh, what happened here? We get a warning saying that our model failed to converge. That’s not good - it means that our effect estimates might be imprecise.

What can we do to solve this? There are a few technical tricks we can use to force the model to converge (see section 4.3). But generally, the reason why our model here doesn’t converge is because it’s too complicated - we have too many variables and not enough data. So for now, let’s simplify and just add one random slope.

##               Estimate Std. Error        df   t value     Pr(>|t|)
## (Intercept)  0.4552242 0.09588469  13.57937  4.747622 0.0003385923
## ConditionS  -0.1088011 0.03287757 115.00448 -3.309282 0.0012494232
##  Groups     Name        Std.Dev. Corr  
##  SubjectID  (Intercept) 0.47456        
##             ConditionS  0.32323  -0.343
##  Region     (Intercept) 0.21798        
##  Experiment (Intercept) 0.16834        
##  Residual               0.44852

Note that the model also estimates the covariance between the random effect’s intercept and slope (in essence, an extra variable to take care of). Here, the intercept and the slope are correlated: participants with higher overall responses will also have a higher S>W value.

THINK: how can we modify equation (3.5) to incorporate the random slope by condition?

Let’s check: did adding a random slope help beyond random intercepts? It did.

## Data: data.md.red
## Models:
## m.ri3: Effect_Size ~ Condition + (1 | SubjectID) + (1 | Experiment) + 
## m.ri3:     (1 | Region)
## m.ris: Effect_Size ~ Condition + (1 + Condition | SubjectID) + (1 | 
## m.ris:     Experiment) + (1 | Region)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m.ri3  6 6927.0 6965.8 -3457.5   6915.0                             
## m.ris  8 6567.3 6619.0 -3275.7   6551.3 363.69      2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that here I decided to only keep the random slope by participant, and it happened to actually be helpful. If possible, you should be more systematic when deciding on model structure. For details, see section 4.3.4.

3.6 Interactions, or Fixed effects can be complicated too

I will leave you with a final twist on our model.

So far we’ve only looked at Condition as our fixed effect: our goal was to get to random effects as soon as possible. But there is another important source of variation in our data - Hemisphere.

Since there are only two hemispheres, there is no reason to include it as a random effect (in this case, we’d have to imagine left and right hemispheres as samples from an infinite population of hemispheres…). So this is going to be our second fixed effect.

##               Estimate Std. Error        df   t value     Pr(>|t|)
## (Intercept)  0.5154497 0.10594841  18.13017  4.865101 0.0001220173
## ConditionS  -0.1088011 0.03287916 114.99242 -3.309121 0.0012501139
## HemisphereR -0.1204575 0.09446622  19.44581 -1.275139 0.2172850860
##  Groups     Name        Std.Dev. Corr  
##  SubjectID  (Intercept) 0.47453        
##             ConditionS  0.32325  -0.343
##  Region     (Intercept) 0.20921        
##  Experiment (Intercept) 0.16812        
##  Residual               0.44852

Great, now we know that left and right hemisphere do not have significantly different overall activity levels1. But we also want to know whether our Condition effects vary by Hemisphere. What do we need to include? An interaction.

##                           Estimate Std. Error         df    t value
## (Intercept)             0.49408827 0.10613083   18.29735  4.6554643
## ConditionS             -0.06607851 0.03537294  153.88236 -1.8680525
## HemisphereR            -0.07773581 0.09537081   20.18837 -0.8150901
## ConditionS:HemisphereR -0.08544343 0.02608228 4469.92124 -3.2759190
##                            Pr(>|t|)
## (Intercept)            0.0001890031
## ConditionS             0.0636561836
## HemisphereR            0.4245326473
## ConditionS:HemisphereR 0.0010612211
##  Groups     Name        Std.Dev. Corr  
##  SubjectID  (Intercept) 0.47461        
##             ConditionS  0.32334  -0.343
##  Region     (Intercept) 0.20923        
##  Experiment (Intercept) 0.16803        
##  Residual               0.44798

Here, * is a shortcut meaning “estimate the main effects and the interaction”. Thus Condition*Hemisphere is equivalent to Condition + Hemisphere + Condition:Hemisphere, where : stands for the interaction itself.

And hey, looks like the Sentence responses in the right hemisphere MD regions are weaker than in the left.

Let’s do a final formal check to see whether adding the hemisphere terms was useful.

## Data: data.md.red
## Models:
## m.ris: Effect_Size ~ Condition + (1 + Condition | SubjectID) + (1 | 
## m.ris:     Experiment) + (1 | Region)
## m.ris3: Effect_Size ~ Condition * Hemisphere + (1 + Condition | SubjectID) + 
## m.ris3:     (1 | Experiment) + (1 | Region)
##        Df    AIC    BIC  logLik deviance Chisq Chi Df Pr(>Chisq)   
## m.ris   8 6567.3 6619.0 -3275.7   6551.3                           
## m.ris3 10 6559.1 6623.7 -3269.5   6539.1 12.28      2   0.002155 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yup, we explained additional variance.

3.7 Summary

  • Fixed effects in a linear mixed effect model act just like regular regression terms.
  • For categorical variables, the first level of a fixed effect variable acts as the intercept (unless you explicitly set the intercept to 0).
  • Random effects are not simple regression terms; they are random variables. As such, you don’t estimate a value for each level (e.g. mean activation for each participant) but instead find the variance around the mean. For instance, you might estimate how much each participant’s data are likely to deviate from the mean value.
  • You can estimate random intercepts (how much does a random effect influence the intercept) and random slopes (how much does a random effect modulate a fixed effect)
  • You can decide whether adding a particular term improves model fit by conducting a likelihood ratio test.

  1. during word reading - see section 4.2.2