16. October 2025
Why Mixed Models?
Mixed modelling has become pretty popular over the last ten years or so. These models combine conventional fixed effects with random effects (thus, mixed). Wait, random effects? What? My data already has plenty of randomness thanks, why do I need some in my model too?
Okay, some background.
In the Before Time
Statistics, like most of math and science, has been sort of cobbled together. “Cobble” is what cobblers do, which is fix shoes. So statistics is sewed together from lots of different pieces, repaired and patched as bits wear out. It’s tough leather too, so you know the people doing the sewing stabbed themselves a few times while doing all that sewing.
You may have noticed this, er, patchiness, when trying to figure out what statistical test you should use. Well, this one works on this kind of data, with these assumptions, and this one….
Let’s take an example. Suppose you had some MRI scans from patients with multiple sclerosis. For each one you measure how directional the diffusion of water is (called the fractional anisotropy, or FA) in their white matter. This is an important measure of how healthy the white matter is, because healthy white matter is composed of long, parallel tracts of nerve cells wrapped up in fatty insulation and the water would much rather move along those tracts than across them. Your patients are on one of two different treatments and you’d like to know if there’s a difference. You might choose to use a t-test for this. Great! t-tests are super common. Even Excel can do them. You could even do one by hand on paper if you needed to, and maybe some statistics professor made you do that once upon a time.
BUT, suppose you have three treatments, not two. Now what? Two t-tests? No, bad. This is the source of the “difference of differences” error. THREE t-tests? Yeah, like that’s better. No, you need to use a one-way ANOVA. Good news, the one-way ANOVA is pretty common too, AND it’s kind of general: you can use it for three treatment groups, or four, or five, etc. You can even use it on two, groups instead of a t test (they are equivalent).
BUT, suppose some of your patients are adults and some are children and you think this might also be important. Now you’ve got two explanatory variables, or effects, treatment and adult/child. Well, perhaps you were tipped off by the “one-way” in the name of the ANOVA. Turns out there’s also a two-way ANOVA.
BUT, suppose you’ve got three explanatory variables? Yup, three-way ANOVA. And now you’re in seldom explored territory.
BUT, suppose your explanatory variables aren’t categorical (adult vs. child) but are continuous (or close enough you can squint and pretend) like age or height or temperature? Well, there’s regression! And if only some of your variables are categorical there’s multiple regression, or ANCOVA, or….
Maybe there’s something simpler? Maybe something more general?
The General Linear Model
We can describe (or model) our t-test situation with this equation:
$$ y = {\color{yellow}\beta_0} + {\color{yellow}\beta_1} \cdot {\color{lightblue}treatment} + \epsilon $$
where
- \(y\) is the responding variable (FA)
- \({\color{lightblue}treatment}\) is zero if they had the first treatment and one if they had the second
- \({\color{yellow}\beta_0}\) is the mean FA in patients with treatment 1
- \({\color{yellow}\beta_1}\) is how much higher the mean FA is in patients with treatment 2 (can be negative)
- \(\epsilon\) is the leftover (residual) error, which we expect to be normally distributed with mean 0 and standard deviation \(\sigma\), written in fancy probabilistic notation like this: \(\epsilon \sim N(0,\sigma)\)
Notice, despite the Greek letters, this is just straightforward adding with a wee bit of multiplying. For patients with treatment 1, \({\color{lightblue}treatment}\) is zero, so \({\color{yellow}\beta_1} \cdot {\color{lightblue}treatment}\) is also zero and has no effect and their observed value; \(y\) is just equal to \({\color{yellow}\beta_0}\) plus whatever error is left over (\(\epsilon\)). For patients on treatment 2, \({\color{lightblue}treatment}\) is one, so they get \({\color{yellow}\beta_1}\) added on.
\({\color{yellow}\beta_0}\) and \({\color{yellow}\beta_1}\) are the parameters and when we fit the model we’re just trying to find good values for them that make the error as small as possible. Once we find those, we can look at the residuals, calculate their standard deviation \(\sigma\) the usual way, and then use a t-test (or an F-test) to see whether the value of \({\color{yellow}\beta_1}\) is significantly different than zero (i.e. it has a significant effect on \(y\)).
If you’re still paying attention you may have noticed that I just described a t-test by using a fancy equation and then saying you still have to do a t-test. BUT, this equation has a trick up it’s sleeve. It can do this:
$$ y = {\color{yellow}\beta_0} + {\color{yellow}\beta_1} \cdot {\color{lightblue}treatment_1} + {\color{yellow}\beta_2} \cdot {\color{lightblue}treatment_2} + \epsilon $$
That’s right, another plus sign, and another \(\beta\). Now we’ve got three treatments, like in the one-way ANOVA. We can keep going too. Four, five, whatever. Just add more terms.
BUT, those terms don’t all have to be \({\color{lightblue}treatment}\). We can do this too:
$$ y = {\color{yellow}\beta_0} + {\color{yellow}\beta_1} \cdot {\color{lightblue}treatment_1} + {\color{yellow}\beta_2} \cdot {\color{lightblue}adult} + {\color{yellow}\beta_2} \cdot {\color{lightblue}child} + \epsilon $$
Hello two-way (plus) ANOVA. We can do this other neat thing too:
$$ y = {\color{yellow}\beta_0} + {\color{yellow}\beta_1} \cdot {\color{lightblue}treatment_1} + {\color{yellow}\beta_2} \cdot {\color{lightblue}adult} + {\color{yellow}\beta_2} \cdot {\color{lightblue}child} + {\color{yellow}\beta_1} \cdot {\color{lightblue}child} \cdot {\color{lightblue}treatment_1} + \epsilon $$
That \({\color{yellow}\beta_1} \cdot {\color{lightblue}child} \cdot {\color{lightblue}treatment_1}\) term is an interaction because it models the effect of both having \({\color{lightblue}treatment_1}\) and being a \({\color{lightblue}child}\), over and above the effect of each one individually.
If you’re into compact t-shirt worthy equations, we can use the magic of matrices to write all these models generally and compactly as:
$$ Y = {\color{lightblue}\boldsymbol{X}}{\color{yellow}\boldsymbol{\beta}} + \boldsymbol{\epsilon} $$
Note that the symbols are all bold now (and \(y\) became \(\boldsymbol{Y}\)) because they’re now matrices or vectors:
- \(\boldsymbol{Y}\) is a column of all the FA measurements
- \({\color{yellow}\boldsymbol{\beta}}\) is a column of all the \({\color{yellow}\beta}\) parameters (note there are only a few of these because all observations share the same parameter values)
- \(\boldsymbol{\epsilon}\) is a column of all the \(\epsilon\) residuals.
- \({\color{lightblue}\boldsymbol{X}}\) is called the design matrix and it is a matrix with a row of all the predictor values for each FA measurement.
The \({\color{lightblue}\boldsymbol{X}}\) and \({\color{yellow}\boldsymbol{\beta}}\) switched order, but this is just convention, because mathematicians like column vectors better than row vectors.
All these equations are called linear predictors because they’re linear (notice all the plus signs) and they predict the measurements in \(\boldsymbol{Y}\). The model represented by the linear predictor is called the general linear model or GLM.
So there you go. No need to try and find (or write) some software that can do a six way ANOVA, or try and figure out whether you need an ANCOVA or multiple regression. Just use a GLM. Oh, and by the way, if you did manage to find a six way ANOVA, it would almost certainly be a GLM pretending to be an ANOVA.
By the way, there’s also the generalized linear model that runs the linear prediction through a nonlinear function in order to become logistic regression, Poisson regression, etc. The neural net types might be getting a tingling sensation right now. But that’s a story for another article, to be written another day.
The General Linear Mixed Model
But wait a minute, someone in the audience says. What about an ANOVA with replication?
Yeah, there are yet more kinds of ANOVA. Paired t-tests are also a thing.
Imagine you didn’t just do one MRI per patient, but you scanned them each two or more times. Each patients’ scans are related to each other, because they’re of the same person: they’re not independent. We might also be very suspicious of a study that was a hundred scans of one person instead of one scan each of a hundred people.
On the other hand, we might also suspect that knowing the measurements are grouped according to a particular structure (e.g. within subject, geographically, temporally, etc.) might be an important thing to model. In this case we have scans that are naturally grouped by being from the same person at different times or under different conditions.
So back to ANOVA? No, you have not suffered all this for nothing. Instead we’ll use the GLM’s superpower. We’ll just add some more terms:
$$ y = {\color{yellow}\beta_0} + {\color{yellow}\beta_1} \cdot {\color{lightblue}treatment} + {\color{lightgreen}\beta_2\cdot{\color{lightblue}patient_2}} + {\color{lightgreen}\beta_3\cdot{\color{lightblue}patient_3}} + … + \epsilon $$
This is completely fine (so long as we have multiple measurements per patient) and works great! Subject ID is a fantastic predictor! We can even have interactions between the regular terms and the \({\color{lightblue}patient}\) terms, so (if we had enough observations under appropriate conditions) we could do things like model patient-specific treatment responses. I’ve stuck with \({\color{lightblue}treatment}\) as a factor here, but it could be \({\color{lightblue}time}\) if you’re into longitudinal studies, or anything else, or all of the above with interactions between.
There’s a wee problem though. “If we had enough observations” should make your eyebrow twitch. MRIs are expensive, and the “…” in the equation means there are lots of extra parameters, so many I won’t even write them out. Also, are we really interested in modelling Patient47’s specific response anyway? Interested enough to spend statistical power on it?
Suppose we write our model in the matrix form like this:
$$ Y = {\color{lightblue}\boldsymbol{X}}{\color{yellow}\boldsymbol{\beta}} + {\color{lightblue}\boldsymbol{Z}}{\color{lightgreen}\boldsymbol{\gamma}} + \boldsymbol{\epsilon} $$
where I’ve split up the \({\color{yellow}\boldsymbol{\beta}}\) into the regular ones and the patient-specific ones, which I’ve called \({\color{lightgreen}\boldsymbol{\gamma}}\). \({\color{lightblue}\boldsymbol{Z}}\) is the corresponding design matrix, just like \({\color{lightblue}\boldsymbol{X}}\) but containing the relevant patient-specific predictors, like subject ID.
Now, I don’t really need to fit an independent \({\color{lightgreen}\gamma}\) for each patient. I could assume, for example, that the patient specific effects are normally distributed, estimate the standard deviation (mean is zero) for that distribution, and then draw the individual \({\color{lightgreen}\gamma}\)s from it. The model might not fit quite as well but it would have a lot fewer parameters.
And that is called the general linear mixed model (GLMM), or mixed model, or random effects model. The \({\color{yellow}\boldsymbol{\beta}}\) are the fixed effects while the \({\color{lightgreen}\boldsymbol{\gamma}}\), because they’re drawn from a random distribution, are the random effects. The random effects are things we’ve decided we’re not interested in enough to model them exhaustively with individual parameters for each group, but we still think they’re important enough we want to include them in a less specific form.
Okay, But Why??
\({\color{lightgreen}\gamma}\) is a pretty cool greek letter. \({\color{lightblue}\boldsymbol{Z}}\) is kind of fun too, so the GLMM equation is objectively cooler than the GLM one.
If you need more justification than that, here’s an actual example. A real one, not one I made up.
Figure 1: White matter fractional anisotropy over time in children with multiple sclerosis (green) and healthy controls (blue).
We had a bunch of white matter FA measurements over time from a group of children who had MS, and a group of healthy controls. We expected FA to go up in the kids as their brains developed, although maybe not enough for us to measure. We wanted to know whether that development was compromised in kids with MS, maybe their FA didn’t go up as fast as in the controls, or maybe it even went down.
Yes, I know this sounds like the example I made up, but this one is real. There’s a paper and everything, and the journal’s name consists of only one word so it must be important.
Figure 1 shows what the data looked like.
Not encouraging.
But whatever, if we could analyze data by eyeballing it we wouldn’t use statistics so let’s model it and see. We’ve got longitudinal measurements so we’ll use a mixed model, like so:
$$ Y = {\color{lightblue}{disease}}{\color{yellow}{\beta}} + {\color{lightblue}{ageC}}{\color{yellow}{\beta}} + {\color{lightblue}{ageOnsetC}}{\color{yellow}{\beta}} + {\color{lightblue}subject}\cdot{\color{lightgreen}\boldsymbol{\gamma}} + \boldsymbol{\epsilon} $$
Figure 2: Mixed model predictions (green and blue lines) with 95% confidence intervals (shaded areas).
I’ve written the equation out so you can see all the terms, and I’ve taken some liberty with the notation on the random effect to avoid lots and lots of indices. We have a plain subject-specific offset, and we’re going to estimate it as a random effect, drawn from a distribution, not as a full fixed effect for each kid.
- \({\color{lightblue}{disease}}\) is 1 for MS, 0 for healthy control
- \({\color{lightblue}{ageC}}\) is the centred age. Centred just means the group mean is subtracted off so the mean ageC is 0, you can see this on the x-axis in the graph.
- \({\color{lightblue}{ageOnsetC}}\) is the age of first presentation for the disease group, zero for the controls, likewise centred.
And, what do you know, all the effects are significant! Really?? What do the predictions look like (Figure 2)?
Figure 3: Original data with modelled effects of age at onset removed. Compare to Figure 2.
If you see a graph like Figure 2, you should be suspicious. Very suspicious.
Aside: if you see a graph that omits the raw data and just shows the fit lines you should ALSO be suspicious.
In fact, everyone was suspicious of my mixed model graphs so I wrote my own stats module that let me subtract whichever modelled effects (including the random ones) I want from the original data before I graph it.
Our graph has an axis for FA, naturally, and we’re using the x-axis for \({\color{lightblue}{ageC}}\). But we’ve got this other term, \({\color{lightblue}{ageOnsetC}}\) that’s not shown because we’ve already run out of axes. The point of including a term for age of onset is that the kids’ brains might have been developing normally until they got their first MS attack, and then changed trajectories after. So maybe projecting this data down onto our 2D screen is what’s confusing things? We could make a 3D graph with \({\color{lightblue}{ageOnsetC}}\) on the new axis, but instead let’s subtract our \({\color{lightblue}{ageOnsetC}}\) predictions from the original data so it’s as if all the patients had the same age at onset (Figure 3).
Figure 4: Original data with modelled effects of age at onset and subject-specific random effects removed.
Well, some stuff moved around. Not much though, and it doesn’t look any better. But what if we remove the random effects too (Figure 4)?
Now that looks believable.
Figure 4 shows the data with all modelled effects either shown on an axis (\({\color{lightblue}{ageC}}\)), displayed in a different colour (\({\color{lightblue}{disease}}\)) or subtracted (\({\color{lightblue}{ageOnsetC}}\) and the random effect) and so is clearly showing where those mean trajectories and error bars are coming from.
So why use mixed models? The random effects are the dramatic difference between Figure 3 (still a mess) and Figure 4 (hey, actually looks like something). We modelled \({\color{lightblue}{disease}}\) and \({\color{lightblue}{ageOnsetC}}\) as fixed effects but the myriad other things that cause simple offsets in FA between people are packed into the random effect, allowing the trajectory to become clear. All for the relatively cheap statistical price of modelling the distribution of the random effects. PLUS we get confidence intervals and p-values that take into account the within-subject correlation from our repeated measurements.