Describe, conceptually, what the likelihood function and maximum likelihood estimation are
Describe the differences between maximum likelihood and restricted maximum likelihood
Conduct statistical tests for fixed effects, and use the small-sample correction when needed
Use the likelihood ratio test to test random slopes
Estimate multilevel models with the Bayesian/Markov Chain Monte Carlo estimator in the brms
package
Regression: OLS
MLM: Maximum likelihood, Bayesian
># Linear mixed model fit by REML ['lmerMod']># Formula: Reaction ~ Days + (Days | Subject)># Data: sleepstudy># REML criterion at convergence: 1744># Random effects:># Groups Name Std.Dev. Corr># Subject (Intercept) 24.74 ># Days 5.92 0.07># Residual 25.59 ># Number of obs: 180, groups: Subject, 18># Fixed Effects:># (Intercept) Days ># 251.4 10.5
Find γs, τs, and σ that maximizes the likelihood function
ℓ(γ,τ,σ;y)=−12{log|V(τ,σ)|+(y−Xγ)⊤V−1(τ,σ)(y−Xγ)}+K
Here's the log-likelihood function for the coefficient of meanses
(see code in the provided Rmd):
># iteration: 1># f(x) = 47022.519159># iteration: 2># f(x) = 47151.291766># iteration: 3># f(x) = 47039.480137># iteration: 4># f(x) = 46974.909593># iteration: 5># f(x) = 46990.872588># iteration: 6># f(x) = 46966.453125># iteration: 7># f(x) = 46961.719993># iteration: 8># f(x) = 46965.890703># iteration: 9># f(x) = 46961.367013># iteration: 10># f(x) = 46961.288830># iteration: 11># f(x) = 46961.298898># iteration: 12># f(x) = 46961.284848># iteration: 13># f(x) = 46961.285238># iteration: 14># f(x) = 46961.284845># iteration: 15># f(x) = 46961.284848># iteration: 16># f(x) = 46961.284845
REML has corrected degrees of freedom for the variance component estimates (like dividing by N−1 instead of by N in estimating variance)
REML is generally preferred in smaller samples
The difference is small when the number of clusters is large
Technically speaking, REML only estimates the variance components1
[1] The fixed effects are integrated out and are not part of the likelihood function. They are solved in a second step, usually by the generalized least squares (GLS) method
REML | ML | |
---|---|---|
(Intercept) | 12.649 | 12.650 |
(0.149) | (0.148) | |
meanses | 5.864 | 5.863 |
(0.361) | (0.359) | |
SD (Intercept id) | 1.624 | 1.610 |
SD (Observations) | 6.258 | 6.258 |
Num.Obs. | 7185 | 7185 |
R2 Marg. | 0.123 | 0.123 |
R2 Cond. | 0.179 | 0.178 |
AIC | 46969.3 | 46969.3 |
BIC | 46996.8 | 46996.8 |
ICC | 0.1 | 0.1 |
RMSE | 6.21 | 6.21 |
REML | ML | |
---|---|---|
(Intercept) | 12.649 | 12.650 |
(0.149) | (0.148) | |
meanses | 5.864 | 5.863 |
(0.361) | (0.359) | |
SD (Intercept id) | 1.624 | 1.610 |
SD (Observations) | 6.258 | 6.258 |
Num.Obs. | 7185 | 7185 |
R2 Marg. | 0.123 | 0.123 |
R2 Cond. | 0.179 | 0.178 |
AIC | 46969.3 | 46969.3 |
BIC | 46996.8 | 46996.8 |
ICC | 0.1 | 0.1 |
RMSE | 6.21 | 6.21 |
REML | ML | |
---|---|---|
(Intercept) | 12.809 | 12.808 |
(0.504) | (0.471) | |
meanses | 6.577 | 6.568 |
(1.281) | (1.197) | |
SD (Intercept id) | 1.726 | 1.581 |
SD (Observations) | 5.944 | 5.944 |
Num.Obs. | 686 | 686 |
R2 Marg. | 0.145 | 0.146 |
R2 Cond. | 0.211 | 0.203 |
AIC | 4419.6 | 4419.7 |
BIC | 4437.7 | 4437.8 |
ICC | 0.1 | 0.1 |
RMSE | 5.89 | 5.89 |
geepack
R package[1]: See van der Leeden et al. (2008) and Lai (2021)
Likelihood ratio: L(γ=0)L(γ=^γ)
Deviance: −2×log(L(γ=0)L(γ=^γ))
= −2LL(γ=0)−[−2LL(γ=^γ)]
= Deviance∣γ=0−Deviance∣γ=^γ
ML (instead of REML) should be used
...># Linear mixed model fit by maximum likelihood ['lmerMod']># Formula: mathach ~ (1 | id)># AIC BIC logLik deviance df.resid ># 47122 47142 -23558 47116 7182 ...
...># Linear mixed model fit by maximum likelihood ['lmerMod']># Formula: mathach ~ meanses + (1 | id)># AIC BIC logLik deviance df.resid ># 46967 46995 -23480 46959 7181 ...
pchisq(47115.81 - 46959.11, df = 1, lower.tail = FALSE)
># [1] 5.95e-36
In lme4
, you can also use
anova(m_lv2, ran_int) # Automatically use ML
LRT assumes that the deviance under the null follows a χ2 distribution, which is not likely to hold in small samples
E.g., 16 Schools
LRT assumes that the deviance under the null follows a χ2 distribution, which is not likely to hold in small samples
E.g., 16 Schools
It is based on the Wald test (not the LRT):
The small-sample correction does two things:
Generally performs well with < 50 clusters
# Waldanova(m_contextual, ddf = "lme4")
># Analysis of Variance Table># npar Sum Sq Mean Sq F value># meanses 1 860 860 26.4># ses 1 1874 1874 57.5
# K-R anova(m_contextual, ddf = "Kenward-Roger")
># Type III Analysis of Variance Table with Kenward-Roger's method># Sum Sq Mean Sq NumDF DenDF F value Pr(>F) ># meanses 324 324 1 16 9.96 0.0063 ** ># ses 1874 1874 1 669 57.53 1.1e-13 ***># ---># Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For meanses
, the critical value (and the p value) is determined based on an F(1,15.51) distribution, which has a critical value of
qf(.95, df1 = 1, df2 = 15.51)
># [1] 4.52
Theoretically, yes, unless you're certain that the slopes are the same for every group
However, frequentist methods usually crash with more than two random slopes
Theoretically, yes, unless you're certain that the slopes are the same for every group
However, frequentist methods usually crash with more than two random slopes
LRT (χ2) is generally a two-tailed test. But for random slopes,
H0:τ1=0 is a one-tailed hypothesis
A quick solution is to divide the resulting p by 21
[1]: Originally proposed by Snijders & Bosker; tested in simulation by LaHuis & Ferguson (2009, https://doi.org/10.1177/1094428107308984)
...># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)># Data: hsball># REML criterion at convergence: 46558...
...># Formula: mathach ~ meanses + ses_cmc + (1 | id)># Data: hsball># REML criterion at convergence: 46569...
...># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)># Data: hsball># REML criterion at convergence: 46558...
...># Formula: mathach ~ meanses + ses_cmc + (1 | id)># Data: hsball># REML criterion at convergence: 46569...
[τ20τ01τ21]
[τ2000]
...># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)># Data: hsball># REML criterion at convergence: 46558...
...># Formula: mathach ~ meanses + ses_cmc + (1 | id)># Data: hsball># REML criterion at convergence: 46569...
[τ20τ01τ21]
[τ2000]
pchisq(10.92681, df = 2, lower.tail = FALSE)
># [1] 0.00424
Need to divide by 2
Better small sample properties
Less likely to have convergence issues
CIs available for any quantities (R2, predictions, etc)
Support complex models not possible with lme4
Is getting increasingly popular in recent research
Computationally more intensive
Need to learn some new terminologies
So, in addition to the likelihood function (as in ML/REML), we need prior information
Prior: Belief about a parameter before looking at the data
For this course, we use default priors set by the brms
package
MCMC | |
---|---|
b_Intercept | 12.647 |
[12.348, 12.939] | |
b_meanses | 5.854 |
[5.141, 6.606] | |
sd_id__Intercept | 1.633 |
[1.386, 1.901] | |
sigma | 6.258 |
[6.159, 6.365] | |
Num.Obs. | 7185 |
R2 | 0.173 |
R2 Adj. | 0.158 |
R2 Marg. | 0.123 |
ELPD | −23433.7 |
ELPD s.e. | 49.7 |
LOOIC | 46867.5 |
LOOIC s.e. | 99.4 |
WAIC | 46867.1 |
RMSE | 6.21 |
r2.adjusted.marginal | 0.116 |
A coefficient is statistically different from zero when the 95% CI does not contain zero.
A simulation-based approach to approximate the sampling distribution of fixed and random effects
Parametric, Residual, and Cases bootstrap
A simulation-based approach to approximate the sampling distribution of fixed and random effects
Parametric, Residual, and Cases bootstrap
In my own work,1 the residual bootstrap was found to perform best, especially when data are not normally distributed and when the number of clusters is small
See R code for this week
Lai (2021, https://doi.org/10.1080/00273171.2020.1746902)
We need to make some assumptions:
Known SD: σ=8
The scores are normally distributed in the population
Assume that we have scores from 5 representative students
Student | Score |
---|---|
1 | 23 |
2 | 16 |
3 | 5 |
4 | 14 |
5 | 7 |
If we assume that μ=10, how likely will we get 5 students with these scores?
If we assume that μ=10, how likely will we get 5 students with these scores?
Student | Score | `P(Yi=yi∣μ=10)` |
---|---|---|
1 | 23 | 0.013 |
2 | 16 | 0.038 |
3 | 5 | 0.041 |
4 | 14 | 0.044 |
5 | 7 | 0.046 |
Multiplying them all together: P(Y1=23,Y2=16,Y3=5,Y4=14,Y5=7|μ=10) = Product of the probabilities =
># [1] 4.21e-08
Student | Score | `P(Yi=yi∣μ=13)` |
---|---|---|
1 | 23 | 0.023 |
2 | 16 | 0.046 |
3 | 5 | 0.030 |
4 | 14 | 0.049 |
5 | 7 | 0.038 |
Multiplying them all together: P(Y1=23,Y2=16,Y3=5,Y4=14,Y5=7|μ=13) = Product of the probabilities =
># [1] 5.98e-08
Compute the likelihood for a range of μ values
Compute the likelihood for a range of μ values
^μ=13 maximizes the (log) likelihood function
Maximum likelihood estimator (MLE)
^μ=13 maximizes the (log) likelihood function
Maximum likelihood estimator (MLE)
N=5
N=20
Describe, conceptually, what the likelihood function and maximum likelihood estimation are
Describe the differences between maximum likelihood and restricted maximum likelihood
Conduct statistical tests for fixed effects, and use the small-sample correction when needed
Use the likelihood ratio test to test random slopes
Estimate multilevel models with the Bayesian/Markov Chain Monte Carlo estimator in the brms
package
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |