Model Estimation and Testing
PSYC 575
Mark Lai
University of Southern California
2020/09/01 (updated: 2022-09-17)
1 / 39

Week Learning Objectives

Describe, conceptually, what the likelihood function and maximum likelihood estimation are
Describe the differences between maximum likelihood and restricted maximum likelihood
Conduct statistical tests for fixed effects, and use the small-sample correction when needed
Use the likelihood ratio test to test random slopes
Estimate multilevel models with the Bayesian/Markov Chain Monte Carlo estimator in the brms package

2 / 39

Estimation

Regression: OLS

MLM: Maximum likelihood, Bayesian

3 / 39

Why should I learn about estimation methods?4 / 39

Why should I learn about estimation methods?Understand software options
4 / 39

Why should I learn about estimation methods?Understand software options
Know when to use better methods
4 / 39

Why should I learn about estimation methods?Understand software options
Know when to use better methods
Needed for reporting
4 / 39

The most commonly used methods in MLM are

maximum likelihood (ML) and restricted maximum likelihood (REML)

># Linear mixed model fit by REML ['lmerMod']
># Formula: Reaction ~ Days + (Days | Subject)
>#    Data: sleepstudy
># REML criterion at convergence: 1744
># Random effects:
>#  Groups   Name        Std.Dev. Corr
>#  Subject  (Intercept) 24.74        
>#           Days         5.92    0.07
>#  Residual             25.59        
># Number of obs: 180, groups:  Subject, 18
># Fixed Effects:
># (Intercept)         Days  
>#       251.4         10.5

5 / 39

Estimation Methods for MLM6 / 39

For MLM

Find $γ$ s, $τ$ s, and $σ$ that maximizes the likelihood function

$ℓ (γ, τ, σ; y) = - \frac{1}{2} {\log | V (τ, σ) | + (y - X γ)^{⊤} V^{- 1} (τ, σ) (y - X γ)} + K$

Here's the log-likelihood function for the coefficient of meanses (see code in the provided Rmd):

7 / 39

Numerical Algorithms

># iteration: 1
>#     f(x) = 47022.519159
># iteration: 2
>#     f(x) = 47151.291766
># iteration: 3
>#     f(x) = 47039.480137
># iteration: 4
>#     f(x) = 46974.909593
># iteration: 5
>#     f(x) = 46990.872588
># iteration: 6
>#     f(x) = 46966.453125
># iteration: 7
>#     f(x) = 46961.719993
># iteration: 8
>#     f(x) = 46965.890703
># iteration: 9
>#     f(x) = 46961.367013
># iteration: 10
>#     f(x) = 46961.288830
># iteration: 11
>#     f(x) = 46961.298898
># iteration: 12
>#     f(x) = 46961.284848
># iteration: 13
>#     f(x) = 46961.285238
># iteration: 14
>#     f(x) = 46961.284845
># iteration: 15
>#     f(x) = 46961.284848
># iteration: 16
>#     f(x) = 46961.284845

8 / 39

ML vs. REML

REML has corrected degrees of freedom for the variance component estimates (like dividing by $N - 1$ instead of by $N$ in estimating variance)

REML is generally preferred in smaller samples
The difference is small when the number of clusters is large

Technically speaking, REML only estimates the variance components¹

[1] The fixed effects are integrated out and are not part of the likelihood function. They are solved in a second step, usually by the generalized least squares (GLS) method

9 / 39

160 Schools

    REML 
    ML 
  
    (Intercept) 
    12.649 
    12.650 
  
    (0.149) 
    (0.148) 
  
    meanses 
    5.864 
    5.863 
  
    (0.361) 
    (0.359) 
  
    SD (Intercept id) 
    1.624 
    1.610 
  
    SD (Observations) 
    6.258 
    6.258 
  
    Num.Obs. 
    7185 
    7185 
  
    R2 Marg. 
    0.123 
    0.123 
  
    R2 Cond. 
    0.179 
    0.178 
  
    AIC 
    46969.3 
    46969.3 
  
    BIC 
    46996.8 
    46996.8 
  
    ICC 
    0.1 
    0.1 
  
    RMSE 
    6.21 
    6.21 
  
10 / 39

	REML	ML
(Intercept)	12.649	12.650
	(0.149)	(0.148)
meanses	5.864	5.863
	(0.361)	(0.359)
SD (Intercept id)	1.624	1.610
SD (Observations)	6.258	6.258
Num.Obs.	7185	7185
R2 Marg.	0.123	0.123
R2 Cond.	0.179	0.178
AIC	46969.3	46969.3
BIC	46996.8	46996.8
ICC	0.1	0.1
RMSE	6.21	6.21

160 Schools

    REML 
    ML 
  
    (Intercept) 
    12.649 
    12.650 
  
    (0.149) 
    (0.148) 
  
    meanses 
    5.864 
    5.863 
  
    (0.361) 
    (0.359) 
  
    SD (Intercept id) 
    1.624 
    1.610 
  
    SD (Observations) 
    6.258 
    6.258 
  
    Num.Obs. 
    7185 
    7185 
  
    R2 Marg. 
    0.123 
    0.123 
  
    R2 Cond. 
    0.179 
    0.178 
  
    AIC 
    46969.3 
    46969.3 
  
    BIC 
    46996.8 
    46996.8 
  
    ICC 
    0.1 
    0.1 
  
    RMSE 
    6.21 
    6.21 
  
16 Schools

    REML 
    ML 
  
    (Intercept) 
    12.809 
    12.808 
  
    (0.504) 
    (0.471) 
  
    meanses 
    6.577 
    6.568 
  
    (1.281) 
    (1.197) 
  
    SD (Intercept id) 
    1.726 
    1.581 
  
    SD (Observations) 
    5.944 
    5.944 
  
    Num.Obs. 
    686 
    686 
  
    R2 Marg. 
    0.145 
    0.146 
  
    R2 Cond. 
    0.211 
    0.203 
  
    AIC 
    4419.6 
    4419.7 
  
    BIC 
    4437.7 
    4437.8 
  
    ICC 
    0.1 
    0.1 
  
    RMSE 
    5.89 
    5.89 
  
10 / 39

	REML	ML
(Intercept)	12.649	12.650
	(0.149)	(0.148)
meanses	5.864	5.863
	(0.361)	(0.359)
SD (Intercept id)	1.624	1.610
SD (Observations)	6.258	6.258
Num.Obs.	7185	7185
R2 Marg.	0.123	0.123
R2 Cond.	0.179	0.178
AIC	46969.3	46969.3
BIC	46996.8	46996.8
ICC	0.1	0.1
RMSE	6.21	6.21

	REML	ML
(Intercept)	12.809	12.808
	(0.504)	(0.471)
meanses	6.577	6.568
	(1.281)	(1.197)
SD (Intercept id)	1.726	1.581
SD (Observations)	5.944	5.944
Num.Obs.	686	686
R2 Marg.	0.145	0.146
R2 Cond.	0.211	0.203
AIC	4419.6	4419.7
BIC	4437.7	4437.8
ICC	0.1	0.1
RMSE	5.89	5.89

Other Estimation MethodsGeneralized estimating equations (GEE)Robust to some misspecification and non-normality
Maybe inefficient in small samples (i.e., with lower power)
See Snijders & Bosker 12.2; the geepack R package
Markov Chain Monte Carlo (MCMC)/BayesianResearchers set prior distributions for the parametersDifferent from "empirical Bayes": Prior coming from the data

Does not depend on normality of the sampling distributionsMore stable in small samples with the use of priors

Can handle complex models
See Snijders & Bosker 12.1
11 / 39

Testing12 / 39

Fixed effects $(γ)$

Usually, the likelihood-based CI/likelihood-ratio (LRT; $χ^{2}$ ) test is sufficient
- Require ML (as fixed effects are not part of the likelihood function in REML)
Small sample (10--50 clusters): Kenward-Roger approximation of degrees of freedom
Non-normality: Residual bootstrap¹

Random effects $(τ)$

LRT (with $p$ values divided by 2)

[1]: See van der Leeden et al. (2008) and Lai (2021)

13 / 39

Testing Fixed Effects14 / 39

Likelihood Ratio (Deviance) TestH0:γ=0H0:γ=015 / 39

Likelihood Ratio (Deviance) Test

$H_{0} : γ = 0$

Likelihood ratio: $\frac{L (γ = 0)}{L (γ = \hat{γ})}$

Deviance: $- 2 \times \log (\frac{L (γ = 0)}{L (γ = \hat{γ})})$
= $- 2 L L (γ = 0) - [- 2 L L (γ = \hat{γ})]$
= $D e v i a n c e ∣_{γ = 0} - D e v i a n c e ∣_{γ = \hat{γ}}$

ML (instead of REML) should be used

15 / 39

Example

...
># Linear mixed model fit by maximum likelihood  ['lmerMod']
># Formula: mathach ~ (1 | id)
>#      AIC      BIC   logLik deviance df.resid 
>#    47122    47142   -23558    47116     7182 
...

...
># Linear mixed model fit by maximum likelihood  ['lmerMod']
># Formula: mathach ~ meanses + (1 | id)
>#      AIC      BIC   logLik deviance df.resid 
>#    46967    46995   -23480    46959     7181 
...

pchisq(47115.81 - 46959.11, df = 1, lower.tail = FALSE)

># [1] 5.95e-36

In lme4, you can also use

anova(m_lv2, ran_int)  # Automatically use ML

16 / 39

Problem of LRT in Small Samples

LRT assumes that the deviance under the null follows a $χ^{2}$ distribution, which is not likely to hold in small samples

Inflated Type I error rates

E.g., 16 Schools

LRT critical value with $α = .05$ : 3.84
Simulation-based critical value: 4.72

17 / 39

Problem of LRT in Small Samples

LRT assumes that the deviance under the null follows a $χ^{2}$ distribution, which is not likely to hold in small samples

Inflated Type I error rates

E.g., 16 Schools

LRT critical value with $α = .05$ : 3.84
Simulation-based critical value: 4.72

17 / 39

$F$ Test With Small-Sample Correction

It is based on the Wald test (not the LRT):

$t = \hat{γ} / \hat{s e} (\hat{γ})$ ,
Or equivalently, the $F = t^{2}$ (for a one-parameter test)

The small-sample correction does two things:

Adjust $\hat{s e} (\hat{γ})$ as it tends to be underestimated in small samples
Determine the critical value based on an $F$ distribution, with an approximate denominator degrees of freedom (ddf)

18 / 39

Kenward-Roger (1997) Correction

Generally performs well with < 50 clusters

# Wald
anova(m_contextual, ddf = "lme4")

># Analysis of Variance Table
>#         npar Sum Sq Mean Sq F value
># meanses    1    860     860    26.4
># ses        1   1874    1874    57.5

# K-R 
anova(m_contextual, ddf = "Kenward-Roger")

># Type III Analysis of Variance Table with Kenward-Roger's method
>#         Sum Sq Mean Sq NumDF DenDF F value  Pr(>F)    
># meanses    324     324     1    16    9.96  0.0063 ** 
># ses       1874    1874     1   669   57.53 1.1e-13 ***
># ---
># Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For meanses, the critical value (and the $p$ value) is determined based on an $F (1, 15.51)$ distribution, which has a critical value of

qf(.95, df1 = 1, df2 = 15.51)

># [1] 4.52

19 / 39

Testing Random Effects20 / 39

LRT for Random Slopes

Should you include random slopes?

Theoretically, yes, unless you're certain that the slopes are the same for every group

However, frequentist methods usually crash with more than two random slopes

Test the random slopes one by one, and identify which one is needed
Bayesian methods are more equipped for complex models

21 / 39

LRT for Random Slopes

Should you include random slopes?

Theoretically, yes, unless you're certain that the slopes are the same for every group

However, frequentist methods usually crash with more than two random slopes

Test the random slopes one by one, and identify which one is needed
Bayesian methods are more equipped for complex models

"One-tailed" LRT

LRT $(χ^{2})$ is generally a two-tailed test. But for random slopes,

$H_{0} : τ_{1} = 0$ is a one-tailed hypothesis

A quick solution is to divide the resulting $p$ by 2¹

[1]: Originally proposed by Snijders & Bosker; tested in simulation by LaHuis & Ferguson (2009, https://doi.org/10.1177/1094428107308984)

21 / 39

Example: LRT for τ21τ12...
># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)
>#    Data: hsball
># REML criterion at convergence: 46558
...
...
># Formula: mathach ~ meanses + ses_cmc + (1 | id)
>#    Data: hsball
># REML criterion at convergence: 46569
...
22 / 39

Example: LRT for $τ_{1}^{2}$

...
># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)
>#    Data: hsball
># REML criterion at convergence: 46558
...

...
># Formula: mathach ~ meanses + ses_cmc + (1 | id)
>#    Data: hsball
># REML criterion at convergence: 46569
...

G Matrix

$[\begin{matrix} τ_{0}^{2} \\ τ_{01} & τ_{1}^{2} \end{matrix}]$

$[\begin{matrix} τ_{0}^{2} \\ 0 & 0 \end{matrix}]$

22 / 39

Example: LRT for $τ_{1}^{2}$

...
># Formula: mathach ~ meanses + ses_cmc + (ses_cmc | id)
>#    Data: hsball
># REML criterion at convergence: 46558
...

...
># Formula: mathach ~ meanses + ses_cmc + (1 | id)
>#    Data: hsball
># REML criterion at convergence: 46569
...

G Matrix

$[\begin{matrix} τ_{0}^{2} \\ τ_{01} & τ_{1}^{2} \end{matrix}]$

$[\begin{matrix} τ_{0}^{2} \\ 0 & 0 \end{matrix}]$

pchisq(10.92681, df = 2, lower.tail = FALSE)

># [1] 0.00424

Need to divide by 2

22 / 39

Bayesian Estimation23 / 39

Benefits

Better small sample properties
Less likely to have convergence issues
CIs available for any quantities ( $R^{2}$ , predictions, etc)
Support complex models not possible with lme4
- E.g., 2+ random slopes
Is getting increasingly popular in recent research

24 / 39

Downsides

Computationally more intensive
- But dealing with convergence issues with ML/REML may end up taking more time
Need to learn some new terminologies

25 / 39

Bayes's Theorem

Posterior $\propto$ Likelihood $\times$ Prior

So, in addition to the likelihood function (as in ML/REML), we need prior information
Prior: Belief about a parameter before looking at the data
For this course, we use default priors set by the brms package
- Note that the default priors may change when software gets updated, so keep track of the package version when you run analyses

26 / 39

27 / 39

	MCMC
b_Intercept	12.647
	[12.348, 12.939]
b_meanses	5.854
	[5.141, 6.606]
sd_id__Intercept	1.633
	[1.386, 1.901]
sigma	6.258
	[6.159, 6.365]
Num.Obs.	7185
R2	0.173
R2 Adj.	0.158
R2 Marg.	0.123
ELPD	−23433.7
ELPD s.e.	49.7
LOOIC	46867.5
LOOIC s.e.	99.4
WAIC	46867.1
RMSE	6.21
r2.adjusted.marginal	0.116

Testing for Bayesian Estimation

A coefficient is statistically different from zero when the 95% CI does not contain zero.

28 / 39

Multilevel Bootstrap29 / 39

A simulation-based approach to approximate the sampling distribution of fixed and random effects

Useful for obtaining CIs
Especially for statistics that are functions of fixed/random effects (e.g., $R^{2}$ )

Parametric, Residual, and Cases bootstrap

30 / 39

A simulation-based approach to approximate the sampling distribution of fixed and random effects

Useful for obtaining CIs
Especially for statistics that are functions of fixed/random effects (e.g., $R^{2}$ )

Parametric, Residual, and Cases bootstrap

30 / 39

In my own work,¹ the residual bootstrap was found to perform best, especially when data are not normally distributed and when the number of clusters is small

See R code for this week

Lai (2021, https://doi.org/10.1080/00273171.2020.1746902)

31 / 39

Bonus: More on Maximum Likelihood Estimation32 / 39

What is Likelihood?

Let's say we want to estimate the population mean math achievement score $(μ)$

We need to make some assumptions:

Known SD: $σ = 8$
The scores are normally distributed in the population

33 / 39

Learning the Parameter From the Sample

Assume that we have scores from 5 representative students

Student	Score
1	23
2	16
3	5
4	14
5	7

34 / 39

Likelihood

If we assume that $μ = 10$ , how likely will we get 5 students with these scores?

35 / 39

Likelihood

If we assume that $μ = 10$ , how likely will we get 5 students with these scores?

Student	Score	` $P (Y_{i} = y_{i} ∣ μ = 10)$ `
1	23	0.013
2	16	0.038
3	5	0.041
4	14	0.044
5	7	0.046

Multiplying them all together: $P (Y_{1} = 23, Y_{2} = 16, Y_{3} = 5, Y_{4} = 14, Y_{5} = 7 | μ = 10)$ = Product of the probabilities =

># [1] 4.21e-08

35 / 39

If $μ = 13$

36 / 39

If $μ = 13$

Student	Score	` $P (Y_{i} = y_{i} ∣ μ = 13)$ `
1	23	0.023
2	16	0.046
3	5	0.030
4	14	0.049
5	7	0.038

Multiplying them all together: $P (Y_{1} = 23, Y_{2} = 16, Y_{3} = 5, Y_{4} = 14, Y_{5} = 7 | μ = 13)$ = Product of the probabilities =

># [1] 5.98e-08

36 / 39

Compute the likelihood for a range of $μ$ values

Likelihood Function

37 / 39

Compute the likelihood for a range of $μ$ values

Likelihood Function

Log-Likelihood (LL) Function

37 / 39

Maximum Likelihood

$\hat{μ} = 13$ maximizes the (log) likelihood function

Maximum likelihood estimator (MLE)

38 / 39

Maximum Likelihood

$\hat{μ} = 13$ maximizes the (log) likelihood function

Maximum likelihood estimator (MLE)

Estimating $σ$

38 / 39

Curvature and Standard Errors

$N = 5$

$N = 20$

39 / 39

Week Learning Objectives

Describe, conceptually, what the likelihood function and maximum likelihood estimation are

Describe the differences between maximum likelihood and restricted maximum likelihood

Conduct statistical tests for fixed effects, and use the small-sample correction when needed

Use the likelihood ratio test to test random slopes

Estimate multilevel models with the Bayesian/Markov Chain Monte Carlo estimator in the brms package

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Model Estimation and Testing

PSYC 575

Mark Lai

University of Southern California

2020/09/01 (updated: 2022-09-17)

Week Learning Objectives

Estimation

Why should I learn about estimation methods?

Why should I learn about estimation methods?

Understand software options

Why should I learn about estimation methods?

Understand software options

Know when to use better methods

Why should I learn about estimation methods?

Understand software options

Know when to use better methods

Needed for reporting

The most commonly used methods in MLM are

maximum likelihood (ML) and restricted maximum likelihood (REML)

Estimation Methods for MLM

For MLM

Numerical Algorithms

ML vs. REML

160 Schools

160 Schools

16 Schools

Other Estimation Methods

Generalized estimating equations (GEE)

Markov Chain Monte Carlo (MCMC)/Bayesian

Testing

Fixed effects (γ)(γ)

Random effects (τ)(τ)

Testing Fixed Effects

Likelihood Ratio (Deviance) Test

H0:γ=0H0:γ=0

Likelihood Ratio (Deviance) Test

H0:γ=0H0:γ=0

Example

Problem of LRT in Small Samples

Problem of LRT in Small Samples

FF Test With Small-Sample Correction

Kenward-Roger (1997) Correction

Testing Random Effects

LRT for Random Slopes

Should you include random slopes?

LRT for Random Slopes

Should you include random slopes?

"One-tailed" LRT

Example: LRT for τ21τ12

Example: LRT for τ21τ12

G Matrix

Example: LRT for τ21τ12

G Matrix

Bayesian Estimation

Benefits

Downsides

Bayes's Theorem

Posterior ∝∝ Likelihood ×× Prior

Testing for Bayesian Estimation

Multilevel Bootstrap

Bonus: More on Maximum Likelihood Estimation

What is Likelihood?

Let's say we want to estimate the population mean math achievement score (μ)(μ)

Learning the Parameter From the Sample

Likelihood

Likelihood

If μ=13μ=13

If μ=13μ=13

Likelihood Function

Likelihood Function

Log-Likelihood (LL) Function

Maximum Likelihood

Maximum Likelihood

Estimating σσ

Curvature and Standard Errors

Week Learning Objectives

Help

Fixed effects $(γ)$

Random effects $(τ)$

$H_{0} : γ = 0$

$H_{0} : γ = 0$

$F$ Test With Small-Sample Correction

Example: LRT for $τ_{1}^{2}$

Example: LRT for $τ_{1}^{2}$

Example: LRT for $τ_{1}^{2}$

Posterior $\propto$ Likelihood $\times$ Prior

Let's say we want to estimate the population mean math achievement score $(μ)$

If $μ = 13$

If $μ = 13$

Estimating $σ$