nlme
for TSCS DataWe’re going to be analyzing data on voting rates (% of adult population voting) in US presidential elections from 1978 to 2014. Load the data files and combine them in preparation for the analysis.
library(tidyverse)
library(reshape2)
## load population data
pop < read.csv('pop.csv') %>%
rename(year = V1, state = V2, population = pop) %>%
mutate(population = scale(population))
## load voting rate data
vote < read.csv('vote.csv') %>% melt() %>%
mutate(variable = gsub('X', '', variable), variable = as.numeric(variable)) %>%
rename(year = variable, vote = value)
## load region data
regions < read.csv('regions.csv') %>%
select(state = State, State.Code, region = Region)
## merge data
dat < vote %>% left_join(regions) %>% select(state) %>% rename(state = State.Code) %>% left_join(pop)
nlme
uses a slightly different syntax for mixed effects models than lme4
does. We specify the fixed part of the model using a regular formula like we would for lm()
, but the random component has to be specified as an argument to the random
option e.g. random = ~ 1 group
would produce a random intercept by group
just like (1  group)
would in lme4
.
library(nlme)
## fit model
mod1 < lme(vote ~ population + region, data = dat, random = ~ 1  year)
## Error in na.fail.default(structure(list(vote = c(40.9, 49, 35.7, 36.5, : missing values in object
Uh oh, looks like we have some missing values in our data and lme()
doesn’t perform listwise deletion like lmer
does. Unfortunately, the error doesn’t tell us where the problem is in our data. Use apply()
and anyNA()
to figure out where we should start looking.
apply(dat, 2, anyNA)
## year vote state region population
## FALSE FALSE TRUE TRUE TRUE
It looks there are missing values for state
, region
, and population
. This means that there aren’t any issues with our voting data, but something’s going wrong with our region and population data. Since we join the region data before joining the population data, let’s start there.
dat < vote %>% left_join(regions)
## list missing states
dat$state[is.na(dat$State.Code)]
## [1] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [5] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [9] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [13] "Rhode Island " "Utah " "Rhode Island " "Utah "
## [17] "Rhode Island " "Utah " "Rhode Island " "Utah "
Our voting data have an extra trailing space for Rhode Island and Utah. left_join()
uses exact string matching, so this extra space is causing a problem. Look up the sub()
function by typing ?sub
. How can we use this to fix our data ensure that all our data get properly merged.
## drop trailing spaces in RI and UT
vote < vote %>% mutate(state = sub('\\ $' , '', state))
## merge data
dat < vote %>% left_join(regions) %>% select(state) %>% rename(state = State.Code) %>% left_join(pop)
## check for NAs again
apply(dat, 2, anyNA)
## year vote state region population
## FALSE FALSE FALSE FALSE FALSE
Now we’re good and can go ahead and estimate our model.
library(texreg)
mod1 < lme(vote ~ population + region, data = dat, random = ~ 1  year)
htmlreg(mod1, stars = .05)
Model 1  

(Intercept)  50.85^{*}  
(0.90)  
population  2.36^{*}  
(0.27)  
regionNortheast  3.26^{*}  
(0.86)  
regionSouth  8.67^{*}  
(0.73)  
regionWest  3.00^{*}  
(0.78)  
AIC  3327.07  
BIC  3356.65  
Log Likelihood  1656.54  
Num. obs.  510  
Num. groups  10  
^{*}p < 0.05 
While nlme
isn’t as advanced as lme4
in a lot of ways, it has certain functionalities that offer much more control over your model. What if we’re worried that we not only need a random intercept by year, but that the errors might vary by year? We can account for this possiblity with the varIdent()
function and the weights
argument to lme()
.
mod2 < lme(vote ~ population + region, data = dat, random = ~ 1  year,
weights = varIdent(~ 1  year))
htmlreg(mod2, stars = .05)
Model 1  

(Intercept)  50.85^{*}  
(0.90)  
population  2.36^{*}  
(0.27)  
regionNortheast  3.26^{*}  
(0.86)  
regionSouth  8.67^{*}  
(0.73)  
regionWest  3.00^{*}  
(0.78)  
AIC  3327.07  
BIC  3356.65  
Log Likelihood  1656.54  
Num. obs.  510  
Num. groups  10  
^{*}p < 0.05 
The previous model assumes that errors are correlated within years, but not between them. Let’s take things a step further and fit an explicit time series model to our data. Take a look at what the correlation
argument to lme()
does and use the corAR1()
function to fit a model with a first order autoregressive structure with states as the grouping variable.
mod3 < lme(vote ~ population + region, data = dat, random = ~ 1  state,
correlation = corAR1(form = ~ year  state))
htmlreg(mod3, stars = .05)
Model 1  

(Intercept)  50.86^{*}  
(1.31)  
population  3.17^{*}  
(0.55)  
regionNortheast  3.18  
(2.01)  
regionSouth  8.61^{*}  
(1.72)  
regionWest  3.09  
(1.82)  
AIC  3219.59  
BIC  3253.39  
Log Likelihood  1601.80  
Num. obs.  510  
Num. groups  51  
^{*}p < 0.05 
The default corStruct
for an lme
model is compound symmetry, which enforces all off diagonal entries of the variancecovariance matrix to be 0. This is a restrictive assumption, but it matches the assumption used by lmer()
. Fit the a model with random intecepts by year using both lmer()
and lme()
and compare the results.
library(lme4)
mod_nlme < lme(vote ~ population + region, data = dat, random = ~ 1  state,
correlation = corCompSymm())
mod_lme4 < lmer(vote ~ population + region + (1  state), data = dat)
htmlreg(list(mod_nlme, mod_lme4), stars = .05,
custom.model.names = c('<code>nlme</code>', '<code>lme4</code>'))
nlme

lme4



(Intercept)  50.86^{*}  50.86^{*}  
(1.31)  (1.31)  
population  3.17^{*}  3.17^{*}  
(0.55)  (0.55)  
regionNortheast  3.18  3.18  
(2.01)  (2.01)  
regionSouth  8.61^{*}  8.61^{*}  
(1.72)  (1.72)  
regionWest  3.09  3.09  
(1.82)  (1.82)  
AIC  3219.59  3217.59  
BIC  3253.39  3247.23  
Log Likelihood  1601.80  1601.80  
Num. obs.  510  510  
Num. groups  51  
Num. groups: state  51  
Var: state (Intercept)  18.10  
Var: Residual  26.22  
^{*}p < 0.05 
They’re identical! Well, at least the coefficient estimates, standard errors, variances of the random effects, and loglikelihood are identical. AIC and BIC are different because lmer()
and lme()
calculate the number of parameters in a model slightly differently, which we can verify by running AIC(mod_nlme, mod_lme4)
and comparing the degrees of freedom.
Since there are so many different ways we can specify a mixed effects model (different random intercepts and slopes, different correlation structures, different levels of groups, etc.), it’s important to think about how we decide between different specifications. The simplest way is to perform a likelihood ratio test with the anova()
function. Fit a regular linear model with the same fixed effects as one of your multilevel models and compare the two; remember to list the unrestricted model first!
mod_lm < lm(vote ~ population + region, data = dat)
anova(mod_lme4, mod_lm)
## Data: dat
## Models:
## mod_lm: vote ~ population + region
## mod_lme4: vote ~ population + region + (1  year)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mod_lm 6 3365 3390 1676 3353
## mod_lme4 7 3331 3360 1658 3317 35.9 1 2e09 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our unrestricted model is more likely to have generated the data than our restricted one, so we should definitely be fitting some type of multilevel model to these data. However,
## parametric bootstrap of likelihood ratio test
lme.boot < function(model_r, model_ur){
new_y << simulate(model_r, 1)[[1]] # simulate data according to null DGP
restricted_dev < 2 * logLik(update(model_r, new_y ~ .))
unrestricted_dev < 2 * logLik(update(model_ur, new_y ~ .))
return(restricted_dev  unrestricted_dev) # LRT under null
}
## 100 bootstraps
boot_dev_nre < replicate(100, expr = lme.boot(mod_lm, mod_lme4))
## the boostrapped approximation: (this is preferred)
test_stat < 2 * logLik(mod_lm)  2 * logLik(mod_lme4)
mean(boot_dev_nre > test_stat) #approximate pvalue
## [1] 0
Not a single bootstrapped likelihood ratio test is greater than our test statistic, so once again our unrestricted model better fits the data than the restricted one. Let’s try comparing the first order autoregressive correlation model with the heteroskedastic model.
## 100 bootstraps
boot_dev_nre < replicate(100, expr = lme.boot(mod2, mod3))
## Error in simulate.lme(model_r, 1): models with "corStruct" and/or "varFunc" objects not allowed
Unfortunately simulate.lme()
doesn’t allow us to simulate outcomes from models with corStruct()
correlation structures, so we have to try another option.
anova(mod3, mod2)
## Model df AIC BIC logLik Test L.Ratio pvalue
## mod3 1 8 3325 3359 1654
## mod2 2 7 3327 3357 1657 1 vs 2 4.2 0.041
Our unrestricted AR1 model is statistically more likely than the heteroskedastic model.