Let’s apply these techniques to a subject that I know nothing about: differences in voting by economic status across the world. Read in the datafile votes.RData, which is a subset of the data contained in Kasara and Suryanarayan (2015).1 These data2 provide an excellent illustration of some of the benefits of tree methods. Due to their small sample size of 211, the authors have to run 6 separate regressions to test each of their proposed explanations in turn. With a tree-based approach, we can include all 6 explanatory variables, and controls, in a single analysis and explore how much explanatory power each contributes.

library(reshape2) # reshaping data for plots
library(ggplot2) # plots

# source custom ggplot theme
source('https://raw.githubusercontent.com/jayrobwilliams/RWmisc/master/theme_rw.R')

# load data
load('votes.RData')

# split into test and training data
train <- sample(1:nrow(votes), (2 / 3) * nrow(votes))
votes_train <- votes[train, ]
votes_test <- votes[-train, ]

The first variable in the data is the estimated \(\beta_j\) effect of income on turnout. The next six are the theoretically driven explanatory variables: voting polarization, electoral distance between the top and bottom income quintiles, bureaucratic quality, government effectiveness, direct tax revenue, and log GDP per capita. The remaining variables are the controls: proportional representation, concurrent elections, compulsory voting, polity score, infant mortality, gross gini coefficient, homicide rate, and ethnic fractionalization.

All of these methods select which predictors to include and which to exclude in a model. Some of them even allow for \(n\)-way interactions and \(n^{th}\) order polynomials. This can result in very complex models where it is difficult to recover the marginal effect of a specific variable. If we were using synethic data and controlled the data generating process, we could compare the choices made by these models to true DGP. When working with real data, we don’t have this option. We thus rely on predictive accuracy as a measure of model fit. Many packages use mean squared error (MSE) as their metric of fit. \[ \begin{align} \frac{1}{n}\sum_{i=1}^{n} \left(\hat{Y_i} - Y_i \right)^2 \end{align} \] Some packages don’t directly provide a MSE, but they all allow us to calculate predicted \(\hat{y}\)s from our test data, and we can then compute MSE ourselves. Let’s go ahead and make a function to calculate MSE because we’re going to be doing this a lot.

# calculate mean squared error
mse <- function(y, yhat) {
  
  mean((yhat - y)^2)
  
}

Single Tree Approaches

The tree package is a little outdated and doesn’t have a lot of options compared to newer packages like rpart, but it does offer some really easy plotting functions to help us understand how regression trees work.

library(tree)
fit_tree <- tree(formula = beta_exp ~., data = votes_train)
summary(fit_tree)
## 
## Regression tree:
## tree(formula = beta_exp ~ ., data = votes_train)
## Variables actually used in tree construction:
## [1] "gdp_pc_ln_z"       "infantmortality_z" "dtax_per_rev_z"   
## [4] "homicide_z"        "VP_z"              "giniall_gross_z"  
## [7] "prs2_z"            "ge_est_z"          "elec_dist_1_5_z"  
## Number of terminal nodes:  18 
## Residual mean deviance:  0.0043 = 0.52 / 122 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.132  -0.041  -0.004   0.000   0.042   0.167

This is a relatively complex tree that only uses six out of 14 variables to make classification choices. Let’s visually inspect the tree to see which variables are doing most of the heavy lifting in sorting outcomes. Use the plot() and text() commands on our model object to get a visual version of this decision tree. The text() command is finnicky, so make sure you execute it in the same command as plot().

# need to execute these as part of the same command
plot(fit_tree); text(fit_tree, cex = .5)

This plot tells us a couple things, the most important of which is that this model will only ever predict a small number of different \(beta_j\) values, even though we have 211 unique values in the data. The plot shows multiple leaves with a value of 0.10. This is due to rounding; the actual values are different. Access the frame within our tree fit object. The yval column for any row that is a <leaf> is a predicted value that our model can produce. See for yourself that we actually have eight different predicted values, rather than the seven that the plot shows.

fit_tree$frame

Before we get into more advanced methods, let’s spend a little more time with tree() and use it to get a better idea of how this classification works. Fit a new regression tree that only uses GDP per capita and direct tax revenue (the two predictors after the initial split in our tree). Plot these two variables against each other, with the color of the points reflecting the estimated effect of income on turnout (the grey() and findInterval() functions will be helpful here, if you don’t want to have to use colorRampPalette()). Then use the partition.tree() command on our tree fit object, and be sure to set add = T to overlay the partitions. partition.tree() is also finnicky, so make sure it’s executed in the same command as plot().

fit_tree_simple <- tree(formula = beta_exp ~ gdp_pc_ln_z + dtax_per_rev_z,
                        data = votes_train)
plot(x = votes_train$gdp_pc_ln_z, y = votes_train$dtax_per_rev_z,
     col = grey(seq(0, 1, length.out = 7))[findInterval(votes_train$beta_exp,
                                                        seq(min(votes_train$beta_exp),
                                                            max(votes_train$beta_exp),
                                                            length.out = 7))], pch = 20,
     xlab = 'GDP per capita', ylab = 'direct tax revenue')
partition.tree(fit_tree_simple, add = T, cex = .5)

We can see where each of the countries in the training dataset fall along the two variables used in our decision tree, and each square represents a leaf at the bottom of the tree. The colors range from lightest to darkest, covering the range of \(\beta_j\) in the data.

The last thing we want to do with tree() is see how changing the parameters of the tree algorithm can alter the resulting model. The decision of whether to grow a new node or not is controlled by a number of parameters, but let’s focus on the mindev argument. This determines how much a potential node must reduce the error for the algorithm to grow a new node. The default value is .01. Try setting it to something larger and plotting the resulting tree.

fit_tree <- tree(beta_exp ~., data = votes_train, mindev = .05)
plot(fit_tree); text(fit_tree, cex = .75)

Our tree is now more simpler and makes splits based only on GDP per capita, infant mortality, and direct tax revenue, two of the top three predictors in the original tree. Access the frame dataframe within our fit object, and write a line of code that returns all the predictions at each leaf (terminal node).

fit_tree$frame[which(fit_tree$frame$var == '<leaf>'), 'yval']
## [1] 0.164 0.063 0.038 0.231

With our minimum deviance value of .1 we’re getting fewer predictions than before, which means we’re covering less of our potential outcomes than we were earlier. Calculate some predicted values from our more complicated original model, and then let’s move on to the rpart package, which is considerably faster with large datasets and offers us some more control over the algorithm.

pred_tree <- predict(fit_tree, votes_test)

Load up rpart and take a look at the main function, which is called – surprisingly enough – rpart().

library(rpart)
fit_rpart <- rpart(formula = beta_exp ~., data = votes_train, method = 'anova')

Just like with tree(), we can plot the resulting tree to see where our model is making decisions.

plot(fit_rpart); text(fit_rpart, cex = .65)

We can prune an rpart fit object by supplying a complexity parameter. Pruning a tree removes branches farther down the tree which do not sufficiently increase the fit by the complexity parameter. For anova trees, the complexity parameter is \(R^2\). We can also define a complexity parameter in rpart() when fitting a tree. Try pruning your tree using a complexity parameter larger than the default of .01. This latter tree will be simpler because the higher complexity parameter penalizes potential splits that do not sufficiently increase the fit of the model.

par(mfrow = c(1, 2))
plot(fit_rpart); text(fit_rpart, cex = .5)
plot(prune(fit_rpart, cp = .033)); text(prune(fit_rpart, cp = .033), cex = .5)

par(mfrow = c(1, 1))

Another advantage of rpart over tree is that we can change the spacing on a tree plot to make the node labels more readable. By default, the vertical spacing between nodes represents the amount of error reduction that node produces. While more accurate, these plots can be harder to read as the number of nodes increases. Try plotting the rpart object with uniform = T to see

plot(prune(fit_rpart, cp = .005), uniform = T); text(prune(fit_rpart, cp = .005),
                                                     cex = .75)

Keep in mind that overfitting is just as much of a concern here as it is in the linear model context. We could fit a tree with a complexity parameter of .00001, but it would have a huge number of branches, and would probably fare poorly in out of sample prediction. Try this out for yourself, and compare the MSE for complexity parameters ranging from .1 to .00001.

fit_rpart_of <- rpart(formula = beta_exp ~., data = votes_train, method = 'anova',
                   control = rpart.control(cp = .00001))

# combine MSE for all 5 complexity parameters into a dataframe
mse_cp <- t(data.frame(mse(predict(prune(fit_rpart_of, cp = .1),
                                   newdata = votes_test), votes_test$beta_exp),
                       mse(predict(prune(fit_rpart_of, cp = .01),
                                   newdata = votes_test), votes_test$beta_exp),
                       mse(predict(prune(fit_rpart_of, cp = .001),
                                   newdata = votes_test), votes_test$beta_exp),
                       mse(predict(prune(fit_rpart_of, cp = .0001),
                                   newdata = votes_test), votes_test$beta_exp),
                       mse(predict(fit_rpart_of, newdata = votes_test),
                           votes_test$beta_exp)))

# rename rows and columns in dataframe
colnames(mse_cp) <- 'MSE'; rownames(mse_cp) <- c('.1', '.01', '.001', '.0001', '.00001')
mse_cp
##           MSE
## .1     0.0084
## .01    0.0114
## .001   0.0112
## .0001  0.0112
## .00001 0.0112

We can see that the best MSE we get for these data is with a complexity parameter of .001, so let’s fit that tree and save its predictions for comparison with other methods later.

fit_rpart <- rpart(formula = beta_exp ~., data = votes_train, method = 'anova',
                   control = rpart.control(cp = .001))
pred_rpart <- predict(fit_rpart, votes_test, type = 'vector')

Random Forests

Random forests are an ensemble technique to try and overcome some of the limitations of trees. As we saw above, trees that have grown very deep are prone to overfitting the data. An additional shortcoming of trees is that they are deterministic. Given the same parameters, an algorithm will grow an identical tree each time.

One way to try and overcome this problem is to use bootstrap aggregating, or bagging. Bagging involves drawing samples from the training data with replacement, applying the same tree algorithm to each one, and then making predictions on test data by averaging the prediction from each indivdiual tree. The prediction \(f(\cdot)\) for a random forest is thus given by:

\[ \begin{align} f(X_i) = \frac{1}{M}\sum_{m=1}^M T_m(X_i,\Theta_m) \end{align} \]

where \(M\) is the number of trees, \(T_m\) is an individual tree, and \(\Theta_m\) are the parameters that define that tree.

Where bootstrap aggregating introduces randomness just in the sampling of subsets, a random forest includes further randomness by selecting a random subset of the variables when making splitting decisions at each node. Just like bagging, a random forest makes predictions by making predictions for each tree, and then averating over them.

library(randomForest)
fit_rf <- randomForest(x = votes_train[, 2:ncol(votes_train)], y = votes_train[, 1])

We can use the predict() function on random forest objects to obtain predictions from the ensemble, and then compare them to true values to assess the forest’s predictive accuracy.

pred_rf <- predict(fit_rf, votes_test)
mse(pred_rf, votes_test$beta_exp)
## [1] 0.0093

Just like with single tree methods, we can alter the parameters of the algorithm used to grow the ensemble. Try changing some of the arguments to randomForest and seeing if you can improve the accuracy of the resulting ensemble. Some of the arguments may have counterintuitive effects; try increasing the minimum size of final nodes (thus decreasing the number of nodes in the tree) and see what happens to your MSE.

fit_rf <- randomForest(x = votes_train[, 2:ncol(votes_train)], y = votes_train[, 1],
                       ntree = 1000, mtry = floor(ncol(votes_train)/5),
                       sampsize = ceiling(.98*nrow(votes_train)),
                       nodesize = 10)
pred_rf <- predict(fit_rf, votes_test)
mse(pred_rf, votes_test$beta_exp)
## [1] 0.0089

Once we’ve gotten an ensemble whose predictve power we’re happy with, we can use the varImpPlot() command to get a sense of which variables are contributing the most explanatory power.

varImpPlot(fit_rf)

Gradient Boosting Machines

Gradient boosting machines are similar to random forests or bagging approaches, but instead of just growing a large number of trees from independent random samples of the data, we grow them sequentially on transformations of the data. For each new tree we optimize

\[ \hat{\Theta}_m = \arg\min_{\Theta_m}\sum_{i=1}^N L(y_i, f_{m-1}(X_i) + T_m (X_i; \Theta_m)) \]

which can be approximated by

\[ \hat{\Theta}_m = \arg\min_{\Theta_m}\sum_{i=1}^N (-g_{im} - T_m (X_i; \Theta_m))^2 \]

where \(g_m\) is the gradient of the loss function \(L\). Since we have a very small dataset, we want to be wary of overfitting in our ensemble. We can do this by setting the bag.fraction argument below the standard of .5, which means we’re training on a smaller subset of the data, and setting the m.minobsinnode argument below its default of 10 which is a significant portion of our data.

library(gbm)
fit_gbm <- gbm(formula = beta_exp ~ ., data = votes_train, distribution = 'gaussian',
               n.trees = 1e3, cv.folds = 3, interaction.depth = 5, n.minobsinnode = 7,
               shrinkage = .001, bag.fraction = .35, keep.data = T,
               n.cores = parallel::detectCores())

We can easily continue growing trees from an existing gbm object with the gbm.more() function. If you didn’t set keep.data = T in the initial call to gbm(), you need to supply the data as an additional argument to gbm.more(). Note that gbm.more() does not perform cross-validation to estimate generalization error.

fit_gbm <- gbm.more(object = fit_gbm, n.new.trees = 1e3)

We can easily make predictions from gbm objects with the predict() function, which requires us to supply the gbm object, our test data, and the tree(s) we wish to use to make predictions. Because gradient boosting machines are an iterative process, the gbm object stores all trees grown. We can use this iterative nature to visualize how the predictions for each observation evolve as the number of trees increases.

# get predictions for all 2000 trees
pred_gbm_init <- predict(object = fit_gbm, newdata = votes_test, n.trees = 1:2000)

# get predictions for first 10 test observations
pred_gbm_plot <- pred_gbm_init[1:10, ]

# reshape from wide to long and plot
pred_gbm_plot <- melt(pred_gbm_plot, id.vars = 'Tree')
ggplot(pred_gbm_plot, aes(x = Var2, y = value, color = as.factor(Var1))) +
  geom_line() +
  labs(x = 'Tree', y = 'f(x)', color = expression(x[i])) +
  theme_rw()

This means we can also see how the mean squared error of the gbm object evolves as the number of trees increases.

gbm_mse_seq <- apply(pred_gbm_init, 2, function(x) mse(votes_test$beta_exp, x))
ggplot(data = data.frame(Tree = 1:2000, MSE = gbm_mse_seq), aes(x = Tree, y = MSE)) +
  geom_line() +
  theme_rw()

Looks like we have our lowest MSE somewhere in the neighborhood of tree 1500. Let’s find the tree with the actual lowest MSE and save that for our model comparison later.

pred_gbm <- predict(object = fit_gbm, newdata = votes_test,
                    n.trees = which(gbm_mse_seq == min(gbm_mse_seq)))

Bayesian Additive Regression Trees

Bayesian additive regression trees (BART) are similar to random forests, except they are estimated in a Bayesian statistical framework, so they are able to produce estimates of uncertainty around predictions. The model is very similar to tree bagging, except with the addition of an error: \[ \begin{align} f(X_i) = \frac{1}{M}\sum_{m=1}^M T_m(X_i,\Theta_m) + \epsilon_i; ~~~~ \epsilon_i \sim \mathcal{N}(0,\sigma^2) \end{align} \] The model places independent priors over all the parameters contained in \(\Theta_m\): the depth of each tree, the variables randomly chosen as candidate to making splitting decisions at each node, the cutoff value of the variable for each split, the predicted value at each terminal node, and the error variance \(\sigma\).

Since this is a Bayesian model, we estimate it using Markov chain Monte Carlo sampling. At each iteration, the sampler makes a draw from the joint posterior \((f, \sigma)|(x, y)\). What this means is that each iteration, we sample an entire forest and error variance, so this process can be time-consuming. The upside is that we end up with a posterior distribution of forests and error variances. Try fitting a BART forest to our training data, but only grow a forest of 50 trees with 500 iterations. Note the specification of the function; the number of burnin iterations is added to, not subtracted from, the number of iterations to draw.

library(BayesTree)
fit_bart <- bart(x.train = votes_train[, 2:ncol(votes_train)],
                 y.train = votes_train[, 1])

We can also plot bart objects to get an idea of the posterior distribution we’ve obtained. The left panel plots \(\sigma\) over the iterations of the sampler. When it has stopped decreasing, this provides evidence that our sampler has reached the stationary distribution and that each subsequent sample will improve the predictive accuracy of our model. The right panel plots \(y_i\) against the \(f(X_i)\)s for each iteration in the sampler along with the credible intervals.

plot(fit_bart)

If you look at the help page for bart() you’ll notice that we can’t directly specify a lot of the algorithm arguments that we can for the other tree functions. We can control these parameters through the choice of priors that we impose on the model. To take advantage of the additive nature of random forests, we want to choose priors that give all variables a relatively equal chance of being chosen in a given subsample. If important variables can easily overwhelm others, they will be present in almost all iterations, losing some of the power of ensemble methods. The weak uniform priors we want to specify are known as regularization priors in this context. The priors on the tree parameters \(\Theta_m\) are fixed in the function, but we can change the parameters on the prior for the error variance \(\sigma\). We specify this prior through the sigdf and sigquant arguments.

Before we start tweaking these prior parameters, we need to make sure we can make predictions to assess predictive fit. Unlike the other tree functions, bart objects do not have a predict() method. Instead, we need to supply the test data when we run the function. We can then access the predicted values from the bart fit object. Try tweaking some of the prior parameter values and the number of trees to grow to see how this affects the MSE of our Bayesian forest.

fit_bart <- bart(x.train = votes_train[, 2:ncol(votes_train)],
                 y.train = votes_train[, 1],
                 x.test = votes_test[, 2:ncol(votes_train)], sigdf = 5,
                 sigquant = .7, k = 2.5, ntree = 500, nskip = 100)
mse(fit_bart$yhat.test.mean, votes_test$beta_exp)

Model Comparison

One of the ways we can assess the accuracy of a given tree based model is comparing its predictive fit with other alternatives. An advantage of this approach is that we can compare between ensembles generated from the same method with different parameters, as well as across ensembles resulting from different methods. Calculate the MSE for your fits from each of the methods we’ve explored and compare them.

mse_tree <- mse(pred_tree, votes_test$beta_exp)
mse_rpart <- mse(pred_rpart, votes_test$beta_exp)
mse_rf <- fit_rf$mse[length(fit_rf$mse)]
mse_gbm <- mse(votes_test$beta_exp, pred_gbm)
mse_bart <- mse(fit_bart$yhat.test.mean, votes_test$beta_exp)

mse_comp <- t(data.frame(tree = mse_tree, rpart = mse_rpart,
                         randomforest = mse_rf, gbm = mse_gbm,
                         bart = mse_bart))
colnames(mse_comp) <- 'MSE'
options(digits = 4)
kable(mse_comp)
MSE
tree 0.0095
rpart 0.0112
randomforest 0.0078
gbm 0.0080
bart 0.0089
options(digits = 2)

Model Interpretation and Quantities of Interest

Machine learning methods are frequently criticized due to their black box nature, which can inhibit our ability to uncover the relationship between predictors and outcomes. Much of the machine learning literature focuses on feature selection, or trying to identify which predictors contribute the most explanatory power to the model. However, these can be very complicated, and often don’t tell us much about the direction of a predictor’s effect on the outcome.

One simple approach is to treat \(f(x)\) the same way we do with generalized linear models when we want to generate quantities of interest. Pick one of your ensemble fits (randomForest or gbm will be easier than BART since they have predict() methods and won’t require you to re-estimate the fit). Generate simulated data profiles that vary direct tax revenus from the minimum to maximum observed values while holding all other predictors at their mean, minimum, and maximum. Generate predicted beta_exp values of the effect of income on turnout for these data, and then plot all three together. This is called a partial dependence plot in the machine learning literature

# get sequence of alcohol b/w observed min/max
tax_seq <- seq(min(votes$dtax_per_rev_z), max(votes$dtax_per_rev_z),
                  length.out = 500)

# predicted probabilities at mean
votes_sim <- votes[, -1]
votes_sim <- apply(votes_sim, 2, mean)
votes_sim <- matrix(rep(votes_sim, 500), nrow = 500, byrow = T)
votes_sim[, 6] <- tax_seq
pred_sim_mean <- predict(fit_rf, votes_sim)
me_df <- data.frame(tax_seq = tax_seq, beta_mean = pred_sim_mean)

# predicted probabilities at max
votes_sim <- votes[, -1]
votes_sim <- apply(votes_sim, 2, max)
votes_sim <- matrix(rep(votes_sim, 500), nrow = 500, byrow = T)
votes_sim[, 6] <- tax_seq
pred_sim_max <- predict(fit_rf, votes_sim)
me_df$beta_max <- pred_sim_max

# predicted probabilities at min
votes_sim <- votes[, -1]
votes_sim <- apply(votes_sim, 2, min)
votes_sim <- matrix(rep(votes_sim, 500), nrow = 500, byrow = T)
votes_sim[, 6] <- tax_seq
pred_sim_min <- predict(fit_rf, votes_sim)
me_df $beta_min <- pred_sim_min

# convert to long to plot each prediction separately
me_df <- melt(me_df, id.vars = 'tax_seq')

# plot
ggplot(data = me_df, aes(x = tax_seq, y = value, color = variable)) +
  geom_line() +
  theme_bw() +
  ggtitle('Random Forest Partial Dependence Plot') +
  labs(x = 'Direct Tax Revenue', y = 'Predicted Income Effect on Turnout') +
  scale_color_discrete(name = 'Other predictors at:',
                       labels = c('Mean', 'Max', 'Min')) +
  theme_rw()

As with most things, someone’s written a package to make this easier. We can use the pdp package to easily create partial dependence plots from many different types of machine learning models.

library(pdp) # partial dependence plots
library(doParallel) # parallel processing

# register parallel backend
registerDoParallel(makeCluster(parallel::detectCores()))

# create partial dependence object for gbm
gbm_part <- partial(fit_gbm, pred.var = c('dtax_per_rev_z', 'infantmortality_z'),
                   train = votes_train, rug = T,
                   n.trees = which(gbm_mse_seq == min(gbm_mse_seq)))

# plot of partial dependence
plotPartial(gbm_part, train = votes_train, rug = T)

# create partial dependence object for rf; need to export package used to create model in parallel
rf_part <- partial(fit_rf, pred.var = c('dtax_per_rev_z', 'infantmortality_z'),
                   train = votes_train, rug = T, parallel = T,
                   paropts = list(.packages = "randomForest"))

# lattice plot of partial dependence
plotPartial(rf_part, train = votes_train, rug = T)

# fancy ggplot of partial dependence
autoplot(rf_part, train = votes_train, contour = T, rug = T)

# 3D plot of partial dependence
plotPartial(rf_part, levelplot = F, drape = T, screen = list(z = 220, x = -60))

The plots for our randomForest and gbm models are pretty similar, and overall there’s not a lot of meaningful difference between them. It looks like direct tax revenue might have a slightly negative relationship with the effect of income on probability of voting, which is the opposite of what the original paper finds. Let’s investigate this closer. Create a new partial dependence object, but this time set the ice argument to true, which will produce individual expectation curves, which show how our model would predict the outcome for each country in our training data.

# partial dependence object for just one predictor; generate ICE curves
rf_part <- partial(fit_rf, pred.var = 'dtax_per_rev_z',
                   train = votes_train, rug = T,
                   ice = T,
                   parallel = T,  paropts = list(.packages = "randomForest"))

# plot ICE curves and average partial dependence
plotPartial(rf_part, train = votes_train, rug = T)

It looks like there’s a somewhat negative relationship here, but only for cases in the lower portion for direct tax revenue in our sample. This is the opposite of the relationship that the authors find in the original paper. It could be that testing all six explanatory varaibles simultaneously would have revealed the same result. However, remember when I told you that \(\beta_j\) had non-random measurement error, and that you shouldn’t average imputed datasets? We can’t draw any valid conclusions from this analysis because we’ve neglected two serious sources of uncertainty and variance in these analyses.

Individual Exercise

In addition to regression trees, we can also fit classification trees when we have binary or categorical outcomes. Use fl2003.RData, which is a subset of the data in Fearon and Laitin (2003), and the package of your choice, to fit an ensemble model that explains onset as a function of all other variables. Determine the most important variables in the ensemble, and then produce a partial dependence plot showing the relationship between two variables that are not the most important, and the predicted probability of civil war in a given observation. Discuss this relationship.

References

Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” The American Political Science Review 97 (1): 75–90.

Kasara, Kimuli, and Pavithra Suryanarayan. 2015. “When Do the Rich Vote Less Than the Poor and Why? Explaining Turnout Inequality Across the World.” American Journal of Political Science 59 (3): 613–27.


  1. The dependent variable in these data, beta_exp, is the coefficient estimate from a logistic regression of the effect of income on the probability of voting, estimated separately for each country using individual level survey data. As such, there is inherent uncertainty around each \(\beta_j\). The original paper accounts for this by treating each \(\beta_j\) as a draw from a \(\mathcal{N}(\beta_j,\sigma_j)\) distribution. This lab does not do this.

  2. I’ve multiply imputed these data due to missingness on some predictors, and then averaged the imputed datasets. Don’t do this; if you don’t remember why, come talk to me.