I have too many control variables … which ones should I include in my regression model? – Health economist


Let’s say you have some data on healthcare spending for different individuals and you want to know which patient characteristics increase healthcare spending. While this seems like something any health economist could do, measuring the relationship requires knowing both (i) which independent variables to include in the data analysis and (ii) their functional form. Option (i) can be determined on basis of previous studies and expert clinicians, but even this is flawed. Point (ii) is very difficult to decipher. Is there a data-driven way to do this?

A newspaper of Belloni, Chernozhukov and Hansen (2014) proposes to use post-double screening (PDS) to identify relevant controls and their functional form. Consider the case where you want to model the following:

YesI = g (wI) +I

where is it

E (ςI| g (wI)) = 0

The sweets from the Belloni paper g (w) as a high-dimensional, approximately linear model where:

g (wI) = Σj = 1 to PJXi, j+ rpi)

Note that in the Belloni framework it is possible for the number of control variables (P.) be greater than the number of observations (N). How can you have more regressors than results? Basically because Belloni requires that the causal relationship be roughly poor in the sense that outside the P. control variables, only S. of them are different from 0 where s ≪ n.

Belloni proposes identifying theses S. important variables using a Least Absolute Shrinkage and Selection Operator (LASSO) model from Frank and Friedman (1993) as follows:

In LASSO coefficients are chosen to minimize the sum of the residual squares plus a penalty term that penalizes the size of the model through the sum of the absolute values ​​of the coefficients. The term is the penalty level which gives the degree to which the number of variables with non-zero (or very small) coefficients is penalized. Documents such as Belloni et al. (2012) Other Belloni et al. (2016) provide some reasonable estimates for the value of. The gamma coefficients are the “penalty loads” which aim to ensure the equivalence of the coefficient estimates at the scaling of x. For example, if one variable was education on a scale of 1 to 16 and another variable was dollar income, a 1-year increase in education is a much larger order-of-magnitude increase than an increase in $ 1 of annual income. Penalty uploads aim to correct this disparity. The authors note that:

The penalty function in the LASSO is special in that it has a node at 0, while the penalty function in the LASSO is special in that it has a node at 0, which results in a scattered estimator with many results in a scattered estimator with many coefficients placed exactly at zero.

One of the problems with the LASSO approach, however, is that the resulting coefficients are biased towards zero. The approach proposed by Belloni is to use post-Lasso estimation using the following two-step approach:

First, LASSO is applied to determine which variables can be eliminated from a forecasting point of view. Then, the coefficients on the remaining variables are estimated by ordinary least squares regression using only the variables with non-zero first-step estimated coefficients. The Post-LASSO estimator is cost effective to implement and … works well and often better than LASSO in terms of convergence rates and bias.

More details are in the paper and there are also a variety of empirical examples. Read everything study.

Additionally, a recent article by Kugler et al. (2021) published last month used the Belloni approach in their study to examine the impact of wage expectations on the decision to become a nurse.


Please enter your comment!
Please enter your name here