14Endogeneity and Instrumental Variable Estimation
In this section we will be using mroz data, obtained from Wooldridge’s Econometric Analysis of Cross Section and Panel Data book’s official site for downloadable materials. This is a “PSID data on the wages of 428 working, married women”.
We start by installing the required packages and loading the required libraries. Note that you do not need to install an already installed package if you are using your private computer. But you will have to do this every time you require the package if you are on a university gadget.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven) # to import data, which is in Stata formatlibrary(ivreg) # IV estimationlibrary(sandwich) # for robust se calculationslibrary(lmtest) # for coeftest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
library(stargazer) # create formatted tables
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
#library(Hmisc) # add labels to variables#library(ggplot2)#library(dplyr) # for data manipulation#library(plm) # to estimate linear panel data models#library(fastDummies) # create dummies based on categorical (factor) variable
Import the mroz_v2.dta data. mroz_v2 data is provided to you in Stata format. Stata is an other statistical package, widely used for data analysis and econometric modelling. You will see that even when Stata is not installed in your system, you will be able to import it into R using the haven package.
Instrumental variable approach is widely referred to as Two-stage Least Squares. These two will be used interchangeably. Please also note the abbreviations IV and 2SLS, respectively, for the former and the latter.
14.1 Case I: 2SLS with one endogenous, one exogenous variable
14.1.1 Task 1
Estimate a regression of logarithmic wage lwage using education (educ) and experience (exper) as independent variables. Include experience in quadratic form.
Call:
lm(formula = lwage ~ educ + exper + expersq, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-3.08404 -0.30627 0.04952 0.37498 2.37115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5220407 0.1986321 -2.628 0.00890 **
educ 0.1074896 0.0141465 7.598 1.94e-13 ***
exper 0.0415665 0.0131752 3.155 0.00172 **
expersq -0.0008112 0.0003932 -2.063 0.03974 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6664 on 424 degrees of freedom
Multiple R-squared: 0.1568, Adjusted R-squared: 0.1509
F-statistic: 26.29 on 3 and 424 DF, p-value: 1.302e-15
Because it is highly likely that we will observe heteroscedasticity, let us summarise the results with heteroscedasticity-robust standard errors. For this, we will use coeftest, which comes with the lmtest package.
coeftest(ols, vcov = vcovHC, type ="HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.52204068 0.20165046 -2.5888 0.009961 **
educ 0.10748965 0.01321897 8.1315 4.72e-15 ***
exper 0.04156651 0.01527304 2.7216 0.006765 **
expersq -0.00081119 0.00042007 -1.9311 0.054139 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Have a look at the coefficient estimates, standard errors, t-statistics and the p-values reported with and without the heteroscedasticity-robust standard errors. How do they compare?
Calculation of robust standard errors corrects for the bias in standard error estimates due to heteroscedasticity or autocorrelation (heteroscedasticity in the case of this example). Therefore, the coefficient estimates remain the same while standard errors are adjusted for the bias. Because of this change in the standard errors, the t-statistics and the p-values (which use standard error in calculations) also change.
14.1.2 Task 2
Explain why there may be an endogeneity issue in the model we estimated above.
Ability is an important determinant of wages, which is not included in the given regression. It is also expected to be highly correlated with education. If that’s the case, omission of ability from the model will lead to a correlation between education and the error term. This causes an issue of endogeneity.
14.1.3 Task 3
Identify the potential instruments that you may use in the data you are given and explain your choice.
The three potential instruments provided in data are: mother’s education, father’s education and husband’s education. Each of these three variables are likely to be highly correlated with individual’s education but not with their wage.
14.1.4 Task 4
Estimate the above equation by two-stage least squares (i.e. instrumental variable estimation) manually.
14.1.4.1 Guidance
Step 1
Let’s say we want to use husband’s education huseduc as an instrument for the woman’s education.
Regress the endogenous variable on the instrument and all other exogenous variables of the model.
Call:
lm(formula = lwage ~ educ_hat + exper + expersq, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-3.12116 -0.34011 0.05149 0.38784 2.36570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2980989 0.3248571 -0.918 0.359334
educ_hat 0.0893851 0.0250210 3.572 0.000394 ***
exper 0.0425893 0.0138836 3.068 0.002296 **
expersq -0.0008457 0.0004148 -2.039 0.042080 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6999 on 424 degrees of freedom
Multiple R-squared: 0.07, Adjusted R-squared: 0.06342
F-statistic: 10.64 on 3 and 424 DF, p-value: 9.339e-07
Let us report these results using the heteroscedasticity-robust standard errors:
coeftest(stage2_1, vcov = vcovHC, type ="HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.29809893 0.33474934 -0.8905 0.3736950
educ_hat 0.08938509 0.02448649 3.6504 0.0002945 ***
exper 0.04258927 0.01591354 2.6763 0.0077327 **
expersq -0.00084567 0.00044034 -1.9205 0.0554634 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14.1.5 Task 5
Estimate the above equation by 2SLS (IV estimation) using R’s built-in command ivreg. Compare the coefficients and standard errors with what you obtained manually.
Call:
ivreg(formula = lwage ~ educ + exper + expersq | huseduc + exper +
expersq, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-3.07677 -0.32148 0.03525 0.37605 2.36256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2980989 0.3099189 -0.962 0.336668
educ 0.0893851 0.0238705 3.745 0.000206 ***
exper 0.0425893 0.0132451 3.215 0.001402 **
expersq -0.0008457 0.0003957 -2.137 0.033155 *
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 424 230.900 <2e-16 ***
Wu-Hausman 1 423 0.892 0.346
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6677 on 424 degrees of freedom
Multiple R-Squared: 0.1536, Adjusted R-squared: 0.1476
Wald test: 11.69 on 3 and 424 DF, p-value: 2.26e-07
The coefficients reported by the ivreg are the same as our manual two-stage estimation. The standard errors are slightly different. This is because in the manual calculations, during the estimation of the second stage, we use predictions from the first stage (educ_hat). This introduces additional uncertainty into the model. Although the statistical packages (such as R) follow a similar approach, they correct for the bias in standard error calculations before reporting these numbers.
Note that this is a different adjustment than calculation of heteroscedasticity-robust standard errors. So let’s integrate that too
coeftest(iv_1, vcov = vcovHC, type ="HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.29809893 0.31885872 -0.9349 0.3503753
educ 0.08938509 0.02306960 3.8746 0.0001237 ***
exper 0.04258927 0.01525285 2.7922 0.0054716 **
expersq -0.00084567 0.00041999 -2.0135 0.0446901 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14.1.6 Task 6
Compare the coefficient on education in OLS and 2SLS approaches.
14.1.6.1 Guidance
We can compare the three models summarising their results in one table using stargazer package. Please note that the table below does not provide heteroscedasticity corrected standard errors. We can replace these conventional standard errors reported using a matrix of robust ones but this requires a few more steps that is beyond this module. You may change these manually for the moment.
As expected, the coefficient on education is higher in OLS estimation than the coefficient obtained through 2SLS. This is because education captures not only the genuine impact of years of schooling but also ability. Each of these indicators are expected to have a positive impact on wages, and they are positively correlated with each other. Hence, omission of ability from the wage regression creates a positive bias on the education’s coefficient.
14.2 Case II: 2SLS with one endogenous and multiple exogenous variables
14.2.1 Task 7
Replicate the 2SLS estimation manually (using OLS), this time with 3 instruments for education: husband’s education (huseduc), mother’s education (motheduc) and father’s education (fatheduc).
14.2.1.1 Guidance
Step1
First, estimate the first stage regression: regress the endogenous variable on the instrument and all other exogenous variables of the model.
Call:
lm(formula = lwage ~ educ_hat_2 + exper + expersq, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-3.1407 -0.3382 0.0594 0.3798 2.3860
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1868574 0.2985449 -0.626 0.531722
educ_hat_2 0.0803918 0.0227772 3.529 0.000462 ***
exper 0.0430973 0.0138760 3.106 0.002024 **
expersq -0.0008628 0.0004144 -2.082 0.037957 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7001 on 424 degrees of freedom
Multiple R-squared: 0.06935, Adjusted R-squared: 0.06277
F-statistic: 10.53 on 3 and 424 DF, p-value: 1.078e-06
Let us report these results using the heteroscedasticity-robust standard errors:
coeftest(stage2_2, vcov = vcovHC, type ="HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18685736 0.31671471 -0.5900 0.5555141
educ_hat_2 0.08039177 0.02303514 3.4900 0.0005337 ***
exper 0.04309732 0.01606065 2.6834 0.0075728 **
expersq -0.00086280 0.00044548 -1.9368 0.0534340 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14.2.2 Task 8
Estimate the above equation by 2SLS (IV estimation) using R’s built-in command ivreg. Compare the coefficients and standard errors with what you obtained manually.
Let us also integrate heteroscedasticity-robust standard errors.
coeftest(iv_2, vcov = vcovHC, type ="HC1")
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18685736 0.30126251 -0.6202 0.5354280
educ 0.08039177 0.02170330 3.7041 0.0002402 ***
exper 0.04309732 0.01530642 2.8156 0.0050951 **
expersq -0.00086280 0.00042166 -2.0462 0.0413549 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14.2.3 Task 9
Test for the relevance of the chosen instruments.
14.2.3.1 Guidance
The chosen instruments should sufficiently explain the variation in the endogenous variable (i.e. the education level). The F-statistic obtained in the first stage is 63.3, which is greater than the widely accepted threshold of 10. Hence, we conclude that all the instrument set sufficiently explain the variations in education (at least one of the them has an impact that is different than zero). The instruments in this example are are relevant.
14.2.4 Task 10
Test for the overidentifying restrictions in the above estimation. Explain what this test does.
14.2.4.1 Guidance
Instruments used in IV estimation should not normally belong to the main model of interest (i.e. in the case of this example, should not be a determinant of individual’s wage) and they should satisfy the relevance and exogeneity assumptions. We confirmed above that the instrument set is relevant. We can check for the exogeneity assumption by applying Sarjan’s J test of overidentifying restrictions.
First, we save the residuals from the iv estimation.
mroz$resid_iv_2 <-residuals(iv_2)
We then regress these saved residuals on all avilable exogenous variables (the instruments + the exogenous variables of the wage model)
Call:
lm(formula = resid_iv_2 ~ huseduc + motheduc + fatheduc + exper +
expersq, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-3.07503 -0.32777 0.04156 0.37759 2.33621
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.606e-03 1.773e-01 0.049 0.961
huseduc 6.781e-03 1.143e-02 0.593 0.553
motheduc -1.039e-02 1.187e-02 -0.875 0.382
fatheduc 6.734e-04 1.138e-02 0.059 0.953
exper 5.603e-05 1.323e-02 0.004 0.997
expersq -8.882e-06 3.956e-04 -0.022 0.982
Residual standard error: 0.67 on 422 degrees of freedom
Multiple R-squared: 0.002605, Adjusted R-squared: -0.009212
F-statistic: 0.2205 on 5 and 422 DF, p-value: 0.9537
We then calculate the chi-squared test-statistic by multiplying the number of observations in sample (n) with the R-squared from above regression. Note how we call these two in the R code provided below.
# Calculate chi-squared test statisticsum_stat <-nrow(mroz) *summary(overid_test)$r.squaredprint(sum_stat)
[1] 1.115043
We can either compare this with a chi-squared table value with 3-1=2 degrees of freedom, or ask R to calculate the corresponding p-value. I find the latter easier:
# p-value for the calculated test-statistic with 2 degrees of freedompchisq(sum_stat, df =2, lower.tail =FALSE)
[1] 0.5726264
The p-value is 0.57. There is not enough evidence to reject the null hypothesis that “the instrument set is exogenous”. Hence, overidentifying restrictions are valid.
The results of this task and the previous one confirm that we have a valid set of instruments.
14.2.5 Task 11
Test for the existence of endogeneity in the wage regression.
14.2.6 Guidance
We will be applying the Durbin-Wu-Hausman test for endogeneity. We will need the saved residuals from the first stage of the 2SLS estimation. Let’s save this under the name resid_2 (to differentiate from the single IV case we ran at the beginning)
mroz$resid_2 <-residuals(stage1_2)
We then estimate the original model of interest by additionally including these saved residuals.
dwh_test <-lm(lwage ~ educ + exper + expersq + resid_2,data = mroz)# Report the results with robest standard errorscoeftest(dwh_test, vcov = vcovHC, type ="HC1")