14  Endogeneity and Instrumental Variable Estimation

In this section we will be using mroz data, obtained from Wooldridge’s Econometric Analysis of Cross Section and Panel Data book’s official site for downloadable materials. This is a “PSID data on the wages of 428 working, married women”.

We start by installing the required packages and loading the required libraries. Note that you do not need to install an already installed package if you are using your private computer. But you will have to do this every time you require the package if you are on a university gadget.

# install required packages
#install.packages("tidyverse")
#install.packages("haven")
#install.packages(ivreg)
#install.packages("sandwich")
#install.packages("lmtest")

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven) # to import data, which is in Stata format
library(ivreg) # IV estimation
library(sandwich) # for robust se calculations
library(lmtest) # for coeftest
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library(stargazer) # create formatted tables

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
#library(Hmisc) # add labels to variables
#library(ggplot2)
#library(dplyr) # for data manipulation
#library(plm) # to estimate linear panel data models
#library(fastDummies) # create dummies based on categorical (factor) variable

Import the mroz_v2.dta data. mroz_v2 data is provided to you in Stata format. Stata is an other statistical package, widely used for data analysis and econometric modelling. You will see that even when Stata is not installed in your system, you will be able to import it into R using the haven package.

mroz <- read_dta("assets/data/mroz_v2.dta")
#View(mroz)

See the the fist few rows:

head(mroz)
# A tibble: 6 × 19
  hours kidslt6 kidsge6   age  educ  wage repwage hushrs husage huseduc huswage
  <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1  1610       1       0    32    12  3.35    2.65   2708     34      12    4.03
2  1656       0       2    30    12  1.39    2.65   2310     30       9    8.44
3  1980       1       3    35    12  4.55    4.04   3072     40      12    3.58
4   456       0       3    34    12  1.10    3.25   1920     53      10    3.54
5  1568       1       2    31    14  4.59    3.60   2000     32      12   10   
6  2032       0       0    54    12  4.74    4.70   1040     57      11    6.71
# ℹ 8 more variables: faminc <dbl>, motheduc <dbl>, fatheduc <dbl>, unem <dbl>,
#   city <dbl>, exper <dbl>, lwage <dbl>, expersq <dbl>

Instrumental variable approach is widely referred to as Two-stage Least Squares. These two will be used interchangeably. Please also note the abbreviations IV and 2SLS, respectively, for the former and the latter.

14.1 Case I: 2SLS with one endogenous, one exogenous variable

14.1.1 Task 1

Estimate a regression of logarithmic wage lwage using education (educ) and experience (exper) as independent variables. Include experience in quadratic form.

ols <- lm(lwage ~ educ + exper + expersq, data = mroz)
summary(ols)

Call:
lm(formula = lwage ~ educ + exper + expersq, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.08404 -0.30627  0.04952  0.37498  2.37115 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.5220407  0.1986321  -2.628  0.00890 ** 
educ         0.1074896  0.0141465   7.598 1.94e-13 ***
exper        0.0415665  0.0131752   3.155  0.00172 ** 
expersq     -0.0008112  0.0003932  -2.063  0.03974 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6664 on 424 degrees of freedom
Multiple R-squared:  0.1568,    Adjusted R-squared:  0.1509 
F-statistic: 26.29 on 3 and 424 DF,  p-value: 1.302e-15

Because it is highly likely that we will observe heteroscedasticity, let us summarise the results with heteroscedasticity-robust standard errors. For this, we will use coeftest, which comes with the lmtest package.

coeftest(ols, vcov = vcovHC, type = "HC1")

t test of coefficients:

               Estimate  Std. Error t value Pr(>|t|)    
(Intercept) -0.52204068  0.20165046 -2.5888 0.009961 ** 
educ         0.10748965  0.01321897  8.1315 4.72e-15 ***
exper        0.04156651  0.01527304  2.7216 0.006765 ** 
expersq     -0.00081119  0.00042007 -1.9311 0.054139 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Have a look at the coefficient estimates, standard errors, t-statistics and the p-values reported with and without the heteroscedasticity-robust standard errors. How do they compare?

Calculation of robust standard errors corrects for the bias in standard error estimates due to heteroscedasticity or autocorrelation (heteroscedasticity in the case of this example). Therefore, the coefficient estimates remain the same while standard errors are adjusted for the bias. Because of this change in the standard errors, the t-statistics and the p-values (which use standard error in calculations) also change.

14.1.2 Task 2

Explain why there may be an endogeneity issue in the model we estimated above.

Ability is an important determinant of wages, which is not included in the given regression. It is also expected to be highly correlated with education. If that’s the case, omission of ability from the model will lead to a correlation between education and the error term. This causes an issue of endogeneity.

14.1.3 Task 3

Identify the potential instruments that you may use in the data you are given and explain your choice.

The three potential instruments provided in data are: mother’s education, father’s education and husband’s education. Each of these three variables are likely to be highly correlated with individual’s education but not with their wage.

14.1.4 Task 4

Estimate the above equation by two-stage least squares (i.e. instrumental variable estimation) manually.

14.1.4.1 Guidance

Step 1

Let’s say we want to use husband’s education huseduc as an instrument for the woman’s education.

Regress the endogenous variable on the instrument and all other exogenous variables of the model.

stage1_1 <- lm(educ ~ huseduc + exper + expersq, data = mroz)
summary(stage1_1)

Call:
lm(formula = educ ~ huseduc + exper + expersq, data = mroz)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5093 -0.9785 -0.2310  1.2857  6.3994 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.694524   0.454849  14.718   <2e-16 ***
huseduc      0.448194   0.029495  15.195   <2e-16 ***
exper        0.042220   0.036338   1.162    0.246    
expersq     -0.001017   0.001085  -0.937    0.349    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.841 on 424 degrees of freedom
Multiple R-squared:  0.3558,    Adjusted R-squared:  0.3512 
F-statistic: 78.05 on 3 and 424 DF,  p-value: < 2.2e-16

Obtain predictions of education using the above model.

mroz$educ_hat <- predict(stage1_1)

Step 2

Estimate the wage regression, replacing the endogenous educ variable, with its exogenous version educ_hat

stage2_1 <- lm(lwage ~ educ_hat + exper + expersq, data = mroz)
summary(stage2_1)

Call:
lm(formula = lwage ~ educ_hat + exper + expersq, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.12116 -0.34011  0.05149  0.38784  2.36570 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.2980989  0.3248571  -0.918 0.359334    
educ_hat     0.0893851  0.0250210   3.572 0.000394 ***
exper        0.0425893  0.0138836   3.068 0.002296 ** 
expersq     -0.0008457  0.0004148  -2.039 0.042080 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6999 on 424 degrees of freedom
Multiple R-squared:   0.07, Adjusted R-squared:  0.06342 
F-statistic: 10.64 on 3 and 424 DF,  p-value: 9.339e-07

Let us report these results using the heteroscedasticity-robust standard errors:

coeftest(stage2_1, vcov = vcovHC, type = "HC1")

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -0.29809893  0.33474934 -0.8905 0.3736950    
educ_hat     0.08938509  0.02448649  3.6504 0.0002945 ***
exper        0.04258927  0.01591354  2.6763 0.0077327 ** 
expersq     -0.00084567  0.00044034 -1.9205 0.0554634 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.1.5 Task 5

Estimate the above equation by 2SLS (IV estimation) using R’s built-in command ivreg. Compare the coefficients and standard errors with what you obtained manually.

iv_1 <- ivreg(lwage ~ educ + exper + expersq | huseduc + exper + expersq, data = mroz)
summary(iv_1)

Call:
ivreg(formula = lwage ~ educ + exper + expersq | huseduc + exper + 
    expersq, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.07677 -0.32148  0.03525  0.37605  2.36256 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.2980989  0.3099189  -0.962 0.336668    
educ         0.0893851  0.0238705   3.745 0.000206 ***
exper        0.0425893  0.0132451   3.215 0.001402 ** 
expersq     -0.0008457  0.0003957  -2.137 0.033155 *  

Diagnostic tests:
                 df1 df2 statistic p-value    
Weak instruments   1 424   230.900  <2e-16 ***
Wu-Hausman         1 423     0.892   0.346    
Sargan             0  NA        NA      NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6677 on 424 degrees of freedom
Multiple R-Squared: 0.1536, Adjusted R-squared: 0.1476 
Wald test: 11.69 on 3 and 424 DF,  p-value: 2.26e-07 

The coefficients reported by the ivreg are the same as our manual two-stage estimation. The standard errors are slightly different. This is because in the manual calculations, during the estimation of the second stage, we use predictions from the first stage (educ_hat). This introduces additional uncertainty into the model. Although the statistical packages (such as R) follow a similar approach, they correct for the bias in standard error calculations before reporting these numbers.

Note that this is a different adjustment than calculation of heteroscedasticity-robust standard errors. So let’s integrate that too

coeftest(iv_1, vcov = vcovHC, type = "HC1")

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -0.29809893  0.31885872 -0.9349 0.3503753    
educ         0.08938509  0.02306960  3.8746 0.0001237 ***
exper        0.04258927  0.01525285  2.7922 0.0054716 ** 
expersq     -0.00084567  0.00041999 -2.0135 0.0446901 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.1.6 Task 6

Compare the coefficient on education in OLS and 2SLS approaches.

14.1.6.1 Guidance

We can compare the three models summarising their results in one table using stargazer package. Please note that the table below does not provide heteroscedasticity corrected standard errors. We can replace these conventional standard errors reported using a matrix of robust ones but this requires a few more steps that is beyond this module. You may change these manually for the moment.

stargazer(ols, stage2_1, iv_1, type = "text")

===============================================================
                                     Dependent variable:       
                               --------------------------------
                                            lwage              
                                       OLS         instrumental
                                                     variable  
                                  (1)       (2)        (3)     
---------------------------------------------------------------
educ                           0.107***              0.089***  
                                (0.014)              (0.024)   
                                                               
educ_hat                                 0.089***              
                                          (0.025)              
                                                               
exper                          0.042***  0.043***    0.043***  
                                (0.013)   (0.014)    (0.013)   
                                                               
expersq                        -0.001**  -0.001**    -0.001**  
                               (0.0004)  (0.0004)    (0.0004)  
                                                               
Constant                       -0.522***  -0.298      -0.298   
                                (0.199)   (0.325)    (0.310)   
                                                               
---------------------------------------------------------------
Observations                      428       428        428     
R2                               0.157     0.070      0.154    
Adjusted R2                      0.151     0.063      0.148    
Residual Std. Error (df = 424)   0.666     0.700      0.668    
F Statistic (df = 3; 424)      26.286*** 10.638***             
===============================================================
Note:                               *p<0.1; **p<0.05; ***p<0.01

As expected, the coefficient on education is higher in OLS estimation than the coefficient obtained through 2SLS. This is because education captures not only the genuine impact of years of schooling but also ability. Each of these indicators are expected to have a positive impact on wages, and they are positively correlated with each other. Hence, omission of ability from the wage regression creates a positive bias on the education’s coefficient.

14.2 Case II: 2SLS with one endogenous and multiple exogenous variables

14.2.1 Task 7

Replicate the 2SLS estimation manually (using OLS), this time with 3 instruments for education: husband’s education (huseduc), mother’s education (motheduc) and father’s education (fatheduc).

14.2.1.1 Guidance

Step1

First, estimate the first stage regression: regress the endogenous variable on the instrument and all other exogenous variables of the model.

stage1_2 <- lm(educ ~ huseduc + motheduc + fatheduc + exper + expersq, 
               data = mroz)
summary(stage1_2)

Call:
lm(formula = educ ~ huseduc + motheduc + fatheduc + exper + expersq, 
    data = mroz)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6882 -1.1519  0.0097  1.0640  5.7302 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.5383110  0.4597824  12.046  < 2e-16 ***
huseduc      0.3752548  0.0296347  12.663  < 2e-16 ***
motheduc     0.1141532  0.0307835   3.708 0.000237 ***
fatheduc     0.1060801  0.0295153   3.594 0.000364 ***
exper        0.0374977  0.0343102   1.093 0.275059    
expersq     -0.0006002  0.0010261  -0.585 0.558899    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.738 on 422 degrees of freedom
Multiple R-squared:  0.4286,    Adjusted R-squared:  0.4218 
F-statistic:  63.3 on 5 and 422 DF,  p-value: < 2.2e-16

Obtain predictions of education using the above model.

mroz$educ_hat_2 <- predict(stage1_2)

Step 2

Estimate the wage regression, replacing the endogenous educ variable, with its exogenous version educ_hat_2

stage2_2 <- lm(lwage ~ educ_hat_2 + exper + expersq, data = mroz)
summary(stage2_2)

Call:
lm(formula = lwage ~ educ_hat_2 + exper + expersq, data = mroz)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1407 -0.3382  0.0594  0.3798  2.3860 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1868574  0.2985449  -0.626 0.531722    
educ_hat_2   0.0803918  0.0227772   3.529 0.000462 ***
exper        0.0430973  0.0138760   3.106 0.002024 ** 
expersq     -0.0008628  0.0004144  -2.082 0.037957 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7001 on 424 degrees of freedom
Multiple R-squared:  0.06935,   Adjusted R-squared:  0.06277 
F-statistic: 10.53 on 3 and 424 DF,  p-value: 1.078e-06

Let us report these results using the heteroscedasticity-robust standard errors:

coeftest(stage2_2, vcov = vcovHC, type = "HC1")

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -0.18685736  0.31671471 -0.5900 0.5555141    
educ_hat_2   0.08039177  0.02303514  3.4900 0.0005337 ***
exper        0.04309732  0.01606065  2.6834 0.0075728 ** 
expersq     -0.00086280  0.00044548 -1.9368 0.0534340 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.2.2 Task 8

Estimate the above equation by 2SLS (IV estimation) using R’s built-in command ivreg. Compare the coefficients and standard errors with what you obtained manually.

iv_2 <- ivreg(lwage ~ educ + exper + expersq | huseduc + motheduc + fatheduc +
                exper + expersq, data = mroz)
summary(iv_2)

Call:
ivreg(formula = lwage ~ educ + exper + expersq | huseduc + motheduc + 
    fatheduc + exper + expersq, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.08378 -0.32135  0.03538  0.36934  2.35829 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1868574  0.2853959  -0.655 0.512996    
educ         0.0803918  0.0217740   3.692 0.000251 ***
exper        0.0430973  0.0132649   3.249 0.001250 ** 
expersq     -0.0008628  0.0003962  -2.178 0.029976 *  

Diagnostic tests:
                 df1 df2 statistic p-value    
Weak instruments   3 422   104.294  <2e-16 ***
Wu-Hausman         1 423     2.732  0.0991 .  
Sargan             2  NA     1.115  0.5726    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6693 on 424 degrees of freedom
Multiple R-Squared: 0.1495, Adjusted R-squared: 0.1435 
Wald test: 11.52 on 3 and 424 DF,  p-value: 2.817e-07 

Let us also integrate heteroscedasticity-robust standard errors.

coeftest(iv_2, vcov = vcovHC, type = "HC1")

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -0.18685736  0.30126251 -0.6202 0.5354280    
educ         0.08039177  0.02170330  3.7041 0.0002402 ***
exper        0.04309732  0.01530642  2.8156 0.0050951 ** 
expersq     -0.00086280  0.00042166 -2.0462 0.0413549 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.2.3 Task 9

Test for the relevance of the chosen instruments.

14.2.3.1 Guidance

The chosen instruments should sufficiently explain the variation in the endogenous variable (i.e. the education level). The F-statistic obtained in the first stage is 63.3, which is greater than the widely accepted threshold of 10. Hence, we conclude that all the instrument set sufficiently explain the variations in education (at least one of the them has an impact that is different than zero). The instruments in this example are are relevant.

14.2.4 Task 10

Test for the overidentifying restrictions in the above estimation. Explain what this test does.

14.2.4.1 Guidance

Instruments used in IV estimation should not normally belong to the main model of interest (i.e. in the case of this example, should not be a determinant of individual’s wage) and they should satisfy the relevance and exogeneity assumptions. We confirmed above that the instrument set is relevant. We can check for the exogeneity assumption by applying Sarjan’s J test of overidentifying restrictions.

First, we save the residuals from the iv estimation.

mroz$resid_iv_2 <- residuals(iv_2)

We then regress these saved residuals on all avilable exogenous variables (the instruments + the exogenous variables of the wage model)

overid_test <- lm(resid_iv_2 ~ huseduc + motheduc + fatheduc + exper + expersq
                  , data = mroz)
summary(overid_test)

Call:
lm(formula = resid_iv_2 ~ huseduc + motheduc + fatheduc + exper + 
    expersq, data = mroz)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.07503 -0.32777  0.04156  0.37759  2.33621 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  8.606e-03  1.773e-01   0.049    0.961
huseduc      6.781e-03  1.143e-02   0.593    0.553
motheduc    -1.039e-02  1.187e-02  -0.875    0.382
fatheduc     6.734e-04  1.138e-02   0.059    0.953
exper        5.603e-05  1.323e-02   0.004    0.997
expersq     -8.882e-06  3.956e-04  -0.022    0.982

Residual standard error: 0.67 on 422 degrees of freedom
Multiple R-squared:  0.002605,  Adjusted R-squared:  -0.009212 
F-statistic: 0.2205 on 5 and 422 DF,  p-value: 0.9537

We then calculate the chi-squared test-statistic by multiplying the number of observations in sample (n) with the R-squared from above regression. Note how we call these two in the R code provided below.

# Calculate chi-squared test statistic
sum_stat <- nrow(mroz) * summary(overid_test)$r.squared

print(sum_stat)
[1] 1.115043

We can either compare this with a chi-squared table value with 3-1=2 degrees of freedom, or ask R to calculate the corresponding p-value. I find the latter easier:

# p-value for the calculated test-statistic with 2 degrees of freedom
pchisq(sum_stat, df = 2, lower.tail = FALSE)
[1] 0.5726264

The p-value is 0.57. There is not enough evidence to reject the null hypothesis that “the instrument set is exogenous”. Hence, overidentifying restrictions are valid.

The results of this task and the previous one confirm that we have a valid set of instruments.

14.2.5 Task 11

Test for the existence of endogeneity in the wage regression.

14.2.6 Guidance

We will be applying the Durbin-Wu-Hausman test for endogeneity. We will need the saved residuals from the first stage of the 2SLS estimation. Let’s save this under the name resid_2 (to differentiate from the single IV case we ran at the beginning)

mroz$resid_2 <- residuals(stage1_2)

We then estimate the original model of interest by additionally including these saved residuals.

dwh_test <- lm(lwage ~ educ + exper + expersq + resid_2,
               data = mroz)
# Report the results with robest standard errors
coeftest(dwh_test, vcov = vcovHC, type = "HC1")

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -0.18685736  0.29772776 -0.6276 0.5305971    
educ         0.08039177  0.02136857  3.7621 0.0001922 ***
exper        0.04309732  0.01510368  2.8534 0.0045373 ** 
expersq     -0.00086280  0.00041557 -2.0762 0.0384826 *  
resid_2      0.04718901  0.02630679  1.7938 0.0735598 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.3 Further Reading

See https://cran.r-project.org/web/packages/ivreg/vignettes/ivreg.html for more information and another empirical example.