15  IV Estimation: The Role of Institutions in Economic Growth

In their study titled The colonial origins of comparative development, Acemoglu, Johnson and Robinson (2001) explore the role of institutions on economic performance, measured as income per capita.

Acemoglu, D., Johnson, S., Robinson, J.A. (2001) The colonial origins of comparative development: an empirical investigation, The American Economic Review, 91(5): 1369-1401. Available from https://economics.mit.edu/sites/default/files/publications/colonial-origins-of-comparative-development.pdf

Data Source: https://economics.mit.edu/people/faculty/daron-acemoglu/data-archive

Some of the variables in their data are listed below:

Variable Definition
shortnam 3 letter country name
logpgp95 log PPP GDP pc in 1995, World Bank
avexpr average protection against expropriation risk
f_brit British Colony (Flopsexpsn)
f_french French Colony (Flopsexpans)
logem4 log settler mortality

Before continuing with the analysis below, it is recommended that you read through the highlighted text in the paper (provided on module Aula page). Try to find answers to the following: - Why institutions or institutional quality may be considered as endogenous in a growth or national income model. - What is the authors’ strategy to break this endogeneity?

On the first page, the authors explain that “[c]ountries with better institutions, more secure property rights and less distortionary policies will invest more in physical and human capital, and will use these factors more efficiently to achieve a greater level of income.” While better institutions have a positive impact on national income, more developed countries are more likely to have better institutions and more established property rights. This simultaneity between national income and institutional quality creates an endogeneity problem. Because any shock affecting the national income (through the error term) will in return influence institutions, creating a correlation between the error term and institutions.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ivreg) # IV estimation
library(sandwich) # for robust se calculations
library(lmtest) # for coeftest
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library(stargazer) # create formatted tables

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
library(lmtest) # use coeftest function to display resuls

Load the acemoglu_2001 data. The data is provided to you in RData format.

load("./assets/data/acemoglu_2001.RData")

Let’s assign a shorter name for our data

df <- acemoglu_2001

15.0.1 Task 1

Estimate the following regression using OLS and comment on the estimation results.

\[logpgp95 = \beta_1 + \beta_2 avexpr + \beta_3 f\_brit + \beta_4 f\_french + u\]

15.0.1.1 Guidance

ols <- lm(logpgp95 ~ avexpr + f_brit + f_french, data = df)
coeftest(ols, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  4.838521   0.359791 13.4481 < 2.2e-16 ***
avexpr       0.527621   0.051401 10.2648 7.873e-15 ***
f_brit      -0.306681   0.218114 -1.4061   0.16486    
f_french    -0.377069   0.204934 -1.8400   0.07072 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The British and French colony dummies are statistically insignificant at 5% level. The origin of the colonising country does not appear to have a statistically significant impact on the GDP per capita of the colony.

The institution variable is statistically significant with a positive sign, implying that as expected, better institutions have an increasing impact on the national income.

15.0.2 Task 2

Re-estimate the above equation, this time by using settler mortality as an instrument for institutional quality.

15.0.2.1 Guidance

Instrumental variable estimation is also referred to as two-stage least squares because of the two-stage estimation that it requires.

Stage 1: Regress the endogenous variable on all exogenous variables (instruments and the independent variables in the model other than the endogenous variable) and calculate predictions for the endogenous variable.

# Regress endogenous variable on IV and other exogenous variables 
step1 <- lm(avexpr ~ logem4 + f_brit + f_french, data = df)

# Obtain predictions for the enodgenous variable
df$avexpr_hat <- predict(step1)

Stage 2: In stage two, we use the predicted values of the endogenous variable from the first stage. We estimate the main model by replacing the endogenous variable with its predictions.

step2 <- lm(logpgp95 ~ avexpr_hat + f_brit + f_french, data = df)
coeftest(step2, vcov = vcovHC, type = "HC1")

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.37240    0.94952  1.4454   0.15356    
avexpr_hat   1.07785    0.15791  6.8257 4.957e-09 ***
f_brit      -0.77770    0.29962 -2.5957   0.01185 *  
f_french    -0.11697    0.21860 -0.5351   0.59456    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient of the institution variable is lower in the OLS estimation than in the 2SLS estimation. It implies that because of endogeneity, OLS underestimates the impact of institutions (i.e. negative bias).

The comparison of standard errors on the other hand, reveals the inefficiency of the 2SLS estimation. This is because we are instrumenting the institution variable. The higher the correlation between the endogenous variable and the instrument, the lower will be the difference in standard errors of the OLS and 2SLS; the lower the correlation between the endogenous variable and the instrument, the higher the 2SLS standard errors will be. The latter case implies weak instrumentation.

15.0.3 Task 3

Is settler mortality a good (valid) instrument? Please test and discuss.

15.0.3.1 Guidance

Let’s print the results of first stage regression (step) from above.

coeftest(step1, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  8.746647   0.741542 11.7952 < 2.2e-16 ***
logem4      -0.534399   0.156066 -3.4242  0.001117 ** 
f_brit       0.629348   0.371462  1.6942  0.095404 .  
f_french     0.047405   0.402817  0.1177  0.906712    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the instrument log settler mortality (\(logem4\)) has a statistically significant impact on average protection against expropriation risk (\(avexpr\)) (the endogenous variable), implying that it has an explanatory power in predicting the values of average protection against expropriation risk. This confirms its relevance as is an instrument. We can also check the correlation coefficient between the two.

cor(df$avexpr, df$logem4)
[1] -0.5197417

The pairwise correlation coefficient between the two variables is moderate ( -0.5197). Log settler mortality is a relevant instrument, though better alternatives could also be sought.

15.0.4 Task 4

Test whether there is endogeneity problem in estimation of the above equation.

15.0.4.1 Guidance

We apply the Durbin-Wu-Hausman Test. The test consists of two-stages. The first stage is the same as the first stage regression of the IV estimation (i.e. Two-Stage Least Squares). In the first stage we regress the instrument on the exogenous variables (the instrument and the other exogenous variables in the main model) and save the residuals from this model. We then estimate the main model, this time by additionally including the residuals from the first stage. Statistical significance of this residual term will imply endogeneity.

The null hypothesis of this test is that there is no endogeneity and the alternative hypothesis is there is endogeneity.

Step 1 Save the residuals from the first stage of 2SLS.

df$resid_step1 <- residuals(step1)

Step 2 Estimate the main model of interest with adding these residuals as one of the independent variables.

dwh <- lm(logpgp95 ~ avexpr + f_brit + f_french + resid_step1, 
          data = df)
coeftest(dwh, vcov = vcovHC, type = "HC1")

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.37240    0.76424  1.7958  0.077651 .  
avexpr       1.07785    0.12441  8.6639 4.166e-12 ***
f_brit      -0.77770    0.23425 -3.3199  0.001548 ** 
f_french    -0.11697    0.17953 -0.6515  0.517228    
resid_step1 -0.68466    0.15548 -4.4037 4.545e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual term is highly statistically significant. We reject the null hypothesis of no endogeneity. This implies that there is endogeneity. Since we have a valid instrument, we choose IV regression over OLS.

15.0.5 Task 5

Use R’s ivreg function to obtain the 2SLS estimation results.

15.0.5.1 Guidance

iv <- ivreg(logpgp95 ~ avexpr + f_brit + f_french | 
           logem4 + f_brit + f_french, 
          data = df)
coeftest(iv, vcov = vcovHC, type = "HC1")

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.37240    1.60014  0.8577   0.39448    
avexpr       1.07785    0.24746  4.3556 5.264e-05 ***
f_brit      -0.77770    0.37703 -2.0627   0.04348 *  
f_french    -0.11697    0.34731 -0.3368   0.73744    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The manually calculated 2SLS regression results have inflated standard errors because of the two-stage approach of using the predictions of the endogenous variable from first stage. R’s ivreg() function corrects for that bias in standard error estimates. Hence, although we have estimated the first and second stage regressions by OLS (the lm() function), to conduct the necessary checks and tests, use ivreg() when it is time to report the results!

15.0.6 Task 6

Compare the estimates for institutional quality in OLS and 2SLS regressions.

15.0.6.1 Guidance

The below table presents OLS, the second stage of 2SLS and the results of ivreg().

stargazer(ols, step2, iv, type = "text")

==============================================================
                                    Dependent variable:       
                              --------------------------------
                                          logpgp95            
                                      OLS         instrumental
                                                    variable  
                                 (1)       (2)        (3)     
--------------------------------------------------------------
avexpr                        0.528***              1.078***  
                               (0.065)              (0.218)   
                                                              
avexpr_hat                              1.078***              
                                         (0.161)              
                                                              
f_brit                         -0.307   -0.778***   -0.778**  
                               (0.211)   (0.262)    (0.354)   
                                                              
f_french                       -0.377    -0.117      -0.117   
                               (0.232)   (0.262)    (0.355)   
                                                              
Constant                      4.839***    1.372      1.372    
                               (0.436)   (1.027)    (1.388)   
                                                              
--------------------------------------------------------------
Observations                     64        64          64     
R2                              0.565     0.479      0.048    
Adjusted R2                     0.543     0.453      0.001    
Residual Std. Error (df = 60)   0.705     0.772      1.043    
F Statistic (df = 3; 60)      25.950*** 18.387***             
==============================================================
Note:                              *p<0.1; **p<0.05; ***p<0.01

The coefficient of the institution variable is lower in the OLS estimation than in the 2SLS estimation. It implies that, because of endogeneity, OLS underestimates the impact of institutions (i.e. negative bias).

The comparison of standard errors on the other hand, reveals the inefficiency of the 2SLS estimation. This is because we are instrumenting the institution variable. The higher the correlation between the endogenous variable and the instrument, the lower will be the difference in standard errors of the OLS and 2SLS; the lower the correlation between the endogenous variable and the instrument, the higher the 2SLS standard errors will be. The latter case implies weak instrumentation.