12 Panel Data Models

European Countries Gasoline Consumption data is obtained from Abay Mulatu 316ECN Applied Econometrics lecture material, Coventry University.

We will be using the all country sample of the gasoline demand data.

We start by loading the required libraries.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Hmisc) # add labels to variables


Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

library(ggplot2)
library(dplyr) # for data manipulation
library(plm) # to estimate linear panel data models


Attaching package: 'plm'

The following objects are masked from 'package:dplyr':

    between, lag, lead

library(fastDummies) # create dummies based on categorical (factor) variable

12.1 Least Squares Dummy Variables (LSDV) Approach

We estimated this model in the previous section using two-country sample of gasoline consumption data. We will replicate it with full sample of countries.

12.1.1 Task 1

Import the gasoline-demand-all-countries.csv data, label variables and create country dummies.

12.1.1.1 Guidance

This is a replication of what we have done in the previous section.

df <- read.csv("~/Desktop/R-workshops/assets/data/gasoline-demand-all-countries.csv", stringsAsFactors=TRUE)
#View(df)
# label variables
label(df$L_gas_cons_pcar) <- "Logarithm of gasoline consumption per car"
label(df$L_income_pc) <- "Logarithm of real income per capita"
label(df$L_gas_price) <- "Logarithm of real gasoline price per gallon"
label(df$L_cars_pc) <- "Logarithm of number of cars per capita"
# create country dummies
# below is what we have done before
# df$italy <- ifelse(df$country == "ITALY", 1, 0)
# df$denmark <- ifelse(df$country == "DENMARK", 1, 0)

Because the data now has 18 countries, I will use another approach to create the country dummies:

df <- dummy_cols(df, select_columns = "country", remove_first_dummy = FALSE, remove_selected_columns = FALSE)

View(df) and you will see that country dummies are created with names country_AUSTRIA, country_BELGIUM, etc. You may remove the country_ in front of all these dummy country names and convert letters to lower case by running the following line:

df <- df %>%
  rename_with(~ tolower(gsub("country_", "", .)), starts_with("country_"))

12.1.2 Task 2

Provide a scatter plot of L_gas_cons_pcar and L_income_pc. Differentiate each country data point by adding color.

ggplot(df, aes(x = L_income_pc, y = L_gas_cons_pcar, color = country)) +
  geom_point() +
  labs(x = "Log income per capita", y = "Log gasoline consumption per car",
       title = "Negative Relationship Between Income and Gasoline Consumption")

#ggsave("plots/panel-data-analysis/scatter-all-countries.png")

12.1.3 Task 3

Estimate a Least Squares Dummy Variable Model that regresses logarithm of gasoline consumption per car (L_gas_cons_pcar) on log of income per capita (L_income_pc), log of gasoline price per gallon (L_gas_price) and number of cars per capita (L_cars_pc). Comment on the results.

12.1.3.1 Guidance

There are multiple ways of approaching data analysis. We can use the country dummies that we created above or alternatively, we can ask R to add dummies based on our country variable. I will show the latter below. Try replicating this using dummies we created in the previous task.

Least Squares Dummy Variable Estimation:

# Least Squares Dummy Variable
lsdv <- lm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + factor(country), data = df)
summary(lsdv)


Call:
lm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + 
    factor(country), data = df)

Residuals:
Logarithm of gasoline consumption per car 
     Min       1Q   Median       3Q      Max 
-0.37877 -0.03976  0.00465  0.04541  0.36286 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 2.28586    0.22832  10.011  < 2e-16 ***
L_income_pc                 0.66225    0.07339   9.024  < 2e-16 ***
L_gas_price                -0.32170    0.04410  -7.295 2.35e-12 ***
L_cars_pc                  -0.64048    0.02968 -21.580  < 2e-16 ***
factor(country)BELGIUM     -0.12030    0.03415  -3.523 0.000489 ***
factor(country)CANADA       0.75598    0.04075  18.554  < 2e-16 ***
factor(country)DENMARK      0.10360    0.03660   2.830 0.004944 ** 
factor(country)FRANCE      -0.08108    0.03356  -2.416 0.016256 *  
factor(country)GERMANY     -0.13599    0.03188  -4.266 2.63e-05 ***
factor(country)GREECE       0.05125    0.04153   1.234 0.218049    
factor(country)IRELAND      0.30647    0.03529   8.683  < 2e-16 ***
factor(country)ITALY       -0.05331    0.03711  -1.436 0.151868    
factor(country)JAPAN        0.09007    0.03861   2.333 0.020262 *  
factor(country)NETHERLANDS -0.05106    0.03358  -1.521 0.129280    
factor(country)NORWAY      -0.06916    0.04041  -1.711 0.087967 .  
factor(country)SPAIN       -0.60408    0.09122  -6.622 1.49e-10 ***
factor(country)SWEDEN       0.74049    0.18008   4.112 4.99e-05 ***
factor(country)SWITZERLAND  0.11665    0.03471   3.360 0.000872 ***
factor(country)TURKEY       0.22413    0.04764   4.704 3.79e-06 ***
factor(country)UK           0.05959    0.03019   1.974 0.049237 *  
factor(country)USA          0.76940    0.04458  17.260  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09233 on 321 degrees of freedom
Multiple R-squared:  0.9734,    Adjusted R-squared:  0.9717 
F-statistic: 586.6 on 20 and 321 DF,  p-value: < 2.2e-16

12.1.4 Task 4

Test if the the country dummies in the LSDV model above are jointly statistically significant.

12.1.4.1 Guidance

We need to perform an F test for restrictions. The LSDV model estimated above is the unrestricted model. We may re-estimate a restricted version of it by removing the country dummies. This model is our pooled OLS.

pooled_ols <- lm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, data = df)
summary(pooled_ols)


Call:
lm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
    data = df)

Residuals:
Logarithm of gasoline consumption per car 
     Min       1Q   Median       3Q      Max 
-0.38412 -0.15307 -0.04981  0.16529  0.59684 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.39133    0.11693   20.45   <2e-16 ***
L_income_pc  0.88996    0.03581   24.86   <2e-16 ***
L_gas_price -0.89180    0.03031  -29.42   <2e-16 ***
L_cars_pc   -0.76337    0.01861  -41.02   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.21 on 338 degrees of freedom
Multiple R-squared:  0.8549,    Adjusted R-squared:  0.8536 
F-statistic:   664 on 3 and 338 DF,  p-value: < 2.2e-16

\(H_0 : \beta_{belgium} = \beta_{canada} = \dots = \beta_{usa} = 0\)

\(H_1 : \text{at least one is different than zero}\)

Now, compare the restricted model with the unrestricted one:

# anova_r <- anova(restricted_model, full_model)
anova_r <- anova(pooled_ols, lsdv)

print(anova_r)

Analysis of Variance Table

Model 1: L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc
Model 2: L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + factor(country)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    338 14.9044                                  
2    321  2.7365 17    12.168 83.961 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject \(H_0\) . The country dummies are jointly statistically significant.

12.1.5 Task 5

Replicate the above test, this time using R’s fixed effects estimation.

plm is the function we will use to estimate a panel linear model, with the index showing our panel data cross-section and time identifiers and the model telling R which panel linear model to estimate:

# Panel Data Fixed Effects Model
fixed_1 <- plm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
               data = df, index = c("country", "year"), model = "within")
summary(fixed_1)

Oneway (individual) effect Within Model

Call:
plm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
    data = df, model = "within", index = c("country", "year"))

Balanced Panel: n = 18, T = 19, N = 342

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.378774 -0.039758  0.004650  0.045412  0.362856 

Coefficients:
             Estimate Std. Error  t-value  Pr(>|t|)    
L_income_pc  0.662250   0.073386   9.0242 < 2.2e-16 ***
L_gas_price -0.321702   0.044099  -7.2950 2.355e-12 ***
L_cars_pc   -0.640483   0.029679 -21.5804 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    17.061
Residual Sum of Squares: 2.7365
R-Squared:      0.8396
Adj. R-Squared: 0.82961
F-statistic: 560.093 on 3 and 321 DF, p-value: < 2.22e-16

Note that the slope coefficients we obtain above are the same as those we obtained in our LSDV model.

The test between Fixed Effects and Pooled OLS models

pFtest(fixed_1, pooled_ols)


    F test for individual effects

data:  L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc
F = 83.961, df1 = 17, df2 = 321, p-value < 2.2e-16
alternative hypothesis: significant effects

Again, note that the F-statistic reported above (83.961) is the same as what we obtained through anova() comparison.

12.1.6 Task 6

Estimate the Within-groups model manually.

12.1.6.1 Guidance

For this, we will need to regress the deviations of observations from group means using a linear model without a constant.

Let’s first create the group averages using the dplyr package

df <- df %>%
  group_by(country) %>%
  mutate(m_L_gas_cons_pcar = mean(L_gas_cons_pcar, na.rm = TRUE),
         m_L_income_pc = mean(L_income_pc, na.rm = TRUE),
         m_L_gas_price = mean(L_gas_price, na.rm = TRUE),
         m_L_cars_pc = mean(L_cars_pc, na.rm = TRUE)) %>%
  ungroup()

We then find the deviations of each observation from the group average.

df <- df %>% 
  mutate(wi_L_gas_cons_pcar = L_gas_cons_pcar - m_L_gas_cons_pcar,
         wi_L_income_pc = L_income_pc - m_L_income_pc,
         wi_L_gas_price = L_gas_price - m_L_gas_price,
        wi_L_cars_pc = L_cars_pc - m_L_cars_pc)

In the final step, we use these deviations in a regression without a constant term. But let us see the scatter plot of gasoline consumption per car and income per capita after this data transformation.

ggplot(df, aes(x = wi_L_income_pc, y = wi_L_gas_cons_pcar, color = country)) + 
  geom_point()

Regression (within-groups estimation):

wg <- lm(wi_L_gas_cons_pcar ~ 0 + wi_L_income_pc + wi_L_gas_price + wi_L_cars_pc,
         data = df)
summary(wg)


Call:
lm(formula = wi_L_gas_cons_pcar ~ 0 + wi_L_income_pc + wi_L_gas_price + 
    wi_L_cars_pc, data = df)

Residuals:
Logarithm of gasoline consumption per car 
     Min       1Q   Median       3Q      Max 
-0.37877 -0.03976  0.00465  0.04541  0.36286 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
wi_L_income_pc  0.66225    0.07141   9.274  < 2e-16 ***
wi_L_gas_price -0.32170    0.04291  -7.497 5.77e-13 ***
wi_L_cars_pc   -0.64048    0.02888 -22.177  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08985 on 339 degrees of freedom
Multiple R-squared:  0.8396,    Adjusted R-squared:  0.8382 
F-statistic: 591.5 on 3 and 339 DF,  p-value: < 2.2e-16

12.1.7 Task 7

Estimate a random effects model and compare this with Pooled OLS.

To estimate a panel random effect model, we use the same panel data linear model plm() function as above, but will change the model to random.

# Panel Data Random Effects Model
random_1 <- plm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
               data = df, index = c("country", "year"), model = "random")
summary(random_1)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
    data = df, model = "random", index = c("country", "year"))

Balanced Panel: n = 18, T = 19, N = 342

Effects:
                   var  std.dev share
idiosyncratic 0.008525 0.092330 0.182
individual    0.038238 0.195545 0.818
theta: 0.8923

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.3977058 -0.0520350  0.0050877  0.0582288  0.3763726 

Coefficients:
             Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept)  1.996698   0.184326  10.8324 < 2.2e-16 ***
L_income_pc  0.554986   0.059128   9.3861 < 2.2e-16 ***
L_gas_price -0.420389   0.039978 -10.5155 < 2.2e-16 ***
L_cars_pc   -0.606840   0.025515 -23.7836 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    18.054
Residual Sum of Squares: 3.0817
R-Squared:      0.82931
Adj. R-Squared: 0.8278
Chisq: 1642.2 on 3 DF, p-value: < 2.22e-16

The comparison between the above the Pooled OLS is done by Breusch-Pagan LM Test.

plmtest(random_1, type = "bp")


    Lagrange Multiplier Test - (Breusch-Pagan)

data:  L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc
chisq = 1465.6, df = 1, p-value < 2.2e-16
alternative hypothesis: significant effects

12.1.8 Task 8

Apply the Hausman Test to compare Random Effects and Fixed Effects Models.

phtest(fixed_1, random_1)


    Hausman Test

data:  L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc
chisq = 302.8, df = 3, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

The null of Random Effects is rejected. We choose the Fixed Effects.

12.1.9 Task 9

Estimate a two-way fixed effects model and test for the joint significance of the time effects.

# Panel Data Fixed Effects Model
fixed_2 <- plm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + factor(year), 
               data = df, index = c("country", "year"), model = "within")
summary(fixed_2)

Oneway (individual) effect Within Model

Call:
plm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + 
    factor(year), data = df, model = "within", index = c("country", 
    "year"))

Balanced Panel: n = 18, T = 19, N = 342

Residuals:
       Min.     1st Qu.      Median     3rd Qu.        Max. 
-0.41920085 -0.03886111  0.00018502  0.04199566  0.23067839 

Coefficients:
                  Estimate Std. Error  t-value  Pr(>|t|)    
L_income_pc       0.051369   0.091386   0.5621 0.5744611    
L_gas_price      -0.192850   0.042860  -4.4995 9.718e-06 ***
L_cars_pc        -0.593448   0.027669 -21.4479 < 2.2e-16 ***
factor(year)1961  0.040970   0.027248   1.5036 0.1337236    
factor(year)1962  0.044249   0.027595   1.6035 0.1098635    
factor(year)1963  0.064744   0.028277   2.2897 0.0227268 *  
factor(year)1964  0.105995   0.029297   3.6179 0.0003479 ***
factor(year)1965  0.124134   0.030049   4.1310 4.677e-05 ***
factor(year)1966  0.167830   0.031046   5.4058 1.310e-07 ***
factor(year)1967  0.198832   0.032048   6.2042 1.801e-09 ***
factor(year)1968  0.230077   0.033201   6.9299 2.537e-11 ***
factor(year)1969  0.242999   0.035304   6.8831 3.374e-11 ***
factor(year)1970  0.275080   0.037182   7.3982 1.365e-12 ***
factor(year)1971  0.304198   0.038516   7.8980 5.262e-14 ***
factor(year)1972  0.332136   0.040532   8.1944 7.160e-15 ***
factor(year)1973  0.369707   0.043209   8.5562 5.913e-16 ***
factor(year)1974  0.327938   0.042232   7.7652 1.266e-13 ***
factor(year)1975  0.362392   0.041877   8.6538 2.984e-16 ***
factor(year)1976  0.370891   0.043626   8.5016 8.651e-16 ***
factor(year)1977  0.385702   0.044559   8.6559 2.939e-16 ***
factor(year)1978  0.400956   0.046409   8.6397 3.295e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    17.061
Residual Sum of Squares: 1.997
R-Squared:      0.88295
Adj. R-Squared: 0.86827
F-statistic: 108.839 on 21 and 303 DF, p-value: < 2.22e-16

# fixed_1 : One-way fixed effects model (cross-section effects included)
# fixed_2 : Two-way fixed effects model (cross-section and time-effects included)
# Joint statistical significance of the time effects
pFtest(fixed_2, fixed_1)


    F test for individual effects

data:  L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + factor(year)
F = 6.2338, df1 = 18, df2 = 303, p-value = 5.36e-13
alternative hypothesis: significant effects

Time effects are jointly statistically significant. We choose the two-way FE model over one-way FE model.

Note that we can estimate the above two-way model by adding an effect = "twoways" option in R, without the need for an explicit inclusion of time dummies.

# Panel Data Fixed Effects Model
fixed_3 <- plm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
               data = df, index = c("country", "year"), model = "within", effect = "twoways")
summary(fixed_3)

Twoways effects Within Model

Call:
plm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
    data = df, effect = "twoways", model = "within", index = c("country", 
        "year"))

Balanced Panel: n = 18, T = 19, N = 342

Residuals:
       Min.     1st Qu.      Median     3rd Qu.        Max. 
-0.41920085 -0.03886111  0.00018502  0.04199566  0.23067839 

Coefficients:
             Estimate Std. Error  t-value  Pr(>|t|)    
L_income_pc  0.051369   0.091386   0.5621    0.5745    
L_gas_price -0.192850   0.042860  -4.4995 9.718e-06 ***
L_cars_pc   -0.593448   0.027669 -21.4479 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    10.644
Residual Sum of Squares: 1.997
R-Squared:      0.81239
Adj. R-Squared: 0.78886
F-statistic: 437.338 on 3 and 303 DF, p-value: < 2.22e-16

12.1.10 Task 10

Estimate a two-way random effects model.

# Panel Data Fixed Effects Model
random_2 <- plm(L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
               data = df, index = c("country", "year"), 
               model = "random", effect = "twoways")
summary(random_2)

Twoways effects Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc, 
    data = df, effect = "twoways", model = "random", index = c("country", 
        "year"))

Balanced Panel: n = 18, T = 19, N = 342

Effects:
                   var  std.dev share
idiosyncratic 0.006591 0.081183 0.147
individual    0.038340 0.195805 0.853
time          0.000000 0.000000 0.000
theta: 0.9053 (id) 0 (time) 0 (total)

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.3956900 -0.0499700  0.0075613  0.0543043  0.3748214 

Coefficients:
             Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept)  2.040793   0.191508  10.6564 < 2.2e-16 ***
L_income_pc  0.564562   0.060854   9.2773 < 2.2e-16 ***
L_gas_price -0.404936   0.040369 -10.0309 < 2.2e-16 ***
L_cars_pc   -0.609360   0.025970 -23.4641 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    17.829
Residual Sum of Squares: 3.0145
R-Squared:      0.83092
Adj. R-Squared: 0.82942
Chisq: 1661.05 on 3 DF, p-value: < 2.22e-16

12.1.11 Task 11

Perform a Hausman Test comparing two-way fixed and random effect models.

phtest(fixed_2, random_2)


    Hausman Test

data:  L_gas_cons_pcar ~ L_income_pc + L_gas_price + L_cars_pc + factor(year)
chisq = 134.07, df = 3, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

We reject the null hypothesis of random effects and choose the fixed effects model.

To sum up, we started by estimating a Pooled OLS model, followed by a fixed effects specification. We have seen alternative ways of estimating the Fixed Effects Model, i.e. the Least Squares Dummy Variable Apporoach, the Within Groups Estimator and R’s built in function plm with the argument model = "within".

We then moved on to Random Effects Model using the plm function with the argument model = "random".

The F-test comparison between the Pooled OLS and Fixed Effects Model indicated the existence of unobserved heterogeneity (time invariant country-specific effects).

The Breusch-Pagan Test was applied to compare the Pooled and Random Effects models. This test also rejected the Pooled OLS and indicated that the variation of the country-specific effects is different than zero.

A comparison between the Fixed Effects and Random Effects model through the Hausman Test rejected the null of Random Effects specification in favour of a Fixed Effects Model.

We then moved on to check existence of time-specific effects that are common to all cross-sectional units. This was done by an F-test, which this time, compared one-way and two way fixed effects specification and tested for the joint significance of the time effects.

The null of time effects being jointly equal to zero is rejected, suggesting estimation of two-way models.

A final comparison was between the Two-way Fixed Effects and Two-way Random effects models. The Hausman Test again rejected the Random Effects specification, which led us to decide on the Two-way Fixed Effects model as our final specification.