5  Introduction to Regression Analysis

5.1 Example 1: Crime data

The example and instructions provided in this section is taken from (Riegler 2022).

Suppose you are examining the relationship between number of crimes and number of police officers. Below, we will generate descriptive statistics, create a scatter plot and see how we estimate OLS regression.

We will use the crime data set, which is already saved in Rdata format.

5.1.1 Task 1

Open crime_v2.Rdata (if it not already open). You may so this trough the menu or the command line using the load() function:

load("~/Desktop/R-workshops/assets/data/crime_v2.Rdata")

5.1.2 Task 2

Check the summary statistics for crimes and officers variables.

summary(crime[, c("crimes", "officers")])
     crimes          officers     
 Min.   :  5276   Min.   : 109.0  
 1st Qu.: 19658   1st Qu.: 402.8  
 Median : 32518   Median : 694.5  
 Mean   : 38123   Mean   : 902.1  
 3rd Qu.: 49434   3rd Qu.:1212.0  
 Max.   :152962   Max.   :4092.0  
# Standard deviation for the 'crimes' variable
sd_crimes <- sd(crime$crimes, na.rm = TRUE)

# Standard deviation for the 'officers' variable
sd_officers <- sd(crime$officers, na.rm = TRUE)

# Print the results
sd_crimes
[1] 27660.3
sd_officers
[1] 721.7255

Note the na.rm = TRUE above. This argument ensures that nay NA values are removed before the calculation.

5.1.3 Task 3

In addition to checking summary statistics, it is always wise to visualise your data before getting into more complicated modelling.

For this task, generate a scatter plot with number of crimes on the y-axis and the number of police officers on the x-axis.

plot(crime$officers~crime$crimes, 
     main = "Relationship between number of police officers and crime")

5.1.4 Task 4

Calculate the Covariance and the Correlation Coefficient between number of crimes and number of police officers. Comment on their values.

5.1.4.1 Guidance

A scatter plot is a good start for identifying relationships between two variables, but it is not sufficient to identify accurately how strong the relationship is between crimes and officers. There are two numerical statistics, that provide information about the relationship between two variables: The Covariance and the Correlation Coefficient.

To produce a Covariance matrix, use the following command:

cov(crime$officers, crime$crimes)
[1] 18212436

The result is: 18,212,436! This number may appear to be too large but the value we obtain as covariance depends on the measuremetn levels of the variables. This measure (i.e. the covariance) does not provide any information on how strong this relationship between crimes and officers is. It only reveals that there is a positive relationship between the number of police officers and the number of crimes committed.

Instead of using the covariance, we can use a standardised covariance - the correlation coefficient. To calculate the correlation matrix, we only have to adjust slightly the covariance command.

cor(crime$officers, crime$crimes)
[1] 0.9123032

The correlation coefficient between number of officers and crimes is 0.91. We conclude that there is a strong positive relationship between our two variables.

5.1.5 Task 5

Regress the number of police officers on crimes and comment on:

  • the sign and magnitude of the regression coefficients

  • the goodness of fit of the estimated model.

lm(officers ~ crimes, data = crime)

Call:
lm(formula = officers ~ crimes, data = crime)

Coefficients:
(Intercept)       crimes  
    -5.4183       0.0238  

We can save this estimation under an object (please note that we use model_1 below, but you may give any name as long as it satisfies the naming conventions):

model_1 <- lm(officers ~ crimes, data = crime)
# display the model
model_1

Call:
lm(formula = officers ~ crimes, data = crime)

Coefficients:
(Intercept)       crimes  
    -5.4183       0.0238  

Although we see the estimated coefficients in the above output we do not have information about the other statistics that we need to proceed. We use the summary() function below:

summary(model_1)

Call:
lm(formula = officers ~ crimes, data = crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-756.64 -153.71  -25.75   89.64 1000.97 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.418291  75.587257  -0.072    0.943    
crimes       0.023804   0.001611  14.777   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 298.9 on 44 degrees of freedom
Multiple R-squared:  0.8323,    Adjusted R-squared:  0.8285 
F-statistic: 218.4 on 1 and 44 DF,  p-value: < 2.2e-16

The intercept term is not statistically significant.

The crimes variable is statistically significant at 0.1%. (Note the significance codes in the output).

The slope coefficient states that for every additional crime, we observe on average of 0.024 more police officers. Using more reader-friendly numbers, we can also infer that for every 1,000 additional crimes committed within a city, 24 more police officers are employed. Note how the latter way of phrasing makes more sense.

\(R^2\) is the measure that provides information on the overall goodness of fit of the model. In this case it is 0.83. This means that 83% of the variation in police officers can be explained with the variation in number of crimes committed. Our estimated model has a good degree of explanatory power.

Looking at the F-statistic (218.4 with a p-value of almost zero), we can conclude that the model, overall, is statistically significant.

5.1.6 Task 6

Add a regression line to the scatter plot you created in Task 3.

5.1.6.1 Guidance

To add a regression line to the plot, we have to use the previously saved regression object model_1 and add it to the previous scatter plot.

plot(crime$officers~crime$crimes, 
     main = "Relationship between number of police officers and crime")
abline(model_1)

5.2 Example 2: Wage data

5.2.1 Task 1

5.2.1.1 Task

Import wage.xls data into R and view the first few rows of the data to have an idea about the contents of the variables, and then save the data in R format.

5.2.1.2 Guidance

Use read_excel() and head() functions.

# install.packages("readxl")
library(readxl)

# Import Excel data
wage2 <- read_excel("./assets/data/wage2.xls", sheet = "wage2")
head(wage2)
# A tibble: 6 × 15
   wage hours    IQ   KWW  educ exper tenure   age married south urban  sibs
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
1   769    40    93    35    12    11      2    31       1     0     1     1
2   808    50   119    41    18    11     16    37       1     0     1     1
3   825    40   108    46    14    11      9    33       1     0     1     1
4   650    40    96    32    12    13      7    32       1     0     1     4
5   562    40    74    27    11    14      5    34       1     0     1    10
6  1400    40   116    43    16    14      2    35       1     0     1     1
# ℹ 3 more variables: brthord <dbl>, meduc <dbl>, feduc <dbl>
# Save data in R format
save(wage2, file = "./assets/data/wage2.Rdata")

5.2.2 Task 2

5.2.2.1 Task

Label variable educ as “years of schooling” and exper as “years of experience”.

5.2.2.2 Guidance

We will need the expss package to label the variables. The installation and calling of the package is deactivated below since we already have done these steps above. After running the below command check the changes in the data from the Environment window on the top-right.

# install.packages("expss")
library(expss)
Loading required package: maditr

To drop variable use NULL: let(mtcars, am = NULL) %>% head()

Use 'expss_output_viewer()' to display tables in the RStudio Viewer.
 To return to the console output, use 'expss_output_default()'.
wage2 <- apply_labels(wage2, 
                      educ = "years of schooling", 
                      exper = "years of experince")

5.2.3 Task 3

5.2.3.1 Task

Generate two new variables: hourly wage and logarithmic wage.

5.2.3.2 Guidance

# Generate new variables
wage2$hourly_wage <- wage2$wage / wage2$hours
wage2$ln_wage <- log(wage2$wage)

5.2.4 Task 4

5.2.4.1 Task

Check the summary statistics for (i) the wage variable, (ii) for all variables.

5.2.4.2 Guidance

We will use the summary() function.

# Summary statistics for the wage variable only
summary(wage2$wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  115.0   669.0   905.0   957.9  1160.0  3078.0 
# Summary statistics for all variables in wage2 data
summary(wage2)
      wage            hours             IQ             KWW       
 Min.   : 115.0   Min.   :20.00   Min.   : 50.0   Min.   :12.00  
 1st Qu.: 669.0   1st Qu.:40.00   1st Qu.: 92.0   1st Qu.:31.00  
 Median : 905.0   Median :40.00   Median :102.0   Median :37.00  
 Mean   : 957.9   Mean   :43.93   Mean   :101.3   Mean   :35.74  
 3rd Qu.:1160.0   3rd Qu.:48.00   3rd Qu.:112.0   3rd Qu.:41.00  
 Max.   :3078.0   Max.   :80.00   Max.   :145.0   Max.   :56.00  
                                                                 
      educ           exper           tenure            age       
 Min.   : 9.00   Min.   : 1.00   Min.   : 0.000   Min.   :28.00  
 1st Qu.:12.00   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.:30.00  
 Median :12.00   Median :11.00   Median : 7.000   Median :33.00  
 Mean   :13.47   Mean   :11.56   Mean   : 7.234   Mean   :33.08  
 3rd Qu.:16.00   3rd Qu.:15.00   3rd Qu.:11.000   3rd Qu.:36.00  
 Max.   :18.00   Max.   :23.00   Max.   :22.000   Max.   :38.00  
                                                                 
    married          south            urban             sibs       
 Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 1.000  
 Median :1.000   Median :0.0000   Median :1.0000   Median : 2.000  
 Mean   :0.893   Mean   :0.3412   Mean   :0.7176   Mean   : 2.941  
 3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 4.000  
 Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :14.000  
                                                                   
    brthord           meduc           feduc        hourly_wage    
 Min.   : 1.000   Min.   : 0.00   Min.   : 0.00   Min.   :  2.30  
 1st Qu.: 1.000   1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.: 15.07  
 Median : 2.000   Median :12.00   Median :10.00   Median : 21.02  
 Mean   : 2.277   Mean   :10.68   Mean   :10.22   Mean   : 22.32  
 3rd Qu.: 3.000   3rd Qu.:12.00   3rd Qu.:12.00   3rd Qu.: 27.70  
 Max.   :10.000   Max.   :18.00   Max.   :18.00   Max.   :102.60  
 NA's   :83       NA's   :78      NA's   :194                     
    ln_wage     
 Min.   :4.745  
 1st Qu.:6.506  
 Median :6.808  
 Mean   :6.779  
 3rd Qu.:7.056  
 Max.   :8.032  
                

5.2.5 Task 5

5.2.5.1 Task

Calculate the correlation coefficient between wage and education.

5.2.5.2 Guidance

We can calculate the correlation coefficients using the cor() function. In the first example below, the correlation coefficient is reported as a single number, while in the second example, we get a correlation matrix.

In most of empirical work, we are usually interested with pairwise correlations among all variables. Hence, we may use the correlation matrix to check the binary correlations among all variables in our sample. This is provided in the third example below.

The "use = complete.obs" added to the commands below asks R to handle missing values by casewise deletion.

# Correlation
cor(wage2$wage, wage2$educ)
[1] 0.3271087
# Correlation
cor(wage2[, c("wage", "educ")], use = "complete.obs")
          wage      educ
wage 1.0000000 0.3271087
educ 0.3271087 1.0000000
cor(wage2, use = "pairwise.complete.obs")
                    wage        hours          IQ         KWW        educ
wage         1.000000000 -0.009504302  0.30908783  0.32613058  0.32710869
hours       -0.009504302  1.000000000  0.07383930  0.11388938  0.09100889
IQ           0.309087827  0.073839301  1.00000000  0.41351552  0.51569701
KWW          0.326130577  0.113889381  0.41351552  1.00000000  0.38813424
educ         0.327108690  0.091008888  0.51569701  0.38813424  1.00000000
exper        0.002189702 -0.062126227 -0.22491253  0.01745245 -0.45557312
tenure       0.128266391 -0.055528006  0.04215883  0.14139800 -0.03616655
age          0.156701761  0.024811636 -0.04374091  0.39305297 -0.01225396
married      0.136582670  0.032563350 -0.01466753  0.08994782 -0.05856602
south       -0.159387287 -0.029519177 -0.20978466 -0.09439242 -0.09703298
urban        0.198406472  0.016573046  0.03893553  0.09819025  0.07215091
sibs        -0.159203728 -0.049602555 -0.28477277 -0.28497534 -0.23928810
brthord     -0.145485385 -0.043129582 -0.17943947 -0.15358472 -0.20499246
meduc        0.214831839  0.076619806  0.33180383  0.24079168  0.36423913
feduc        0.237586922  0.063172297  0.34390758  0.23488927  0.42692545
hourly_wage  0.931240501 -0.317645466  0.26502635  0.26059936  0.27167136
ln_wage      0.953141156 -0.047219079  0.31478770  0.30627128  0.31211665
                   exper      tenure          age      married       south
wage         0.002189702  0.12826639  0.156701761  0.136582670 -0.15938729
hours       -0.062126227 -0.05552801  0.024811636  0.032563350 -0.02951918
IQ          -0.224912532  0.04215883 -0.043740911 -0.014667528 -0.20978466
KWW          0.017452446  0.14139800  0.393052967  0.089947816 -0.09439242
educ        -0.455573115 -0.03616655 -0.012253956 -0.058566019 -0.09703298
exper        1.000000000  0.24365440  0.495329763  0.106349115  0.02125724
tenure       0.243654402  1.00000000  0.270601647  0.072605374 -0.06169141
age          0.495329763  0.27060165  1.000000000  0.106980249 -0.02947768
married      0.106349115  0.07260537  0.106980249  1.000000000  0.02275672
south        0.021257241 -0.06169141 -0.029477681  0.022756718  1.00000000
urban       -0.047385845 -0.03848582 -0.006749288 -0.040248179 -0.10989797
sibs         0.064310470 -0.03916116 -0.040719238 -0.004327422  0.06631979
brthord      0.088300019 -0.02847775  0.005435916 -0.014737189  0.09370679
meduc       -0.186317286 -0.01496769 -0.029319099 -0.022763437 -0.15787359
feduc       -0.256792630 -0.05924123 -0.071303285 -0.020324390 -0.17236334
hourly_wage  0.017757793  0.13541822  0.126683019  0.115115701 -0.14716118
ln_wage      0.020601158  0.18585262  0.161822314  0.149975894 -0.19481092
                   urban         sibs      brthord       meduc       feduc
wage         0.198406472 -0.159203728 -0.145485385  0.21483184  0.23758692
hours        0.016573046 -0.049602555 -0.043129582  0.07661981  0.06317230
IQ           0.038935525 -0.284772765 -0.179439471  0.33180383  0.34390758
KWW          0.098190247 -0.284975345 -0.153584717  0.24079168  0.23488927
educ         0.072150908 -0.239288104 -0.204992462  0.36423913  0.42692545
exper       -0.047385845  0.064310470  0.088300019 -0.18631729 -0.25679263
tenure      -0.038485824 -0.039161158 -0.028477749 -0.01496769 -0.05924123
age         -0.006749288 -0.040719238  0.005435916 -0.02931910 -0.07130328
married     -0.040248179 -0.004327422 -0.014737189 -0.02276344 -0.02032439
south       -0.109897970  0.066319792  0.093706790 -0.15787359 -0.17236334
urban        1.000000000 -0.031468824  0.002419787  0.03402366  0.11223944
sibs        -0.031468824  1.000000000  0.593913799 -0.28715120 -0.23202649
brthord      0.002419787  0.593913799  1.000000000 -0.27593376 -0.23037060
meduc        0.034023660 -0.287151198 -0.275933760  1.00000000  0.57649476
feduc        0.112239438 -0.232026494 -0.230370600  0.57649476  1.00000000
hourly_wage  0.189240304 -0.131364072 -0.120293460  0.18348733  0.20469678
ln_wage      0.203797585 -0.152809172 -0.141852712  0.21357476  0.22338514
            hourly_wage     ln_wage
wage         0.93124050  0.95314116
hours       -0.31764547 -0.04721908
IQ           0.26502635  0.31478770
KWW          0.26059936  0.30627128
educ         0.27167136  0.31211665
exper        0.01775779  0.02060116
tenure       0.13541822  0.18585262
age          0.12668302  0.16182231
married      0.11511570  0.14997589
south       -0.14716118 -0.19481092
urban        0.18924030  0.20379758
sibs        -0.13136407 -0.15280917
brthord     -0.12029346 -0.14185271
meduc        0.18348733  0.21357476
feduc        0.20469678  0.22338514
hourly_wage  1.00000000  0.89974921
ln_wage      0.89974921  1.00000000

The above table is informative but the reported numbers have far too many decimals. It is distracting our focus. Below, we round these in two decimal points, which is enough to have an idea about the strength of the correlation between our variables

# Calculate pairwise correlations and store them under name cor_matrix
cor_matrix <- cor(wage2, use = "pairwise.complete.obs")

# Round the correlation values to 2 decimal places and save them under the name rounded_cor_matrix
rounded_cor_matrix <- round(cor_matrix, 2)

# Display the rounded correlation matrix
print(rounded_cor_matrix)
             wage hours    IQ   KWW  educ exper tenure   age married south
wage         1.00 -0.01  0.31  0.33  0.33  0.00   0.13  0.16    0.14 -0.16
hours       -0.01  1.00  0.07  0.11  0.09 -0.06  -0.06  0.02    0.03 -0.03
IQ           0.31  0.07  1.00  0.41  0.52 -0.22   0.04 -0.04   -0.01 -0.21
KWW          0.33  0.11  0.41  1.00  0.39  0.02   0.14  0.39    0.09 -0.09
educ         0.33  0.09  0.52  0.39  1.00 -0.46  -0.04 -0.01   -0.06 -0.10
exper        0.00 -0.06 -0.22  0.02 -0.46  1.00   0.24  0.50    0.11  0.02
tenure       0.13 -0.06  0.04  0.14 -0.04  0.24   1.00  0.27    0.07 -0.06
age          0.16  0.02 -0.04  0.39 -0.01  0.50   0.27  1.00    0.11 -0.03
married      0.14  0.03 -0.01  0.09 -0.06  0.11   0.07  0.11    1.00  0.02
south       -0.16 -0.03 -0.21 -0.09 -0.10  0.02  -0.06 -0.03    0.02  1.00
urban        0.20  0.02  0.04  0.10  0.07 -0.05  -0.04 -0.01   -0.04 -0.11
sibs        -0.16 -0.05 -0.28 -0.28 -0.24  0.06  -0.04 -0.04    0.00  0.07
brthord     -0.15 -0.04 -0.18 -0.15 -0.20  0.09  -0.03  0.01   -0.01  0.09
meduc        0.21  0.08  0.33  0.24  0.36 -0.19  -0.01 -0.03   -0.02 -0.16
feduc        0.24  0.06  0.34  0.23  0.43 -0.26  -0.06 -0.07   -0.02 -0.17
hourly_wage  0.93 -0.32  0.27  0.26  0.27  0.02   0.14  0.13    0.12 -0.15
ln_wage      0.95 -0.05  0.31  0.31  0.31  0.02   0.19  0.16    0.15 -0.19
            urban  sibs brthord meduc feduc hourly_wage ln_wage
wage         0.20 -0.16   -0.15  0.21  0.24        0.93    0.95
hours        0.02 -0.05   -0.04  0.08  0.06       -0.32   -0.05
IQ           0.04 -0.28   -0.18  0.33  0.34        0.27    0.31
KWW          0.10 -0.28   -0.15  0.24  0.23        0.26    0.31
educ         0.07 -0.24   -0.20  0.36  0.43        0.27    0.31
exper       -0.05  0.06    0.09 -0.19 -0.26        0.02    0.02
tenure      -0.04 -0.04   -0.03 -0.01 -0.06        0.14    0.19
age         -0.01 -0.04    0.01 -0.03 -0.07        0.13    0.16
married     -0.04  0.00   -0.01 -0.02 -0.02        0.12    0.15
south       -0.11  0.07    0.09 -0.16 -0.17       -0.15   -0.19
urban        1.00 -0.03    0.00  0.03  0.11        0.19    0.20
sibs        -0.03  1.00    0.59 -0.29 -0.23       -0.13   -0.15
brthord      0.00  0.59    1.00 -0.28 -0.23       -0.12   -0.14
meduc        0.03 -0.29   -0.28  1.00  0.58        0.18    0.21
feduc        0.11 -0.23   -0.23  0.58  1.00        0.20    0.22
hourly_wage  0.19 -0.13   -0.12  0.18  0.20        1.00    0.90
ln_wage      0.20 -0.15   -0.14  0.21  0.22        0.90    1.00

5.2.6 Task 6

5.2.6.1 Task

Examine the relationship between education and wage using a scatter plot.

5.2.6.2 Guidance

We use the ggplot2 package to draw plots. First install the package and call the library.

# install.packages("ggplot2")
 library(ggplot2)

Attaching package: 'ggplot2'
The following object is masked from 'package:expss':

    vars

Education is expected to have a positive impact on wage. In our scatter plot, educ will be on the horizontal-axis while wage will be on the vertical-axis.

# Scatter plot
ggplot(wage2, aes(x = educ, y = wage)) +
  geom_point() +
  labs(title = "Scatter plot of Wage vs. Education", x = "Years of Schooling", y = "Wage")

You see above the full set of lines to create this plot. But let us do this step by step to have a better understanding. First, we bring the educ and wage variables from the wage2 data and position these on our plot.

ggplot(wage2, aes(x = educ, y = wage))

We then add (using the + sign), the observations in our data, represented by dots.

ggplot(wage2, aes(x = educ, y = wage)) +
  geom_point() 

It is always good practice to give a title for your plot. Notice also that the horizontal and vertical axes above are labelled by the variable names. We may also replace these with proper definitions of the variables. This is to make it easier for the readers to understand your plots:

ggplot(wage2, aes(x = educ, y = wage)) +
  geom_point() +
  labs(title = "Scatter plot of Wage vs. Education", x = "Years of Schooling", y = "Wage")

5.2.7 Task 7

5.2.7.1 Task

Tabulate the urban variable to see the distribution of observations in rural and urban areas

5.2.7.2 Guidance

We use the table() function for that purpose.

table(wage2$urban)

  0   1 
264 671 

5.2.8 Task 8

5.2.8.1 Task

Let’s say we are interested to plot the education-wage relationship differentiating between people in rural and urban areas. Replicate the scatter plot above, but this time, using different colors for rural and urban.

5.2.8.2 Guidance

Notice how we add the color = urban option below. We do the same for the label too.

# Scatter plot - colored by urban
ggplot(wage2, aes(x = educ, y = wage, color = urban)) +
  geom_point() +
  labs(title = "Scatter plot of Wage vs. Education", x = "Years of Schooling", y = "Wage", color = "urban")

The labelling of the above plot looks as if we have a range of values for the urban variable, changing from zero to one. The urban variable, in fact, is a dummy, taking two values only: zero for rural and one for urban residence. If you look into this variable entry in more detail, you will see that it is stored as num. We can change this using the factor() function. Instead of overriding the urban variable, let’s create a new variable urban_residence to see a comparison.

wage2$urban_residence <- factor(wage2$urban, levels = c(0,1), labels = c("rural", "urban"))

Below, we view the two variables using R’s dplyr package.

# install.packages("dplyr")
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:expss':

    compute, contains, na_if, recode, vars, where
The following objects are masked from 'package:maditr':

    between, coalesce, first, last
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
#View(select(wage2, urban, urban_residence))

Let’s re-run our scatter plot code again (but replacing urban with urban_residence:

# Scatter plot - colored by urban
ggplot(wage2, aes(x = educ, y = wage, color = urban_residence)) +
  geom_point() +
  labs(title = "Scatter plot of Wage vs. Education", x = "Years of Schooling", y = "Wage", color = "urban_residence")

5.2.9 Task 9

5.2.9.1 Task

Estimate a regression model where wage is regressed on education. Interpret the results.

5.2.9.2 Guidance

We use the lm() function to estimate linear regression models. You may read ~ in wage ~ educ below as “approximately modelled as” James et al. (2023). We may also say “wage is regressed on education”.

# Linear regression
model_1 <- lm(wage ~ educ, data = wage2)
summary(model_1)

Call:
lm(formula = wage ~ educ, data = wage2)

Residuals:
    Min      1Q  Median      3Q     Max 
-877.38 -268.63  -38.38  207.05 2148.26 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  146.952     77.715   1.891   0.0589 .  
educ          60.214      5.695  10.573   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 382.3 on 933 degrees of freedom
Multiple R-squared:  0.107, Adjusted R-squared:  0.106 
F-statistic: 111.8 on 1 and 933 DF,  p-value: < 2.2e-16

In the above regression output, we see that education has a statistically significant impact on wages. Each year of schooling increases wage by around £60, on average. The F test tells us that the regression model has an explanatory power, even though the R-squared value is low.

5.2.10 Task 10

5.2.10.1 Task

Using the regression model above, predict what the wage would be for given values of education (how much do we expect the wage would be for given years of schooling).

5.2.10.2 Guidance

Below, we recall model_1 to calculate predicted values; save the predictions under name wage_hat under wage2 data.

# Save predicted values under name wage_hat
wage2$wage_hat <- predict(model_1)

5.2.11 Task 11

5.2.11.1 Task

Add the estimated regression line to the wage-education scatter plot.

5.2.11.2 Guidance

We will be adding the regression line to the scatter plot we produced above. We use geom_smooth for this purpose. Let’s first remember what we did before:

# Scatter plot of education and wage
ggplot(wage2, aes(x = educ, y = wage)) +
  geom_point() +
  labs(title = "Scatter plot with Fitted Line", x = "Years of Schooling", y = "Wage")

Now, let’s add the regression line:

# Scatter plot with fitted line
ggplot(wage2, aes(x = educ, y = wage)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter plot with Fitted Line", x = "Years of Schooling", y = "Wage")
`geom_smooth()` using formula = 'y ~ x'

the geom_smooth(method = "lm") asks R to add a line estimating a “linear model” (i.e. a regression) of wage on educ.

Note that we could save this plot as an object by assigning it a name on the left hand side of the command. We will do that below and name the plot as scatter_wage_educ.

Can you guess what the plot would look if we changed se = FALSE to se = TRUE above? We can also try that below:

# Scatter plot with fitted line
scatter_wage_educ <- ggplot(wage2, aes(x = educ, y = wage)) +
                        geom_point() +
                        geom_smooth(method = "lm", se = TRUE) +
                        labs(title = "Scatter plot with Fitted Line", x = "Years of Schooling", y = "Wage")
print(scatter_wage_educ)
`geom_smooth()` using formula = 'y ~ x'

We could also add this sample regression line by using the wage_hat variable. wage_hat shows the predicted value of wage given observed values of education.

# Scatter plot with fitted line
# we add the wage_hat variable
ggplot(wage2, aes(x = educ, y = wage)) +
  geom_point() +
  geom_line(aes(y = wage_hat), color = "blue", size = 1) +
  labs(title = "Scatter plot with Fitted Line", x = "Years of Schooling", y = "Wage")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Note that we used geom_line() this time to add a line plot of an already existing variable in the data set.

  • ggplot(wage2, aes(x = educ, y = wage)) creates a canvas, a plot area with educ at the horizontal and wage at the vertical axis
  • geom_point() adds a scatterplot of wage against educ.
  • geom_line(aes(y = wage_hat)) adds the line for the predicted wage_hat values. The aes(y = wage_hat) ensures the line graph uses wage_hat on the y-axis while sharing the x-axis (educ).
  • color and size are optional for styling the line. Try experimenting with these and observe the changes.

5.2.12 Task 12

5.2.12.1 Task

Estimate a multiple regression model by adding experience and urban residence into the above regression. Save it under name model_2

5.2.12.2 Guidance

We will add exper and urban variables into the regression model using + sign.

# Linear regression
model_2 <- lm(wage ~ educ + exper + urban, data = wage2)
summary(model_2)

Call:
lm(formula = wage ~ educ + exper + urban, data = wage2)

Residuals:
    Min      1Q  Median      3Q     Max 
-799.67 -234.04  -34.26  197.89 2119.62 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -362.821    106.419  -3.409 0.000679 ***
educ          74.119      6.193  11.968  < 2e-16 ***
exper         17.940      3.105   5.777 1.03e-08 ***
urban        160.306     26.920   5.955 3.69e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 369.5 on 931 degrees of freedom
Multiple R-squared:  0.1676,    Adjusted R-squared:  0.1649 
F-statistic: 62.47 on 3 and 931 DF,  p-value: < 2.2e-16

How does model_2 compare to model_1?

5.2.13 Task 13

5.2.13.1 Task

Save your data to keep the newly created hourly_wage and ln_wage variables.

5.2.13.2 Guidance

# Save data in R format
save(wage2, file = "./assets/data/wage2.Rdata")

5.2.14 A Gentle Introduction to dplyr library

The dplyr library comes with R’s tidyverse package. The ggplot2 library we used above to produce plots is also a part of the tidyverse package.

I will replicate below a few of the tasks that we performed above using the dplyr library

5.2.14.1 Viewing data

We have seen before to use View to see the contents of data in a spreadsheet format:

head(wage2)
# A tibble: 6 × 19
   wage hours    IQ   KWW educ      exper tenure   age married south urban  sibs
  <dbl> <dbl> <dbl> <dbl> <labelle> <lab>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>
1   769    40    93    35 12        11         2    31       1     0     1     1
2   808    50   119    41 18        11        16    37       1     0     1     1
3   825    40   108    46 14        11         9    33       1     0     1     1
4   650    40    96    32 12        13         7    32       1     0     1     4
5   562    40    74    27 11        14         5    34       1     0     1    10
6  1400    40   116    43 16        14         2    35       1     0     1     1
# ℹ 7 more variables: brthord <dbl>, meduc <dbl>, feduc <dbl>,
#   hourly_wage <dbl>, ln_wage <dbl>, urban_residence <fct>, wage_hat <dbl>
#View(wage2)

We may use dplyr to select variables for viewing. Using select allows us to “keep or drop columns using their names and types”.

#View(select(wage2, wage, educ, exper, urban, urban_residence))

5.2.14.2 Generating new variables

We used the following lines to create hourly_wage and ln_wage variables:

# Generate new variables
wage2$hourly_wage <- wage2$wage / wage2$hours
wage2$ln_wage <- log(wage2$wage)

dplyr ’s mutate us used to “create, modify, and delete columns”. Let us create a new data frame, wage2_new to see what it does:

wage2_new <- wage2 %>% 
  mutate (
    hourly_wage_n = wage / hours,
    ln_wage_n = log(wage)
  )

In the above lines, we create a new data frame based on wage2 . Note the %>% above. This is a part of the command and is called the pipe operator. It helps us to simply the code and do the operations one step after another. We first call wage2 and create the new variables, hourly_wage_n and ln_wage_n .

Note how we avoided the use of wage2$ every time we referred to a variable in wage2 data.

Another example we used to create a new variable was when we predicted values of wage for given levels of education after estimating model_1.

Below is the code we used:

wage2$wage_hat <- predict(model_1)

We can do this as follows using dplyr

wage2 <- wage2 %>% 
  mutate(
    wage_hat_n = predict(model_1)
  )

5.2.14.3 Tabulating Variables

We used the code below to tabulate values of urban variable

table(wage2$urban)

  0   1 
264 671 

we may use count in dplyr for this purpose

wage2 %>% 
  count(urban)
# A tibble: 2 × 2
  urban     n
  <dbl> <int>
1     0   264
2     1   671

Remember that we could save this as a new object:

urban_table <- wage2 %>% 
  count(urban)
print(urban_table)
# A tibble: 2 × 2
  urban     n
  <dbl> <int>
1     0   264
2     1   671

Which output do you prefer?

5.3 Further Exercises

Download the data set called EAWE21.Rdata from the module page on Aula and save it. This is a subset of the Educational Attainment and Wage Equations data set used in Dougherty (2016) available from https://global.oup.com/uk/orc/busecon/economics/dougherty5e/student/datasets/eawe/. For this exercise we are interested in two variables:

  • EXP : Total out-of-school work experience (years) as of the 2002 interview

  • EARNINGS : Current hourly earnings in $ reported at the 2002 interview

5.3.1 Tasks

  1. Calculate summary statistics (mean, median, minimum, maximum) for the variables EXP and EARNINGS

  2. Draw scatter plot of EARNINGS on EXP.

  3. Calculate the covariance and correlation between earnings and exp and comment on the values

  4. Regress EARNINGS on EXP and comment on

    1. the sign and size of the regression coefficients
    2. the goodness of fits of the estimated model.
  5. Add a regression line to the scatter plot.