4  Importing Data into R

4.1 Example 1: Crime data

The example and instructions provided in this section is taken from (Riegler 2022).

The following exercise gives you a hands-on introduction to basic operations in R using a real-world data set. It begins with importing a MS-Excel data set into R and asks you to perform some basic operations to familiarise yourself with some of the commands that will be relevant for the coursework and in subsequent computer classes.

Please download the Excel data set called crime.xls from Aula and save it into a drive of your choice. This is a data set that contains crime levels and other socio-economic information on 46 cities across the US for the year 1982. The full version of the data set can be accessed at http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html. The variables are defined as follows:

Variable Definition
pop actual population in number
crimes total number of crimes
unem unemployment rate (%)
officers number of police officers
pcinc per capita income, $
area land area, square miles
lawexpc law enforcement expenditure per capita, $

From here on, you need to open a R script to save all your commands to be able to replicate your results:

4.1.1 Task 1

4.1.1.1 Task

Import the Excel data set into R.

4.1.1.2 Guidance

The native data format of R is .Rdata, however, you can also open other formats, such as .xlsx, .csv, etc. Non-native data formats have to be imported rather than just opened. Before we can we can import Excel spreadsheets directly into R, we have to activate a R-library first.

You can either use the package manager window (in the right bottom corner of the R screen) and tick the box next to the package name or you type the following into the terminal window (in the left bottom of the R screen)

library(readxl)

This line loads the necessary readxl library. But you will probably receive an error message when you run the above line. This is because we first need to install the read_excel package. (Note that you will need to type the below line without the pound (hashtag) sign at the beginning of the line).

# install.packages("readxl")
library(readxl)

There are two ways to import:

  1. Through command line:

    crime <- read_excel("./assets/data/crime.xls")

    In the above line, we import the dataset with the read_excel function and store it under the name crime. Notice how the new crime data is added as an object in the R environment.

  2. Through menu:

    File –> Import Dataset –> From Excel

Don’t forget to tick the “First Row as Names” box if it is not ticked!

4.1.2 Task 2

4.1.2.1 Task

View the dataset in R’s data viewer.

4.1.2.2 Guidance

To open the data viewer, use the View function.

# View(crime)

Note that the first letter of View is capitalised.

4.1.3 Task 3

4.1.3.1 Task

View the first few (six) entries of the crime data to get a feeling of what the values look like.

4.1.3.2 Guidance

Use the head function

head(crime)
# A tibble: 6 × 7
     pop crimes  unem officers pcinc  area lawexpc
   <dbl>  <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>
1 229528  17136  8.20      326  8532  44.6    851.
2 814054  75654  8.10     1621  7551 375      875.
3 374974  31352  9         633  8343  49.8   1122.
4 176496  15698 12.6       245  7592  74      744.
5 288446  31202 12.6       504  7558  97.3    974.
6 122768  16806 13.9       186  6411  55.3    762.

4.1.4 Task 4

4.1.4.1 Task

Label the variables using the definitions given above.

4.1.4.2 Guidance

You have to attach a variable label to each variable. There is already a library available which facilitates the allocation of labels to variables. First, we need to install the package!

# install.packages("expss")
library(expss)
Loading required package: maditr

Use magrittr pipe '%>%' to chain several operations:
             mtcars %>%
                 let(mpg_hp = mpg/hp) %>%
                 take(mean(mpg_hp), by = am)
        

Use 'expss_output_rnotebook()' to display tables inside R Notebooks.
 To return to the console output, use 'expss_output_default()'.
crime <- apply_labels(crime, 
                      pop = "actual population in number",
                      crimes = "total number of crimes",
                      unem = "unemployment rate (%)", 
                      officers = "number of police officers", 
                      pcinc = "per capita income, $",
                      area = "land area, square miles", 
                      lawexpc = "law enforcement expenditure per capita, $"
                      )

4.1.5 Task 5

4.1.5.1 Task

Create a new variable which measures the population density for each city.

4.1.5.2 Guidance

To generate a new variable and add it to the existing crime dataset, we use the following command:

crime$popdens <- crime$pop / crime$area

You may wonder why we add crime$ in front of every variable. The reason is that R can store more than one data frame, matrix, list, vector etc., at the same time, so the prefix crime$ is necessary to avoid ambiguity and ensure that we are working with variables in the crime data. Think of crime$ as an address where e.g. the variable pop stays. If you have loaded another data frame that contains a pop variable, R would know that we want to use the variable from the crime dataset and not from the other data frame. There are library packages that can facilitate the process, however, we will cover them later in the module.

Note that the newly created population density variable is now labelled as the original population variable (pop). Let’s update the label with the method we introduced in Task 4. Note that we do not need to call the library again, as it is already called.

crime <- apply_labels(crime, 
                      popdens = "population density: pop / area")

4.1.6 Task 6

4.1.6.1 Task

Sort the data with respect to the population density of each city.

4.1.6.2 Guidance

Sorting data is a useful action to get a general feeling for the data, e.g. are there any outliers in the dataset? Are there any unusual patterns?

To change the order of the rows in a data frame, we will apply the order function. We first rank all observations with respect to the population density and store this information in a vector called rank. The rank vector contains indices that we can use to sort the crime data frame. Below, we save the sorted data under a new name, crime.popdens1

rank <- order(crime$popdens)
crime.popdens1 <- crime[rank,]

Let’s see the result (note. how the population density variable is now sorted from the smallest to the largest):

head(crime.popdens1)
# A tibble: 6 × 8
  pop        crimes     unem       officers   pcinc      area  lawexpc popdens  
  <labelled> <labelled> <labelled> <labelled> <labelled> <lab> <label> <labelle>
1 425093     38195       4.7        767       7991       604.0  570.00  703.7964
2 268887     14537       5.5        400       7704       255.9  570.63 1050.7503
3 462657     34736      10.4        937       7585       352.0  582.56 1314.3665
4 451397     45503      10.4       1145       7480       316.4 1054.17 1426.6656
5 412661     47128       8.3        719       7336       258.5  554.70 1596.3675
6 173630     18915       8.7        366       7409       100.5  827.16 1727.6617
# you may alternatively use 
# View(crime.popdens1)

This procedure sorts the data from the smallest to the largest value. To sort the data from the largest to the smallest number, we set the order argument decreasing to TRUE.

crime.popdens2 <- crime[order(crime$popdens, decreasing = TRUE),]
head(crime.popdens2)
# A tibble: 6 × 8
  pop        crimes     unem       officers   pcinc      area  lawexpc popdens  
  <labelled> <labelled> <labelled> <labelled> <labelled> <lab> <label> <labelle>
1  708287     68598      8.4       1971       9265        46.4 1050.00 15264.806
2  334414     36172     15.4       1166       4525        24.1 1139.32 13876.099
3  365506     52901     12.3        979       6084        34.3  714.00 10656.152
4 1181868    152962     20.3       4092       6251       135.6 1483.52  8715.840
5  360493     28592     16.9       1034       5929        41.8  749.44  8624.235
6  158533     15233     11.3        408       6169        18.9  661.50  8387.990

Have you observed a slight difference in the way we sorted the data? We can save some time and space by merging the two steps into one line, however, it is sometimes easier to understand a command if it is split into separate stages.

4.1.7 Task 7

4.1.7.1 Task

What is the minimum and maximum value for population density in the crime data?

4.1.7.2 Guidance

The minimum and maximum values can be produced by generating standard descriptive statistics of the variables.

summary(crime$popdens)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  703.8  2797.1  4236.8  4967.5  7052.2 15264.8 

Before you finish, save the dataset under a new name. Never overwrite your original data!

save(crime, file = "./assets/data/crime_v2.Rdata")

The above command tells R to use the crime dataset and save it as crime_v2.Rdata. Rdata is an R specific format. R can also save data in .csv format, that can be opened with any text editor or spreadsheet software:

write.csv(crime, file = "./assets/data/crime_v2.csv", row.names = TRUE) 

Now you are ready to answer the following questions on your own:

4.1.8 Further Exercises

  1. Find the minimum and maximum number of police officers in the data set.
  2. Create a new variable which measures the crime rate per 1,000 of population.
  3. Is the city with the highest number of police officers also the city with the highest crime density?
  4. How many crimes occurred in the richest city?
  5. Is the richest city also the one with the highest number of police officers?
  6. What is the average unemployment rate across these 46 U.S. cities?
  7. Does the city with the highest unemployment rate also have the highest crime level?