library(readxl)4 Importing Data into R
4.1 Example 1: Crime data
The example and instructions provided in this section is taken from (Riegler 2022).
The following exercise gives you a hands-on introduction to basic operations in R using a real-world data set. It begins with importing a MS-Excel data set into R and asks you to perform some basic operations to familiarise yourself with some of the commands that will be relevant for the coursework and in subsequent computer classes.
Please download the Excel data set called crime.xls from Aula and save it into a drive of your choice. This is a data set that contains crime levels and other socio-economic information on 46 cities across the US for the year 1982. The full version of the data set can be accessed at http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html. The variables are defined as follows:
| Variable | Definition |
|---|---|
| pop | actual population in number |
| crimes | total number of crimes |
| unem | unemployment rate (%) |
| officers | number of police officers |
| pcinc | per capita income, $ |
| area | land area, square miles |
| lawexpc | law enforcement expenditure per capita, $ |
From here on, you need to open a R script to save all your commands to be able to replicate your results:
4.1.1 Task 1
4.1.1.1 Task
Import the Excel data set into R.
4.1.1.2 Guidance
The native data format of R is .Rdata, however, you can also open other formats, such as .xlsx, .csv, etc. Non-native data formats have to be imported rather than just opened. Before we can we can import Excel spreadsheets directly into R, we have to activate a R-library first.
You can either use the package manager window (in the right bottom corner of the R screen) and tick the box next to the package name or you type the following into the terminal window (in the left bottom of the R screen)
This line loads the necessary readxl library. But you will probably receive an error message when you run the above line. This is because we first need to install the read_excel package. (Note that you will need to type the below line without the pound (hashtag) sign at the beginning of the line).
# install.packages("readxl")
library(readxl)There are two ways to import:
Through command line:
crime <- read_excel("./assets/data/crime.xls")In the above line, we import the dataset with the
read_excelfunction and store it under the namecrime. Notice how the newcrimedata is added as an object in the R environment.Through menu:
File –> Import Dataset –> From Excel
Don’t forget to tick the “First Row as Names” box if it is not ticked!
4.1.2 Task 2
4.1.2.1 Task
View the dataset in R’s data viewer.
4.1.2.2 Guidance
To open the data viewer, use the View function.
# View(crime)Note that the first letter of View is capitalised.
4.1.3 Task 3
4.1.3.1 Task
View the first few (six) entries of the crime data to get a feeling of what the values look like.
4.1.3.2 Guidance
Use the head function
head(crime)# A tibble: 6 × 7
pop crimes unem officers pcinc area lawexpc
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 229528 17136 8.20 326 8532 44.6 851.
2 814054 75654 8.10 1621 7551 375 875.
3 374974 31352 9 633 8343 49.8 1122.
4 176496 15698 12.6 245 7592 74 744.
5 288446 31202 12.6 504 7558 97.3 974.
6 122768 16806 13.9 186 6411 55.3 762.
4.1.4 Task 4
4.1.4.1 Task
Label the variables using the definitions given above.
4.1.4.2 Guidance
You have to attach a variable label to each variable. There is already a library available which facilitates the allocation of labels to variables. First, we need to install the package!
# install.packages("expss")
library(expss)Loading required package: maditr
Use magrittr pipe '%>%' to chain several operations:
mtcars %>%
let(mpg_hp = mpg/hp) %>%
take(mean(mpg_hp), by = am)
Use 'expss_output_rnotebook()' to display tables inside R Notebooks.
To return to the console output, use 'expss_output_default()'.
crime <- apply_labels(crime,
pop = "actual population in number",
crimes = "total number of crimes",
unem = "unemployment rate (%)",
officers = "number of police officers",
pcinc = "per capita income, $",
area = "land area, square miles",
lawexpc = "law enforcement expenditure per capita, $"
)4.1.5 Task 5
4.1.5.1 Task
Create a new variable which measures the population density for each city.
4.1.5.2 Guidance
To generate a new variable and add it to the existing crime dataset, we use the following command:
crime$popdens <- crime$pop / crime$areaYou may wonder why we add crime$ in front of every variable. The reason is that R can store more than one data frame, matrix, list, vector etc., at the same time, so the prefix crime$ is necessary to avoid ambiguity and ensure that we are working with variables in the crime data. Think of crime$ as an address where e.g. the variable pop stays. If you have loaded another data frame that contains a pop variable, R would know that we want to use the variable from the crime dataset and not from the other data frame. There are library packages that can facilitate the process, however, we will cover them later in the module.
Note that the newly created population density variable is now labelled as the original population variable (pop). Let’s update the label with the method we introduced in Task 4. Note that we do not need to call the library again, as it is already called.
crime <- apply_labels(crime,
popdens = "population density: pop / area")4.1.6 Task 6
4.1.6.1 Task
Sort the data with respect to the population density of each city.
4.1.6.2 Guidance
Sorting data is a useful action to get a general feeling for the data, e.g. are there any outliers in the dataset? Are there any unusual patterns?
To change the order of the rows in a data frame, we will apply the order function. We first rank all observations with respect to the population density and store this information in a vector called rank. The rank vector contains indices that we can use to sort the crime data frame. Below, we save the sorted data under a new name, crime.popdens1
rank <- order(crime$popdens)
crime.popdens1 <- crime[rank,]Let’s see the result (note. how the population density variable is now sorted from the smallest to the largest):
head(crime.popdens1)# A tibble: 6 × 8
pop crimes unem officers pcinc area lawexpc popdens
<labelled> <labelled> <labelled> <labelled> <labelled> <lab> <label> <labelle>
1 425093 38195 4.7 767 7991 604.0 570.00 703.7964
2 268887 14537 5.5 400 7704 255.9 570.63 1050.7503
3 462657 34736 10.4 937 7585 352.0 582.56 1314.3665
4 451397 45503 10.4 1145 7480 316.4 1054.17 1426.6656
5 412661 47128 8.3 719 7336 258.5 554.70 1596.3675
6 173630 18915 8.7 366 7409 100.5 827.16 1727.6617
# you may alternatively use
# View(crime.popdens1)This procedure sorts the data from the smallest to the largest value. To sort the data from the largest to the smallest number, we set the order argument decreasing to TRUE.
crime.popdens2 <- crime[order(crime$popdens, decreasing = TRUE),]
head(crime.popdens2)# A tibble: 6 × 8
pop crimes unem officers pcinc area lawexpc popdens
<labelled> <labelled> <labelled> <labelled> <labelled> <lab> <label> <labelle>
1 708287 68598 8.4 1971 9265 46.4 1050.00 15264.806
2 334414 36172 15.4 1166 4525 24.1 1139.32 13876.099
3 365506 52901 12.3 979 6084 34.3 714.00 10656.152
4 1181868 152962 20.3 4092 6251 135.6 1483.52 8715.840
5 360493 28592 16.9 1034 5929 41.8 749.44 8624.235
6 158533 15233 11.3 408 6169 18.9 661.50 8387.990
Have you observed a slight difference in the way we sorted the data? We can save some time and space by merging the two steps into one line, however, it is sometimes easier to understand a command if it is split into separate stages.
4.1.7 Task 7
4.1.7.1 Task
What is the minimum and maximum value for population density in the crime data?
4.1.7.2 Guidance
The minimum and maximum values can be produced by generating standard descriptive statistics of the variables.
summary(crime$popdens) Min. 1st Qu. Median Mean 3rd Qu. Max.
703.8 2797.1 4236.8 4967.5 7052.2 15264.8
Before you finish, save the dataset under a new name. Never overwrite your original data!
save(crime, file = "./assets/data/crime_v2.Rdata")The above command tells R to use the crime dataset and save it as crime_v2.Rdata. Rdata is an R specific format. R can also save data in .csv format, that can be opened with any text editor or spreadsheet software:
write.csv(crime, file = "./assets/data/crime_v2.csv", row.names = TRUE) Now you are ready to answer the following questions on your own:
4.1.8 Further Exercises
- Find the minimum and maximum number of police officers in the data set.
- Create a new variable which measures the crime rate per 1,000 of population.
- Is the city with the highest number of police officers also the city with the highest crime density?
- How many crimes occurred in the richest city?
- Is the richest city also the one with the highest number of police officers?
- What is the average unemployment rate across these 46 U.S. cities?
- Does the city with the highest unemployment rate also have the highest crime level?