Chapter 4 Importing Data

We’ll learn how to import data sets into R, including from packages, and from spreadsheets, such as Excel or .csv files. We’ll then see how to make changes to the data, or how to do simple tasks using data frames.

4.1 Importing From a Package

We saw in a previous chapter that one easy way to read data in is from packages- this is often used for teaching exercises . Remember, to use data from a package, we first install the package if we haven’t already. Recall to do that, we can use the Rstudio menu bar “Tools -> Install Packages…” mouse action.

So that we don’t have many data sets unnecessarily taking up R’s memory, remember from the previous chapter we have to go through a two-step process of making sure that the package is installed on the computer, and then loading the desired data set into our current R session. Once the package is installed, we can load the data into our session via the following command:

data('alfalfa', package='faraway')   # load the data set 'alfalfa' from the package 'faraway'

Because R tries to avoid loading datasets until they are needed, the object alfalfa isn’t initially loaded as a data.frame but rather as a “promise” that it eventually will be loaded whenever you first use it. So let’s first access it by viewing it.

View(alfalfa)

There are two ways to enter the view command. Either executing the View() function from the console, or perhaps more simply by clicking on either the white table or the object name in the Environment tab.

4.2 Import from .csv or .xls files

A common way of transmitting data is using “csv” (Comma Separated Values) files (with the file suffix of .csv). Most software packages will use this format and it’s probably the most common interchange format.

A short file available here might look like this:

Start,End,Name
1837,1901,Victoria
1901,1910,Edward VII
1910,1936,George V
1936,1936,Edward VIII
1936,1952,George VI
1952,2022,Elizabeth II
2022,,Charles III

where the rows in the file represent the data frame rows, and the columns are just separated by commas. The first row of the file is usually the column titles.

The best way to import a file is using the import wizard accessed via ‘File -> Import Dataset’. This will then give you a choice of file types to read from (.csv files are in the “Text” options). Navigate to the file you want to import, and click on import. Note that R will preview the file as you import it, and here’s a good time to check the settings to make sure that you are importing the right file, and the parameters are correct.

When you import a file using the import wizard, R generates code that does the actual import. We MUST copy that code into our R Script file or else the import won’t happen when we run the script. So only use the import wizard to generate the import code! The code generated by the import wizard ends with a View() command and which can be removed. The code that I’ll paste into my R Script file typically looks like this:

library(readr)
kingsAndQueens <- read_csv("http://www.ben-parker.co.uk/wp-content/uploads/2022/10/kingsAndQueens.csv")
View(kingsAndQueens)

Similar principles apply to reading in data from excel files, using the readxl library. This is best accessed via the Import Dataset window, and for real data can be quite messy as excel spreadsheets are often not formatted well for import.

4.3 Types of data in R

It is worth thinking briefly here about how R stores data. Remember the diamondsdata set:

library(ggplot2)
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y                z         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900   Max.   :31.800  
## 

You will see in the summary command above for the diamonds data set that R knows that the mean of price can be explained, but for example, that color is categorical data, and does not attempt to find a mean color, which would not make sense.

R stores data as different types, and if we look at the structure of the diamonds dataset using the str command, we can see this structure:

str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

carat, for example is noted as num (numerical data, meaning continuous), where price is an int, or an integer, as price has only taken integer prices. The unusual thing R does, as a statistical language, is to have a data type which is a factor, which is how R stores categorical data. factorscan be ordered (ordinal data), as here with cut,color,and clarity, or unordered (nominal data).

Other data types that R uses are strings (str), and logicals.

oneName<-"Kier"
myNames<-c("Tony", "Gordon", "David", "Jeremy")
leader<-TRUE
myFriend<-c(FALSE,TRUE,FALSE,TRUE)

R is generally pretty good at recognising which data type is which, but if we explicitly want to define something as a categorical data, we usually have to tell R

# This is just a vector containing two strings
y<-c("Bert","Ernie")
y
## [1] "Bert"  "Ernie"
# Here we tell R that this is categorical data
z<-as.factor(c("Bert","Ernie"))
z
## [1] Bert  Ernie
## Levels: Bert Ernie

More details on the types can be found in the appendix

4.4 Exercises

  1. There is a spreadsheet with the football (soccer) results from the English Premiership 2019-2020 season available at this url. Because the readr package doesn’t care whether a file is on your local computer or on the Internet, we’ll use this file.
    1. Start the import wizard using: “File -> Import Dataset -> From Text (readr) …” and input the above web URL. Click the update button near the top to cause the wizard to preview the result.
    2. Save the generated code to your R Script file and show the first few rows using the head() command.
    3. The goals scored by the home side are in the ‘FTHG’ column. Calculate the mean number of goals for the home side.
    4. The result of the match (Home win, Draw, Away Win) are in the FTR column. Produce a crosstab using the table command to show the number of times each HomeTeam achieved each result. Which team had the most draws in the season?
    5. Produce bar charts showing the number of goals scored in each game (from the FTHG column)