Chapter 4 Importing Data
We’ll learn how to import data sets into R, including from packages, and from spreadsheets, such as Excel or .csv files. We’ll then see how to make changes to the data, or how to do simple tasks using data frames.
4.1 Importing From a Package
We saw in a previous chapter that one easy way to read data in is from packages- this is often used for teaching exercises . Remember, to use data from a package, we first install the package if we haven’t already. Recall to do that, we can use the Rstudio menu bar “Tools -> Install Packages…” mouse action.
So that we don’t have many data sets unnecessarily taking up R’s memory, remember from the previous chapter we have to go through a two-step process of making sure that the package is installed on the computer, and then loading the desired data set into our current R session. Once the package is installed, we can load the data into our session via the following command:
data('alfalfa', package='faraway') # load the data set 'alfalfa' from the package 'faraway'
Because R tries to avoid loading datasets until they are needed,
the object alfalfa
isn’t initially loaded as a data.frame
but rather as a
“promise” that it eventually will be loaded whenever you first use it. So let’s
first access it by viewing it.
View(alfalfa)
There are two ways to enter the view command. Either executing the View()
function
from the console, or perhaps more simply by clicking on either the white table or the object name in the
Environment
tab.
4.2 Import from .csv
or .xls
files
A common way of transmitting data is using “csv” (Comma Separated Values) files (with the file suffix of .csv). Most software packages will use this format and it’s probably the most common interchange format.
A short file available here might look like this:
Start,End,Name
1837,1901,Victoria
1901,1910,Edward VII
1910,1936,George V
1936,1936,Edward VIII
1936,1952,George VI
1952,2022,Elizabeth II 2022,,Charles III
where the rows in the file represent the data frame rows, and the columns are just separated by commas. The first row of the file is usually the column titles.
The best way to import a file is using the import wizard accessed via ‘File -> Import Dataset’. This will then give you a choice of file types to read from (.csv files are in the “Text” options). Navigate to the file you want to import, and click on import. Note that R will preview the file as you import it, and here’s a good time to check the settings to make sure that you are importing the right file, and the parameters are correct.
When you import a file using the import wizard, R generates code that does the actual import. We MUST copy that code into our R Script file or else the import won’t happen when we run the script. So only use
the import wizard to generate the import code! The code generated by the import wizard ends with a View()
command and which can be removed. The code that I’ll paste into my R Script
file typically looks like this:
library(readr)
<- read_csv("http://www.ben-parker.co.uk/wp-content/uploads/2022/10/kingsAndQueens.csv")
kingsAndQueens View(kingsAndQueens)
Similar principles apply to reading in data from excel files, using the readxl
library. This is best accessed via the Import Dataset window, and for real data can be quite messy as excel spreadsheets are often not formatted well for import.
4.3 Types of data in R
It is worth thinking briefly here about how R stores data. Remember the diamonds
data set:
library(ggplot2)
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y z
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710 Median : 3.530
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900 Max. :31.800
##
You will see in the summary command above for the diamonds
data set that R knows that the mean of price can be explained, but for example, that color is categorical data, and does not attempt to find a mean color, which would not make sense.
R stores data as different types, and if we look at the structure of the diamonds
dataset using the str
command, we can see this structure:
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
carat
, for example is noted as num
(numerical data, meaning continuous), where price
is an int
, or an integer, as price has only taken integer prices. The unusual thing R does, as a statistical language, is to have a data type which is a factor
, which is how R stores categorical data. factors
can be ordered (ordinal data), as here with cut
,color
,and clarity
, or unordered (nominal data).
Other data types that R uses are strings (str
), and logicals.
<-"Kier"
oneName<-c("Tony", "Gordon", "David", "Jeremy")
myNames<-TRUE
leader<-c(FALSE,TRUE,FALSE,TRUE) myFriend
R is generally pretty good at recognising which data type is which, but if we explicitly want to define something as a categorical data, we usually have to tell R
# This is just a vector containing two strings
<-c("Bert","Ernie")
y y
## [1] "Bert" "Ernie"
# Here we tell R that this is categorical data
<-as.factor(c("Bert","Ernie"))
z z
## [1] Bert Ernie
## Levels: Bert Ernie
More details on the types can be found in the appendix
4.4 Exercises
- There is a spreadsheet with the football (soccer) results from the English Premiership 2019-2020 season available at this url. Because the
readr
package doesn’t care whether a file is on your local computer or on the Internet, we’ll use this file.- Start the import wizard using: “File -> Import Dataset -> From Text (readr) …” and input the above web URL. Click the update button near the top to cause the wizard to preview the result.
- Save the generated code to your R Script file and show the first few rows using the
head()
command. - The goals scored by the home side are in the ‘FTHG’ column. Calculate the mean number of goals for the home side.
- The result of the match (Home win, Draw, Away Win) are in the FTR column. Produce a crosstab using the table command to show the number of times each
HomeTeam
achieved each result. Which team had the most draws in the season? - Produce bar charts showing the number of goals scored in each game (from the FTHG column)