Chapter 15 Data imports and graphing
Much of this chapter is based strongly (for large parts identical) on the STAT444 course by Derek Sonderegger. There is a Video Lecture that accompanies this chapter.
15.1 Importing Data
We’ll learn how to import data sets into R, including from packages, and from spreadsheets, such as Excel or .csv files. We’ll then see how to make changes to the data, or how to do simple tasks using data frames.
15.1.1 Importing From a Package
We saw in a previous chapter that one easy way to read data in is from packages- this is often used for teaching exercises . Remember, to use data from a package, we first install the package if we haven’t already. Recall to do that, we can use the Rstudio menu bar “Tools -> Install Packages…” mouse action.
So that we don’t have many data sets unnecessarily taking up R’s memory, remember from the previous chapter we have to go through a two-step process of making sure that the package is installed on the computer, and then loading the desired data set into our current R session. Once the package is installed, we can load the data into our session via the following command:
data('alfalfa', package='faraway') # load the data set 'alfalfa' from the package 'faraway'
Because R tries to avoid loading datasets until they are needed,
the object alfalfa
isn’t initially loaded as a data.frame
but rather as a
“promise” that it eventually will be loaded whenever you first use it. So let’s
first access it by viewing it.
View(alfalfa)
There are two ways to enter the view command. Either executing the View()
function
from the console, or perhaps more simply by clicking on either the white table or the object name in the
Environment
tab.
15.1.2 Import from .csv
or .xls
files
A common way of transmitting data is using “csv” (Comma Separated Values) files (with the file suffix of .csv). Most software packages will use this format and it’s probably the most common interchange format.
A short file might look like this:
Start,End,Name
1837,1901,Victoria
1901,1910,Edward VII
1910,1936,George V
1936,1936,Edward VIII
1936,1952,George VI
1952,2022,Elizabeth II 2022,,Charles III
where the rows in the file represent the data frame rows, and the columns are just separated by commas. The first row of the file is usually the column titles.
If data is stored as an Excel file and we just tell R where the file is and which worksheet tab to import.
The best way to import a file is using the import wizard accessed via ‘File -> Import Dataset’. This will then give you a choice of file types to read from (.csv files are in the “Text” options). Navigate to the file you want to import, and click on import. Note that R will preview the file as you import it, and here’s a good time to check the settings to make sure that you are importing the right file, and the parameters are correct.
When you import a file using the import wizard, R generates code that does
the actual import. We MUST copy that code into our R Script file or else the
import won’t happen when we run the script. So only use
the import wizard to generate the import code! The code generated by the import
wizard ends with a View()
command and which can be removed. The code that I’ll paste into my R Script
file typically looks like this:
# NB Not real code as the file is not a real one
library(readxl)
<- read_excel("/MyFiles/DataFile/WinningLotteryNumbers.xlsx") lotto
15.1.3 Types of data in R
It is worth thinking briefly here about how R stores data. Remember the diamonds
data set:
library(ggplot2)
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y z
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710 Median : 3.530
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900 Max. :31.800
##
You will see in the summary command above for the diamonds
data set that R knows that the mean of price can be explained, but for example, that color is categorical data, and does not attempt to find a mean color, which would not make sense.
R stores data as different types, and if we look at the structure of the diamonds
dataset using the str
command, we can see this structure:
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
carat
, for example is noted as num
(numerical data, meaning continuous), where price
is an int
, or an integer, as price has only taken integer prices. The unusual thing R does, as a statistical language, is to have a data type which is a factor
, which is how R stores categorical data. factors
can be ordered (ordinal data), as here with cut
,color
,and clarity
, or unordered (nominal data).
Other data types that R uses are strings (str
), and logicals.
<-"Kier"
oneName<-c("Tony", "Gordon", "David", "Jeremy")
myNames<-TRUE
leader<-c(FALSE,TRUE,FALSE,TRUE) myFriend
R is generally pretty good at recognising which data type is which, but if we explicitly want to define something as a categorical data, we usually have to tell R
# This is just a vector containing two strings
<-c("Bert","Ernie")
y y
## [1] "Bert" "Ernie"
# Here we tell R that this is categorical data
<-as.factor(c("Bert","Ernie"))
z z
## [1] Bert Ernie
## Levels: Bert Ernie
More details on the types can be found in the appendix
15.2 More general graphs: ggplot2
We have seen how to do simple graphs in a previouis chapter. A more general approach, and increasingly popular, is the ‘tidyverse’ which is a collection of modern packages that do some very useful things. One part of this tidyverse package is an advanced graphing library called ggplot2.
To install the package the very first time, we either use the “Tools-> Import Packages” menu and install tidyverse, or use the following command.
install.packages("tidyverse")
We must load the package into memory using the library command every time:
library(tidyverse)
To make the most of ggplot2
it is important to wrap your mind around “The Grammar of Graphics”. Briefly, the act of building a graph can be broken down into three steps.
Define what data set we are using.
What is the major relationship we wish to examine?
In what way should we present that relationship? These relationships can be presented in multiple ways, and the process of creating a good graph relies on building layers upon layers of information. For example, we might start with printing the raw data and then overlay a regression line over the top.
Next, it should be noted that ggplot2
is designed to act on data frames. It is actually hard to just draw three data points and for simple graphs it might be easier to use the base graphing system in R. However for any real data analysis project, the data will already be in a data frame and this is not an annoyance.
These notes are sufficient for creating simple graphs using ggplot2
, but are not intended to be exhaustive. There are many places online to get help with ggplot2
. One very nice resource is the website, http://www.cookbook-r.com/Graphs/, which gives much of the information available in the book R Graphics Cookbook which I highly recommend. Second is just googling your problems and see what you can find on websites such as StackExchange.
One way that ggplot2
makes it easy to form very complicated graphs is that it provides a large number of basic building blocks that, when stacked upon each other, can produce extremely complicated graphs. A full list is available at http://docs.ggplot2.org/current/ but the following list gives some idea of different building blocks. These different geometries are different ways to display the relationship between variables and can be combined in many interesting ways.
Geom | Description | Required Aesthetics |
---|---|---|
geom_histogram |
A histogram | x |
geom_bar |
A barplot (y is number of rows) | x |
geom_col |
A barplot (y is given by a column) | x, y |
geom_density |
A density plot of data. (smoothed histogram) | x |
geom_boxplot |
Boxplots | x, y |
geom_line |
Draw a line (after sorting x-values) | x, y |
geom_path |
Draw a line (without sorting x-values) | x, y |
geom_point |
Draw points (for a scatterplot) | x, y |
geom_smooth |
Add a ribbon that summarizes a scatterplot | x, y |
geom_ribbon |
Enclose a region, and color the interior | ymin, ymax |
geom_errorbar |
Error bars | ymin, ymax |
geom_text |
Add text to a graph (with box) | x, y, label |
geom_label |
Add text to a graph (without box) | x, y, label |
geom_tile |
Create Heat map | x, y, fill |
A graph can be built up layer by layer, where:
- Each layer corresponds to a
geom
, each of which requires a dataset and a mapping between an aesthetic and a column of the data set.- If you don’t specify either, then the layer inherits everything defined in the
ggplot()
command. - You can have different datasets for each layer!
- If you don’t specify either, then the layer inherits everything defined in the
- Layers can be added with a
+
, or you can define two plots and add them together (second one over-writes anything that conflicts).
15.2.1 Scatterplots
We’ll again use the iris
dataset to demonstrate here.
ggplot( data=iris, aes(x=Sepal.Length, y=Petal.Length) ) +
geom_point( )
- The data set we wish to use is specified using
data=iris
. - The relationship we want to explore is
x=Sepal.Length
andy=Petal.Length
. This means the x-axis will be the Sepal Length and the y-axis will be the Petal Length. - The way we want to display this relationship is through graphing 1 point for every observation.
We can define other attributes that might reflect other aspects of the data. For example, we might want for the color of the data point to change dynamically based on the species of iris.
ggplot( data=iris, aes(x=Sepal.Length, y=Petal.Length) ) +
geom_point( aes(color=Species) )
The aes()
command (short for aesthetic) inside the previous section of code is quite mysterious. The way to think about the aes()
is that it gives you a way to define relationships that are data dependent. In the previous graph, the x-value and y-value for each point was defined dynamically by the data, as was the color. If we just wanted all the data points to be colored blue and larger, then the following code would do that
ggplot( data=iris, aes(x=Sepal.Length, y=Petal.Length) ) +
geom_point( color='blue', size=4 )
The important part isn’t that color and size were defined in the geom_point()
but that they were defined outside of an aes()
function!
- Anything set inside an
aes()
command will be of the formattribute=Column_Name
and will change based on the data. - Anything set outside an
aes()
command will be in the formattribute=value
and will be fixed.
15.2.2 Box Plots
Boxplots are a common way to show a categorical variable on the x-axis and continuous on the y-axis.
ggplot(iris, aes(x=Species, y=Petal.Length)) +
geom_boxplot()
The boxes show the \(25^{th}\), \(50^{th}\), and \(75^{th}\) percentile and the lines coming off the box extend to the smallest and largest non-outlier observation.
15.3 Faceting
The goal with faceting is to make many panels of graphics where each panel represents the same relationship between variables, but something changes between each panel. For example using the iris
dataset we could look at the relationship between Sepal.Length
and Petal.Length
either with all the data in one graph, or one panel per species.
library(ggplot2)
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length)) +
geom_point() +
facet_grid( . ~ Species )
The line facet_grid( formula )
tells ggplot2
to make panels, and the formula tells how to orient the panels. In R, formulas are always interpreted in the order y ~ x
. Because I want the species to change as we go across the page, but don’t have anything I want to change vertically we use . ~ Species
to represent that. If we had wanted three graphs stacked then we could use Species ~ .
.
For a second example, we look at a dataset that examines the amount a waiter was tipped by 244 parties. Covariates that were measured include the day of the week, size of the party, total amount of the bill, amount tipped, whether there were smokers in the group and the gender of the person paying the bill
library(reshape)
## Warning: package 'reshape' was built under R version 4.1.3
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
data(tips, package='reshape')
head(tips)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
It is easy to look at the relationship between the size of the bill and the percentage tipped.
ggplot(tips, aes(x = total_bill, y = tip / total_bill )) +
geom_point()
Next we ask if there is a difference in tipping percent based on gender or day of the week by plotting this relationship for each combination of gender and day.
ggplot(tips, aes(x = total_bill, y = tip / total_bill, color=time )) +
geom_point() +
facet_grid( sex ~ day )
# facet_grid( day ~ sex ) # changing orientation emphasizes certain comparisons!
Sometimes we want multiple rows and columns of the facets, but there is only one categorical variable with many levels. In that case we use facet_wrap which takes a one-sided formula.
ggplot(tips, aes(x = total_bill, y = tip / total_bill )) +
geom_point() +
# facet_grid( . ~ day) # Four graphs in a row, Too Squished left/right!
facet_wrap( ~ day ) # spread graphs out both left/right and up/down.
Finally we can allow the x and y scales to vary between the panels by setting “free”, “free_x”, or “free_y”. In the following code, the y-axis scale changes between the gender groups.
ggplot(tips, aes(x = total_bill, y = tip / total_bill )) +
geom_point() +
facet_grid( sex ~ day, scales="free_y" )
15.4 Annotation
15.4.1 Axis Labels and Titles
To make a graph more understandable, it is necessary to tweak the axis labels and add a main title and such. Here we’ll adjust labels in a graph, including the legend labels.
# Save the graph before I add more to it.
<-
P ggplot( data=iris, aes(x=Sepal.Length, y=Petal.Length, color=Species) ) +
geom_point( aes(color=Species) ) +
labs( title='Sepal Length vs Petal Length' ) +
labs( x="Sepal Length (cm)", y="Petal Length (cm)" ) +
labs( color="Species Name") +
labs( caption = "Iris data from Edgar Anderson (1935)" )
# Print out the plot
P
You could either call the labs()
command repeatedly with each label, or you could provide multiple arguments to just one labs()
call.
15.4.2 Text Labels
One way to improve the clarity of a graph is to remove the legend and label the points directly on the graph. For example, we could instead have the species names near the cloud of data points for the species.
Usually our annotations aren’t stored in the data.frame
that contains our data of interest. So we need to either create a new (usually small) data.frame
that contains all the information needed to create the annotation or we need to set the necessary information in-place. Either way, we need to specify the x
and y
coordinates, and the label
to be printed, as well as any other attribute that is set in the global aes()
command. That means if color
has been set globally, the annotation layer also needs to address the color
attribute.
15.4.2.1 Using a data.frame
To do this in ggplot, we need to make a data frame that has the columns Sepal.Length
and Petal.Length
so that we can specify where each label should go, as well as the label that we want to print. Also, because color is matched to the Species
column, this small dataset should also have a the Species
column.
This step always requires a bit of fussing with the graph because the text size and location should be chosen based on the size of the output graphic and if I rescale the image it often looks awkward. Typically I leave this step until the figure is being prepared for final publication.
# create another data frame that has the text labels I want to add to the graph.
<- data.frame(
annotation.data Sepal.Length = c(4.5, 6.5, 7.0), # Figured out the label location by eye.
Petal.Length = c(2.25, 3.75, 6.5), # If I rescale the graph, I would redo this step.
Species = c('setosa', 'versicolor', 'virginica'),
Text = c('SETOSA', 'VERSICOLOR', 'VIRGINICA')
)
# Use the previous plot I created, along with the
# aes() options already defined.
+
P geom_text( data=annotation.data, aes(label=Text), size=2.5) + # write the labels
theme( legend.position = 'none' ) # remove the legend
15.4.2.2 Setting attributes in-line
Instead of creating a new data frame, we could just add a new layer and just set all of the graph attributes manually. To do this, we have to have one layer for each text we want to add to the graph. The annotate
function takes a geom layer type and the necessary inputs an allows us to avoid the annoyance of building a labels data frame.
+
P annotate('text', x=4.5, y=2.25, size=6, color='#F8766D', label='SETOSA' ) +
annotate('text', x=6.5, y=3.75, size=6, color='#00BA38', label='VERSICOLOR' ) +
annotate('text', x=7.0, y=6.50, size=6, color='#619CFF', label='VIRGINICA' ) +
theme(legend.position = 'none')
Finally there is a geom_label
layer that draws a nice box around what you want to print.
+
P annotate('label', x=4.5, y=2.25, size=6, color='#F8766D', label='SETOSA' ) +
annotate('label', x=6.5, y=3.50, size=6, color='#00BA38', label='VERSICOLOR' ) +
annotate('label', x=7.0, y=6.75, size=6, color='#619CFF', label='VIRGINICA' ) +
theme(legend.position = 'none')
My recommendation is to just set the x
, y
, and label
attributes manually inside an annotate()
call if you have one or two annotations to print on the graph. If you have many annotations to print, the create a data frame that contains all of them and use data=
argument in the geom to use that created annotation data set.
15.5 Exercises
Examine the dataset
trees
, which should already be pre-loaded. Look at the help file using?trees
for more information about this data set. We wish to build a scatterplot that compares the height and girth of these cherry trees to the volume of lumber that was produced.- Create a graph using
ggplot2
with Height on the x-axis, Volume on the y-axis, and Girth as the either the size of the data point or the color of the data point. Which do you think is a more intuitive representation? - Add appropriate labels for the main title and the x and y axes.
- The R-squared value for a regression through these points is 0.36 and the p-value for the statistical significance of height is 0.00038. Add text labels “R-squared = 0.36” and “p-value = 0.0004” somewhere on the graph.
- Create a graph using
Consider the following small dataset that represents the number of times per day a mother played “Ring around the Rosy” with her daughter relative to the number of days since she has learned this game. The column
yhat
represents the best fitting line through the data, andlwr
andupr
represent a 95% confidence interval for the predicted value on that day. Because these questions ask you to produce several graphs and evaluate which is better and why, please include each graph and response with each sub-question.<- data.frame( Rosy times = c(15, 11, 9, 12, 5, 2, 3), day = 1:7, yhat = c(14.36, 12.29, 10.21, 8.14, 6.07, 4.00, 1.93), lwr = c( 9.54, 8.5, 7.22, 5.47, 3.08, 0.22, -2.89), upr = c(19.18, 16.07, 13.2, 10.82, 9.06, 7.78, 6.75))
Using
ggplot()
andgeom_point()
, create a scatterplot withday
along the x-axis andtimes
along the y-axis.Add a line to the graph where the x-values are the
day
values but now the y-values are the predicted values which we’ve calledyhat
. Notice that you have to set the aestheticy=times
for the points andy=yhat
for the line. Because eachgeom_
will accept anaes()
command, you can specify they
attribute to be different for different layers of the graph.Add a ribbon that represents the confidence region of the regression line. The
geom_ribbon()
function requires anx
,ymin
, andymax
columns to be defined. For examples of usinggeom_ribbon()
see the online documentation: http://docs.ggplot2.org/current/geom_ribbon.html.What happened when you added the ribbon? Did some points get hidden? If so, why?
Reorder the statements that created the graph so that the ribbon is on the bottom and the data points are on top and the regression line is visible.
The color of the ribbon fill is ugly. Use Google to find a list of named colors available to
ggplot2
. For example, I googled “ggplot2 named colors” and found the following link: http://sape.inf.usi.ch/quick-reference/ggplot2/colour. Choose a color for the fill that is pleasing to you.Add labels for the x-axis and y-axis that are appropriate along with a main title.
We’ll next make some density plots that relate several factors towards the birth weight of a child. Because these questions ask you to produce several graphs and evaluate which is better and why, please include each graph and response with each sub-question.
The
MASS
package contains a dataset calledbirthwt
which contains information about 189 babies and their mothers. In particular there are columns for the mother’s race and smoking status during the pregnancy. Load thebirthwt
by either using thedata()
command or loading theMASS
library.Read the help file for the dataset using
MASS::birthwt
. The covariatesrace
andsmoke
are not stored in a user friendly manner. For example, smoking status is labeled using a 0 or a 1. Because it is not obvious which should represent that the mother smoked, we’ll add better labels to therace
andsmoke
variables. For more information about dealing with factors and their levels, see theFactors
chapter in these notes.library(tidyverse) data('birthwt', package='MASS') <- birthwt %>% mutate( birthwt race = factor(race, labels=c('White','Black','Other')), smoke = factor(smoke, labels=c('No Smoke', 'Smoke')))
Graph a histogram of the birth weights
bwt
usingggplot(birthwt, aes(x=bwt)) + geom_histogram()
.Make separate graphs that denote whether a mother smoked during pregnancy by appending
+ facet_grid()
command to your original graphing command.Perhaps race matters in relation to smoking. Make our grid of graphs vary with smoking status changing vertically, and race changing horizontally (that is the formula in
facet_grid()
should have smoking be the y variable and race as the x).Remove
race
from the facet grid, (so go back to the graph you had in part d). I’d like to next add an estimated density line to the graphs, but to do that, I need to first change the y-axis to be density (instead of counts), which we do by usingaes(y=..density..)
in theggplot()
aesthetics command.Next we can add the estimated smooth density using the
geom_density()
command.To really make this look nice, lets change the fill color of the histograms to be something less dark, lets use
fill='cornsilk'
andcolor='grey60'
. To play with different colors that have names, check out the following: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.Change the order in which the histogram and the density line are added to the plot. Does it matter and which do you prefer?
Finally consider if you should have the histograms side-by-side or one ontop of the other (i.e.
. ~ smoke
orsmoke ~ .
). Which do you think better displayes the decrease in mean birthweight and why?
Load the dataset
ChickWeight
, which comes preloaded in R, and get the background on the dataset by reading the manual page?ChickWeight
. Because these questions ask you to produce several graphs and evaluate which is better and why, please include each graph and response with each sub-question.Produce a separate scatter plot of weight vs age for each chick. Use color to distinguish the four different
Diet
treatments.We could examine these data by producing a scatterplot for each diet. Most of the code below is readable, but if we don’t add the
group
aesthetic the lines would not connect the dots for each Chick but would instead connect the dots across different chicks.data(ChickWeight) ggplot(ChickWeight, aes(x=Time, y=weight, group=Chick )) + geom_point() + geom_line() + facet_grid( ~ Diet)
There is a spreadsheet with the football (soccer) results from the English Premiership 2019-2020 season available at this url. Because the
readr
package doesn’t care whether a file is on your local computer or on the Internet, we’ll use this file.- Start the import wizard using: “File -> Import Dataset -> From Text (readr) …” and input the above web URL. Click the update button near the top to cause the wizard to preview the result.
- Save the generated code to your R Script file and show the first few rows using the
head()
command. - The goals scored by the home side are in the ‘FTHG’ column. Calculate the mean number of goals for the home side.
- The result of the match (Home win, Draw, Away Win) are in the FTR column. Produce a crosstab using the table command to show the number of times each
HomeTeam
achieved each result. Which team had the most draws in the season? - Produce bar charts showing the number of goals scored in each game (from the FTHG column)