Chapter 8 Further Hypothesis Testing

In the previous chapter we introduced some hypotheses tests, and we’ll introduce just a couple more types of hypothesis test here.

8.1 Tests for contingency tables

For tests on contingency tables, we’re going to be demonstrating on another data set which shows the composition of the UK House of Lords by gender as of February 2021.

library(readr)
HouseOfLords <- read_csv("https://www.dropbox.com/s/9ev9kg5vnm20x0m/HouseOfLords.csv?dl=1")
HouseOfLords

## # A tibble: 14 x 4
##    `Party/Group`                 Men Women Total
##    <chr>                       <dbl> <dbl> <dbl>
##  1 Conservative                  194    68   262
##  2 Crossbench                    136    47   183
##  3 Labour                        118    62   180
##  4 Liberal Democrat               55    32    87
##  5 Non-affiliated                 40     9    49
##  6 Bishops                        21     5    26
##  7 Democratic Unionist Party       5     0     5
##  8 Green Party                     0     2     2
##  9 Ulster Unionist Party           2     0     2
## 10 Conservative Independent        1     0     1
## 11 Independent Social Democrat     1     0     1
## 12 Labour Independent              0     1     1
## 13 Lord Speaker                    1     0     1
## 14 Plaid Cymru                     1     0     1

Essentially we see that the House of Lords is dominated by a few main parties. (The crossbenchers are not a political party, but gather as a group to loosely work together). There are also some Church of England bishops, and some small other groups. We also see that there are more men than women.

We may wish to test whether the men:women ratio is different for different party groups.

To make it easy, let’s ignore the smaller groups, and remove the total:

HLReduced<-HouseOfLords[1:4,2:3]
HLReduced

## # A tibble: 4 x 2
##     Men Women
##   <dbl> <dbl>
## 1   194    68
## 2   136    47
## 3   118    62
## 4    55    32

In order to test the hypothesis, all we need do is run the chi-squared test

chisq.test(HLReduced)

## 
##  Pearson's Chi-squared test
## 
## data:  HLReduced
## X-squared = 7.2133, df = 3, p-value = 0.0654

This is testing the hypotheses that

\(H_0\): Gender is independent of Party/Group
\(H_1\): Gender is not independent of Party/Group

We see that the p-value is 0.065402. We would reject a test of independence at the 5% level of significance (but only just).

The chi-squared test is very flexible, and useful for a lot of situations, but there are some flaws. Let’s see what happens if we do the test for a bigger data set, including all the minor parties:

HLFull<-HouseOfLords[,2:3]
HLFull

## # A tibble: 14 x 2
##      Men Women
##    <dbl> <dbl>
##  1   194    68
##  2   136    47
##  3   118    62
##  4    55    32
##  5    40     9
##  6    21     5
##  7     5     0
##  8     0     2
##  9     2     0
## 10     1     0
## 11     1     0
## 12     0     1
## 13     1     0
## 14     1     0

chisq.test(HLFull)

## Warning in chisq.test(HLFull): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  HLFull
## X-squared = 23.18, df = 13, p-value = 0.03957

Note how the p-value has changed quite a lot, even though we have added very little data. The reason for this is that the chi-squared test does not react well to very small expected numbers in each cell, so if doing this by hand we can group categories together (here we could, for example, combine the last 5 groups into “other”). Although flexible, the chi-squared test has some faults!

For very small dataset, Fisher’s Exact Test is an alternative- it does not make distribution assumptions like the chi-squared test does.

fisher.test(HLReduced)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  HLReduced
## p-value = 0.06611
## alternative hypothesis: two.sided

8.1.1 Forming tables quickly and doing a chi-squared test.

Note that we can also use the tables command to very quickly form tables of data. Here we see whether the Full Time Result varies by team (“H”=Home Win, “A”=Away Win, “D”=Draw).

library(readr)
football <- read_csv("https://www.football-data.co.uk/mmz4281/1819/E0.csv")

## Rows: 380 Columns: 62
## -- Column specification ---------------------------------------------------------------------
## Delimiter: ","
## chr  (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl (55): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, HR, AR, B365H...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

table(football$FTR,football$HomeTeam)

##    
##     Arsenal Bournemouth Brighton Burnley Cardiff Chelsea Crystal Palace Everton Fulham
##   A       2           6        8      10      11       1              9       5     10
##   D       3           5        5       2       2       6              5       4      3
##   H      14           8        6       7       6      12              5      10      6
##    
##     Huddersfield Leicester Liverpool Man City Man United Newcastle Southampton Tottenham
##   A           14         8         0        1          3        10           6         5
##   D            3         3         2        0          6         1           8         2
##   H            2         8        17       18         10         8           5        12
##    
##     Watford West Ham Wolves
##   A       8        6      5
##   D       3        4      4
##   H       8        9     10

chisq.test(table(football$FTR,football$HomeTeam))

## Warning in chisq.test(table(football$FTR, football$HomeTeam)): Chi-squared approximation may
## be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  table(football$FTR, football$HomeTeam)
## X-squared = 95.34, df = 38, p-value = 7.893e-07

What do we conclude?

8.2 Non parametric tests

We have so far looked at parametric tests; each of our test’s critical values is found by looking at some distribution (e.g. Z, t, \(\chi^2\)), which are parametric distributions (they are based on some unknown parameters).

Most of these tests rely on a large sample size \(n\); so what do we do if we have a small number of data?

For example, if we’re manufacturing a new chemical reagent, we might only be able to make or test a small number. We solve this problem by non-parametric testing.

Essentially, we do not need to know anything about the distribution of the underlying statistic.

8.2.1 An example of a non-parametric test

We are testing wind turbine blades to find out which produces the most power when we attach to a turbine. This experiment is expensive to conduct, so we can only test 12 blades. 6 of the turbines are bought from a Scottish supplier (S), and six are brought from a Chinese supplier (C).

We put the power produced in one week by 12 wind turbines in descending order, according to a long trial for each. The claim is that the Chinese supplier produces better turbines.

In this example, we might not be able to use a parametric test: we only have a very small sample size, and we don’t have any knowledge of the distribution of the wind power.

Let us say that the order of the turbines, from most power to least, is:

S	S	S	C	C	C	S	C	S	C	S	C
45	49	51	52	54	55	57	59	61	63	64	68

Here, we could use a non-parametric test, called the Wilcoxon rank-sum test, or sometimes the Mann-Whitney U test which is identical in effect, but uses a slightly different method.

Our null hypothesis is that both samples come from the same distribution.

For each C, we add the ranks, so this is RankSum=4 +5 +6 + 8 + 10 + 12 = 45

Our test statistic \(W\) is given by \(\mbox{RankSum}-\frac{n_c(n_c+1)}{2}=45-\frac{6\times 7}{2}=24\). R will calculate the probability that this (adjusted) rank sum is higher under the null hypothesis that the turbines are the same (i.e. that the ordering of C and S is random.)

It’s very simple to do in R:

C <- c(68, 63, 59, 55 ,54, 52)
S <- c(64, 61, 57, 51, 49 ,45)
wilcox.test(C, S, paired = FALSE, alternative = "greater")

## 
##  Wilcoxon rank sum exact test
## 
## data:  C and S
## W = 24, p-value = 0.197
## alternative hypothesis: true location shift is greater than 0

As with previous tests, we just need to look at the p-value to show that there is not enough evidence to reject \(H_0\) and we conclude that there is not enough evidence to show that the Chinese supplier is better than the Scottish.

8.3 Confidence intervals, Hypothesis Testing, p-values

We’ve talked about confidence intervals, hypothesis testing, and using p-values as separate things. However, each of these uses the same distributions; so saying that \(\mu=\mu_0\) is not in a 95% confidence interval, or that \(\mu\ne\mu_0\) at the 5% level of significance, or that \(\mu\ne\mu_0\) with a p-value of less than \(5\%\) are all equivalent.

As long as we clearly present our hypotheses, our assumptions, and our methods of testing, all are valid.

p-values are becoming increasingly popular as we can present one number, and let the reader determine whether it is good enough. For example, ‘’population A has a bigger mean than population B when we do a t-test to compare two means, with a p-value of 0.042’’ is a useful statement.

Remember, we can look at the output of a t.test (and many other hypothesis tests in R) to get either a p-value, or a confidence interval

library(ggplot2)
largeCars<-mpg[mpg$cyl>=6,] # Select cars with 4 or 6 cylinders.
t1<-t.test(largeCars$cty,largeCars$hwy)
t1

## 
##  Welch Two Sample t-test
## 
## data:  largeCars$cty and largeCars$hwy
## t = -14.241, df = 239.18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.692458 -5.065931
## sample estimates:
## mean of x mean of y 
##  14.50336  20.38255

Here we are testing the hypothesis \(H_0\): The mean mpg of cars in the city is the same as that on the highway. \(H_1\): The mean mpg of cars in the city is different from that on the highway.

Either (1) we can test this formally, and present the p-value. As the p-value is incredibly small, we have strong evidence to reject the null hypothesis.
Or (2) we can say that the difference in mean mpgs from city to highway is (-6.69,-5.07). This means that, as a 95% Confidence Interval, the mean highway mpg is between 5.07 and 6.69 lower than that in the city. As this confidence interval does not contain zero, we can be sure (at the 5% level of significance) that we can reject \(H_0\)

The first statement may be more useful if we wish to absolutely test the difference. In many contexts, the second statement is more useful as we then have some measure of how big the difference is likely to be. R does not care which we use, and in fact as we have seen hypothesis testing and confidence intervals can be used in the same command.

8.3.1 Using Hypothesis Tests in Practice

There are a myriad of other tests that can be done, and it’s really a question of picking one off a menu which test works. We need to know

The type of our data (e.g. categorical, continuous, discrete)
Any assumptions we can make about our populations
What we want to test (our hypotheses).

For particular cases, it’s best to consult an experienced statistician in the field, normally before you do the experiment. For practical experimentation, they can also advise on things like sample sizes, and will tell you how much work you need to do to get a result with sufficient statistical power.

8.4 Exercises

The built in data set warpbreaks lists the number of breaks in yarn during weaving, for different types of wool (A,B), and different tensions (L,M,H),

Examine the data and present appropriate graph(s) to see what factors make a difference to the number of breaks.
Put the data in the form of a table by using the xtabs command

##        wool
## tension   A   B
##       L 401 254
##       M 216 259
##       H 221 169

and use a chi-squared test of independence to see if there are difference in the number of breaks for different looms. Also perform a Fisher Exact test for these data. What are your conclusions?

Which wool and tension would you suggest to the mill owner to minimise the number of breaks?
A business analyst suggest ignoring the tension, and just performing a t-test to see if wool A or wool B produces a different number of breaks. Perform this test, and explain why this conclusion may be misleading.

The amount of yield in a chemical process is sampled over 8 randomly chosen days for each of two different catalysts, A and B. The data are presented below

## [1] 19 22 21 21 24 11 20 17

## [1] 34 20 22 33 34 16 32 22

Perform a t-test to test whether there is a difference between the chemical processes. What assumptions are necessary for a t-test?
Perform a Wilcoxon rank-sum test to test whether both samples come from the same distribution. What assumptions are necessary for such a test?
At the 5% level of significance, do your results for the two tests agree? Why/why not?
It becomes clear that the yields reported above have been presented as log yields, and the variables should be replaced by their exponentials (exp(A),exp(B)). If we test these results, does it change your conclusions?