Chapter 11 Model Selection

Choosing a model manually is often difficult; let’s suppose we have ten possible regressor variables \(x_1,x_2,\ldots,x_{10}\), each of which can be in the model, or not in the model; even with a first order model (where we allow terms \(\beta_1x_1,\beta_5x_5\), etc, but not second order terms like \(\beta_{24}x_2x_4\)) each term can be in or out of the model; there are \(2^{10}\) possible models; thus we need some automatic way of choosing between them.

There are many techniques we can use, but one popular one is to assume that every term is needed in the model, and then sequentially delete terms until we find a good model. This is a backwards stepwise procedure, and the details of this are :

Start with full model, e.g. \(y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_{10}x_{10}\).
Find all possible terms that we can remove from the model which improve some criterion.
For each possible term, find the term which improves the model most, according to this criterion.
If removing none of the terms improves the model, then stop.

This procedure is called a

There are many criteria which can be used to determine which is the best model, and here I use the AIC (Akaike Information Criterion). Formulaically, this is \[\mbox{ AIC}=2p-2\log L,\] where \(p\) is the number of parameters in the model and \(\log L\) the log-likelihood of the model given the data observed.

We want this to be as low as possible (it is often negative). The effect of this criterion is to whilst simultaneously . i.e. we get models which fit well, but have a low number of parameters. In general, statisticians like to work with simpler models, as we are simple people, and simpler models are easier to conceptualise.

11.1 Example

Data on 100 executive salaries is listed as execsal2.dat. The dependent variable gives the (logarithm of) the salary of executives for each of 10 possible explanatory variables,as below.

y Salary of executive
x1 Experience (in years)
x2 Education (in years)
x3 Gender (1 if male 0 if female)
x4 Number of employees supervised
x5 Corporate assets (in millions of USD)
x6 Board member (1 if yes, 0 if no)
x7 Age (in years)
x8 Company profits (in millions of USD)
x9 Has international responsibility (1 if yes, 0 if no)
x10 Company’s total sales (in millions of USD)

We want to find a model that shows how these variables determine executive salary.

# REad the data in. We will only use columns 2 to 12, so let's just call this exec.
exec<-read.table("https://www.dropbox.com/s/jefnudvc130xhzp/EXECSAL2.txt?dl=1",
                 header=TRUE)
exec<-exec[2:12]
head(exec)

##         Y X1 X2 X3  X4  X5 X6 X7 X8 X9 X10
## 1 11.4436 12 15  1 240 170  1 44  5  0  21
## 2 11.7753 25 14  1 510 160  1 53  9  0  28
## 3 11.3874 20 14  0 370 170  1 56  5  0  26
## 4 11.2172  3 19  1 170 170  1 26  9  0  24
## 5 11.6553 19 12  1 520 150  1 43  7  0  27
## 6 11.1619 14 13  0 420 160  1 53  9  0  27

We start with a full model containing all terms X1 to X10. We use lm1<-lm(Y~ .,data=exec) to fit the full model.

lm1<-lm(Y~ .,data=exec)

# Perform stepwise backwards regression
slm1<-step(lm1)

## Start:  AIC=-504.84
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
## 
##        Df Sum of Sq     RSS     AIC
## - X10   1   0.00063 0.51583 -506.71
## - X7    1   0.00073 0.51593 -506.70
## - X8    1   0.00153 0.51673 -506.54
## - X6    1   0.00482 0.52002 -505.91
## - X9    1   0.00984 0.52504 -504.94
## <none>              0.51520 -504.84
## - X5    1   0.08810 0.60330 -491.05
## - X2    1   0.41581 0.93102 -447.66
## - X4    1   0.63133 1.14653 -426.84
## - X3    1   0.99872 1.51393 -399.05
## - X1    1   1.43512 1.95032 -373.72
## 
## Step:  AIC=-506.71
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9
## 
##        Df Sum of Sq     RSS     AIC
## - X7    1   0.00050 0.51633 -508.62
## - X8    1   0.00149 0.51732 -508.43
## - X6    1   0.00448 0.52031 -507.85
## - X9    1   0.00992 0.52575 -506.81
## <none>              0.51583 -506.71
## - X5    1   0.08769 0.60352 -493.01
## - X2    1   0.41593 0.93176 -449.59
## - X4    1   0.63878 1.15461 -428.14
## - X3    1   1.03375 1.54959 -398.72
## - X1    1   1.52826 2.04409 -371.02
## 
## Step:  AIC=-508.62
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X8 + X9
## 
##        Df Sum of Sq    RSS     AIC
## - X8    1    0.0015 0.5178 -510.33
## - X6    1    0.0040 0.5203 -509.85
## - X9    1    0.0096 0.5260 -508.77
## <none>              0.5163 -508.62
## - X5    1    0.0898 0.6061 -494.58
## - X2    1    0.4243 0.9406 -450.64
## - X4    1    0.6384 1.1547 -430.13
## - X3    1    1.0503 1.5666 -399.62
## - X1    1    3.9764 4.4927 -294.27
## 
## Step:  AIC=-510.33
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X9
## 
##        Df Sum of Sq    RSS     AIC
## - X6    1    0.0033 0.5211 -511.69
## - X9    1    0.0089 0.5267 -510.64
## <none>              0.5178 -510.33
## - X5    1    0.0885 0.6064 -496.55
## - X2    1    0.4230 0.9408 -452.62
## - X4    1    0.6420 1.1598 -431.69
## - X3    1    1.0490 1.5668 -401.61
## - X1    1    3.9749 4.4927 -296.27
## 
## Step:  AIC=-511.69
## Y ~ X1 + X2 + X3 + X4 + X5 + X9
## 
##        Df Sum of Sq    RSS     AIC
## - X9    1    0.0093 0.5304 -511.93
## <none>              0.5211 -511.69
## - X5    1    0.0947 0.6159 -496.99
## - X2    1    0.4347 0.9558 -453.04
## - X4    1    0.6868 1.2079 -429.63
## - X3    1    1.0466 1.5677 -403.55
## - X1    1    3.9718 4.4929 -298.27
## 
## Step:  AIC=-511.93
## Y ~ X1 + X2 + X3 + X4 + X5
## 
##        Df Sum of Sq    RSS     AIC
## <none>              0.5304 -511.93
## - X5    1    0.0879 0.6183 -498.59
## - X2    1    0.4289 0.9594 -454.67
## - X4    1    0.6908 1.2212 -430.53
## - X3    1    1.0656 1.5961 -403.76
## - X1    1    3.9627 4.4932 -300.26

This means that R considers the full model, which has AIC=-504.84;
It considers removing all the terms \(X_1\) to \(X_{10}\) individually, and provides in the last column the AIC for the models with these variables removed.
Removing variables \(X_6\) to \(X_{10}\) in this instance improve the model (produce a lower AIC) whereas removing \(X_1\) to \(X_5\) makes the model worse.
Thus we remove \(X_{10}\) from the model (the value whose removal improves the AIC most) and repeat this procedure.
There is a lot more output until we can no longer improve the AIC of -511.93 by removing more variables, so our final model is \[Y_i=\beta_0+\beta_1X_{1i} + \beta_2X_{2i} + \beta_3{X_3i} + \beta_4{X_4i} + \beta_5X_{5i}+\epsilon_i.\] The final model chosen is stored in slm1

summary(slm1)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = exec)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.201219 -0.056016 -0.003581  0.053656  0.187251 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.9619345  0.1010567  98.578  < 2e-16 ***
## X1          0.0272762  0.0010293  26.501  < 2e-16 ***
## X2          0.0290921  0.0033367   8.719 9.71e-14 ***
## X3          0.2246932  0.0163503  13.742  < 2e-16 ***
## X4          0.0005244  0.0000474  11.064  < 2e-16 ***
## X5          0.0019623  0.0004972   3.947 0.000153 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07512 on 94 degrees of freedom
## Multiple R-squared:  0.9206, Adjusted R-squared:  0.9164 
## F-statistic: 218.1 on 5 and 94 DF,  p-value: < 2.2e-16

All the terms remaining (X1,X2,X3,X4,X5) seem highly significant, although of course we should investigate the residuals to chack that the final proposed model is reasonable!

11.2 Related methods

The AIC is just one of many similar likelihood based criteria which are used; another is the BIC (Bayesian information criterion) which is defined similarly to the AIC as \[\mbox{BIC}=p\log n - 2\log L\]

There are other procedures for model selection which we should be aware of but will not use in this course; we can

evaluate all models; this is perhaps possible for 10 variables with a fast computer, but less so for 100 variables.
start with the null model \(y=\beta_0\) and add the terms which improve our criterion each time (this is a forward stepwise procedure).

For example, we demonstrate the forward prcedure with the executive salary dataset. We start with the intercept only model, Y~1 in R

# This is the intercept only model
lm2<-lm(Y~1,data=exec)
summary(lm2)

## 
## Call:
## lm(formula = Y ~ 1, data = exec)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79072 -0.17304  0.00768  0.15298  0.60838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.45502    0.02598   440.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2598 on 99 degrees of freedom

slm2<-step(lm2,scope=formula(lm1),direction='forward')

## Start:  AIC=-268.57
## Y ~ 1
## 
##        Df Sum of Sq    RSS     AIC
## + X1    1    4.1364 2.5462 -363.06
## + X7    1    2.6488 4.0338 -317.05
## + X3    1    1.0492 5.6335 -283.64
## + X2    1    0.3264 6.3563 -271.57
## + X4    1    0.2897 6.3930 -271.00
## + X5    1    0.2774 6.4052 -270.81
## <none>              6.6827 -268.57
## + X10   1    0.0201 6.6625 -266.87
## + X8    1    0.0181 6.6646 -266.84
## + X6    1    0.0169 6.6657 -266.82
## + X9    1    0.0002 6.6824 -266.57
## 
## Step:  AIC=-363.06
## Y ~ X1
## 
##        Df Sum of Sq    RSS     AIC
## + X3    1   0.87027 1.6760 -402.88
## + X2    1   0.32522 2.2210 -374.72
## + X4    1   0.31253 2.2337 -374.15
## + X5    1   0.26811 2.2781 -372.18
## <none>              2.5462 -363.06
## + X6    1   0.04591 2.5003 -362.87
## + X10   1   0.04132 2.5049 -362.69
## + X8    1   0.01466 2.5316 -361.63
## + X7    1   0.00843 2.5378 -361.39
## + X9    1   0.00381 2.5424 -361.21
## 
## Step:  AIC=-402.88
## Y ~ X1 + X3
## 
##        Df Sum of Sq    RSS     AIC
## + X4    1   0.60068 1.0753 -445.26
## + X2    1   0.28150 1.3945 -419.27
## + X5    1   0.19195 1.4840 -413.04
## + X6    1   0.10205 1.5739 -407.16
## <none>              1.6760 -402.88
## + X8    1   0.00735 1.6686 -401.32
## + X10   1   0.00137 1.6746 -400.96
## + X9    1   0.00022 1.6757 -400.89
## + X7    1   0.00000 1.6760 -400.88
## 
## Step:  AIC=-445.26
## Y ~ X1 + X3 + X4
## 
##        Df Sum of Sq     RSS     AIC
## + X2    1   0.45697 0.61832 -498.59
## + X5    1   0.11593 0.95936 -454.67
## + X6    1   0.02841 1.04688 -445.94
## <none>              1.07529 -445.26
## + X7    1   0.00623 1.06906 -443.84
## + X8    1   0.00622 1.06907 -443.84
## + X10   1   0.00044 1.07485 -443.30
## + X9    1   0.00003 1.07526 -443.26
## 
## Step:  AIC=-498.59
## Y ~ X1 + X3 + X4 + X2
## 
##        Df Sum of Sq     RSS     AIC
## + X5    1  0.087902 0.53041 -511.93
## <none>              0.61832 -498.59
## + X6    1  0.009688 0.60863 -498.17
## + X9    1  0.002451 0.61587 -496.99
## + X8    1  0.001376 0.61694 -496.82
## + X7    1  0.000343 0.61797 -496.65
## + X10   1  0.000000 0.61832 -496.59
## 
## Step:  AIC=-511.93
## Y ~ X1 + X3 + X4 + X2 + X5
## 
##        Df Sum of Sq     RSS     AIC
## <none>              0.53041 -511.93
## + X9    1 0.0092875 0.52113 -511.69
## + X6    1 0.0037568 0.52666 -510.64
## + X10   1 0.0003588 0.53006 -509.99
## + X8    1 0.0002463 0.53017 -509.97
## + X7    1 0.0000122 0.53040 -509.93

In this case we get the same model from backwards and forwards stepwise selection. This is not always the case.

11.2.1 Exercises

From Mendenhall and Sincich Ex 6.4. In any production process in which one or more workers are engaged in a variety of tasks, the total time spent in production varies as a function of the size of the work pool and the level of output of the various activities. For example, in a large metropolitan department store, the number of hours worked (y) per day by the clerical staff may depend on the following variables

\(x_1\)= Number of pieces of mail processed.
\(x_2\)= Number of money orders and gift certificates sold.
\(x_3\)= Number of window payments (customer charge accounts) transacted.
\(x_4\)= Number of change order transactions processed.
\(x_5\)= Number of cheques cashed.
\(x_6\)= Number of pieces of miscellaneous mail processed on an ``as available’’ basis.
\(x_7\)= Number of bus tickets sold.

This file is available as CLERICAL.txt on the course dropbox.

Conduct a (backwards) stepwise regression analysis using R as per the example above.
For the resulting model, interpret the \(\beta\) estimates in the model.
For the chosen model, check the residuals and comment on whether the model assumptions are satisfied.
Find \(R\) and \(R^2\) for the model, and compare it to those for the full (first-order) model. What is the difference between \(R\) and \(R^2\)?
Use the command to find out how do a forwards stepwise regression. Do you get the same model selected?
Finally, change the criterion from the default to the BIC criterion described above. Do you get the same model selected?

It is also possible (for small number of parameters \(p\)) to fit all possible models. For the execsal dataset, we can do this using the leaps package as follows:

#install.packages("leaps")
library(leaps)   #f
bestsub.model <- regsubsets(Y ~ ., 
                            data = exec, nvmax =10)
results<-summary(bestsub.model)
results

The models with the best values of \(R^2\), Residual Sum of Squares, Adjusted \(R^2\) are stored as follows

results$rsq
results$rss
results$adjr2

Which are the best models for these three criteria?

The Hitters dataset in the ISLR package contains a lot of information about baseball players. Use the stepwise technique and the best subset regression method to indentify a good model for predicting the Salary in this dataset using the other variables as explanatory variables.