Chapter 11 Model Selection
Choosing a model manually is often difficult; let’s suppose we have ten possible regressor variables \(x_1,x_2,\ldots,x_{10}\), each of which can be in the model, or not in the model; even with a first order model (where we allow terms \(\beta_1x_1,\beta_5x_5\), etc, but not second order terms like \(\beta_{24}x_2x_4\)) each term can be in or out of the model; there are \(2^{10}\) possible models; thus we need some automatic way of choosing between them.
There are many techniques we can use, but one popular one is to assume that every term is needed in the model, and then sequentially delete terms until we find a good model. This is a backwards stepwise procedure, and the details of this are :
- Start with full model, e.g. \(y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_{10}x_{10}\).
- Find all possible terms that we can remove from the model which improve some criterion.
- For each possible term, find the term which improves the model most, according to this criterion.
- If removing none of the terms improves the model, then stop.
This procedure is called a
There are many criteria which can be used to determine which is the best model, and here I use the AIC (Akaike Information Criterion). Formulaically, this is \[\mbox{ AIC}=2p-2\log L,\] where \(p\) is the number of parameters in the model and \(\log L\) the log-likelihood of the model given the data observed.
We want this to be as low as possible (it is often negative). The effect of this criterion is to whilst simultaneously . i.e. we get models which fit well, but have a low number of parameters. In general, statisticians like to work with simpler models, as we are simple people, and simpler models are easier to conceptualise.
11.1 Example
Data on 100 executive salaries is listed as execsal2.dat. The dependent variable gives the (logarithm of) the salary of executives for each of 10 possible explanatory variables,as below.
- y Salary of executive
- x1 Experience (in years)
- x2 Education (in years)
- x3 Gender (1 if male 0 if female)
- x4 Number of employees supervised
- x5 Corporate assets (in millions of USD)
- x6 Board member (1 if yes, 0 if no)
- x7 Age (in years)
- x8 Company profits (in millions of USD)
- x9 Has international responsibility (1 if yes, 0 if no)
- x10 Company’s total sales (in millions of USD)
We want to find a model that shows how these variables determine executive salary.
# REad the data in. We will only use columns 2 to 12, so let's just call this exec.
<-read.table("https://www.dropbox.com/s/jefnudvc130xhzp/EXECSAL2.txt?dl=1",
execheader=TRUE)
<-exec[2:12]
exechead(exec)
## Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 11.4436 12 15 1 240 170 1 44 5 0 21
## 2 11.7753 25 14 1 510 160 1 53 9 0 28
## 3 11.3874 20 14 0 370 170 1 56 5 0 26
## 4 11.2172 3 19 1 170 170 1 26 9 0 24
## 5 11.6553 19 12 1 520 150 1 43 7 0 27
## 6 11.1619 14 13 0 420 160 1 53 9 0 27
We start with a full model containing all terms X1 to X10. We use
lm1<-lm(Y~ .,data=exec)
to fit the full model.
<-lm(Y~ .,data=exec)
lm1
# Perform stepwise backwards regression
<-step(lm1) slm1
## Start: AIC=-504.84
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
##
## Df Sum of Sq RSS AIC
## - X10 1 0.00063 0.51583 -506.71
## - X7 1 0.00073 0.51593 -506.70
## - X8 1 0.00153 0.51673 -506.54
## - X6 1 0.00482 0.52002 -505.91
## - X9 1 0.00984 0.52504 -504.94
## <none> 0.51520 -504.84
## - X5 1 0.08810 0.60330 -491.05
## - X2 1 0.41581 0.93102 -447.66
## - X4 1 0.63133 1.14653 -426.84
## - X3 1 0.99872 1.51393 -399.05
## - X1 1 1.43512 1.95032 -373.72
##
## Step: AIC=-506.71
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9
##
## Df Sum of Sq RSS AIC
## - X7 1 0.00050 0.51633 -508.62
## - X8 1 0.00149 0.51732 -508.43
## - X6 1 0.00448 0.52031 -507.85
## - X9 1 0.00992 0.52575 -506.81
## <none> 0.51583 -506.71
## - X5 1 0.08769 0.60352 -493.01
## - X2 1 0.41593 0.93176 -449.59
## - X4 1 0.63878 1.15461 -428.14
## - X3 1 1.03375 1.54959 -398.72
## - X1 1 1.52826 2.04409 -371.02
##
## Step: AIC=-508.62
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X8 + X9
##
## Df Sum of Sq RSS AIC
## - X8 1 0.0015 0.5178 -510.33
## - X6 1 0.0040 0.5203 -509.85
## - X9 1 0.0096 0.5260 -508.77
## <none> 0.5163 -508.62
## - X5 1 0.0898 0.6061 -494.58
## - X2 1 0.4243 0.9406 -450.64
## - X4 1 0.6384 1.1547 -430.13
## - X3 1 1.0503 1.5666 -399.62
## - X1 1 3.9764 4.4927 -294.27
##
## Step: AIC=-510.33
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X9
##
## Df Sum of Sq RSS AIC
## - X6 1 0.0033 0.5211 -511.69
## - X9 1 0.0089 0.5267 -510.64
## <none> 0.5178 -510.33
## - X5 1 0.0885 0.6064 -496.55
## - X2 1 0.4230 0.9408 -452.62
## - X4 1 0.6420 1.1598 -431.69
## - X3 1 1.0490 1.5668 -401.61
## - X1 1 3.9749 4.4927 -296.27
##
## Step: AIC=-511.69
## Y ~ X1 + X2 + X3 + X4 + X5 + X9
##
## Df Sum of Sq RSS AIC
## - X9 1 0.0093 0.5304 -511.93
## <none> 0.5211 -511.69
## - X5 1 0.0947 0.6159 -496.99
## - X2 1 0.4347 0.9558 -453.04
## - X4 1 0.6868 1.2079 -429.63
## - X3 1 1.0466 1.5677 -403.55
## - X1 1 3.9718 4.4929 -298.27
##
## Step: AIC=-511.93
## Y ~ X1 + X2 + X3 + X4 + X5
##
## Df Sum of Sq RSS AIC
## <none> 0.5304 -511.93
## - X5 1 0.0879 0.6183 -498.59
## - X2 1 0.4289 0.9594 -454.67
## - X4 1 0.6908 1.2212 -430.53
## - X3 1 1.0656 1.5961 -403.76
## - X1 1 3.9627 4.4932 -300.26
- This means that R considers the full model, which has AIC=-504.84;
- It considers removing all the terms \(X_1\) to \(X_{10}\) individually, and provides in the last column the AIC for the models with these variables removed.
- Removing variables \(X_6\) to \(X_{10}\) in this instance improve the model (produce a lower AIC) whereas removing \(X_1\) to \(X_5\) makes the model worse.
- Thus we remove \(X_{10}\) from the model (the value whose removal improves the AIC most) and repeat this procedure.
- There is a lot more output until we can no longer improve the AIC of -511.93 by removing more variables, so our final model is \[Y_i=\beta_0+\beta_1X_{1i} + \beta_2X_{2i} + \beta_3{X_3i} + \beta_4{X_4i} + \beta_5X_{5i}+\epsilon_i.\] The final model chosen is stored in slm1
summary(slm1)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = exec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.201219 -0.056016 -0.003581 0.053656 0.187251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.9619345 0.1010567 98.578 < 2e-16 ***
## X1 0.0272762 0.0010293 26.501 < 2e-16 ***
## X2 0.0290921 0.0033367 8.719 9.71e-14 ***
## X3 0.2246932 0.0163503 13.742 < 2e-16 ***
## X4 0.0005244 0.0000474 11.064 < 2e-16 ***
## X5 0.0019623 0.0004972 3.947 0.000153 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07512 on 94 degrees of freedom
## Multiple R-squared: 0.9206, Adjusted R-squared: 0.9164
## F-statistic: 218.1 on 5 and 94 DF, p-value: < 2.2e-16
All the terms remaining (X1,X2,X3,X4,X5) seem highly significant, although of course we should investigate the residuals to chack that the final proposed model is reasonable!