Chapter 11 Model Selection

Choosing a model manually is often difficult; let’s suppose we have ten possible regressor variables \(x_1,x_2,\ldots,x_{10}\), each of which can be in the model, or not in the model; even with a first order model (where we allow terms \(\beta_1x_1,\beta_5x_5\), etc, but not second order terms like \(\beta_{24}x_2x_4\)) each term can be in or out of the model; there are \(2^{10}\) possible models; thus we need some automatic way of choosing between them.

There are many techniques we can use, but one popular one is to assume that every term is needed in the model, and then sequentially delete terms until we find a good model. This is a backwards stepwise procedure, and the details of this are :

  • Start with full model, e.g.  \(y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\beta_{10}x_{10}\).
  • Find all possible terms that we can remove from the model which improve some criterion.
  • For each possible term, find the term which improves the model most, according to this criterion.
  • If removing none of the terms improves the model, then stop.

This procedure is called a

There are many criteria which can be used to determine which is the best model, and here I use the AIC (Akaike Information Criterion). Formulaically, this is \[\mbox{ AIC}=2p-2\log L,\] where \(p\) is the number of parameters in the model and \(\log L\) the log-likelihood of the model given the data observed.

We want this to be as low as possible (it is often negative). The effect of this criterion is to whilst simultaneously . i.e. we get models which fit well, but have a low number of parameters. In general, statisticians like to work with simpler models, as we are simple people, and simpler models are easier to conceptualise.

11.1 Example

Data on 100 executive salaries is listed as execsal2.dat. The dependent variable gives the (logarithm of) the salary of executives for each of 10 possible explanatory variables,as below.

  • y Salary of executive
  • x1 Experience (in years)
  • x2 Education (in years)
  • x3 Gender (1 if male 0 if female)
  • x4 Number of employees supervised
  • x5 Corporate assets (in millions of USD)
  • x6 Board member (1 if yes, 0 if no)
  • x7 Age (in years)
  • x8 Company profits (in millions of USD)
  • x9 Has international responsibility (1 if yes, 0 if no)
  • x10 Company’s total sales (in millions of USD)

We want to find a model that shows how these variables determine executive salary.

# REad the data in. We will only use columns 2 to 12, so let's just call this exec.
exec<-read.table("https://www.dropbox.com/s/jefnudvc130xhzp/EXECSAL2.txt?dl=1",
                 header=TRUE)
exec<-exec[2:12]
head(exec)
##         Y X1 X2 X3  X4  X5 X6 X7 X8 X9 X10
## 1 11.4436 12 15  1 240 170  1 44  5  0  21
## 2 11.7753 25 14  1 510 160  1 53  9  0  28
## 3 11.3874 20 14  0 370 170  1 56  5  0  26
## 4 11.2172  3 19  1 170 170  1 26  9  0  24
## 5 11.6553 19 12  1 520 150  1 43  7  0  27
## 6 11.1619 14 13  0 420 160  1 53  9  0  27

We start with a full model containing all terms X1 to X10. We use lm1<-lm(Y~ .,data=exec) to fit the full model.

lm1<-lm(Y~ .,data=exec)

# Perform stepwise backwards regression
slm1<-step(lm1)
## Start:  AIC=-504.84
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
## 
##        Df Sum of Sq     RSS     AIC
## - X10   1   0.00063 0.51583 -506.71
## - X7    1   0.00073 0.51593 -506.70
## - X8    1   0.00153 0.51673 -506.54
## - X6    1   0.00482 0.52002 -505.91
## - X9    1   0.00984 0.52504 -504.94
## <none>              0.51520 -504.84
## - X5    1   0.08810 0.60330 -491.05
## - X2    1   0.41581 0.93102 -447.66
## - X4    1   0.63133 1.14653 -426.84
## - X3    1   0.99872 1.51393 -399.05
## - X1    1   1.43512 1.95032 -373.72
## 
## Step:  AIC=-506.71
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9
## 
##        Df Sum of Sq     RSS     AIC
## - X7    1   0.00050 0.51633 -508.62
## - X8    1   0.00149 0.51732 -508.43
## - X6    1   0.00448 0.52031 -507.85
## - X9    1   0.00992 0.52575 -506.81
## <none>              0.51583 -506.71
## - X5    1   0.08769 0.60352 -493.01
## - X2    1   0.41593 0.93176 -449.59
## - X4    1   0.63878 1.15461 -428.14
## - X3    1   1.03375 1.54959 -398.72
## - X1    1   1.52826 2.04409 -371.02
## 
## Step:  AIC=-508.62
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X8 + X9
## 
##        Df Sum of Sq    RSS     AIC
## - X8    1    0.0015 0.5178 -510.33
## - X6    1    0.0040 0.5203 -509.85
## - X9    1    0.0096 0.5260 -508.77
## <none>              0.5163 -508.62
## - X5    1    0.0898 0.6061 -494.58
## - X2    1    0.4243 0.9406 -450.64
## - X4    1    0.6384 1.1547 -430.13
## - X3    1    1.0503 1.5666 -399.62
## - X1    1    3.9764 4.4927 -294.27
## 
## Step:  AIC=-510.33
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X9
## 
##        Df Sum of Sq    RSS     AIC
## - X6    1    0.0033 0.5211 -511.69
## - X9    1    0.0089 0.5267 -510.64
## <none>              0.5178 -510.33
## - X5    1    0.0885 0.6064 -496.55
## - X2    1    0.4230 0.9408 -452.62
## - X4    1    0.6420 1.1598 -431.69
## - X3    1    1.0490 1.5668 -401.61
## - X1    1    3.9749 4.4927 -296.27
## 
## Step:  AIC=-511.69
## Y ~ X1 + X2 + X3 + X4 + X5 + X9
## 
##        Df Sum of Sq    RSS     AIC
## - X9    1    0.0093 0.5304 -511.93
## <none>              0.5211 -511.69
## - X5    1    0.0947 0.6159 -496.99
## - X2    1    0.4347 0.9558 -453.04
## - X4    1    0.6868 1.2079 -429.63
## - X3    1    1.0466 1.5677 -403.55
## - X1    1    3.9718 4.4929 -298.27
## 
## Step:  AIC=-511.93
## Y ~ X1 + X2 + X3 + X4 + X5
## 
##        Df Sum of Sq    RSS     AIC
## <none>              0.5304 -511.93
## - X5    1    0.0879 0.6183 -498.59
## - X2    1    0.4289 0.9594 -454.67
## - X4    1    0.6908 1.2212 -430.53
## - X3    1    1.0656 1.5961 -403.76
## - X1    1    3.9627 4.4932 -300.26
  • This means that R considers the full model, which has AIC=-504.84;
  • It considers removing all the terms \(X_1\) to \(X_{10}\) individually, and provides in the last column the AIC for the models with these variables removed.
  • Removing variables \(X_6\) to \(X_{10}\) in this instance improve the model (produce a lower AIC) whereas removing \(X_1\) to \(X_5\) makes the model worse.
  • Thus we remove \(X_{10}\) from the model (the value whose removal improves the AIC most) and repeat this procedure.
  • There is a lot more output until we can no longer improve the AIC of -511.93 by removing more variables, so our final model is \[Y_i=\beta_0+\beta_1X_{1i} + \beta_2X_{2i} + \beta_3{X_3i} + \beta_4{X_4i} + \beta_5X_{5i}+\epsilon_i.\] The final model chosen is stored in slm1
summary(slm1)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = exec)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.201219 -0.056016 -0.003581  0.053656  0.187251 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.9619345  0.1010567  98.578  < 2e-16 ***
## X1          0.0272762  0.0010293  26.501  < 2e-16 ***
## X2          0.0290921  0.0033367   8.719 9.71e-14 ***
## X3          0.2246932  0.0163503  13.742  < 2e-16 ***
## X4          0.0005244  0.0000474  11.064  < 2e-16 ***
## X5          0.0019623  0.0004972   3.947 0.000153 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07512 on 94 degrees of freedom
## Multiple R-squared:  0.9206, Adjusted R-squared:  0.9164 
## F-statistic: 218.1 on 5 and 94 DF,  p-value: < 2.2e-16

All the terms remaining (X1,X2,X3,X4,X5) seem highly significant, although of course we should investigate the residuals to chack that the final proposed model is reasonable!