Keywords: multiple linear regression

Multiple Predictors(Inputs)

Go back to our first example in ‘Simple Linear Regression’, we have three inputs(budget of TV, Radio, Newspaper) and an output (Sale):

The first trouble we come across is which model we would employ.

There are two spontaneous strategies:

  1. extending one input model to multiple inputs model
  2. making several models, each of which contains only one input and output.

The second strategy is just repeating our discussion in ‘Simple Linear Regression’, and we can get the coefficients and relative statistics, like ‘t-statistic’ , ‘p-value’ e.t.c. According to our knowledge, we can get(I copy the data from the book ‘An introduction to statistical learning’1):

TV-Sales Estimate Standard Error t-statistic p-value
$w_0$ $7.0325$ $0.4578$ $15.36$ $<0.0001$
$w_1$ $0.0475$ $0.0027$ $17.67$ $<0.0001$

Radio-Sales Estimate Standard Error t-statistic p-value
$w_0$ $9.312$ $0.563$ $16.54$ $<0.0001$
$w_1$ $0.203$ $0.020$ $9.92$ $<0.0001$

Newspaper-Sales Estimate Standard Error t-statistic p-value
$w_0$ $12.351$ $0.621$ $19.88$ $<0.0001$
$w_1$ $0.055$ $0.017$ $3.30$ $<0.0001$

These three models contain information about the relationship of each combination, and they can be used to predict unknown ‘Sales’ by input separately. However, three models will produce three predictions, which prediction would be recorded as the final prediction is not sure. Average or other calculation of the three predictions could solve the problem, but there is no evidence that which kind of calculation could make a precise prediction.

Then a single model that contains all predictors was built, it was writen as:

Y=w_0+w_1X_1+w_2X_2+\dots+w_nX_n+\varepsilon \tag{1}

The single input model we learnd before, is a specialty of equation(1) when $n=1$. Such as:

Sales =w_0+ w_1 \times TV \tag{2}

In the multiple linear regression model, $w_i$ is the slope of $X_i$, which means how much $Y$ is effected when other $X_j\text{ where }j\neq i$ hold and $X_i$ increase one unit.


How to calculate the each coefficient of the model is not contained in this article, and different methods to solve this could be found ‘Simple Linear Regression’. Though we do not investigate a certain algorithm, we assume we have got a correct result of the parameters by least square method whose object function is still $\text{RSS}$:

&=\sum(y_i-\hat{w_0}-\hat{w_1}x_1-\dots – \hat{w_n}x_n-\varepsilon)^2

where $\begin{bmatrix}\hat{w_0},\hat{w_1},\dots, \hat{w_n}\end{bmatrix}$ minimizes $\text{RSS}$

Here we analytic the model because, up to now, we still don’t know whether every predictor is necessary for the model. In other words, we want to confirm every predictor contributes to the model.

Let’s go back to the ‘TV, Radio, Newspaper – Sales’ example.

\text{Sales}= w_0 + w_1\times \text{TV}+ w_2\times \text{Radio}+ w_3\times \text{Newspaper}$$

and we got parameter:

$\hat{w_0}$ $\hat{w_1}$ $\hat{w_2}$ $\hat{w_3}$
$2.939$ $0.046$ $0.189$ $-0.001$

We observe that in the table above and find $\hat{w_3}$ is little wierd. Because its absolute value is much smaller than those of $\hat{w_1}, \hat{w_2}$. Then we list all the statistics who assess accuracy of the parameter:

Coefficient SE t-statistic p-value
$\hat{w_0}$(Intercept) 2.939 0.3119 9.42 $<0.0001$
$\hat{w_1}$(TV) 0.046 0.0014 32.81 $<0.0001$
$\hat{w_2}$(Radio) 0.189 0.0086 21.89 $<0.0001$
$\hat{w_3}$(Newspaper) -0.001 0.0059 -0.18 $0.8599$

‘p-value’ and ‘t-statistic’ of $\hat{w_3}$ have changed a lot from ‘Newspaper v.s. Sales’ model. And those indicated that in the multiple linear regression predictors(inputs) affect each other. When we have only ‘Newspaper’ it works for predicting with parameter whose ‘t-statistic’ is large and ‘p-value’ is smalle. But the picture of the ‘Newspaper-Sales’ shows that the linear relationship between ‘Newspaper’ and ‘Sales’ is not very strong. This not strong relationship vanished in the multiple linear regression because the other two predictors dominate the relationship.

The multiple linear regression gives that ‘TV’ and ‘Radio’ are the essential factors for the prediction of ‘Sales’ while the single linear models did not.

But what happened to ‘Newspaper’? We supposed that there may be some connection between ‘Newspaper’ and ‘Radio’. For example, in a region radio attracted a lot of customers and these guys love reading newspapers coincidentally.

Another interesting example is that the number of shark attacks has a strong relation to the sale of ice-cream near the beach.

These three predictors and output have the Correlation Matrix:


this matrix presents a strong relation between ‘Radio’ and ‘Newspaper’.

Deciding on Important Variable

As we discussed above ‘newspaper’ is not an important variable for these problems. But what should we do to other problems who has a huge number of parameters to decide which some should be used in multiple linear regression.
A spontaneous strategy is we can test every combination. In the ‘Media-Sales’ example above, we could test the following combinations:

No. Combination
1 ‘TV’
2 ‘Radion’
3 ‘Newspaper’
4 ‘TV’+’Radion’
5 ‘TV’+Newspaper’
6 ‘Radion’+’Newspaper’
7 ‘TV’+’Radion’+’Newspaper’
8 None of the predictors

The No.8 combination represents that the output is just a random variable who has a mean 0 normal distribution.

The accuracy of every linear regression of that combination of predictors can be measured through many different kinds of statistics:

  1. Mallow’s $C_p$
  2. Akaike information criterion(AIC)
  3. Bayesian information criterion(BIC)
  4. Adjusted $\text{R}^2$

This strategy has a problem — dimension curse. Three inputs produced $2^3=8$ combinations. When we have only 10 predictors, we have to test 1024 times. In practice, more than 100 predictors are very normal and we can never test all their combinations.

What we need is a feasible method, and we have three different kinds of method:

  1. Forward Selection
  2. Backward Selection
  3. Mixed Selection

Forward Selection

Selecting variable one by one, each time we select the best among the rest until some conditions are reached. And we stop the greedy process. For instance, in the first iteration, we select $x_2$ from $\{x_1,x_2,x_3,x_4\}$ and then we take out one variable and combine it with $x_2$ and calculate a statistic who can measure the accuracy of the new two-variable linear regression. We select the best one out of the set and test if we could stop the iteration by some condition like a threshold or something else.

Backward Selection

It is inverse of forwarding selection. At first, it takes all the variables and then removes the least important variable one by one until some conditions are reached, too.

Mixed Selection

Mixed selection is a fixed method of forwarding selection and backward selection. When we add a new variable into the method, we test whether there are any variables can be removed for the new variable may contain the same information as the selected variables. For instance, the ‘Media-Sales’ example, we firstly selected the ‘TV’ variable then the ‘Newspaper’ for some reason, when we get ‘Radio’ data and the ‘Newspaper’ is no more necessary; so we remove the ‘Newspaper’ from the set of variables.


1 James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Last modified: March 24, 2020