Keywords: linear regression

Accuracy of the Model1

An assumption about the model of the observed data is

$$
Y=f(X)+\varepsilon\tag{1}
$$

which means there is an actual generator behind the data and some noise was added during some phases. This is a general model for most regression problems. In linear regression mission, this would specialize as:

$$
Y=w_1X+w_0+\varepsilon\tag{2}
$$

In the post ‘simple linear regression problems’, we have mentioned a performance measurement of the linear model, and it is called ‘$\text{RSS}$’ (Residual Sum on Square). Because it has a strong connection with the performance of the linear model, which increases when the linear model fitted data worse. And it is used to direct the parameter searching in the learning phase. This kind of model performance, somehow, might be regarded as a way to define the accuracy of the model:

  • Lower $\text{RSS}$, better model
  • Higher $\text{RSS}$, worse model

Beside $\text{RSS}$, in this post, we take out another two numeral properties of the accuracy of the linear model.

  1. $\text{RSE}$
  2. $\text{R}^2$

Residual Standard Error($\text{RSE}$)

$\text{RSS}$ is a good tool to assess acuracy to model, but it also has deficiency. $\text{RSE}$ is a derivative of $\text{RSS}$:

$$
\text{RSE}=\sqrt{\frac{1}{n-2}\text{RSS}}=\sqrt{\frac{1}{n-2}\sum^n_{i=1}(y_i-\hat{y_i})^2}\tag{3}
$$

Factor $2$ in $n-2$ came from that there are 2 parameters in our simple linear regression model. And if there are $m$ parameters in our model, this factor will be changed to $n-m$:

$$
\text{RSE}=\sqrt{\frac{1}{n-m}\text{RSS}}=\sqrt{\frac{1}{n-m}\sum^n_{i=1}(y_i-\hat{y_i})^2}\tag{4}
$$

Why $\text{RSE}$ is a better choice than $\text{RSE}$?

One of the assumption of equation(1) is that $\varepsilon$ is random variable whose has Gaussian distribution with mean $0$ . Then every $Y_i$ also has a Gaussian distribution with mean $w_1X_i+w_0$ and variance $\delta^2$. In another world:

$$
\delta_i^2=(\hat{y_i}-y_i)^2
$$

And it is the essential observation to estimate the real $\delta^2$. And all $X_i$’s are i.i.d, which means they have the same $\delta^2$. And $\text{RSS}$ can be a statistic to predict $\delta^2$. To make an unbias estimation of standard deviation $\delta$, $\text{RSS}$ is converted to $\text{RSE}$.

In other words, when we have the actual parameters, $\text{RSE}$ can be used as an unbias estimate of the standard deviation of $\delta$ in equation (1) ($\delta_i^2=(\hat{y_i}-y_i)^2$ has a $\chi^2$ distribution). But, when we don’t have the correct parameters, $\text{RSE}$ has the same effect of $\text{RSS}$.

$\text{RSE}$ measures the lake of fitting:

  • when $\hat{y_i}\sim y_i$, we have a small $\text{RSE}$ — good fit
  • when $\hat{y_i} \text{ far from } y_i$, we have a big $\text{RSE}$ — bad fit

$\text{R}^2$

Consider the following situation: if the value of $X_i$ is among $[10^9,10^{10}]$, the $\text{RSE}$ of our model may be bigger than 1 million. And another model of the same data, but using the conversion, like $\log(X_i)$, the $\text{RSE}$, of the new model may lie below 100. Then which model is good can not be told just by $\text{RSE}$. We need a relative measure but not an absolute measure.

A traditional way of changing absolute measure to a relative one is building a propotion. So we get:

$$
\text{R}^2=\frac{\text{TSS}-\text{RSS}}{\text{TSS}}=1-\frac{\text{RSS}}{\text{TSS}}
$$

where $\text{TSS}$ is total sum of squares $\sum(y_i-\bar{y})^2$. To $\text{TSS}$, we can draw the $\bar{y}$ as a line through the data:

where gray line is the deviation of each point, their length sum up to $\text{TSS}$. While after linear fitting, we got $\text{RSS}$:

$\text{TSS}$ is alway greater or equal to $\text{RSS}$:

$$
\begin{aligned}
\text{TSS}&=\sum(y_i-\bar{y})^2\\
&=\sum(y_i-\hat{y_i}+\hat{y_i}-\bar{y})^2\\
&=\sum(y_i-\hat{y_i})^2+\sum(\hat{y_i}-\bar{y})^2+2\sum(y_i-\hat{y_i})\sum(\hat{y_i}-\bar{y})\\
&=\text{RSS}+\sum(\hat{y_i}-\bar{y})^2+2\sum(y_i-\hat{y_i})\sum(\hat{y_i}-\bar{y})\\
&\geq \text{RSS}\geq 0
\end{aligned}
$$

where $\text{TSS}=\text{RSS}$ only when $\hat{y_i}=\bar{y}\text{ for all possible } i$ and $\text{TSS}-\text{RSS}$ is the reduction of uncertain after fitting the model.

This gives us the following conclusion:

  1. $0 \leq \text{R}^2 \leq 1$
  2. When $\text{R}^2 = 1$, $\text{RSS}=0$, that means a perfect fitting and the model explain everything
  3. When $\text{R}^2 = 0$, $\text{RSS}=\text{TSS}$, this means the fitting does nothing for predicting, because its prediction is just the same as the mean of the sample.

So, $\text{R}^2$ is a good measurement for the accuracy of the model.

Usage of $\text{R}^2$

Usage of $\text{R}^2$ is not just an indicator of how good a model is, but it can be used in any way. The first one, for instance, in Physics, we know $X, Y$ has a linear relationship, but $\text{R}^2$ after regression is close to 0. So, this $\text{R}^2$ gives information that data has too much noise, or even the experiment is totally wrong.

The second example is using $\text{R}^2$ to test whether the sample has a linear relationship. What we need to do is do linear regression and then test its $\text{R}^2$, if $\text{R}^2\to 0$, which means they are not linear.

$\text{Cov}(X,Y)$ and $\text{R}^2$

We test if the linear relationship is strong by $\text{R}^2$, this is a little like we test if there was some relationship between two variables by covariance. And $\text{R}^2$ is a linear transform of covariance of $X$ and $Y$. However we use $\text{R}^2$ instead of $\text{Cov}(X,Y)$ because in multi-variables case $\text{R}^2$ works easier.

References


1 James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Last modified: March 25, 2020