Keywords: least squared error, linear regression

How correct we believe the parameters of the model are always concerned by us. To have more confidence in using methods talked previously, we would like to make a reliable framework, under which the method is always feasible.

Are you sure about your decision is correct?

We have introduced a naive method to solve the linear regression problem: the relation between the budget of TV and the result of the sale.1 And, the probability which has been used in the problem is a subject of uncertain problem, so the answer to such sort of problems are also unsure. Then we have to list shreds of evidence to support our result and answer the unfriendly questions, like this:

  • “Are you sure about your decision is correct?”
  • “95.89%”

How to get the percentage “95.89%” and how to build evidence to support this number are what we are going to talk about, today.


Before we starting our long mathematical article, a fundamental assumption should be accepted. Both the data we will deal with and all sorts of problems we will come up with have a common model:


where, $y$ is produced by an unknown system $f(x)+\varepsilon$, but $y$ can be observed. And $f()$ and $x$ are both unknown or known incompletely. And $\varepsilon$ is a random variable, who has mean zero, and it can have any distribution, while the normal distribution is usually used because of its simplicity.

Why does this weird equation become our essential condition? The philosophy behind this is our belief in mathematic. We have faith in that the world runs on mathematics that is the function $f(x)$ who drives everything around us. However, it exists in an unknown form, we may never know what it is. But we can use some model to approach it as close as we want. This is the basic idea of statistics. On the other hand, the $\varepsilon$ is the mistake we made during the whole procedure, a very easy example is to measure the length of a pen, and your tool, the ruler, maybe not precise. or you write down the wrong number of the result. These kinds of mistakes are always there, we can never get rid of them. So we use a random variable $\varepsilon$ to make the model more practical.

A small conclusion: $f(x)$ is the rule behind the world, and it is unknown. $\varepsilon$ is the error made during our observation or recording and it is regarded as a mean zero random variable.

Linear $f(x)$

In ‘Simple Linear Regression’, we have solved the problem through a very popular method(almost every course of machine learning put this topic at their beginning), but we would like to go deeper, the first question we come across is how to analysis the estimate of the parameter.

Take our linear regression equation into equation(1), and we get the model:


The parameters have their names:

  • $w_0$ is the intercept, which means when we set $x=0$ we get $y=w_0$
  • $w_1$ is called slop, and it represents an average increase of $y$ when a one-unit increase in $x$

however, $\varepsilon$ doesn’t have a special name, it catches all noise in the model, for example:

  1. the original $f(x)$ behind the data may be not linear
  2. the measurement may have errors

$\varepsilon$ plays an important part in linear regression, and then we have the second assumption:

$\varepsilon$ is independent of $x$

However, this assumption is distinctly incorrect in the “TV-Sales” problem. We can just recognize that in the linear regression figure:

If we consider the error as the distance between sample points and lines, we can easily find that the error increases, as $X$ (TV)increasing. However, we still believe this assumption is reasonable. Because, under this assumption, our regression analysis is easier.

If equation (2) is the exact model of our observation, the line $f(\cdot)$ would be called the ‘population regression line’. As we said this equation is not always known, we can only use a regression line to approach it. Among those regression lines, the least-squares line is one of the best linear approaches to the true relationship.

Relationship between Population Regression Line and Least Squares Lines

Now we take

Y=2.3x+1+\varepsilon \tag{3}

as our population regression line, in which $\varepsilon$ has a mean 0 Gaussian distribution. And then 5 samples are generated, each of them contains 20 points($x$ is a random variable, however, we can make it fixed without loss of generality). And each sample is used to fit a line by the least square method. We draw all of them in a single figure, the red line(Generator) is the population regression line and the dushed line is the 5 least-squares line:

This procedure can do again and again, and the least-squares line is just the same as the population regression line is impossible(for it can be modeled by a contiuous random variable, its probability is $0$). But the expectations of $\hat{w_0}$ and $\hat{w_1}$ will be equal to $1$ and $2.3$ if we can draw infinite samples. In another world:

\mathbb{E}(\hat{w_0})\stackrel{\text{number of samples}\to \infty}{=}w_0\\
\mathbb{E}(\hat{w_1})\stackrel{\text{number of samples}\to \infty}{=}w_1

This is true because the least square line gives an unbias estimate to the sample. And bias and unbias is an interesting topic in statistics.

For short, if the expectation of the parameter is equal to the original coefficient$(w_0,w_1)$ in equation(2), it is called unbias. Unbias does not mean the estimation is better than the one which is bias, but it has some good properties that the unbias one does not.

Standard Error

The standard error is one of the most useful statistics in parameter estimating. Variance is an essential numerical feature of a random variable, a distribution, and maybe a set of data. Standard deviation is the square root of variance. And the standard deviation of a sample is called a standard error instead. If we want to estimate $\mu$ which is the mean of $y$, we can estimate $\mu$ through $\hat{\mu}=\frac{1}{n}\sum^ny_i$. And then we analyze the variance of $\mu$. That is:


where $delta$ is the standart deviation of each point of the sample. This is a little confusing, we go back to equantion(3) where $x_i$ is random varible has identity distribution with$x$, and $y_i$’s randomness comes from $\varepsilon$ but not $x_i$, that means $y_i$ has Gaussian destribution with mean $w_1X_i+w_0$. The $\delta$ now is the square root of the variance. For now, we consider all the points in the sample is independent, so:

\text{var}(\hat{\mu})&=\text{var}(\frac{1}{n}\sum^n y_i)\\

Equation(5) is a very famouse equation, and we can get the following conclusion from this simple equation:

  1. as $n$ increasing, $\text{var}(\hat{\mu})$ decreases. So $\hat{\mu}$ becomes more and more certain.
  2. $\text{var}(\hat{\mu})$ and $\text{SE}(\hat{\mu})^2$ is a good measure of how far $\hat{\mu}$ is to the actual $\mu$

One sentence about SE: smaller SE, less uncertainness.

There is a hint here, we should sperate a sample point and a realization of it. A sample point is a random variable who has the same distribution as the population. And this is why we can use a sample to estimate parameters of population distribution. SE plays an indispensable part in the following sections.

Use Standard Error(SE) to Solve the Question

“How sure about the estimation of $w_1$ and $w_0$?”

Then we go back to the little example in last section about the mean of $y$. This $y$ can just be the $y$ in equation(3) and as well as equation(2). Least square gave us a solution of equation(2) in the post ‘Simple Linear Regression’, by:


where $\bar{y}$ has the same meaning with $\hat{\mu}$(Notation: $\bar{Y}$ is the mean of population, and $\bar{y}$ is the mean of sample, but here we consider them as the same thing). Take equation(7) into equation(6), we could get $\text{SE}$ of $w_0$ and $w_1$:


and $\delta^2=\text{var}(\varepsilon)$. This process is little complecated, however, in the multi-variables case, this can be derived more easily. In equation(8), these features can be found:

  1. When $\bar{x}=0$, $\text{SE}(\hat{w_0})^2$ has the same value as variance of $y$ $\text{SE}(\hat{w_0})^2=\frac{\delta^2}{n}$
  2. When $\sum^{n}_{i=1}(x_i-\bar{x})^2$ going bigger, square of $\text{SE}$ of $w_0$ and $w_1$ go smaller and both $w_0$ and $w_1$ become more certain.

Both $\text{SE}$’s contain $\delta^2$ and $\delta^2$ is a key numeral property of population distribution, so we would like to estimate $\delta^2$ from sample. And this estimate is known as the residual standard error, and is given by the formula:


For $\delta^2$ is alway unknown, and is estimated from the data, $\text{SE}$ should be writen as $\hat{\text{SE}}$, but for simplicity of notation we will write $\text{SE}$.

According confidence interval of $w_0$ and $w_1$, there is $95\%$ chance that the interval:


will contain $w_0$ and $w_1$(where $95\%$ is not a precise probability for the interval of $\pm2\text{SE}$)

The question “Are you sure about your decision is correct?” can be answered now. In the interval $\hat{w_1}\pm2\text{SE}$, we have $95\%$ probability to catch the actual $w_1$

“Is there a relationship between $x$ and $y$?”

To answer his question, we introduce the ‘hypothesis test’ into our post. And then, we have two hypotheses:

H_0\text{: There is no relationship between} x \text{ and } y

versus the alternative hypothesis:

H_a\text{: There is relationship between} x \text{ and } y

And the equivalent mathematical conclusion is ” whether $w_1$ is far from $0$ or not”, because according equation(2), when $w_1=0$ we have:


and $y$ is a random variable with mean $w_0$. And it has no relationship with $x$ at all.

We are not sure about what the actual value of $w_1$ and we can just measure the uncertain $w_1$ by $\hat{w_1}$ and its $\text{SE}(\hat{w_1})$. From this view, if $\hat{w_1}$ if far away from $0$ for some kind of distance of $\text{SE}$. We can have some confidence about weather $\hat{w_1}$ is 0. Genernally, $\hat{w_1}$ and $\text{SE}{\hat{w_1}}$ have the combination as the table:

$\hat{w_1}$ is close to $0$ $\hat{w_1}$ is far from $0$
$\text{SE}(\hat{w_1})$ is large more likely is $0$ likely to be 0 or not
$\text{SE}(\hat{w_1})$ is tiny likely to be 0 or not less likely to be 0

In the figures, we assume $w_1$ has a mean $\hat{w_1}$ bell-ship distribution whose variance is somehow determined by $\text{SE}(\hat{w_1})$

This is the first column of the table. When $\hat(w_1)=0.1$ a tiny $\text{SE}$(the origin line) can make sure $w_1$ has a very high probability to not equal to 0.

In contrast, This is the second column of the table. When $\hat(w_1)=1$ a large $\text{SE}$(the red line) can do nothing to guarantee $w_1$ is not 0.


Then $\hat{w_1}$ and $\text{SE}$ should be combined to a new form to indicate the chance of the $w_1=0$, for this, we present you “t-statistic


This is called a t-statistic because $t$ has a $t$-distribution with $n-2$ degrees of freedom. Equation(11) can be described as “there is $t$ numbers of $\text{SE}(w_1)$ destance from $\hat{w_1}$ to 0”, and it is reliable distance for it is a relative distance but not an absolute one

The bigger the $t$ is, the more unlikely $\hat{w_1}=0$ is.


Another generated measurement about this is called a ‘p-value‘, who is a famous guy in Statistics. Here it is the value of probability that the value who is larger than $|t|$. In the figure below, it represents the size of the area of the shadow:

and it can be also denoted as:


According to the realtionship between $t$ and how far from $\hat{w_1}$ to $0$, we have:

  • The bigger $t$, the more distance from $\hat{w_1}$ to $0$
  • The smaller $\text{p-value}$, the more distance from $\hat{w_1}$ to $0$
  • the more distance from $\hat{w_1}$ to $0$, the strong relationship between $X$ and $y$


1 James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Last modified: March 25, 2020