Linear Regression

Linear regression is one of simplest ways of building a model that can make predictions from existing data (the “training data”). Regression models are used to predict numbers (“what will the temperature be tomorrow?”), while classification models are used to predict discrete outcomes (“will it rain tomorrow?”).

Although linear regression is an elementary method in data analysis that has existed for 200 years, it is robust, flexible, easy to compute, easy to understand, often performs quite well, and, just as importantly, the foundation upon which many modern regression models are built.

We give a more general perspective in a separate piece.

Modeling data for linear regression

There are many ways to build a model from data. In regression, the training data is given as a set of pairs of independent and dependent variables. Independent variables are the part of the data our finished model will use to make predictions; dependent variables are what the predictions should be. So our training data always has both values of the dependent and independent variables, but our testing data only has the values of the independent variables. The job of the model is exactly to predict the value of the dependent variable.

Let’s get a little more concrete. Imagine you are designing a computerized hygrometer: an instrument to read moisture content from the air. You are in charge of writing the software that will read varying amounts of voltage from the sensor and convert those readings into humidity measurements. As input, you are given a set of voltage readings and associated humidity values (the humidity values having been collected by some other trusted method). Let us call $x_i$ the voltage values, and $y_i$ the humidity values.

Our first decision is as to what model we will use for the predictions. This decision cannot come alone from the data (although we will later discuss methods to help with this). In linear regression, the fundamental assumption is that the model is, well, linear: we look for the linear combination1 of simpler, known models that best predicts the training examples.

For example, a model that says the humidity $y$ will be predicted as

\[y = a x + b\]

is a linear regression model, because we are trying to predict an unknown value ($y$) from a linear combination of known values ($x$ and $1$). To solve a linear regression model is to find the values of $a$ and $b$ that best match the training data, in hopes that it will best predict the test data.

Solving a linear regression model

In order to find the best values for $a$ and $b$, we first need to define what we mean by best. Here, we will use a very simple notion, which says that we want to minimize how bad our model does over the entirety of the training data, and that we will count how badly the model does at each training point by the square of the difference between the predicted value and the observed value:

\[E = \sum_i (y_i - (ax_i + b))^2\]

$E$ stands for the “energy” of the model, or the “error” of the model, and we want to find the model that minimizes the error.2 This is simply a matter of taking the derivative of $E$ with respect to $a$ and $b$, setting those values to zero, and solving the resulting system of equations.

The crucial observation in linear models is that, since we know the values $y_i$ and $x_i$ at training time, when we take derivatives of $E$, the resulting expressions are always linear functions of $a$ and $b$. This is true even if the model we are fitting uses non-linear functions. For example, imagine that our model was, instead,

\[y = a x^2 + b x + c.\]

Even though the model would be fitting parabolic curves to the data (instead of linear fits), the process of combining the simpler models ($1$, $x$ and $x^2$) is still linear 3.

Setting up the matrices

Solving the system of equations



TBF. Click on the plot to add points.

  1. A linear combination of a set of vectors is a weighted sum of those vectors, where the weights can be arbitrary values. 

  2. Different generalizations of this error function make up a surprisingly large fraction of modern methods in data analysis. 

  3. See linear least squares