Table of Contents
What is regression, anyways?
Regression is a way to find relationships between different variables or data points and then use this to predict new values.
An example would be data about houses. We can observe various things about them like the number of rooms, the city it’s located in, the year it was built and so on. All these observations are connected to the houses price, which we know for some houses but might want to predict for others. So after we have found the relation between the observations and the price as a function of the observations, we can apply this function to new houses and get a good estimate of the price.
These relationships between data points can have different forms. One of the simplest forms is a linear relationship and then this is called linear regression.
Theory
Given \{y_i, x_{i,1}, ..., x_{i,p}\}_{i=1}^{n} of n data points, the model assumes that there is a linear relation between the scalar values $ y_i $ and the feature vector x_i = (x_{i,1}, ..., x_{i,p}) .
This can be expressed by the model with noise \varepsilon_i :
y_i = \beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p} + \varepsilon_i = x_i^T \beta + \varepsilon_i
for i = 1, ..., n . Those n equations can be expressed together in a matrix notation as
y = X\beta + \varepsilon
Here X is extended by an additional columns of 1s to accommodate the \beta_0 . And each row is one data point.
X = \begin{bmatrix} 1 & x_{0,0} & x_{0,1} & ... \\ 1 & x_{1,0} & x_{1,1} & \\ 1 & x_{2,0} & x_{2,1} & \\ ... & & & \end{bmatrix}, \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ ... \end{bmatrix}
y is called the regressand or the dependent variable, because when choosing this model we assume that y is dependent on x in a linear fashion. Or we might even have a formula that expresses y in terms of x ( y = f(x) ), then y is dependent on x by design and we don’t get to choose.
The vectors in matrix X are known as the regressors, explanatory variables or input variables.
The values of \beta are the parameters and they are what determines the linearity of the model. As long as the model is linear in \beta , we call it linear.
The noise is modeled by \varepsilon . This can be interpreted as divergence from the underlying linear model
Least Squares
Now how do we find these parameters in \beta ? One method is called the method of least squares.
For this we want to inspect “how wrong” any specific vector \beta is. You can measure this in different ways depending on your data and application, but a popular approach is the sum of squares:
\mathcal{L}(X\beta, y) = \sum_{i=0}^n (x_i \beta - y_i)^2 = ||X\beta - y||^2
Here we calculate our predicted value x_i\beta and calculate the difference to the true value y_i . This is squared, which has the effect that larger differences are weighted more than small ones. Whether this is appropriate depends on the applications and there exist different choices here. If you want to know more about these, look up “Loss functions” which is the \mathcal{L}( \cdot, \cdot) here.
After squaring, the losses for each data point are summed up. This is equivalent to the common Euclidian norm (also known as L2-norm) squared.
Now we want to minimize this loss of course, because we want our predicted values to be as close as possible to the true values while still maintaining this linear model. This means we want to get our goal predictor
\beta^* = \underset{\beta}{\text{argmin }} \mathcal{L} (X\beta, y)
To get the minimum point of a function, we need to form the derivative and set it equal to zero, then solve for \beta . If you’re unsure how derivatives work when you’re working with matrices and vectors I can recommend https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.
In this case we get the following:
-2X^TY + 2X^TX\beta = 0 \\ \iff X^TY = X^TX\beta \\ \iff (X^TX)^{-1} X^TY = \beta
So that’s the way we get the optimal \beta.
Python Implementations
Pure Numpy version
Of course you won’t be building every machine learning model from scratch whenever you need to apply one. However, I think there is always some value in understanding how it would be done with just standard python libraries and numpy for matrix-vector calculations.
This enables you to do better debugging and analyzing of your data as well, because you know why the model might output unexpected values. It can also help with understanding what good applications for the model are.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# data
hours_studied = np.array([3, 5, 7, 8, 10, 13, 14])
points_scored = np.array([60, 75, 77, 81, 84, 85, 90])
plt.scatter(hours_studied, points_scored);
plt.xlabel('Hours studied')
plt.ylabel('Points scored');
Here we imported the numpy
module with the shortcut np
and the plotting library to create the plot you see above because it’s always nice to see the data you’re working with. %matplotlib inline
is something you might want to add if you’re working with a Jupyter Notebook as I was for this example.
My example data is an array of hours studied for an exam and the outcome as the points scored on this exam. You can see in the scatter plot that this resembles a line – and this line we will find now with the above calculations.
First we create the matrices as shown above by reshaping X and Y to be columns vectors – with dimensions 7 and 1, 7 rows, 1 column. And we add a column vector with only entries of value 1 to the left of X. Our \beta will be a column vector of two values: \beta_0 and \beta_1 which together will create an equation for a line y = \beta_1 x + \beta_0 .
# create matrices
X = hours_studied.reshape((7, 1))
X = np.hstack((np.ones((7,1)), X))
Y = points_scored.reshape((7,1))
# find beta
from numpy.linalg import inv
beta = inv(X.T @ X) @ X.T @ Y
We get this by applying the formula we derived before: (X^TX)^{-1} X^TY = \beta . numpy.linalg.inv
is a function for inverting a matrix and @
is used for matrix multiplication.
Now all that’s left is plotting our new line in between the data points:
# plot beta line in between data points
x_values = np.arange(3, 15)
y_values = beta[0] + beta[1] * x_values
y_values
plt.scatter(hours_studied, points_scored);
plt.plot(x_values, y_values, 'r-')
plt.xlabel('Hours studied')
plt.ylabel('Points scored');
Scikit-learn
Of course you don’t have to code it yourself every time. There are also implementations of this ready to use. One of the most used libraries in machine learning is Scitkit-learn and they have a Linear Regression Model. For our example we would use the X
and Y
from above except we skip the part where we added the 1-vector to X
.
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, Y)
beta0 = reg.intercept_
beta1 = reg.coef_
And that’s it. We simply fit the model to our data and values and then we can extract the coefficients of \beta through reg.intercept_
and reg.coef_
. We can even use the model to predict new values by using the predict
function, which simply calculates the value of y = \beta_1 x + \beta_0 at the given point, here 5.
reg.predict(np.array([[5]]))
# output: array([[70.97807018]])