Least squares regressions#
Finnish university students are encouraged to use the CSC Noppe platform.
Others can follow the lesson and fill in their student notebooks using Binder.
A linear regression or least squares fit for a line is a common way of fitting a line to a set of point data for two variables, such as on an x-y graph. In essence, we assume the two values are linearly related to one another and find the line that fits the data best. Once done, this provides a means to estimate values from the line where no data may exist in the scatter plot dataset. Not all data should be expected to be fit well by a line, but linear regressions are a powerful method for determining cases when two variables are directly related to one another. A common example might be the temperature at which magma erupts versus the SiO2 content of the magma, as shown below in Figure 2.1.

Figure 2.1. Eruption temperatures of magmas as a function of their SiO2 content with a linear regression line. Source: Figure 16.1 from McKillup and Dyar, 2010.
The equation of a line#
In order to perform a linear regression, we need to find the equation of the best-fit line, such the one that might relate temperature and SiO2 content in the example above. To do this, we first need to recall the equation for a line:
where \(x\) and \(y\) are the coordinates of the data points, \(A\) is the \(y\)-intercept, and \(B\) is the slope of the line.
Thus, in order to calculate a “best fit” line to some data our task is to determine the values of the constants \(A\) and \(B\). Consider the example below in which \(A\) and \(B\) are known. If we make the rather common assumption that the uncertainties for the values on the \(x\) axis are negligible compared to the uncertainties along the \(y\) axis, we can say:
Thus, it is possible to find the value of \(y\) for two linearly related values when \(A\) and \(B\) are known.
Finding the coefficients \(A\) and \(B\)#
Finding the values of \(A\) and \(B\) then for the case of a linear regression to some \(x\)-\(y\) data is fairly straightforward, though it does involve a bit of algebra. For our purposes, I’ll refer you to Chapter 8 of Taylor, 1997 for a complete explanation of how to find \(A\) and \(B\), and simply present the relevant equations below.
The value of the \(y\)-intercept can be found using
where \(x\) is the \(i\)th data point plotted on the \(x\)-axis, \(y\) is the \(i\)th data point plotted on the \(y\)-axis, and \(\Delta\) is defined below.
The line slope can be found using
where \(N\) is the number of values in the regression.
And the value of \(\Delta\) is
With the equations above, you are now able to calculate an unweighted linear regression, the best-fit lines to some \(x\)-\(y\) data in which the uncertainties in the measurements are not considered to influence the fit of the line. It is also possible to fit regression lines that consider the variable uncertainties in the data, referred to as weighted regressions, but will will not consider that type of regression at this stage.
In-class demonstration space#
The cell below can be used for following live demonstrations during the class lesson.
# Coding done during class time goes below
import numpy as np
import matplotlib.pyplot as plt
# make up some data
# define some x values
x=np.random.rand(10)*5
# y-intercept
A=0
# slope
B=2
#define some random error
e=np.random.rand(10)*2
#define some y-values
y=A+B*x+e
# plot the data
plt.plot(x,y,'bo')
[<matplotlib.lines.Line2D at 0x7ef1a7f99a90>]
#best fit line
# Let's define A, B for the regression
#denumerator
delta=len(x)*sum(x**2)-sum(x)**2
#y-intercept
A=(sum(x**2)*sum(y)-sum(x)*sum(x*y))/delta
#slope
B=(len(x)*sum(x*y)-sum(x)*sum(y))/delta
print(f'The y-intercept is {A:.2}, the slope is {B:.2}.')
The y-intercept is 0.72, the slope is 2.1.
# make our regression line
xreg=np.linspace(0,5,10)
yreg=A+B*xreg
plt.plot(x,y,'bo',xreg,yreg,'k--')
[<matplotlib.lines.Line2D at 0x7ef1a7feb230>,
<matplotlib.lines.Line2D at 0x7ef1a7feb380>]
# define Pearson correlation coefficient
xmean=sum(x)/len(x)
ymean=sum(y)/len(y)
#print(xmean)
#print(ymean)
numerator=sum((x-xmean)*(y-ymean))
denominator=((sum((x-xmean)**2))*(sum((y-ymean)**2)))**0.5
r=numerator/denominator
print(r)
0.982694782973545