For more information on any
topic on this page see
BIOMETRY by Sokal
Regression is a method
by which a functional relationship in the real world may be described
by a mathematical model which may then, like all models, be used
to explore, describe or predict the relationship.
Regression vs Correlation
To read about this,
Firstly, the difference
between regression and correlation needs to be emphasised. Both
methods attempt to describe the association between two (or more)
variables, and are often confused by students and professional
Simple linear regression
Let's begin by looking
at the simplest case, where there are two variables, one explanatory
(X) and one response variable (Y), ie. change in X
causes a change in Y.
The data used in this analysis is in the file data1.txt
and is shown in Fig 1. It is always worth viewing your
data (if possible) before performing regressions to get an idea
as to the type of relationship (eg. whether it is best
described by a straight line or curve).
Figure 1 A plot of the data.
Y = a +
By looking at this scatter plot, it can be seen that variables
X and Y
have a close relationship that may be reasonably represented
by a straight line. This would be represented mathematically
where a describes where the line crosses
the y-axis, b describes the
slope of the line, and e is
an error term that describes the variation of the real data above
and below the line. Simple linear regression attempts to find
a straight line that best 'fits' the data, where the variation
of the real data above and below the line is minimised.
Figure 2.1 Plot shown the fitted regression
line and data points.
Figure 2.2 Detailed section of Fig. 2.1
with residuals and fitted values shown.
Assuming that variation in Y
is explained by variation of X,
we can begin our regression. In Minitab it would look like this.
The command, regr
'Y' 1 'X', instructs Minitab
to regress Y onto just 1 explanatory variable,
This output tells us several things: the output tells us the
equation of the fitted line and gives us important
formal information regarding the assocation of the variables
and how well the fitted line describes the data.
Minitab output regressing Y
on X, with important sections
highlighted in red.
||The fitted line has a=7.76
and b=0.769 and now that we know the equation, we
can plot the line onto the data (e
is not needed to plot the line); see Fig 2.1 below. This
is the mathematical model describing the functional response
of Y to X.
The p (probability)
values for the constant (a) and X, actually the
slope of the line (b). These values measure
the probability that the values for a
and b are not derived by chance.
These p values are not a measure of 'goodness of fit' per
se, rather they state the confidence that one can have in
the estimated values being correct, given the constraints of
the regression analysis (ie., linear with all data points
having equal influence on the fitted line). The p(X) value of 0.000
is a little misleading as Minitab only calculates p values to
3 decimal places, so this should be written as
The R-squared and adjusted
R-squared values are estimates of the 'goodness of fit' of
the line. They represent the % variation of the data explained
by the fitted line; the closer the points to the line, the
better the fit. Adjusted R-squared is not sensitive to the number
of points within the data. R-squared is derived from
R-squared = 100 *
SS(regression) / SS(total)
For linear regression with
one explanatory variable like this analysis, R-squared is the
same as the square of r, the correlation coefficient.
||The sum of squares (SS) represents
variation from several sources.
SS(regression) describes the variation within the fitted
values of Y, and is the sum of the squared difference
between each fitted value of Y
and the mean of Y. The squares are taken to 'remove'
the sign (+ or -) from the residual values to make the calculation
SS(error) describes the variation of observed Y from estimated
(fitted) Y. It is derived from the cumulative addition of the
square of each residual, where a residual is the distance
of a data point above or below the fitted line (see Fig 2.2).
SS(total) describes the variation within the values of
Y, and is the sum of the squared difference between
each value of Y and the mean of Y.
||This is the same as the p(X)
value in highlight 2.
Data points that are unusually
far from the fitted line (compared to the other points) are pointed
out to the user in Minitab and Genstat. Such data points are
worthy of special attention, as they may be spurious, due to
recording error, for example, and could cause a dodgy regression
line to be fitted. There is some justification for removing such
points from the data before attempting regression analysis, but
there must be very strong evidence that the data is unreliable!
Which data point is
observation 6? (see Fig 2.1)