## On-line statistics

### Linear Regression

Regression is a simple statistical tool used to model the the dependence of a variable on one (or more) explanatory variables. This functional relationship may then be formally stated as an equation, with associated statistical values that describe how well this equation fits the data.

Topics included on this page are:
Regression vs. correlation
Simple linear regression, with example

For more information on any topic on this page see
BIOMETRY by Sokal & Rohlf

Regression is a method by which a functional relationship in the real world may be described by a mathematical model which may then, like all models, be used to explore, describe or predict the relationship.

## Regression vs Correlation

Firstly, the difference between regression and correlation needs to be emphasised. Both methods attempt to describe the association between two (or more) variables, and are often confused by students and professional scientists alike!

To read about this, click here.

## Simple linear regression

Let's begin by looking at the simplest case, where there are two variables, one explanatory (X) and one response variable (Y), ie. change in X causes a change in Y. The data used in this analysis is in the file data1.txt and is shown in Fig 1. It is always worth viewing your data (if possible) before performing regressions to get an idea as to the type of relationship (eg. whether it is best described by a straight line or curve).

Figure 1 A plot of the data.

By looking at this scatter plot, it can be seen that variables
X and Y have a close relationship that may be reasonably represented by a straight line. This would be represented mathematically as

Y = a + b X + e

where
a describes where the line crosses the y-axis, b describes the slope of the line, and e is an error term that describes the variation of the real data above and below the line. Simple linear regression attempts to find a straight line that best 'fits' the data, where the variation of the real data above and below the line is minimised.

 Figure 2.1 Plot shown the fitted regression line and data points. Figure 2.2 Detailed section of Fig. 2.1 with residuals and fitted values shown.

Assuming that variation in
Y is explained by variation of X, we can begin our regression. In Minitab it would look like this.
The command,
regr 'Y' 1 'X', instructs Minitab to regress Y onto just 1 explanatory variable, X.
This output tells us several things: the output tells us the equation of the fitted line and gives us important formal information regarding the assocation of the variables and how well the fitted line describes the data.

 Minitab output regressing Y on X, with important sections highlighted in red. 1. The fitted line has a=7.76 and b=0.769 and now that we know the equation, we can plot the line onto the data (e is not needed to plot the line); see Fig 2.1 below. This is the mathematical model describing the functional response of Y to X. 2. The p (probability) values for the constant (a) and X, actually the slope of the line (b). These values measure the probability that the values for a and b are not derived by chance. These p values are not a measure of 'goodness of fit' per se, rather they state the confidence that one can have in the estimated values being correct, given the constraints of the regression analysis (ie., linear with all data points having equal influence on the fitted line). The p(X) value of 0.000 is a little misleading as Minitab only calculates p values to 3 decimal places, so this should be written as p(X) < 0.001. 3. The R-squared and adjusted R-squared values are estimates of the 'goodness of fit' of the line. They represent the % variation of the data explained by the fitted line; the closer the points to the line, the better the fit. Adjusted R-squared is not sensitive to the number of points within the data. R-squared is derived from R-squared = 100 * SS(regression) / SS(total) For linear regression with one explanatory variable like this analysis, R-squared is the same as the square of r, the correlation coefficient. 4. The sum of squares (SS) represents variation from several sources. SS(regression) describes the variation within the fitted values of Y, and is the sum of the squared difference between each fitted value of Y and the mean of Y. The squares are taken to 'remove' the sign (+ or -) from the residual values to make the calculation easier. SS(error) describes the variation of observed Y from estimated (fitted) Y. It is derived from the cumulative addition of the square of each residual, where a residual is the distance of a data point above or below the fitted line (see Fig 2.2). SS(total) describes the variation within the values of Y, and is the sum of the squared difference between each value of Y and the mean of Y. 5. This is the same as the p(X) value in highlight 2. 6. Data points that are unusually far from the fitted line (compared to the other points) are pointed out to the user in Minitab and Genstat. Such data points are worthy of special attention, as they may be spurious, due to recording error, for example, and could cause a dodgy regression line to be fitted. There is some justification for removing such points from the data before attempting regression analysis, but there must be very strong evidence that the data is unreliable! Which data point is observation 6? (see Fig 2.1)

Link to advanced regression techniques

This page written by Dr Jon Read, April 1998.

# Descriptive Stats Diversity Indices Comparisons Correlations Regression

Ted Gaten  Department of Biology  gat@le.ac.uk
Entry approved by the Head of Department. Last Updated: May 2000