[University home]

 On-line statistics


Normal distributions


 

The term 'normal distribution' refers to a particular way in which observations will tend to pile up around a particular value rather than be spread evenly across a range of values (the Central Limit Theorem). It is generally most applicable to continuous data and is intrinsically associated with parametric statistics (e.g. ANOVA, t tests, regression analysis). Graphically the normal distribution is best described by a 'bell-shaped' curve. This curve is described in terms of the point at which its height is maximum (its 'mean') and how wide it is (its 'standard deviation').

In the above example, the most common measurement (i.e. 9)is the same in curves A and B but there is a greater range of values for A than for B. Curve C has the same distribution as A but the most common measurement (i.e. 18) is twice that of curve A. All of these distributions are normal and can be described by:

 

where z is the height of the curve (proportional) at measurement Y, µ is the mean and sigma is the standard deviation of the curve. With this equation you can do a graphic check of a sample to see whether the data are normally distributed. The mean is defined as:

 

That is, the sum of all the measurements divided by the number of measurements made. The spread of the normal distribution (variance) is basically the sum of how much the measurements (x) differ from the mean:

 

That is, the sum of the difference between each of the measurements and the mean. From this equation it can be seen that if many measurements are much greater or smaller than the mean then the variance will be large. A mean is generally quoted with its standard deviation which is simply the square root of the variance.

By multiplying z by the total number of observations you can calculate the number of observations you would expect to see for a particular measurement. As long as the data are normally distributed, if you then plot the cumulative observed frequency against the cumulative expected frequency the resulting plot should be a straight line.

Example

The antennae lengths of a sample of 32 woodlice were measured and found to have a mean of 4 mm and standard deviation of 2.37 mm. Using these parameters and the equation above, the expected frequency at each of the lengths encountered was calculated.

Measurement

Observed frequency

Cumulative observed frequency

Estimated frequency

Estimated cumulative frequency

0
1
2
3
4
5
6
7
8
9

2
3
4
3
4
7
3
3
2
1

2
5
9
12
16
23
26
29
31
32

1.3
2.4
3.8
4.9
5.4
4.9
3.8
2.4
1.3
0.6

1.3
3.7
7.5
12.4
17.8
22.7
26.5
28.9
30.2
30.8

When the observed frequencies (bars) are plotted against the predicted normal distribution (red line) it can be seen that there is a rough agreement between the two. When the cumulative frequencies are plotted against each other the resulting straight line suggests that this sample may have a distribution close enough to normal to allow the use of parametric statistics. To test for normality properly you would have to use something like a Kolmogorov-Smirnoff test (see below).


Deviations from Normality

The above describes the normal distribution that are found occassionally

Tests for Normality

The simplest method of assessing normality is to look at the frequency distribution histogram. The most important things to look at are the symetry and peakiness of the curve. In addition be aware of curves that indicate two or more peaks this would show a bimodal distribution and are not very friendly in statistics.

Visual appraisals must only be used as an indication of the distribution and subsequently better methods must be used. Values of skew and kurtosis as found in Excel's Function Wizard (SKEW and KURT respectively) are another good indicator, but can be over optimistic regarding the data's match with normality. Before the advent of good computers and statistical programs, users could be forgiven for trying to avoid any surplus calculations. Now that both are available and much easier to use, tests for normality (and homogeneity of variance) should always be carried out as a best practice in statistics. SPSS and Minitab contain the Kolmogorov-Smirnov test, which is the principal goodness of fit test for normal and uniform data sets. Alternatively, if you are a whizz on the calculator or in Excel and have a day or two spare or have access to UNISTAT, you may wish to use the Shapiro-Wilk test which is more reliable when n<50.

Both of the above tests use the same hypotheses:

The P-value will be provided by SPSS or Minitab, if below 0.05 reject the HO.

Descriptive Stats

Diversity Indices

Comparisons

Correlations

Regression


[University Home][Biology Home][University Index A-Z][University Search][University Help]


 Ted Gaten  Department of Biology  gat@le.ac.uk
Entry approved by the Head of Department. Last Updated: May 2000