## On-line statistics

Descriptive statistics

Descriptive statistics are used simply to describe the sample you are concerned with. They are used in the first instance to get a feel for the data, in the second for use in the statistical tests themselves, and in the third to indicate the error associated with results and graphical output. Many of the descriptions or "parameters" such as the mean will be familiar to you already and probably use them far more than you are aware of. For instance, when have you taken a trip to see a friend without a quick estimate of the time it will take you to get there (= mean)? Very often you will give your friend a time period within which you expect to arrive "say between 7.30 and 8.00 traffic depending". This is an estimate of the standard deviation or perhaps standard error of the times taken in previous trips. The more often you have taken the same journey the better the estimate will be. It is the same when measuring the length of the forelegs of a sample of donkeys in a biological experiment.

This page is divided up into two main sections but you must also refer to the pages on normal, binomial, negative binomial and Poisson distributions as these are also descriptive of data sets.

All examples on this page refer to samples and not the population as a whole.

All the following can be calculated using the "descriptive statistics" or "summary statistics" function in Excel or any of the statistics software.

Measures of central tendency

Most data sets have many values that cluster about the most common value (mean). The number of data points with a given value will decline the farther the value is from the mean. This phenomenon can clearly be seen in the following frequency distribution graphs.

Do ensure that your data does not follow the pattern displayed in bold "bimodal distribution". This suggests that you have sampled two populations (such as male and female where sexual dimorphism is apparent) and such data cannot be analysed easily.

Mean

The most common description of the central tendency is the mean () and is found using: i.e.

 28.5 18.75 22.9 25.4 24.55 23.7 23.9

By examination of the data the mean can be estimated at around 24. Using the above equation, it is:

Median

However, the mean can distort the picture if there are a few extreme but legitimate values (not affected by inaccurate measurements). The median can help with this scenario and is found by locating the "middle" value.

22.9 23.7 23.9 24.55 25.4 28.5

If n is an even number the median is the mean of the two middle values.

Mode

This is the value that occurs most often and does not exist in many data sets including the one above. It is of use where the above two parameters cannot be found; most often in categorized data sets i.e. from the following pitfall trap data

 Coleoptera Molluscs Annelids Mammals Dipterans Homoptera Hemiptera 35 12 14 2 25 17 20

The mean for each category is already displayed and the median is irrelevant. The mode is the Coleoptera group with 35 hits.

Measures of Dispersion and Variability

We can describe data more fully using other parameters that are also used in the hypothesis tests.

Range - the highest and lowest value in a data set 18.75 - 28.5 and 2 - 35 respectively in the above data sets

Standard Deviation (s) - Useful to assess how variable a sample is. But the coefficient of variation is easier to use.

Coefficient of Variation - Useful to see how much variation occurs within your data set. The higher it is the more data points you need to collect to be confident that the sample is representative of the population. It can also be used to compare variation between data sets. Calculated using: where s is standard deviation.

Variance (s2) - This is the most difficult value to use and need only be considered when using t-tests or ANOVA. Two or more s2 values can be compared statistically using the F-test or homogeneity of variance tests.

Standard Error (SE) - This is essential to assess how closely your sample relates to the population. By calculating the 95% confidence intervals () you can say that the population mean has a 95% chance of being within this range. Such information should be included in graphical output.

# Descriptive Stats Diversity Indices Comparisons Correlations Regression

Ted Gaten  Department of Biology  gat@le.ac.uk
Entry approved by the Head of Department. Last Updated: May 2000