Summary Statistics


Mean

Denote the n observations in the data set X1, X2, X3, . . . , Xn.

The terms "mean" and "average" are interchangeable. To compute the mean, sum the observations and divide by the number of observations. The symbol for the mean is typically an X with a bar over it--we read this as "X-bar".

Example

Consider the data listed in the column below. There are five of them: n = 5.

Obsn Value
X1 4
X2 14
X3 9
X4 2
X5 6
Total 35

The mean is X-bar = (4 + 14 + 9 + 2 + 6) 5 = 35 5 = 7.

* Actually, this symbol is only used when the data set is a sample (measurement taken for a subset of all units/cases). A different symbol--the greek letter m ("mu") is used when referring to the mean of a population (a data set consisting of measurements on all units/cases).

Note
Most calculators costing over $10 have built-in statistical functions. Entering the data and pressing the proper sequence of keys will produce the mean without requiring any arithmetic. Using a calculator is preferred to computing by hand; for all but the smallest data set a calculator increases speed and decreases the likelihood of error. Consult your calculator's instruction manual for the details. Or, see your friendly local stats guru---(s)he may be able to show you how to do this.

Median

The median M is found as follows.

  1. Sort the observations smallest to largest.
  2. Compute (n + 1)/2. This gives the position of the median (not the median itself) in the ordered data set.

Example

Consider the data listed in the row below. There are five of them: n = 5. I've sorted them.

Position 1 2 3 4 5
Value 2 4 6 9 14

(n + 1)/2 = (5 + 1)/2 = 6/2 = 3; the median is found in the 3rd postion. The value in the 3rd position is shown in red. This value.is the median: M = 6.

Example

Consider the starting salary data for 1995 psychology graduates of SUNY-Oswego. The data is shown, in sorted form, below.

08820 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48000

There are 54 observations; the median is found in position 55/2 = 27.5. (The 27th and 28th observations are colored red.) By convention we take the median to be halfway between these two observations---the average of the two.

M = (18000+18500)/2 = 18250

The median is 18250.

Note
Only very expensive calculators (costing near $100) have can find a median. (A median is not computed--it is found!) The median is conceptually easier, and anyone who can order numbers can find it. Computing the mean requires addition and division. In terms of what kids learn in school, the ability to find the median precedes the ability to find the mean. In this respect, it's surprising that most people are inclined to use the mean rather than the median, since the median is more basic. However, given that we can add and divide, from a computational standpoint the median is really more difficult to find because it requires a sorted set of numbers. Sorting is inherently more time-consuming than is averaging. (There is a theoretical result in computer science that proves this.) This may not seem obvious until you consider the task of sorting a very large set of unordered values. In essence it is necessary to store the entire data set to perform the sorting operation. Because of their limited memory, this constrains calculators. Statistical and spreadsheet programs on computers can sort very large sets of values almost instantaneously.

Percentiles

Percentiles

The Pth percentile is the value Q for which P% of the data are less than Q. (P stands for "percentile," Q stands for quantity--a measured value.)

For instance, the 90th percentile of combined SAT scores is 1240 (P = 90, Q = 1240). This means that 90% of all scores are below 1240; equivalently, 10% are above 1240. If your score falls at the 90th percentile that's a good thing.

Higher percentiles are not necessarily a good thing. Your golf average might fall at the 96th percentile. Therefore, 96% of all golfers have an average below yours, 10% above yours. Because higher scores result from poorer golfing, this places your golfing in the worst 4%; only 4% of all golfers are worse than you are.

Some percentiles can be found merely by examining a histogram. Take, for instance, the starting salaries of 1995 SUNY-Oswego Psychology graduates; the histogram is displayed below.

Approximately 6% of the data falls between 7500 and 12500; that is, below 12500 (since this is the lowermost class). Therefore 12500 is approximately the 6th percentile. About 22.5% fall in the 12500 to 17500 class, and 42.5% in the 17500 to 22500 class; therefore, close to 71% (=6%+22.5%+42.5%) of the data fall below 22500. This makes 22500 (approximately) the 71st percentile. Not all percentiles can be estimated well from the histogram. For instance, consider the 15th percentile. From the histogram it's possible to identify the 6th percentile; further, the 28.5th percentile is about 17500. However, the histogram provides no information about the distribution of values within the 12500 to 17500 class---which is where the 15th percentile must be. This is one drawback of histograms.

The Median

The Median is the 50th percentile. 50% of the data are below the median, 50% above.

Computing the Pth Percentile

  1. Sort the data.
  2. The Pth percentile is found in position P(n + 1)/100.
  3. If this position is between observations round to the nearest position (or, if you know how, interpolate between the two).

Example

Let's find the 15th percentile of the starting salary data. Here's the sorted version.

08820 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48000

Our position is 15(54 + 1)/100 = 15(55)/100 = 8.25. Round to position 8; the corresponding observation is shown in red. The 15th percentile is approximately 16000.

A better approach uses interpolation. The 8.25 position is really between the 8 and 9 positions, shown in red and blue respectively. One-quarter of the way from 16000 to 16500 is

0.75(16000)+0.25(16500) = 16125.

So 16125 is a (better) approximation to the 15th percentile. However, it's not that different from 16000, and either would suffice.


Quartiles & IQR

Lower Quartile Q1 (Often called the first quartile.)
The lower quartile Q1 is the 25th percentile. 25% (1/4) of the data are below Q1, 75% of the data are above Q1.
To find Q1, merely find the median of those observations below the median M.
 
Upper Quartile Q3 (Often called the third quartile.)
The upper quartile Q3 is the 75th percentile. 75% of the data are below Q3, 25% (1/4) of the data are above Q3.
To find Q3, merely find the median of those observations above the median M.

Example

Let's find the two quartiles for the starting salary data. Again, the data must be sorted. Begin by computing the median; we did that above and found M = 18250.

08820 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48000

The observations below M are shown in red. There are 27 of them, their median would therefore be found in postion 13 (among only those below M--the red values). I've underlined the value: the lower quartile is Q1 = 16900.

The observations above M are those that aren't colored red. There are also 27 of them. The median of these 27 values is found in position 13 as well (you need only compute the position number for one of the two quartiles; that position will work for both). The upper quartile is underlined: Q3 = 23000.

Note: The median might also be referred to as the "middle" or second quartile.

Interquartile Range

Interquartile Range IQR
The interquartile range IQR is the distance between the two quartiles. IQR = Q3 - Q1. The IQR is one measure of a distributions spread.

The interquartile range should accompany the median as an appropriate measure of spread.

Example

For the psychology graduates we found an upper quartile of 23000 and a lower quartile of 16900. Then IQR = 23000-16900 = 6100.


Standard Deviation

Denote the n observations in the data set X1, X2, X3, . . . , Xn.

Before computing the standard deviation you must compute the mean. (The data need not be sorted.)

To find the variance of the n observations

  1. Find the mean
  2. Find the deviation of each observation from the mean. There will be n of these.
  3. Square each of these.
  4. Sum the squared deviations.
  5. Divide the sum by (n - 1).

S2 is the symbol used for the variance.

The standard deviation S is the square root of the variance. We will always use the standard deviaiton; computing the variance is at most an intermediate step.

Example

If forced to compute a standard deviation by hand, this is how I go about it. The data are listed in the red column of the table

Value Deviation Deviation2
4 -3 9
14 7 49
9 2 4
2 -5 25
6 -1 1
Total 0 88

Find the mean. The mean is 7 (see above).

For each observation, compute deviation = value - mean. I've placed these values in the silver column of the table. For instance, the 1st value in the data set is 4: deviation = 4 - 7 = -3. If a deviation is negative it indicates an observation below the mean, if positive it indicates an observation above the mean. Note that the deviations add to 0. This is true for any set of numbers and is a general property of the mean. The deviations to the left (below the mean) and to the right (above the mean) have identical sums.

Square each deviation. I've done this in the blue column. For example, the 1st value in the data set is 4, the corresponding deviation is -3. Squaring -3 results in 9. This column should contain all postive values.

Sum the squared deviations. Add the values in the blue column. Their sum is 88.

Divide this sum by one fewer than the number of observations; the resulting value is the variance. There are 5 observations, less one gives 4. 88/4 = 22. The variance is S2 = 22.

The square root of the variance is the standard deviation. The square root of 22 is 4.69, so the standard deviation is S = 4.69 (rounding to 4.7 would be appropriate).

Note
Most calculators costing over $10 have built-in statistical functions. Entering the data and pressing the proper sequence of keys will produce the standard deviation without requiring any computing by hand. Consult your calculator's instruction manual for the details. Or, see your friendly local stats guru---(s)he may be able to show you how to do this.