Summary Statistics
MeanDenote the n observations in the data set X_{1}, X_{2}, X_{3}, . . . , X_{n}. The terms "mean" and "average" are interchangeable. To compute the mean, sum the observations and divide by the number of observations. The symbol for the mean is typically an X with a bar over itwe read this as "Xbar". ExampleConsider the data listed in the column below. There are five of them: n = 5.
The mean is Xbar = (4 + 14 + 9 + 2 + 6) ÷ 5 = 35 ÷ 5 = 7. * Actually, this symbol is only used when the data set is a sample (measurement taken for a subset of all units/cases). A different symbolthe greek letter m ("mu") is used when referring to the mean of a population (a data set consisting of measurements on all units/cases).
MedianThe median M is found as follows.
ExampleConsider the data listed in the row below. There are five of them: n = 5. I've sorted them.
(n + 1)/2 = (5 + 1)/2 = 6/2 = 3; the median is found in the 3rd postion. The value in the 3rd position is shown in red. This value.is the median: M = 6. ExampleConsider the starting salary data for 1995 psychology graduates of SUNYOswego. The data is shown, in sorted form, below. 08820 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48000 There are 54 observations; the median is found in position 55/2 = 27.5. (The 27th and 28th observations are colored red.) By convention we take the median to be halfway between these two observationsthe average of the two. M = (18000+18500)/2 = 18250 The median is 18250.
Percentiles
For instance, the 90th percentile of combined SAT scores is 1240 (P = 90, Q = 1240). This means that 90% of all scores are below 1240; equivalently, 10% are above 1240. If your score falls at the 90th percentile that's a good thing. Higher percentiles are not necessarily a good thing. Your golf average might fall at the 96th percentile. Therefore, 96% of all golfers have an average below yours, 10% above yours. Because higher scores result from poorer golfing, this places your golfing in the worst 4%; only 4% of all golfers are worse than you are. Some percentiles can be found merely by examining a histogram. Take, for instance, the starting salaries of 1995 SUNYOswego Psychology graduates; the histogram is displayed below. Approximately 6% of the data falls between 7500 and 12500; that is, below 12500 (since this is the lowermost class). Therefore 12500 is approximately the 6th percentile. About 22.5% fall in the 12500 to 17500 class, and 42.5% in the 17500 to 22500 class; therefore, close to 71% (=6%+22.5%+42.5%) of the data fall below 22500. This makes 22500 (approximately) the 71st percentile. Not all percentiles can be estimated well from the histogram. For instance, consider the 15th percentile. From the histogram it's possible to identify the 6th percentile; further, the 28.5th percentile is about 17500. However, the histogram provides no information about the distribution of values within the 12500 to 17500 classwhich is where the 15th percentile must be. This is one drawback of histograms.
Computing the Pth Percentile
ExampleLet's find the 15th percentile of the starting salary data. Here's the sorted version. 08820 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48000 Our position is 15(54 + 1)/100 = 15(55)/100 = 8.25. Round to position 8; the corresponding observation is shown in red. The 15th percentile is approximately 16000. A better approach uses interpolation. The 8.25 position is really between the 8 and 9 positions, shown in red and blue respectively. Onequarter of the way from 16000 to 16500 is
So 16125 is a (better) approximation to the 15th percentile. However, it's not that different from 16000, and either would suffice. Quartiles & IQR
ExampleLet's find the two quartiles for the starting salary data. Again, the data must be sorted. Begin by computing the median; we did that above and found M = 18250. 08820 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48000 The observations below M are shown in red. There are 27 of them, their median would therefore be found in postion 13 (among only those below Mthe red values). I've underlined the value: the lower quartile is Q_{1} = 16900. The observations above M are those that aren't colored red. There are also 27 of them. The median of these 27 values is found in position 13 as well (you need only compute the position number for one of the two quartiles; that position will work for both). The upper quartile is underlined: Q_{3} = 23000. Note: The median might also be referred to as the "middle" or second quartile. Interquartile Range
The interquartile range should accompany the median as an appropriate measure of spread. ExampleFor the psychology graduates we found an upper quartile of 23000 and a lower quartile of 16900. Then IQR = 2300016900 = 6100. Standard DeviationDenote the n observations in the data set X_{1}, X_{2}, X_{3}, . . . , X_{n}. Before computing the standard deviation you must compute the mean. (The data need not be sorted.) To find the variance of the n observations
S^{2} is the symbol used for the variance. The standard deviation S is the square root of the variance. We will always use the standard deviaiton; computing the variance is at most an intermediate step. ExampleIf forced to compute a standard deviation by hand, this is how I go about it. The data are listed in the red column of the table
Find the mean. The mean is 7 (see above). For each observation, compute deviation = value  mean. I've placed these values in the silver column of the table. For instance, the 1st value in the data set is 4: deviation = 4  7 = 3. If a deviation is negative it indicates an observation below the mean, if positive it indicates an observation above the mean. Note that the deviations add to 0. This is true for any set of numbers and is a general property of the mean. The deviations to the left (below the mean) and to the right (above the mean) have identical sums. Square each deviation. I've done this in the blue column. For example, the 1st value in the data set is 4, the corresponding deviation is 3. Squaring 3 results in 9. This column should contain all postive values. Sum the squared deviations. Add the values in the blue column. Their sum is 88. Divide this sum by one fewer than the number of observations; the resulting value is the variance. There are 5 observations, less one gives 4. 88/4 = 22. The variance is S^{2} = 22. The square root of the variance is the standard deviation. The square root of 22 is 4.69, so the standard deviation is S = 4.69 (rounding to 4.7 would be appropriate).
