Prediction Intervals


A class of people; all of different ages (when measured to the nearest day). If I randomly sample 19 people, how do I predict the next (20th) person's age?

  1. Consider first all 20 people. The 20th person is equally likely to be the youngest, second youngest, third youngest, etc., third oldest, second oldest, oldest. That is, in the ordered list of all 20 people, the 20th person selected is equally likely to occupy any of the positions 1, 2, 3, . . . , 20.
  2. Examine the picture below.

             
    In this picture the first 19 people have been isolated from the remaining 20th person. If the 20th person is the youngest, then the 20th person "fits in" in the gap to the left of the smallest value. If the 20th person is second youngest, then the 20th person fits in the 2nd gap, and so on. Rephrasing the point made above: The 20th person is equally likely to fall into each of the 20 gaps formed by the first 19 people.

  3. Since there are 20 gaps, each gap carries a probability of 1/20 or 0.05 (5% if you like).
  4. The chance is 0.05 + 0.05 = 0.10 (or 10%) that person 20 falls outside the entire range of the first 19 people. The chance is 0.90 (90%) that selection 20 falls inside the range. As a result, the range from the smallest to largest of the first 19 people is a 90% prediction interval (PI) for the next (subsequent) observation. That's it! That's what a prediction interval is; in specific a 90% PI. Note that the probabilities are obtained from the number of gaps, which is 1 greater than the number of observations. Each gap has probability 1/(n + 1) where n is the sample size.
  5. Here's data for a random sample of 19 people drawn from the class.

    6516 7565 6684 7974 7067 6648 6657 7214 7597 7088 7300 7898 7246 8546 7783 7704 6752 8064 7266

    Begin by sorting the data.

    6516 6648 6657 6684 6752 7067 7088 7214 7246 7266 7300 7565 7597 7704 7783 7898 7974 8064 8546

    The 90% PI is then (6516, 8546). We write intervals like this with the small value first. Read it: "Betweeen 6516 and 8546." The values that define this interval are called the bounds of the interval. 6516 is the lower bound and 8546 is the upper bound. (Some people use the term endpoint in place of bound.) The percentage (here 90%) is called the confidence level or procedural reliability for the procedure.
  6. Of course maybe you don't need to be 90% confident in your result. If we move in one observation from each end, covering two more gaps at 5% each, we obtain an 80% PI. For the data above this 80% PI is (6648, 8064). Below you see a dotplot that marks of a number of prediction intervals. Make sure you grasp the relationship between the confidence (or reliability) level of the procedure (the %) and the width of the interval.

         

Interpretation

Like almost all statistical intervals, this one can be a little tricky to interpret -- it requires some thought. For example, the 90% PI of (6516, 8546) given above is intended to predict the age of the next randomly selected person in the class. It turns out that, of the remaining students in the class, 55 of 56-- that's 0.9821 or 98.21%--have age between 6516 and 6546. This is the conundrum: The reasoning used to develop this only works when talking about "random data." Once a sample is selected it becomes non-random. This may be easier to see by looking at the graph above. Before seeing any data each gap had probability of 5%. Now that we have data, compare the gap between the second and third smallest values (6648 and 6657--only 9 days apart) to that between the fifth and sixth smallest values (6752 and 7067--that's 315 days apart). Common sense tells us that it's far more likely that the twentieth observation will fall in the larger gap.

The 90% refers to the average predictive success of the entire procedure. That is, if I repeated the following

  • sample 19 people at random,
  • form the 90% PI,
  • sample a twentieth person at random

then 90% of the time the twentieth person falls within the bounds of the interval. Another way of thinking about it is that if repeated over and over, on average 90% of the remaining students would fall within the prediction bounds. Not in any one case, but on average.


Follow the path to exercises involving prediction.