# Metapoll

Results of "Final" polls -- as of 11/06/00 (the election was held the following day).

 Pollster Bush % Gore % ± % Sample Size CNN 47 45 2 2386 ABC 48 45 3 2165 CBS 47 42 3 1356 Newsweek 45 43 4 1001 MSNBC 46 48 3 1200

A "metapoll" polls polls and pools results. This document is illustrates one simple methodology for doing this. (Because its simple, its probably not the preferred method for a precise analysis. Some comments on this will conclude the document.)

### Differences have different error margins!

Before starting, be clear that what matters is not Bush's %, nor Gore's %, but instead the difference between the two. In a two candidate race, or one in which the third and beyond candidates carry only a small percent of the population, the polling error for this difference is equal to twice the polling error for any one of the two candidates. This kind of makes sense: If, for example, our random sample produces a result that is 2% too high (2% over the actual figure for the population) for Gore, then it has produced a result 2% too low for Bush. Consequently, in analysing poll results its always good to look at the difference between the two candidates and the appropriate (doubled) sampling error.

 Pollster Difference (Bush - Gore)% ± Error (Diff)% CNN +2 4 ABC +3 6 CBS +5 6 Newsweek +2 8 MSNBC -2 6

### Pooling polls

The simplest way of pooling polls to arrive at a metapoll is to average the results of the constituent polls. Sum first, then divide by the number of polls. The average of the differences is

The average of these poll results puts Bush in front by 2.0%

It does not make sense to average the errors margins!

This would be akin to saying that combining the information from all five polls actually results in a margin of error that is WIDER than the margin of error for one of the constituent polls (the CNN poll). That's clearly absurd. This happens because it doesn't make sense to "average" things of the ± variety. Are we averaging + with +? - with -? + with -? - with +?

What's remarkable, and beautiful, is that high school geometry applies here! To find the margin or error for our metapoll we use (an extended version of) The Pythagorean Theorem: For a right trianble with legs a and b, and hypotenuse c: a2 + b2 = c2. The key is that instead of summing error margins, we pretend that they are legs of a right triangle, and we obtain resulting error margins from the hypotenuse.

For example, consider combining the the CBS and Newsweek polls above, with margins ±6% and ±8%. When we add the differences for these two polls (3 + 2 = 5), we use right triangle theory to sum the error margins!

 The margin of error when adding two results is found by the Pythagorean Theorem.

That is: the appropriate error margin is for the combined difference of 2 + 3 = 5 is ±10. Now we divide both results (which involve totals, not averages) by two: 2.5% ± 5% (which is better than either of the polls alone).

### Generalizing to multiple polls

The result is extended with a generalized Pythagorean Theorem (operating on "hypertriangles" in multidimension spaces).

To find the error margin for out metapoll estimate of 2%, involving five polls, we treat each of the constituent error margins as a leg of a triangle and, to find the error margin for the sum of each polls' estimates we use this extended Pythagorean Theorem:

For the sum of the differences (before averaging them) we have

Then divide by 5 (the number of polls) to get the metapoll error margin of 13.71/5 = 2.74.

Our metapoll result has W ahead by an estimated 2.00% with error margin of 2.74%.

## Summary

You've learned two things:

1. When considering the difference between results for two candidates in an essentially two candidate race, double the reported error margin.
2. Use the Pythagorean Theorem to combine margins of error from separate (independent) polls.

What you haven't learned is that

1. All media polls are published at a confidence level of 95%. Essentially when media polls are properly done, there's only a 1 in 20 chance of a result that doesn't cover the true result (the result for the entire population). Or...95% of all polls are good ones -- 5% are "misleading."
2. The published error margin accounts only for errors due to random sampling. Nonresponse, response bias, wording of questions, and so forth, may introduce bias (a systematic tendency of a poll to miss the true target result) into results. Most biases are impossible to quantify (until after the fact). Consequently, all election polls should be taken with a grain of salt. Note current response rates to telephone surveys run at around 15%. Those who will respond may well have different preferences than those who won't -- and this cannot be quantified (largely because, Catch-22, the nonresponders won't respond!). It's well known that the wealthy, those with less than 8th grade education, non-English speakers, and the elderly, elude most pollsters. These subgroups have substantially different politial views than the population as a whole. Consequently, most polls are flawed.

### Remarks

Another statistical procedure, significance testing, can be used to answer the following question: If the two candidates are in a true exact tie (for the population), how likely is a difference as large as the observed poll difference (of +2.0% favoring W). The answer to this question is 15.3% (a link to the appropriate document supporting this calculation is supplied below). That is: If the race really is exactly tied, then a metapoll result of 2.0% (or more) in favor of either candidate has roughly 2 in 13 chance of occurring. So, our actual result is "sort of" unlikely for a truly tied race -- leading to the inference that the race is not tied (that Ws perceived lead is somewhat statistically significant). (Of course no race is ever exactly tied in a large nationwide election. The 15.3% merely says "The observed poll results are somewhat unlikely if the race is truly tied.)

Also, averaging these 5 polls is probably not preferred: They have different sample sizes and use different plans to obtain "random" samples. Knowing the "best" way to pool the results (which would likely result in a different error margin) would require substantially more information on each of the constituent polls. This method is the "quick-and-dirty" way to pool by simple averaging.

### Resources

Many of these issues are addressed in The Myth of the Volatile Voter.

I've written a more technical treatment of the issue of Statistical Ties; this, interestingly enough, brings forth the issue of how to handle this doubling of the margin of error -- even when there are more than two candidates with a sizeable fraction of the vote.