Chapter 2: Observations
[last revised 20 September 2005]
Observation -- the direct use of our senses to determine some properties of some object in the world -- is our most reliable source of claims and of evidence for claims. Someone tells us "It is raining here now." The best way for us to determine whether we should believe the claim that this sentence expresses is to look outside and check. Furthermore, if some one tells us that all crows are black, the fact that we have seen some crows and all those crows that we saw were black provides some evidence that this is true. Finally, if someone tells us that pearls are rare, our observation that few of the oysters we have eaten had pearls is some evidence that this is likely true.
But observation is not an infallible guide. There are subtle human predispositions in interpreting observations that can result in errors in our judgment, leading us to believe claims that we should not believe. Worse, these faults in our reason can and are frequently exploited by those interested in manipulating us to believe false claims. Also, there are some clear mathematical principles that we can use to determine whether some kinds of generalizations from observations are warranted. Our goal in this chapter is to identify:
When we have learned to identify these, we will have prepared ourselves to reduce our likelihood to believe false claims based on flawed observations. That is, although observations are likely our most reliable source of claims, we know that our use of observations can lead to mistakes; knowing the limitations and occasional failures of observations will help us avoid those mistakes.
There are at least four kinds of observations.
We can observe a particular thing before us at this moment. We can have such direct evidence, for example, that this page is white. Such an observation seems a direct expression of what our senses tell us right when we are making the claim, and thus our experience at this moment provides the best possible reason to believe the claim. Indeed, it may be difficult to even doubt a claim that correctly describes what we immediately perceive. We can call such an observation "occurrent," because it is occurring right now.
We also depend frequently on our memory of particular observations. Let us suppose you have a gray car. If someone asks you what color your car is, and you are not looking at the car as you speak, you can still confidently answer, "It is gray." You have made many observations of the color of your particular car. You remember this clearly. It is an excellent and reliable help when you try to find your car in a parking lot. In such a case, you use past particular observations as (very strong) evidence for your belief that your car is gray. This is an example of what we can call a remembered particular observation.
Particular observations are essential tools for navigating the world and for corroborating some of our beliefs. However, we would be crippled if we were limited to claims about particulars alone. As a practical necessity, we need to also sometimes make general claims based on observations. Such claims can include, all rattlesnakes are dangerous, all pots of boiling water are very hot, any glass of clean cool water can quench my thirst. These are useful things to know, and might even save your life. But note that they are claims about all rattlesnakes, all pots of boiling water, all glasses of clean cool water. We cannot observe the properties and effects of all rattlesnakes, all pots of boiling water, and all glasses of clean cool water. Instead, what we must do is derive a general claim (a claim about all or many things of a kind) from some observations of some of those kinds.
Thus, we may also conclude on the basis of a number of particular observations of rattlesnakes that all rattlesnakes are dangerous. But of course we did not observe all rattlesnakes, but rather only some. We generalize from some observations (the 10 rattlesnakes you have seen) to all potential observations of that kind (all the rattlesnakes you could see). This is a general claim, and implicitly then a claim about any future particular observations, inferred from some particular observations.
Finally, it is reasonable to use generalizations to draw conclusions about probability. If 1% of the oyster you have opened have pearls, you might conclude of the closed oyster before you that there is a 1% chance that there is a pearl in it. We do not observe probability, but rather we observe a frequency of (past) occurrences. If we must make some claims about overall frequency, and thus about probabilities, we can do so taking past observations and generalizations as evidence. This is a kind of generalization from observation.
This leaves us with the following, possibly incomplete, list of kinds of observation-based claims:
Particular observations are observations of a single event or individual thing. Occurrent observations are observations made right here, right now. Particular occurrent observations provide our best evidence for the relevant claims.
We know from experience that our particular occurrent observations can sometimes be wrong when the conditions of observation are poor. If you see a dim shape on a hill while it is foggy or dark, you may take this as evidence that there is a horse on the hill. But if doubt lingers, one excellent check on a particular occurrent observation is to get corroboration from other observers. You could ask your friend, "is that a horse?" If she agrees, you have two particular occurrent observations where before you had only one. Being able to repeat an observation, and have other people make the observation, is a strong form additional evidence.
But we sometimes make mistakes in particular occurrent observations that are not the result of poor conditions of observation, but rather of biases that appear to be very common in human beings. Being aware of these biases can help you try to compensate for them, or at least inspire skepticism where skepticism is merited. There is one such bias that is most important: the effect of our expectations upon our observations.
Our observations can be influenced by our expectations. It appears that an observer who expects to see something is more likely to see that very thing. An important early set of experiments documenting this effect were done by Robert Rosenthal and others. In one experiment, Rosenthal asked subjects to rate photographs of people on a scale of -10 to +10, where -10 meant they appeared to be failures and +10 meant that they appeared to be successes. One group of subjects was told that their average should be about +5, while the other groups was told their average should be about -5. Both groups were shown the same set of photographs. But the first group, told to expect more successful-looking people, averaged in their rankings +0.4 and the second group, told to expect less successful-looking people, averaged -0.08 in their rakings (Rosenthal and Fode 1963). It appears that subjects in part judged the photographs based not only on what they saw but also what they were told to expect.
A more dramatic case tested the effect of expectations on teachers. A group of elementary school children were given a test that their teachers were told would indicate the students "blooming." The test did not do this, but rather was an intelligence test. Afterwards, a group of the children was randomly chosen by Rosenthal, who told the teacher that the tests show these students were going to show significant academic achievement. Of these students, some showed little immediate academic promise, and so it was natural to suppose that they were going to be late bloomers. Year end test results show that in fact the designated students did score higher on IQ tests at the end of the year (Rosenthal and Jacobson 1968). That is, they acted just like late bloomers. What these experiments demonstrate is that expectations can shape our behavior in ways that may in turn directly affect our observations, or indirectly affect them by shaping our preparation for our observations.
These kinds of results have been recreated numerous times, demonstrating that they are genuine natural phenomena. Whenever we make some observations and we have an expectation of the result, there is a potential that our observation will be influenced by our expectation. Such influences are likely to be strongest for observation judgments that require significant interpretation. Consider the test above in which one must decide which individuals in photographs look successful. This is an observation requiring a great deal of interpretation. It is quite different than if the subject were asked, say, to determine whether a light bulb had gone on or whether a liquid started on fire. Judgments that require a great deal of interpretation are also a notorious problem in animal behavior studies, where researchers may be asked to decide whether a non-human animal is, say, grooming itself or scratching. If your research hypothesis says the animal will groom under these conditions, the influence of bias may be formidable.
Particular remembered observations are observations that we made in the past and now rely upon our memory to draw upon. Claims based on remembered observations can fall prey to at least two kinds of problems: more emotionally salient observations are more likely to be remembered, and memory is unreliable.
We attend to things in our environment that are emotionally salient. A loud noise will be noticed whereas a common quiet one will not. A fast moving car catches our attention, whereas a slow walking person may not. A car accident will turn our head, but a tree will not. There is now clear scientific evidence not only that we are more likely to form such memories, but also that we are more likely to recall them. The mechanisms of this memory-formation process are also coming to be understood.
This disposition to attend to and remember the more emotionally-salient events in our past could result in a bias. We should expect to be less reliable when evaluating claims based on recalled particular observations in those cases that, for example, require us to remember whether something rather uninteresting happened. Thus, given that we must judge some past situations based upon our particular remembered observations, there could be a tendency to judge those situations with some neglect to uninteresting data and a special focus upon emotionally-salient data. One's recollection of the one hockey game you attended, for example, may be solely of the fight there. Then, if asked about some feature of the game that is less interesting, such as the skating of some particular player, you may be unable to recall accurately the relevant observations.
More importantly, however, is the simple fact that memory can be unreliable. There is now a large body of evidence that people's memories can be inaccurate, and can change over time. This is true even of emotionally-salient memories. A simple and dramatic illustration of this has been shown experimentally by two researchers (Neisser and Harsch 1992). Neisser and Harsch were interested in "flash bulb memories," the idea that we form very vivid memories of specific very important events. Examples include the common idea that we all remember where we were when Kennedy was shot or when the Twin Towers fell. Neisser and Harsch looked at the space shuttle Challenger explosion. Their test was very simple. They asked students on the day after the explosion to briefly describe where they were when they learned about the disaster. Then, many years later, they asked some of the same former students the same question, and compared their results. What they found is that very commonly the subjects had different accounts after several years had passed, even though the "memories" they now reported were sometimes very vivid. Assuming that their first report is the more accurate one, this suggests that even our most vivid memories can become inaccurate over time.
Generally, we must recognize that particular remembered observations are strong but fallible evidence for a claim.
It would be helpful if we could somehow control for the bias that arises from expectation in our particular observations. A common scientific method seeks to do just this, with what are called "blind" experiment preparations. In such a preparation, you control for the expectation bias of researchers by not telling the observer which observations have been subjected to the potential cause you are studying, and which have not. Thus, the bias that the observer may have that the cause will have certain expected effects cannot in any direct way shape their observations, because they don't know which observations are relevant.
For example, suppose that we wanted to test the effects of substance X on some rats. We are testing the hypothesis that substance X is a stimulant that will make rats more active. The potential bias here is that if the observing researcher expects X to make rats more active, she may be inclined to see this in their behavior. To control for this, we first separate two researchers, A and B. A keeps track of the rats, and feeds them. Researcher A divides the rats (not physically, but in records) into two groups, randomly chosen: one group gets substance X in their food, and the other does not. Only researcher A knows which rats have received which. Researcher A ideally will not talk with researcher B, who is doing the measurements. Next, lacking this knowledge, the researcher B is asked to observe each rat in turn and judge its rate of activity, perhaps in terms of such measurable actions as circumnavigations of the cage, or turns on a wheel, to minutes of motion before resting. Researcher B observes all the rats, those that received substance X and those that did not, without knowing which are which. Thus, her expectation that some will be more active, and others not more active, cannot in any obvious way influenced any particular observation, since she will not know of which rat to expect which behaviors.
Here expectation bias cannot directly affect observations because the observer (researcher B) is not sure which of the things she is observing have the cause about which she has some expectations. All scientific experiments should ideally be done in such a blind condition like this when they require any kind of judgment or interpretation of the data. One should always then be wary of claims to have shown some effect in a study until you discern whether the experiment was blind in this sense.
Often we need to make general claims based on a finite number of observations. This kind of reasoning -- generalizing from some sample (subset of a whole group) of observations to conclude something about an entire population (that is, the whole group) of events or things -- is sometimes called inductive reasoning.
When we generalize from particular to general in this way, we need to ensure, to the best of our ability, that our sample observations are representative of the population. By "representative," we mean that the sample is like the population in the regard that we hope to measure. If we are interested in the height of people, we want our sample of people to be roughly as tall as the population of all people. If we are interested in the density of quartz, we would like to get quartz samples that are like all potential quartz samples in terms of their density.
But if we hope to discover, say, the average height of people, how can we know before we start whether our sample is representative? We cannot, obviously, and so we should aim to ensure that we have not allowed any bias to come into selection of our sample. Our best method to do this is to gather our sample in a way that ensures it is random. By random, we mean that any possible measurement of the relevant kind had an equal chance of being in our sample. If that is so, we have ensured that there is no bias in how we chose our sample. Thus, for example, a random sample of people in our height study would help ensure that we are very unlikely to have chosen a group of unusually tall or unusually short people for our measurements. We are unlikely to get these people since by definition they are more rare, and so if each person had an equal likelihood of being in our sample, such unusual people would be far less common and so less likely to end up in the sample.
It is very difficult to get a random sample. The source of our sample is often influenced by subtle biases. Consider phone polls, used to judge public opinion. These polls are made of people with telephones, perhaps of people with listed telephone numbers, and -- probably far more important -- of people who are willing to take the time to answer questions from strangers. It might be that this introduces a bias of socioeconomic condition (leaving out those without telephones, for example) or maybe of more diverse views (getting responses only from those with a gripe who are eager to answer questions). Such potential biases may be insignificant, but the point is that it can be difficult to know whether this is so.
Many kinds of populations have natural variation. That is, there is some difference in the population in terms of the feature we aim to measure. For example, in people, there are significant differences in height. This means that a small sample will be susceptible to inaccuracy because of the potential that the natural variation that occurs in the sample will differ substantially from the mean in the population. For this reason, in general, a larger sample is better. This is sometimes referred to as "the law of large numbers": the larger the sample, the more representative of the population it will be. Alternatively, we can say that the odds that the sample measurement are inaccurate in regards to the population go down as the size of the sample goes up.
The size of a sufficiently large sample is a function of how accurate we hope to be, the size of the overall population, and the expected variability of the population. That calculation is beyond our goals here, but in general we should at least be aware that the larger the sample size, the more accurate our generalization is likely to be. There is a measure of the degree to which statistics tells us a generalization is to be in error. This is called the margin of error, and is a function of the size of the sample and the variation in the population. When you see a generalization from a sample, you should always ask to see the margin of error also. This will give you a sense of what the potential range of likely error is. For example, a study that found 60% of Americans approve of Senator Jones, with a margin of error of +/- 3%, we know that it is most likely that the population falls somewhere between 57% and 63% approving of Jones.
Generalizations from sample observations fall prey to a number of common mistakes that make us more likely to make errors in our judgments and believe claims that do not deserve our belief. Three such mistakes are particularly important to guard against: neglecting the importance of sample size, over-valuing anecdotes, failing to recognize when a sample is not random, and resisting revision of first generalizations.
People are frequently inclined to forget the law of large numbers, and the resulting principle that larger sample size means a sample that is more representative of the population as a whole. A simple test by Daniel Kahneman and Amos Tversky illustrates this well. They presented a simple problem to individuals. There are two hospitals, one has about 45 babies born each day, and the other has about 15 babies born each day. Assume that among all births, 50% are male, 50% female. Over a period of a year, each hospital made a note of those days when more than 60% of the babies born that day were male. Which hospital noted more such days? Interestingly, most people answer this question that both hospitals have about the same number of observations (Tversky and Kahneman 1982). But we explicitly said that the population average is 50% male. Thus, a larger sample size is more likely to be closer to this mean. The smaller hospital, where each day's births is a smaller sample, is more likely to have an instance of observations far from the mean. To overlook this is to overlook one important feature of larger sample size: the larger the sample, the more likely its mean is to be near the population mean.
Related to forgetting that larger samples are better (and smaller samples more inaccurate) is the fact that we strongly overvalue our own small samples or anecdotes. We often overvalue our own experience, and generalize from the observations that we have made even though our observations are not likely to be random. Thus, a person who is employed and who spends all her time with people who are employed might be inclined to come to the general conclusion that unemployment is low. But her sample might be biased by the fact that she knows people primarily through work. A related bias arises because we are inclined to highly value vivid anecdotes. In general, a dry and uninteresting description of a finding of a large sample will not influence people's beliefs as strongly as does a vivid and emotionally-salient individual anecdote. An example experiment was done by Hamill and colleagues. One set of subjects were told a number of dry statistics about welfare, including that only 10% of recipients remained on welfare for four years or longer, and that the median stay on welfare for middle aged recipients was two years. Another set of subjects was told this data but also told a vivid story about a woman on welfare who was Puerto Rican, obese, had a succession of lovers in the house, had many children by these different men, and lived in a filthy apartment infested with cockroaches. Hamill and colleagues found that the dry statistics had no effect in either group on the perception subjects had of welfare, but in the case of those who heard the vivid story they were more inclined to have developed a more harsh and negative view of welfare (Hamill, Wilson & Nisbett 1979).
The point here is not that welfare is a benefit or harm. Rather, the effect of dry statistics, which if accurate should have substantial influence on people's perceptions if they were generalizations from large random samples, appeared to have no influence. The vivid anecdote, however, which because a sample of 1 should have had no significant influence on people's perception of welfare-recipients, had a very strong effect.
This is a powerful and pernicious effect, and one that is used with strong results by individuals seeking to manipulate us. Being aware of this effect and seeking to counter it will be a substantial benefit to human reasoning.
Related to the power of the anecdote is the failure of people to recognize when a sample is not random. This applies also to the case of the anecdote: the anecdote taken from experience is a problem not only because it is a small sample, but also my own experience will often have biases resulting from my social standing, my income, my social and economic background, and so on. But people often also simple fail to recognize when samples are clearly biased. Consider the following experiment by Ross, Amabile, and Steinmetz (1977). Two subjects play a kind of general knowledge question game. One subject is the questioner, the other the answerer or contestant. The questioner is asked to prepare 10 difficult but not impossible general knowledge questions. She will then ask these one at a time of the answerer, wait for an answer, and then say "correct," or say "incorrect" and provide the correct answer. After the questioning game, the questioner, the answerer, and some observers were all asked to rank the general knowledge facility of the two game players on a scale from 0 to 100, with 50 as average. Questioners, answerers, and observers all tended to rate the general knowledge of the questioner substantially higher than that of the answerer. Observers rated the questioner as high as 80 while the answerer less than 50.
What is striking is that there could not be a more blatant example of a non-random sample than the one used by the observers to judge the general knowledge of the contestants. Suppose we think of general knowledge as a population of general claims. One's facility with general knowledge would then be the ability to answer such questions. We would best measure this by picking questions at random and asking them of the person being measured. But in this experiment, the questioner picked the questions. There is no evidence whatsoever that the questioners pick of question is representative of the questioners overall familiarity with general claims. In fact, we might naturally presume that they are sure to be the very questions with which the questioner is most familiar. We can judge nothing accurate about the questioner's general knowledge from such a sample.
Finally, there is substantial evidence that we resisting revision of first generalizations. We tend to only take incremental steps away from our first generalizations. For example, suppose that you meet one German and she is very tall. You conclude, based on a too small and likely biased sample that Germans are tall. Later, you are exposed to a random sample of 200 Germans. They are all about the average height of Americans. Being American, you conclude that (this group of Germans) is of average height. Now, given that your first sample was small and likely biased, it should carry little or no weight compared to the later finding. However, many are inclined to only incrementally revise their initial generalization, and so conclude something like, Germans are of average height or taller. The evidence does not support such a claim, and one should drop our earlier generalization in favor of the more substantially supported generalization.
One form of generalization from observations is the derivation of a probability from past events. In most cases this is a simple derivation: if we observe that n% of our sample has some property P, then assuming our sample is random and sufficiently large, we assume that observations in the future will have property P n% of the time.
For example, suppose we observe in a random sample of oysters that 1% of them have a pearl. For any given oyster that we may later encounter, we conclude there is about a 1% chance that there is a pearl in it. This is like the case of basic generalization because our sample corroborates the conclusion that 1% of the population of oysters have a pearl, and now if we are to take at random any oyster from that population, we have about a 1% chance of getting one of the ones with a pearl.
Three problems commonly occur for this kind of generalization; these echo concerns about other forms of generalizations from observations discussed above.
First, just as we overall our own experience and we overvalue anecdotes, individuals tend to confuse the availability of something for them with its overall frequency. That is, if something is available to them, they may assume that it is more frequent than it actually is. This is akin to generalizing from one's own, non-random, sample. For example, even as crime rates have fallen, crime on local news has become more and more popular as a kind of entertainment news. If someone concludes from this that crime is on the rise, then their perception is biased because the availability of reports about crime are increasing not necessarily because of an increase in crime but rather because of an increase in interest in crime.
Second, vividness and emotional salience can influence our perceptions of probability also. People tend to notice and to recall more vivid experiences. This can lead people to overestimate the probability of some event of which they may have a small but very vivid sample. This may explain why some perceive planes as very dangerous, and cars as not very dangerous. There is something quite spectacular about a plane crash, with smoking ruins and often a large number of people dead. Such an event is more interesting, more salient, more emotionally affective than statistics about car crashes. And yet, getting in ones car is far more dangerous on average than getting on a plane.
A very special mistake that concerns probabilistic reasoning alone is the gambler's fallacy. This common mistake arises when someone confuses the probability of two or more independent outcomes, taken as a whole, with the probability of a single outcome after some other outcomes of that kind have preceded it.
An example will make this clear. Suppose that we flip a coin. It is a balanced coin and there is a 50% chance it will land heads, and a 50% chance it will land tails. Each flip of the coin is an independent event, so each has a 50% chance of landing one way or the other. What are the odds that four flips will produce four heads in a row? Probability theory tells us that the probability of independent events is the product of their individual occurrence. So the odds are .5 x .5 x .5 x .5, or .0625. This is a little more than 6%. Now, suppose that we flip the coin three times, and it comes up heads all three times. It is very unlikely to flip four heads in a row, so the gambler reasons the next flip will be tails -- indeed, it is about 94% likely to come up tails. But this reasoning is flawed. It confuses the probability of a sequence with the probability of a single toss of the coin. Each toss has a 50% chance of being heads. The next toss, in this case, has a 50% chance of coming up heads again.
Here is another example. Above, we assumed that 1% of oysters had pearls. This is an independent case (that is, finding an oyster does not make me more or less likely to find another oyster). Consider the following two cases:
1. Jones has two oysters, and wants to know what the odds of finding a pearl in each is.
2. Jones has two oysters, opens one and finds an oyster inside, and now wants to know what the odds of finding a pearl in the second oyster are.
The odds of finding a pearl in each of two oysters, if 1% of oysters were to have pearls, would be 0.01%. However, the odds of finding a pearl in an oyster, after we open another oyster and find a pearl, is 1%.
The gambler's fallacy leads many a gambler to make bad decisions, and lines the pockets of many casino owners. Gamblers often continue gambling with the erroneous belief that a losing streak means that they are due for a win. No such luck.
 Not only can someone sway public opinion by reiterating an irrelevant anecdote, but this is the principle that underlies the importance of vivid lies. During the first Gulf War invasion, a young woman appeared weeping before a House committee, describing how babies were turned out of their incubators in Kuwaiti hospitals and left lying on the floor. The girl was later found to be the daughter of a Kuwaiti official, and no evidence was ever found for her claims, nor could she have been witness to those claims. However, the anecdote lived on, even presented as true in the "based on a true story" HBO film about CNN reporters in Iraq.