Introduction and review? Try these:
The basic question here is: Is there a relationship between two categorical variables? Consider, for example, data collected on the relapse of cocaine addict after treatment with three different drugs.
Here's an appropriate plot.
We see that 41.6% of the people using desipramine relapsed, 75.0% of those using lithium relapsed and 83.3% of those taking the placebo relapsed. Certainly it appears that desipramine is prefered to the other two -- additionally, lithium worked a little better than the placebo. However, there are bound to be some differences in any case -- due to chance alone. Remember, patients were randomly allocated into the three treatment groups -- by chance alone the desipramine group could have inluded a number of addicts who, for whatever reasons, were more predisposed to not relapse. So, we would like to decide if the relationship between treatment and relapse status is statistically significant in the sense that it is too strong to happen just by chance if all treatments are equally effective. "Not likely to happen just by chance if H0 is true" is the usual meaning of statistical significance.
We use the c2 test (that's "Chi-square" with "ch" sounding like "k") to assess the evidence in favor of an association between the two variables. In general, the null and alternative hypotheses are
H0: "There is no association between [spell out the two categorical variables]"
HA: "There is some association between [spell out the two categorical variables]" OR (more simply) "H0 is false"
In our example, write
H0: There is no association between the drug used for treatment and whether the addict relapses
HA: H0 is false -- there is an association
(You may use "relationship" in place of association.)
Use a statistical software package to obtain the P-value for testing these hypotheses. In Minitab you enter the cross-classification table. Then choose
> Stat > Tables > Chi-Square Test select the appropriate columns then click OK
In this example you'll get the following "report."
Chi-Square TestExpected counts are printed below observed countsC1 C2 Total 1 10 14 24 16.00 8.00 2 18 6 24 16.00 8.00 3 20 4 24 16.00 8.00 Total 48 24 72Chi-Sq = 2.250 + 4.500 + 0.250 + 0.500 + 1.000 + 2.000 = 10.500DF = 2, P-Value = 0.005
The P-value is 0.005 which means that it is fairly unlikely (1 in 200 chance) that the observed differences did not occur due to chance alone. At the 5% significance level (and the 1% level) we reject the null hypothesis. There is a statistically significant association between treatment and likelihood of relapse (and clearly, desapramine is the preferred treatment).
See the "Expected counts" (in RED) printer below the observed counts? Be careful -- these counts are NOT the data. I've tabled them and produced a segmented bar chart.
TABLE OF EXPECTED COUNTS
The expected counts are counts that would have occurred if there were exactly no association between the two variables. You can see that the marginal totals remain the same; it's just that the expected counts make changing from one treatment to another have no effect on whether or not a person relapses. This chart refelcts what would be the case if the marginal distribution on relapse (2/3 relapsed, 1/3 did not) were true for each treatment.
The test statistic is the c2 statistic, measuring how far -- in aggregate over each of the cells -- the expected counts are from the observed counts.
Larger values for the test statistic are equivalent to great total discrepancy between the observed (actual) counts and the expected (theoretical and assuming there is no association) counts. Larger values for the test statistic result in smaller P-values -- that is, less likelihood that chance alone casued the observed association.
Here's another application.