Chapter 11 - Goodness-of-Fit and Contingency Tables

11-1 Goodness-of-Fit

A goodness-of-fit test is used to test the hypothesis that an observed frequency distribution fits (or conforms to) some claimed distribution
Notation for testing for goodness-of-fit:
- O: observed frequency of an outcome
- E: expected frequency of an outcome
- k: number of different categories or cells
- n: total number of trials (or total of observed sample values)
- p: probability that a sample value falls within a particular category
Requirements for testing for goodness-of-fit:
- The data have been randomly selected
- The sample data consist of frequency counts for each of the different categories
- For each category, the expected frequency is at least 5
If the expected frequencies are all equal: Calculate E = n/k
If the expected frequencies are NOT all equal: Calculate E = np for each individual category
The observed frequencies are all whole numbers because they represent actual counts, but the expected frequencies need not be whole numbers
"If the P is low, the null must go" (If the p-value is small, reject the null hypothesis that the distribution is as claimed)
X^2 test statistic is a measure of the discrepancy between observed and expected frequencies
The theoretical distribution of sum(O-E)^2/E is a discrete distribution because the number of possible values is finite. The distribution can be approximated by a chi-square distribution, which is continuous. This approximation is generally considered acceptable, given that all expected values E >= 5
The number of degrees of freedom reflects the fact that we can freely assign frequencies to k-1 categories before the frequency for every category is determined

11-2 Contingency Tables

A contingency table (or two-way frequency table) is a table consisting of frequency counts of categorical data corresponding to two different variables (one variable used to categorize rows, the second variable used to categorize columns)
In a test of independence, we test the null hypothesis that in a contingency table, the row and column variables are independent
Notation for contingency table:
- O: observed frequency in a cell
- E: expected frequency in a cell
- r: number of rows in a contingency table
- c: number of columns in a contingency table
Requirements for contingency table:
- Sample data are randomly selected
- Sample data are represented as frequency counts in a two-way table
- For every cell in the contingency table, the expected frequency E is at least 5
Degrees of freedom = (r-1)(c-1)
Test of independence with a contingency table are always right-tailed
The distribution of the test statistic X^2 can be approximated by the chi-square distribution
E = (row total * column total) / (grand total)
In a chi-square test of homogeneity, samples are randomly selected from different populations and we want to determine whether those populations have the same proportions of some characteristic being considered
A chi-square test of homogeneity is a test that different populations have the same proportion of some characteristics
Fisher's exact test is often used for a 2 x 2 contingency table with one or more expected frequencies that are below 5, and Fisher's exact test provides an exact p-value and does not require an approximation technique