Chapter 11 - Goodness-of-Fit and Contingency Tables
11-1 Goodness-of-Fit
- A goodness-of-fit test is used to test the hypothesis that an observed frequency distribution fits (or conforms to) some claimed distribution
- Notation for testing for goodness-of-fit:
- O: observed frequency of an outcome
- E: expected frequency of an outcome
- k: number of different categories or cells
- n: total number of trials (or total of observed sample values)
- p: probability that a sample value falls within a particular category
- Requirements for testing for goodness-of-fit:
- The data have been randomly selected
- The sample data consist of frequency counts for each of the different categories
- For each category, the expected frequency is at least 5
- If the expected frequencies are all equal: Calculate E = n/k
- If the expected frequencies are NOT all equal: Calculate E = np for each individual category
- The observed frequencies are all whole numbers because they represent actual counts, but the expected frequencies need not be whole numbers
- "If the P is low, the null must go" (If the p-value is small, reject the null hypothesis that the distribution is as claimed)
- X^2 test statistic is a measure of the discrepancy between observed and expected frequencies
- The theoretical distribution of sum(O-E)^2/E is a discrete distribution because the number of possible values is finite. The distribution can be approximated by a chi-square distribution, which is continuous. This approximation is generally considered acceptable, given that all expected values E >= 5
- The number of degrees of freedom reflects the fact that we can freely assign frequencies to k-1 categories before the frequency for every category is determined
11-2 Contingency Tables
- A contingency table (or two-way frequency table) is a table consisting of frequency counts of categorical data corresponding to two different variables (one variable used to categorize rows, the second variable used to categorize columns)
- In a test of independence, we test the null hypothesis that in a contingency table, the row and column variables are independent
- Notation for contingency table:
- O: observed frequency in a cell
- E: expected frequency in a cell
- r: number of rows in a contingency table
- c: number of columns in a contingency table
- Requirements for contingency table:
- Sample data are randomly selected
- Sample data are represented as frequency counts in a two-way table
- For every cell in the contingency table, the expected frequency E is at least 5
- Degrees of freedom = (r-1)(c-1)
- Test of independence with a contingency table are always right-tailed
- The distribution of the test statistic X^2 can be approximated by the chi-square distribution
- E = (row total * column total) / (grand total)
- In a chi-square test of homogeneity, samples are randomly selected from different populations and we want to determine whether those populations have the same proportions of some characteristic being considered
- A chi-square test of homogeneity is a test that different populations have the same proportion of some characteristics
- Fisher's exact test is often used for a 2 x 2 contingency table with one or more expected frequencies that are below 5, and Fisher's exact test provides an exact p-value and does not require an approximation technique
\