Chapter 11 - Goodness-of-Fit and Contingency Tables

11-1 Goodness-of-Fit

  • goodness-of-fit test is used to test the hypothesis that an observed frequency distribution fits (or conforms to) some claimed distribution
  • Notation for testing for goodness-of-fit:
    • O: observed frequency of an outcome
    • E: expected frequency of an outcome
    • k: number of different categories or cells
    • n: total number of trials (or total of observed sample values)
    • p: probability that a sample value falls within a particular category
  • Requirements for testing for goodness-of-fit:
    • The data have been randomly selected
    • The sample data consist of frequency counts for each of the different categories
    • For each category, the expected frequency is at least 5
  • If the expected frequencies are all equal: Calculate E = n/k
  • If the expected frequencies are NOT all equal: Calculate E = np for each individual category
  • The observed frequencies are all whole numbers because they represent actual counts, but the expected frequencies need not be whole numbers
  • "If the P is low, the null must go" (If the p-value is small, reject the null hypothesis that the distribution is as claimed)
  • X^2 test statistic is a measure of the discrepancy between observed and expected frequencies
  • The theoretical distribution of sum(O-E)^2/E is a discrete distribution because the number of possible values is finite. The distribution can be approximated by a chi-square distribution, which is continuous. This approximation is generally considered acceptable, given that all expected values E >= 5
  • The number of degrees of freedom reflects the fact that we can freely assign frequencies to k-1 categories before the frequency for every category is determined

11-2 Contingency Tables

  • contingency table (or two-way frequency table) is a table consisting of frequency counts of categorical data corresponding to two different variables (one variable used to categorize rows, the second variable used to categorize columns)
  • In a test of independence, we test the null hypothesis that in a contingency table, the row and column variables are independent
  • Notation for contingency table:
    • O: observed frequency in a cell
    • E: expected frequency in a cell
    • r: number of rows in a contingency table
    • c: number of columns in a contingency table
  • Requirements for contingency table:
    • Sample data are randomly selected
    • Sample data are represented as frequency counts in a two-way table
    • For every cell in the contingency table, the expected frequency E is at least 5
  • Degrees of freedom = (r-1)(c-1)
  • Test of independence with a contingency table are always right-tailed
  • The distribution of the test statistic X^2 can be approximated by the chi-square distribution
  • E = (row total * column total) / (grand total)
  • In a chi-square test of homogeneity, samples are randomly selected from different populations and we want to determine whether those populations have the same proportions of some characteristic being considered
  • chi-square test of homogeneity is a test that different populations have the same proportion of some characteristics
  • Fisher's exact test is often used for a 2 x 2 contingency table with one or more expected frequencies that are below 5, and Fisher's exact test provides an exact p-value and does not require an approximation technique

\