AP Statistics: Unit 5 - Sampling Distributions

Introduction to Sampling Distributions

Fundamentals of Sampling Distributions

Parameter vs. Statistic

In statistics, it is crucial to distinguish between the numbers that describe a whole population and the numbers that describe a sample drawn from that population.

  • Parameter: A number that describes some characteristic of the population. In statistical practice, the value of a parameter is usually unknown.
    • Notation: $\mu$ (population mean), $p$ (population proportion), $\sigma$ (population standard deviation).
  • Statistic: A number that describes some characteristic of a sample. The value of a statistic can be computed directly from the sample data. We use statistics to estimate parameters.
    • Notation: $\bar{x}$ (sample mean), $\hat{p}$ (sample proportion), $s$ (sample standard deviation).

What is a Sampling Distribution?

If we take repeated random samples of the same size $n$ from the same population, the value of the statistic (like the mean $\bar{x}$ or proportion $\hat{p}$) will vary from sample to sample. This concept is typically visualized as:

Process of creating a sampling distribution from a population

Definition: The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.


Bias and Variability

When evaluating a statistic as an estimator of a parameter, we look at twomain characteristics: bias (center) and variability (spread).

Biased vs. Unbiased Estimators

  • Unbiased Estimator: A statistic is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated.
    • $\hat{p}$ and $\bar{x}$ are unbiased estimators of $p$ and $\mu$, respectively.
  • Bias: Failure of the sampling distribution to center on the population parameter.

Variability

  • Variability: Describes how spread out the values of the sample statistic are.
  • Larger samples ($n$) result in smaller variability. This is a key rule: Averages based on larger samples vary less than averages based on smaller samples.

Targets demonstrating high/low bias and high/low variability

➥ Example 5.1: Analyzing Estimators

Imagine manufacturing baseballs with a target weight of 146g. You test different estimators (A, B, C, D) using various sample sizes.

  • Unbiased: If the estimator's distribution centers at 146g.
  • Low Variability: If the range of the estimator's values is small.
  • Note: As sample size $n$ increases, the variability of an unbiased estimator should decrease (the distribution gets narrower/taller).

Sampling Distribution for Sample Proportions

We use the sample proportion $\hat{p}$ to estimate the population proportion $p$.

The Formulas

If we choose a Simple Random Sample (SRS) of size $n$ from a large population with proportion of successes $p$:

  • Shape: Approximately Normal if the Large Counts Condition is met.
  • Center (Mean): $\mu_{\hat{p}} = p$
  • Spread (Standard Deviation): \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}

The Three Conditions

To perform calculations, three conditions must be checked:

  1. Randomness: The data must come from a random sample or randomized experiment.
  2. 10% Condition (Independence): $n \le 0.10N$. The sample size must be less than 10% of the population size. This ensures that sampling without replacement is mathematically equivalent to sampling with replacement (keeping the probabilities constant).
  3. Large Counts Condition (Normality): $np \ge 10$ AND $n(1-p) \ge 10$. The expected number of successes and failures must both be at least 10.
➥ Example 5.2: Math Anxiety

It is estimated that $p = 0.80$ of people with high math anxiety experience physical pain responses. In a random sample of $n = 110$ people:

1. Check Conditions:

  • Random: Stated in problem.
  • 10% Condition: $110$ is likely $< 10\%$ of all people with math anxiety.
  • Large Counts: $110(0.80) = 88 \ge 10$ and $110(0.20) = 22 \ge 10$. The distribution is approx. Normal.

2. Calculate Parameters:

  • Mean $\mu_{\hat{p}} = 0.80$
  • Standard Deviation $\sigma_{\hat{p}} = \sqrt{\frac{0.80(0.20)}{110}} \approx 0.0381$

3. Probability Calculation:
What is the probability that less than 75% ($0.75$) experience the pain response?

  • z-score: $z = \frac{0.75 - 0.80}{0.0381} = -1.31$
  • Area: Using TI-84 normalcdf(lower: -1E99, upper: 0.75, mu: 0.80, sigma: 0.0381) or normalcdf(lower: -10, upper: -1.31, mu: 0, sigma: 1).
  • Result: $P(\hat{p} < 0.75) \approx 0.0951$

Sampling Distribution for Differences in Proportions

When comparing two populations, we analyze the sampling distribution of the difference $\hat{p}1 - \hat{p}2$.

Formulas

  • Center: $\mu{\hat{p}1 - \hat{p}2} = p1 - p_2$
  • Spread: \sigma{\hat{p}1 - \hat{p}2} = \sqrt{\frac{p1(1-p1)}{n1} + \frac{p2(1-p2)}{n_2}}
    • Note: Variances add, standard deviations do not. We sum the variances and take the square root.

Conditions

Must be checked for BOTH samples independently:

  1. Randomness: Independent random samples.
  2. 10% Condition: $n1 \le 0.10N1$ and $n2 \le 0.10N2$.
  3. Large Counts: $n1p1, n1(1-p1), n2p2, n2(1-p2)$ must all be $\ge 10$.

Sampling Distribution for Sample Means

We use the sample mean $\bar{x}$ to estimate the population mean $\mu$.

The Formulas

For an SRS of size $n$ from a population with mean $\mu$ and standard deviation $\sigma$:

  • Center (Mean): $\mu_{\bar{x}} = \mu$
  • Spread (Standard Deviation): \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}

Important Concept: Behavior of Means

  • Averages are less variable than individual observations.
  • As $n$ increases, $\sigma_{\bar{x}}$ decreases (the curve becomes narrower and taller).
  • The mean of the sampling distribution remains unbiased regardless of sample size.

Conditions for Normality (Crucial Distinction)

How do we know if the sampling distribution of $\bar{x}$ is Normal? We check one of two cases:

Case 1: The Population is Normal
If the original population distribution is Normal, the sampling distribution of $\bar{x}$ is Normal for any sample size $n$.

Case 2: The Central Limit Theorem (CLT)
If the population shape is unknown or skewed, the sampling distribution of $\bar{x}$ will be approximately Normal if the sample size is large ($n \ge 30$).

Diagram showing skewed population transforming into Normal sampling distribution as n increases

The Three Conditions

  1. Random: SRS.
  2. 10% Condition: $n \le 0.10N$ (Used to justify the formula $\sigma/\sqrt{n}$).
  3. Normal/Large Sample: Population is Normal OR $n \ge 30$ (CLT).
➥ Example 5.3: Naked Mole Rats

Life expectancy of mole rats: $\mu = 21$ years, $\sigma = 3$ years. Distribution is unknown. Sample size $n=40$.

  1. Check: $n=40 \ge 30$, so by the CLT, the sampling distribution is approx. Normal.
  2. Parameters: $\mu{\bar{x}} = 21$, $\sigma{\bar{x}} = \frac{3}{\sqrt{40}} \approx 0.474$.
  3. Problem: Probability that mean life expectancy is between 20 and 22 years.
    • normalcdf(lower: 20, upper: 22, mu: 21, sigma: 0.474) = 0.965.

Sampling Distribution for Differences in Means

When comparing two independent means, we analyze $\bar{x}1 - \bar{x}2$.

Formulas

  • Center: $\mu{\bar{x}1 - \bar{x}2} = \mu1 - \mu_2$
  • Spread: \sigma{\bar{x}1 - \bar{x}2} = \sqrt{\frac{\sigma1^2}{n1} + \frac{\sigma2^2}{n_2}}

Conditions

  1. Random: Independent random samples.
  2. 10% Condition: Checked for both populations.
  3. Normal/Large Sample: Both populations must be Normal OR both sample sizes ($n1, n2$) must be $\ge 30$.
➥ Example 5.4: Genetic Mutations
  • Group 1 (40yo): $\mu1 = 65, \sigma1 = 15, n_1 = 35$.
  • Group 2 (20yo): $\mu2 = 25, \sigma2 = 5, n_2 = 40$.
  • Task: Probability that the difference in means (Group 1 - Group 2) is between 35 and 45.

Solution:

  • Mean Diff: $65 - 25 = 40$
  • SD Diff: $\sqrt{\frac{15^2}{35} + \frac{5^2}{40}} = \sqrt{6.428 + 0.625} \approx 2.656$
  • Calculation: normalcdf(35, 45, 40, 2.656) = 0.940.
  • Check: Both $n$ are $\ge 30$, so Normal approximation is valid via CLT.

Calculator Reference (TI-84)

1. Finding Probability (Area)

Use when you have the cut-off value ($x$, $\bar{x}$, or $\hat{p}$) and want the probability (percentage).

  • Command: 2nd $\to$ VARS $\to$ 2:normalcdf
  • Inputs: (lower_bound, upper_bound, mean, SD)
  • Note: Use $-1E99$ for negative infinity and $1E99$ for positive infinity.

2. Finding Cut-off Values (Percentiles)

Use when you have the probability/area (e.g., "top 10%", "90th percentile") and want the cut-off value.

  • Command: 2nd $\to$ VARS $\to$ 3:invNorm
  • Inputs: (area_to_the_left, mean, SD)

TI-84 screen showing the distribution menu and syntax for normalcdf


Common Mistakes & Pitfalls

  1. Confusing the Law of Large Numbers (LLN) with the Central Limit Theorem (CLT)

    • Mistake: Saying "The sample is large, so by the Law of Large Numbers it's normal."
    • Correction: LLN says $\bar{x}$ approaches $\mu$ as $n$ grows. CLT says the shape of the distribution becomes Normal as $n$ grows.
  2. Misapplying the Large Counts Condition

    • Mistake: Using $n \ge 30$ to check normality for proportions.
    • Correction: $n \ge 30$ is for means. For proportions, you MUST use $np \ge 10$ and $n(1-p) \ge 10$.
  3. Forgetting to divide by $\sqrt{n}$

    • Mistake: Using $\sigma$ instead of $\frac{\sigma}{\sqrt{n}}$ in normalcdf.
    • Correction: If the problem asks about the probability of a sample mean, you must use the standard deviation of the sampling distribution (Standard Error), which is smaller than the population SD.
  4. Notation Errors

    • Mistake: Writing $\mu = 0.80$ for a proportion problem.
    • Correction: Use $p$ for population proportion and $\mu$ for population mean. Use $\hat{p}$ for sample proportion and $\bar{x}$ for sample mean.
  5. Standard Deviation vs. Standard Error

    • If you know the population $\sigma$, use it. If you are approximating $\sigma$ with sample data ($s$), the standard deviation of the sampling distribution is technically called the Standard Error (Unit 6 concept, but good to know).