AP Statistics: Unit 5 - Sampling Distributions

Introduction to Sampling Distributions

Fundamentals of Sampling Distributions

Parameter vs. Statistic

In statistics, it is crucial to distinguish between the numbers that describe a whole population and the numbers that describe a sample drawn from that population.

Parameter: A number that describes some characteristic of the population. In statistical practice, the value of a parameter is usually unknown.
- Notation: $\mu$ (population mean), $p$ (population proportion), $\sigma$ (population standard deviation).
Statistic: A number that describes some characteristic of a sample. The value of a statistic can be computed directly from the sample data. We use statistics to estimate parameters.
- Notation: $\bar{x}$ (sample mean), $\hat{p}$ (sample proportion), $s$ (sample standard deviation).

What is a Sampling Distribution?

If we take repeated random samples of the same size $n$ from the same population, the value of the statistic (like the mean $\bar{x}$ or proportion $\hat{p}$) will vary from sample to sample. This concept is typically visualized as:

Process of creating a sampling distribution from a population

Definition: The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

Bias and Variability

When evaluating a statistic as an estimator of a parameter, we look at twomain characteristics: bias (center) and variability (spread).

Biased vs. Unbiased Estimators

Unbiased Estimator: A statistic is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated.
- $\hat{p}$ and $\bar{x}$ are unbiased estimators of $p$ and $\mu$, respectively.
Bias: Failure of the sampling distribution to center on the population parameter.

Variability

Variability: Describes how spread out the values of the sample statistic are.
Larger samples ($n$) result in smaller variability. This is a key rule: Averages based on larger samples vary less than averages based on smaller samples.

Targets demonstrating high/low bias and high/low variability

➥ Example 5.1: Analyzing Estimators

Imagine manufacturing baseballs with a target weight of 146g. You test different estimators (A, B, C, D) using various sample sizes.

Unbiased: If the estimator's distribution centers at 146g.
Low Variability: If the range of the estimator's values is small.
Note: As sample size $n$ increases, the variability of an unbiased estimator should decrease (the distribution gets narrower/taller).

Sampling Distribution for Sample Proportions

We use the sample proportion $\hat{p}$ to estimate the population proportion $p$.

The Formulas

If we choose a Simple Random Sample (SRS) of size $n$ from a large population with proportion of successes $p$:

Shape: Approximately Normal if the Large Counts Condition is met.
Center (Mean): $\mu_{\hat{p}} = p$
Spread (Standard Deviation): $\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$

The Three Conditions

To perform calculations, three conditions must be checked:

Randomness: The data must come from a random sample or randomized experiment.
10% Condition (Independence): $n \le 0.10N$. The sample size must be less than 10% of the population size. This ensures that sampling without replacement is mathematically equivalent to sampling with replacement (keeping the probabilities constant).
Large Counts Condition (Normality): $np \ge 10$ AND $n(1-p) \ge 10$. The expected number of successes and failures must both be at least 10.

➥ Example 5.2: Math Anxiety

It is estimated that $p = 0.80$ of people with high math anxiety experience physical pain responses. In a random sample of $n = 110$ people:

1. Check Conditions:

Random: Stated in problem.
10% Condition: $110$ is likely $< 10\%$ of all people with math anxiety.
Large Counts: $110(0.80) = 88 \ge 10$ and $110(0.20) = 22 \ge 10$. The distribution is approx. Normal.

2. Calculate Parameters:

Mean $\mu_{\hat{p}} = 0.80$
Standard Deviation $\sigma_{\hat{p}} = \sqrt{\frac{0.80(0.20)}{110}} \approx 0.0381$

3. Probability Calculation:
What is the probability that less than 75% ($0.75$) experience the pain response?

z-score: $z = \frac{0.75 - 0.80}{0.0381} = -1.31$
Area: Using TI-84 normalcdf(lower: -1E99, upper: 0.75, mu: 0.80, sigma: 0.0381) or normalcdf(lower: -10, upper: -1.31, mu: 0, sigma: 1).
Result: $P(\hat{p} < 0.75) \approx 0.0951$

Sampling Distribution for Differences in Proportions

When comparing two populations, we analyze the sampling distribution of the difference $\hat{p}1 - \hat{p}2$.

Formulas

Center: $\mu{\hat{p}1 - \hat{p}2} = p1 - p_2$
Spread: $\sigma{\hat{p}1 - \hat{p}2} = \sqrt{\frac{p1(1-p1)}{n1} + \frac{p2(1-p2)}{n_2}}$
- Note: Variances add, standard deviations do not. We sum the variances and take the square root.

Conditions

Must be checked for BOTH samples independently:

Randomness: Independent random samples.
10% Condition: $n1 \le 0.10N1$ and $n2 \le 0.10N2$.
Large Counts: $n1p1, n1(1-p1), n2p2, n2(1-p2)$ must all be $\ge 10$.

Sampling Distribution for Sample Means

We use the sample mean $\bar{x}$ to estimate the population mean $\mu$.

The Formulas

For an SRS of size $n$ from a population with mean $\mu$ and standard deviation $\sigma$:

Center (Mean): $\mu_{\bar{x}} = \mu$
Spread (Standard Deviation): $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

Important Concept: Behavior of Means

Averages are less variable than individual observations.
As $n$ increases, $\sigma_{\bar{x}}$ decreases (the curve becomes narrower and taller).
The mean of the sampling distribution remains unbiased regardless of sample size.

Conditions for Normality (Crucial Distinction)

How do we know if the sampling distribution of $\bar{x}$ is Normal? We check one of two cases:

Case 1: The Population is Normal
If the original population distribution is Normal, the sampling distribution of $\bar{x}$ is Normal for any sample size $n$.

Case 2: The Central Limit Theorem (CLT)
If the population shape is unknown or skewed, the sampling distribution of $\bar{x}$ will be approximately Normal if the sample size is large ($n \ge 30$).

Diagram showing skewed population transforming into Normal sampling distribution as n increases

The Three Conditions

Random: SRS.
10% Condition: $n \le 0.10N$ (Used to justify the formula $\sigma/\sqrt{n}$).
Normal/Large Sample: Population is Normal OR $n \ge 30$ (CLT).

➥ Example 5.3: Naked Mole Rats

Life expectancy of mole rats: $\mu = 21$ years, $\sigma = 3$ years. Distribution is unknown. Sample size $n=40$.

Check: $n=40 \ge 30$, so by the CLT, the sampling distribution is approx. Normal.
Parameters: $\mu{\bar{x}} = 21$, $\sigma{\bar{x}} = \frac{3}{\sqrt{40}} \approx 0.474$.
Problem: Probability that mean life expectancy is between 20 and 22 years.
- normalcdf(lower: 20, upper: 22, mu: 21, sigma: 0.474) = 0.965.

Sampling Distribution for Differences in Means

When comparing two independent means, we analyze $\bar{x}1 - \bar{x}2$.

Formulas

Center: $\mu{\bar{x}1 - \bar{x}2} = \mu1 - \mu_2$
Spread: $\sigma{\bar{x}1 - \bar{x}2} = \sqrt{\frac{\sigma1^2}{n1} + \frac{\sigma2^2}{n_2}}$

Conditions

Random: Independent random samples.
10% Condition: Checked for both populations.
Normal/Large Sample: Both populations must be Normal OR both sample sizes ($n1, n2$) must be $\ge 30$.

➥ Example 5.4: Genetic Mutations

Group 1 (40yo): $\mu1 = 65, \sigma1 = 15, n_1 = 35$.
Group 2 (20yo): $\mu2 = 25, \sigma2 = 5, n_2 = 40$.
Task: Probability that the difference in means (Group 1 - Group 2) is between 35 and 45.

Solution:

Mean Diff: $65 - 25 = 40$
SD Diff: $\sqrt{\frac{15^2}{35} + \frac{5^2}{40}} = \sqrt{6.428 + 0.625} \approx 2.656$
Calculation: normalcdf(35, 45, 40, 2.656) = 0.940.
Check: Both $n$ are $\ge 30$, so Normal approximation is valid via CLT.

Calculator Reference (TI-84)

1. Finding Probability (Area)

Use when you have the cut-off value ($x$, $\bar{x}$, or $\hat{p}$) and want the probability (percentage).

Command: 2nd $\to$ VARS $\to$ 2:normalcdf
Inputs: (lower_bound, upper_bound, mean, SD)
Note: Use $-1E99$ for negative infinity and $1E99$ for positive infinity.

2. Finding Cut-off Values (Percentiles)

Use when you have the probability/area (e.g., "top 10%", "90th percentile") and want the cut-off value.

Command: 2nd $\to$ VARS $\to$ 3:invNorm
Inputs: (area_to_the_left, mean, SD)

TI-84 screen showing the distribution menu and syntax for normalcdf

Common Mistakes & Pitfalls

Confusing the Law of Large Numbers (LLN) with the Central Limit Theorem (CLT)
- Mistake: Saying "The sample is large, so by the Law of Large Numbers it's normal."
- Correction: LLN says $\bar{x}$ approaches $\mu$ as $n$ grows. CLT says the shape of the distribution becomes Normal as $n$ grows.
Misapplying the Large Counts Condition
- Mistake: Using $n \ge 30$ to check normality for proportions.
- Correction: $n \ge 30$ is for means. For proportions, you MUST use $np \ge 10$ and $n(1-p) \ge 10$.
Forgetting to divide by $\sqrt{n}$
- Mistake: Using $\sigma$ instead of $\frac{\sigma}{\sqrt{n}}$ in normalcdf.
- Correction: If the problem asks about the probability of a sample mean, you must use the standard deviation of the sampling distribution (Standard Error), which is smaller than the population SD.
Notation Errors
- Mistake: Writing $\mu = 0.80$ for a proportion problem.
- Correction: Use $p$ for population proportion and $\mu$ for population mean. Use $\hat{p}$ for sample proportion and $\bar{x}$ for sample mean.
Standard Deviation vs. Standard Error
- If you know the population $\sigma$, use it. If you are approximating $\sigma$ with sample data ($s$), the standard deviation of the sampling distribution is technically called the Standard Error (Unit 6 concept, but good to know).