Inference for Proportions: Hypothesis Tests, Errors, and Two-Proportion Inference
Introduction to Significance Tests and Setting Up Hypotheses
When you do statistical inference, you’re using data from a sample to learn about a population. With proportions, the population parameter you care about is usually the true fraction of individuals in a population with some characteristic—called the population proportion.
A significance test (also called a hypothesis test) is a formal way to decide whether sample evidence is strong enough to support a claim about a population parameter. The core idea is simple:
- Start by assuming a “status quo” claim about the parameter.
- Ask: If that claim were true, how surprising is the sample result I got?
- If it would be very surprising, you have evidence against the status quo.
Parameters vs. statistics (what you know vs. what you estimate)
In inference, it’s crucial to separate:
- Parameter: a fixed (but usually unknown) value describing the population, like .
- Statistic: a number computed from the sample, like the sample proportion .
If you observe “successes” (people with the characteristic) in a sample of size , then:
The logic of hypothesis testing
A test is built around two competing statements:
- Null hypothesis: the default assumption; typically “no change” or “no difference.”
- Alternative hypothesis: what you’re looking for evidence to support.
You then compute a p-value, which is:
- p-value: the probability, assuming the null hypothesis is true, of getting a result at least as extreme as the one you observed.
Small p-values mean your observed result would be unlikely if the null were true, so the data provide evidence against the null.
Setting up hypotheses for a population proportion
Suppose a company claims that 40% of customers prefer Product A. Let be the true proportion of all customers who prefer Product A.
A standard setup is:
- Null hypothesis: includes an equals sign and states a specific value.
- Alternative hypothesis: reflects the research question and uses one of three forms.
Two-sided (detect any difference):
Right-tailed (detect an increase):
Left-tailed (detect a decrease):
The alternative hypothesis should match the question before you look at the data. Choosing a one-sided alternative after seeing the sample result is a common (and serious) mistake.
Significance level and decisions
The significance level is the cutoff you choose for how strong the evidence must be to reject the null. Common choices are or .
Decision rule:
- If p-value : **reject** (evidence supports ).
- If p-value : **fail to reject** (not enough evidence for ).
“Fail to reject” is not the same as “accept.” You are not proving the null is true—you’re saying the sample didn’t provide strong evidence against it.
Notation snapshot (common in AP Statistics)
| Idea | Common notation | Meaning |
|---|---|---|
| Population proportion | True (unknown) proportion in the population | |
| Sample proportion | Proportion in the sample | |
| Null value | Hypothesized proportion in | |
| Significance level | Threshold for rejecting |
Example (hypotheses only)
A school believes more than 30% of students get at least 8 hours of sleep.
Let = true proportion of all students at the school who get at least 8 hours of sleep.
That “greater than” comes directly from “more than 30%.”
Exam Focus
- Typical question patterns:
- “A company claims that … Do the data provide convincing evidence that the true proportion is (greater/less/different)?”
- “Write appropriate hypotheses for the parameter described.”
- “Interpret the p-value in context.”
- Common mistakes:
- Writing (null hypotheses are about , not ).
- Using when the question implies a one-sided alternative (or vice versa).
- Saying “there is a 5% chance is true” (p-values do not give probabilities that hypotheses are true).
Carrying Out a Significance Test for a Population Proportion
For AP Statistics, the standard significance test for a single population proportion is the one-proportion z test. It uses the idea that when is large enough, the sampling distribution of is approximately Normal.
When a Normal model is reasonable (conditions)
A hypothesis test is only as trustworthy as its conditions. For a one-proportion z test, you typically check:
- Randomness: Data come from a random sample or a randomized experiment.
- Independence (10% condition): If sampling without replacement, the sample size should be less than 10% of the population size.
- Large counts (Normal approximation): Under the null hypothesis, both expected counts are at least 10:
Notice the test uses (the null value) in the condition, because the null hypothesis is what you assume when calculating how unusual the sample result is.
Test statistic (how far from the null, in standard errors)
If and your sample gives from sample size , the test statistic is:
Interpretation:
- The numerator is the difference between what you saw and what the null claims.
- The denominator is the standard deviation of under the null, often called the standard error under the null.
- The result tells you how many standard deviations your sample proportion is from the null value.
From z to a p-value
The p-value depends on the alternative hypothesis:
- If (right-tailed), p-value is the area to the right of your .
- If (left-tailed), p-value is the area to the left.
- If (two-sided), p-value is **twice** the tail area beyond .
You can find this using technology (calculator function, applet, or Normal CDF).
The AP Statistics “State, Plan, Do, Conclude” structure
Free-response questions often reward clear communication. A strong solution usually includes:
- State: Parameter, hypotheses, and significance level.
- Plan: Name the test and check conditions.
- Do: Compute , , and the p-value.
- Conclude: Decision (reject/fail to reject) and a contextual conclusion about .
Worked example: one-proportion z test
A city website claims that 60% of residents support a new public transit plan. A random sample of 200 residents finds 132 support it.
Let = true proportion of all city residents who support the plan.
State (hypotheses)
(We’re checking whether the claim is wrong in either direction.)
Plan (conditions)
- Random: the problem states a random sample.
- 10% condition: 200 is presumably less than 10% of all residents (you’d state this assumption if population size isn’t given).
- Large counts using :
So a one-proportion z test is appropriate.
Do (compute)
Sample proportion:
Test statistic:
Compute the standard error under the null:
Then:
Because the alternative is two-sided, the p-value is:
Using Normal probabilities, is about 0.0418, so p-value is about 0.0836.
Conclude
At , p-value , so we fail to reject . The sample does not provide convincing evidence that the true proportion of residents who support the plan differs from 60%.
Notice the wording: you’re not saying “60% is true,” only that you don’t have strong evidence it’s different.
What can go wrong (common conceptual pitfalls)
- Using instead of in the test statistic’s standard error: In a test, you assume the null is true, so the standard error is based on .
- Misinterpreting the p-value: A p-value is about the probability of the data (or more extreme) under , not the probability is true.
- Ignoring direction: For a one-sided test, results in the “wrong” direction produce large p-values even if looks big.
Exam Focus
- Typical question patterns:
- “Do a significance test at and interpret the p-value.”
- “Check conditions and identify the correct inference procedure.”
- “Given a computer output with and p-value, write the conclusion in context.”
- Common mistakes:
- Checking large counts with instead of for a test.
- Concluding “reject , so is false” without stating what you have evidence for (the alternative, in context).
- Treating “not significant” as “no effect” rather than “not enough evidence with this sample.”
Type I and Type II Errors and Power
Whenever you make a decision from sample data, you risk being wrong. Hypothesis testing organizes these risks into two types of errors.
The two error types (defined in plain language)
A hypothesis test ends in one of two decisions: reject or fail to reject . Reality also has two possibilities: is true or is false. That creates four outcomes.
Type I error: You reject even though is true.
- In words: a “false alarm.”
- Probability: approximately (the significance level), assuming conditions hold.
Type II error: You fail to reject even though is false.
- In words: you “miss” a real difference.
- Probability: often called (depends on the true value of , the sample size, and ).
Power: the probability that you correctly reject when is false.
Power matters because a test that almost never rejects isn’t very useful—even if it rarely makes Type I errors.
Why Type I and II errors depend on context
The labels “Type I” and “Type II” don’t automatically mean “worse” or “better.” The consequences depend on the setting.
Example context: testing whether a restaurant’s claim “60% of customers are satisfied” is accurate.
- Type I error (rejecting a true claim): you might publicly accuse the restaurant of misrepresenting satisfaction when it isn’t.
- Type II error (missing a false claim): you might fail to detect that satisfaction is actually lower, and customers keep getting misled.
In medical screening, the stakes can flip—false positives vs. false negatives have very different costs.
The tradeoff between and
If you make it harder to reject (smaller ), you reduce Type I error risk—but you often increase Type II errors (lower power), because you require stronger evidence to reject.
If you make it easier to reject (larger ), you increase false alarms but reduce misses.
So you can’t usually minimize both error types at once without changing something else.
How to increase power (without “cheating”)
You increase power when it becomes easier for the test to detect real differences. Common ways:
- Increase sample size : this reduces the standard error, so real differences create larger values.
- Use a larger significance level : easier to reject , but increases Type I error risk.
- Have a true parameter farther from the null: if the true is very different from , the test will detect it more often.
- Reduce variability: for proportions, variability is tied to , though you typically don’t control this directly.
Example: describing Type I and Type II errors in context
A manufacturer tests:
where is the proportion of products that are defective. They reject if the sample provides strong evidence the defect rate is higher than 2%.
- Type I error (in context): Conclude the defect rate is higher than 2% when in reality it is 2%.
- Type II error (in context): Fail to detect that the defect rate is higher than 2% when in reality it is higher than 2%.
This “put it in words” skill is heavily tested.
A simple power calculation idea (conceptual, with one numeric illustration)
Power calculations can be done using Normal approximations. The underlying idea is:
- Under , is centered at .
- Under a specific alternative value (say ), is centered at .
- A rejection region (based on ) cuts off extreme values.
- Power is the probability that falls in the rejection region when .
Illustration (one-sided): Suppose you test
with and . For a right-tailed z test, the critical z value is about 1.645. That corresponds to rejecting when
Compute the cutoff:
So reject when:
If the true proportion were actually , then power is approximately:
Using a Normal model under the alternative:
The standard deviation under is:
Compute the z-value for the cutoff under the alternative:
So power is approximately , which is about 0.78. Interpreting that: if the true proportion is 0.62, this test would reject about 78% of the time.
You’re not always asked to compute power numerically on AP, but you are expected to understand what affects it and how to describe errors.
Exam Focus
- Typical question patterns:
- “Describe a Type I error and a Type II error in context for this test.”
- “If the significance level is lowered, what happens to Type I error risk and power?”
- “How could you increase the power of this test?”
- Common mistakes:
- Describing errors without context (you must say what wrong conclusion you reached about ).
- Thinking is the probability of a Type I error no matter what (it’s set as the long-run rate when is true).
- Claiming that increasing reduces both Type I and Type II errors automatically (Type I error is controlled by , but power typically increases with ).
Confidence Intervals and Tests for the Difference of Two Proportions
Inference for proportions often comes in two closely related forms:
- Confidence intervals estimate a parameter.
- Significance tests assess evidence for a claim.
They are connected: a two-sided test at significance level lines up with a confidence interval at confidence level .
How confidence intervals connect to significance tests (one proportion)
A confidence interval gives a range of plausible values for . For a one-proportion z interval, the standard form is:
Key point: the interval uses in the standard error, not . That’s because intervals estimate based on what the sample suggests.
Connection to a two-sided test:
- If a confidence interval for **does not include** , then a two-sided test at level would reject .
- If it does include , you would fail to reject at that .
This connection is extremely useful for interpretation, but be careful: it aligns most cleanly for two-sided tests.
Moving to two proportions: what changes?
Often you want to compare two groups: new vs. old method, treatment vs. control, Group A vs. Group B.
Define parameters:
- = true proportion of successes in population 1
- = true proportion of successes in population 2
The parameter of interest is usually:
From two independent samples:
- Sample 1: successes out of , so
- Sample 2: successes out of , so
Conditions for two-proportion inference
You typically check:
- Random: two random samples, or a randomized experiment with two groups.
- Independent groups: the two samples/groups are independent (no matching pairs here).
- 10% condition: each sample is less than 10% of its population if sampling without replacement.
- Large counts: expected success/failure counts are at least 10 in each group.
For a confidence interval, you check counts using and :
For a significance test of , you often check large counts using a pooled estimate (explained next).
Confidence interval for (two-proportion z interval)
The standard error for an interval uses separate sample proportions:
Then the interval is:
Interpretation tip: A confidence interval for that is entirely positive suggests ; entirely negative suggests .
Significance test for (two-proportion z test)
A common null hypothesis is:
which is equivalent to .
Under the null, it makes sense to assume both groups share a common true proportion, so you estimate that common value using the pooled proportion:
Then the standard error under is:
And the test statistic is:
This “pooled vs. unpooled” distinction is one of the most tested technical details:
- Intervals: unpooled standard error (uses and ).
- Tests with : pooled standard error (uses ).
Worked example: two-proportion z interval and test
A school tries a new email reminder system to reduce late homework.
- Group 1 (new system): 40 of 200 students turned in late homework at least once.
- Group 2 (old system): 65 of 210 students turned in late homework at least once.
Let be the true proportion of students who would be late with the new system, and with the old system.
(A) Confidence interval for
Compute sample proportions:
Difference:
Standard error (unpooled):
Compute pieces:
So:
For a 95% confidence interval, use :
Margin of error:
Interval:
Approximately:
Interpretation: You are 95% confident the true proportion late under the new system is between 2.6 and 19.3 percentage points lower than under the old system.
(B) Significance test for
Suppose you test whether the new system reduces late homework:
Pooled proportion:
Standard error under null:
Compute:
So:
Test statistic:
Left-tailed p-value is , which is about 0.0055.
Conclusion at : p-value is much smaller than 0.05, so reject . There is convincing evidence that the new email reminder system reduces the proportion of students who turn in late homework.
Notice how the confidence interval and test agree: the interval for was entirely negative, consistent with rejecting .
What can go wrong in two-proportion problems
- Mixing up pooled and unpooled standard errors: pooled for tests (when null says equal), unpooled for intervals.
- Confusing independence: two-proportion z methods assume independent groups. If the same individuals are measured twice or matched, that’s a different procedure (matched pairs).
- Interpreting a CI backwards: a 95% CI does not mean “95% of individuals are in this range.” It’s about plausible values of the parameter.
Exam Focus
- Typical question patterns:
- “Construct and interpret a confidence interval for .”
- “Test whether the proportions differ (or whether one is larger) using a two-proportion z test.”
- “Use the confidence interval to assess a claim about a difference in proportions.”
- Common mistakes:
- Using the pooled proportion in a confidence interval standard error.
- Forgetting to define and in context (AP scoring often requires parameter definition).
- Concluding causation from two samples when the design is observational (causal language is best reserved for randomized experiments).