Unit 3: Collecting Data

Populations, Samples, and the Logic of Data Collection

Statistics is about learning something about a big group without measuring every single member. The way you collect data determines what you’re allowed to conclude later. In Unit 3, you learn to think like a designer: you plan how data will be produced so that conclusions are trustworthy.

Population vs. sample (and why the difference matters)

A population is the entire group you want information about. A sample is the subset of the population that you actually measure. This distinction matters because most statistical methods assume your sample is a fair representation of the population. If your sampling process favors certain types of individuals (even unintentionally), your results can look precise while being systematically wrong.

A helpful analogy is tasting soup. The population is the whole pot; the sample is one spoonful. A spoonful can tell you about the pot if it’s taken in a way that mixes the soup fairly. If ingredients settle and you only sample from the top, your taste will be biased.

Census

A census is collecting data from every individual in a population. Censuses can be expensive, time-consuming, and sometimes impractical.

Parameter vs. statistic

A parameter is a numerical value that describes a population (usually unknown). A statistic is a numerical value computed from a sample (known once you take the sample). The core goal of sampling is to use statistics to estimate parameters with as little bias and variability as possible.

Example: If you care about the proportion of all seniors at your school who have a part-time job, that true population proportion is a parameter. If you survey 80 seniors and compute the proportion who have jobs, that number is a statistic.

Sampling frame and who your study is really about

A sampling frame is the list (or method) from which you actually select your sample: a roster, database, directory, or list of addresses. A common real-world problem is that the sampling frame does not perfectly match the population.

Example:

Population: “All registered voters in a county.”
Sampling frame: “People with landline numbers in a phone directory.”

Even if you randomly sample from the phone directory, you may systematically miss groups (like younger voters). That isn’t “bad randomness”; it’s a mismatch between frame and target population (undercoverage).

Observational study vs. experiment (preview)

Unit 3 includes two big ways to collect data.

Sampling and observational studies measure what already exists (no assigning treatments).
Experiments deliberately impose treatments on individuals and compare results.

This difference determines the biggest kind of conclusion you can make.

Random sampling supports generalizing to the population.
Random assignment supports cause-and-effect conclusions.

Exam Focus

Typical question patterns:
- Identify the population, sample, parameter, statistic, and sampling frame from a scenario.
- Decide whether a design allows generalization, causation, both, or neither.
- Explain why a sample might be biased even if it is large.
Common mistakes:
- Confusing “population” with “sample,” especially when the sample is described first.
- Treating a statistic as if it were a parameter (“the survey result is the true proportion”).
- Assuming randomness fixes everything: randomness inside a flawed sampling frame doesn’t remove undercoverage.

Random Sampling and How to Get a Representative Sample

To learn about a population, you want a sample chosen using a chance process and designed to avoid systematic favoritism. In statistics, random does not mean haphazard; it means a known chance mechanism.

What “random” accomplishes

Random sampling aims to produce a representative sample by giving individuals a fair chance to be selected. Randomness doesn’t guarantee a perfect mirror of the population, but it helps prevent consistent patterns of over- or under-representation. It also supports probability-based reasoning later: if the method is truly random, you can quantify how likely it is to see sampling differences just by chance.

A “cards-in-a-box” intuition (and its limitation)

A common intuition is to write each population member’s name on a card, mix thoroughly, and draw a set number of cards. This can give everyone an equal chance, but it’s usually too time-consuming for large populations, and bias can creep in if the mixing is not truly thorough.

Simple Random Sample (SRS)

A simple random sample (SRS) of size $n$ is a sample chosen so that **every possible group** of $n$ individuals in the population has an equal chance of being selected. This is stronger than “each individual has an equal chance”; in an SRS, each _group_ of size $n$ is equally likely.

How to select an SRS (core procedure)

Label each individual in the population from 1 to $N$ .
Use a random number generator (calculator/computer) or a random digit table.
Select $n$ distinct labels, ignoring repeats (sampling without replacement).

You must be able to describe the chance process clearly enough that someone else could replicate it.

Example 3.3: SRS for an AP Statistics class

Suppose 80 students are taking an AP Statistics course and the teacher wants a random sample of 10 students to try out a practice exam.

Computer method: Assign students numbers 1 through 80. Use a computer to generate 10 random integers between 1 and 80 without replacement (throw out repeats). The sample is the students whose assigned numbers were generated.
Random digit table method: Assign students numbers 01, 02, 03, …, 80. Read two digits at a time from a random digit table, ignoring 00, numbers over 80, and repeats, until you have 10 unique numbers. If the table began:
75425 56573 90420 48642 27537 61036 15074 84675
then the selected students would be numbered 75, 42, 55, 65, 73, 04, 27, 53, 76, and 10. Here, 90 and 86 are ignored because they are over 80, and later repeats (like additional 42s) are ignored.

Advantages and disadvantages of SRS

An SRS is easy to interpret and requires minimal advance knowledge of the population besides having a complete sampling frame. It supports statistical inference from the sample to the population, and when repeated many times it tends to produce sample statistics centered around the true parameter value (unbiased in the long run).

However, SRS can be hard to execute for large populations because you need a complete list of all potential subjects. Repeatedly contacting nonrespondents can be time-consuming. Also, by chance, important groups may be underrepresented in a particular SRS.

Other probability sampling methods (often more practical than SRS)

In real studies, SRS can be expensive or inconvenient. AP Statistics emphasizes several designs that still use randomness.

Stratified random sampling

A stratified random sample divides the population into non-overlapping strata that are homogeneous within strata (similar on a characteristic related to the variable of interest). Then you take an SRS within each stratum and combine the results.

This can reduce variability and produce more precise estimates than an SRS of the same size when the variable of interest differs meaningfully across strata.

A common approach is proportional stratified sampling, sampling from each stratum in proportion to its population size. Sometimes you may intentionally oversample a small stratum to get enough data about that subgroup; interpretation must then be careful (often requiring weighting, which AP typically does not ask you to compute in detail).

Advantages include reduced variability within strata and clearer comparisons among groups. Disadvantages include implementation difficulty for large populations and the fact that forcing subdivisions when no meaningful strata exist is not helpful.

Cluster sampling

A cluster sample divides the population into clusters that are ideally heterogeneous (each cluster is like a mini-version of the whole population). Then you randomly select clusters and include all individuals (or everything) in the selected clusters.

Cluster sampling is often cheaper and faster, especially for geographically spread-out populations, and with limited fixed funds it can allow a larger sample size than other methods.

The tradeoff is precision: for a given sample size, cluster sampling usually provides less precision than an SRS or stratified sample. If the population doesn’t have natural clusters—or the chosen clusters are not representative—cluster sampling can easily produce bias.

Key stratified vs. cluster distinction:

Stratified: sample some people from every group.
Cluster: sample all (or many) people from some groups.

Students often mix these up because both start by forming groups, but the purpose of the groups is opposite.

Systematic sampling

In systematic sampling, you list the population in some order, choose a random starting point, and then select every $k$ th individual. This is easy to implement and can be close to random.

Advantage: If similar members are grouped together on the list, systematic sampling can function a bit like stratified sampling, but more easily implemented.
Main danger (periodicity): If the list has a repeating pattern that matches the sampling period, the sample can be biased.

Example: If a factory line alternates experienced and new workers and you sample every 2nd worker, you might select only one type.

Multistage sampling

Multistage sampling combines methods in stages, often selecting clusters and then sampling within selected clusters.

Example: A national health survey might:

Randomly select counties.
Randomly select neighborhoods within counties.
Randomly select households within neighborhoods.
Randomly select an adult within each household.

Worked example: choosing an appropriate sampling method

Scenario: A school district wants to estimate the average time students spend on homework per night across all middle school students. The district has 8 schools.

An SRS of all middle school students could be ideal but requires a complete district-wide list and can be time-consuming. A cluster sample might randomly select 2 schools and survey all students in those schools; this is cheaper but may be less precise if schools differ a lot. A stratified sample could stratify by school and sample within each school to ensure representation from every school. A strong recommendation (if feasible) is stratifying by school because “school” likely relates to homework load.

Example 3.4: 100 students from a Chicago school of 5000 (Cubs question)

Suppose a sample of 100 high school students from a Chicago school of size 5000 is to be chosen to determine their views on whether the Cubs will win another World Series this century.

Cards-in-a-box method: Have each student write a name on a card, mix, and have the principal draw 100 cards. A concern is whether the cards are mixed well; for instance, if an entire PE class tosses cards in at the same time, cards may clump and the draw may overrepresent that class.
Random number generator SRS method: Number students 1 through 5000 and randomly generate 100 unique integers between 1 and 5000, throwing out repeats.
Random digit table SRS method: Number students 0001 through 5000; read four digits at a time; ignore 0000, repeats, and numbers over 5000 until you have 100 unique numbers.

Alternative procedures (also valid probability samples, if done correctly):

Systematic: Choose a random starting name on a student list and then select every 50th name.
Stratified: Use separate lists of freshmen, sophomores, juniors, and seniors and randomly pick 25 from each.
Cluster: If each homeroom is a random mix of 20 students across all grade levels, randomly pick five homerooms and sample all students in those rooms.

Exam Focus

Typical question patterns:
- Identify whether a described method is SRS, stratified, cluster, systematic, or multistage.
- Explain why stratified sampling can reduce variability for certain variables.
- Describe an actual random procedure (labels, random generator, handling repeats).
Common mistakes:
- Calling any “random-looking” method an SRS without checking the definition (equal chance for every group of size $n$ ).
- Mixing up strata and clusters (forming groups is not the key; the purpose of the groups is).
- Forgetting a random start in systematic sampling (without it, it’s not a probability sample).

Bias in Sampling: What Can Go Wrong (and How to Explain It)

Bias is the quickest way to invalidate a sample and make meaningful conclusions impossible. A sampling method is biased if it consistently produces samples that do not represent the population, typically favoring certain responses over others.

A crucial point: sampling bias is a property of the method, not of any one particular sample produced by that method.

The big idea: variability vs. bias

Variability is random sample-to-sample fluctuation.
Bias is systematic error that consistently pushes results in one direction.

A bigger random sample reduces variability, but it does not fix bias caused by a flawed method or a flawed sampling frame.

Common sources of bias (name it and explain it in context)

Undercoverage

Undercoverage happens when some groups in the population are left out of the sampling frame or are less likely to be included.

Example: A survey about student stress sent only to students in honors classes undercovers non-honors students.

Nonresponse bias

Nonresponse occurs when selected individuals can’t be contacted or refuse to participate. Nonresponse bias becomes a problem when nonresponders differ systematically from responders (for example, people working multiple jobs may be harder to reach and may also have different views on economic policy).

Response bias

Response bias happens when respondents give inaccurate answers, often due to social desirability, sensitive topics, interviewer influence, or poorly designed questions.

Voluntary response bias

A voluntary response sample is one where individuals choose themselves to participate (call-in polls, online polls). These overrepresent people with strong opinions and underrepresent people who don’t care much.

Convenience sampling

A convenience sample chooses individuals who are easiest to reach. It can be quick and inexpensive, but generalizing from the sample to the population is usually unjustified.

Quota sampling bias

Quota sampling bias occurs when interviewers are given free choice in picking people to fill pre-set quotas for certain categories, without randomization. This is risky because interviewer choice can systematically favor certain types of respondents.

Question wording bias

Question wording bias occurs when nonneutral, leading, loaded, ambiguous, or poorly worded questions (or even question order) push respondents toward unrepresentative answers. (This overlaps with response bias but is important enough to name explicitly.)

Explaining bias direction (a skill AP likes)

Often, you’re asked whether a result is likely an overestimate or underestimate.

A strong explanation:

Identify the flaw (undercoverage, voluntary response, nonresponse, etc.).
Identify which group is overrepresented or underrepresented.
Explain why that group likely differs in the variable being measured.
State the direction (overestimate/underestimate) only when you can justify it.

Example: Estimating average sleep by surveying students at 7:00 a.m. in the library may oversample highly motivated students or those with early schedules, potentially biasing results.

Worked example: diagnosing bias (bike lanes)

Scenario: A city wants to estimate support for building more bike lanes. They post an online poll on the city cycling club’s website.

Sampling flaw: voluntary response (and also undercoverage).
Who is overrepresented: people already interested in cycling.
Likely direction: support will likely be overestimated compared with the full city population.

Example 3.2: Military Times online survey

Military Times, in collaboration with the Institute for Veterans and Military Families at Syracuse University, conducted a voluntary and confidential online survey of U.S. service members who were readers of Military Times, with military status verified through Defense Department email addresses.

Possible sources of bias include voluntary response (strongly opinionated people may be more likely to respond) and undercoverage (only Military Times readers are included). Response bias is less likely here because the survey was confidential.

Exam Focus

Typical question patterns:
- Identify a type of bias from a study description and explain it in context.
- Predict and justify the direction of bias.
- Propose a better sampling method that uses randomness.
Common mistakes:
- Saying “it’s biased” without naming the mechanism (undercoverage vs. nonresponse vs. response bias).
- Assuming large $n$ removes bias.
- Claiming a direction of bias without a believable contextual reason.

Designing Surveys and Measurements: Asking Questions That Produce Trustworthy Data

Even with a perfect random sample, you can still collect bad data if the measurement process is flawed. A survey doesn’t directly measure attitudes; it measures responses to questions under particular conditions.

What a survey is really measuring

Asking “How many hours do you study per week?” can trigger memory errors, social desirability (inflation), or inconsistent definitions (does “studying” include homework?). Good survey design aims to make the recorded response as close as possible to the underlying truth you care about.

Question wording and structure

Leading questions push respondents toward an answer.
- Example: “Do you agree that responsible students complete at least two hours of homework nightly?”
- Better: “On a typical school night, about how many hours of homework do you complete?”
Loaded questions use emotionally charged language.
- Example: “Should the school waste money on new athletic facilities?”
Double-barreled questions ask about two things at once.
- Example: “Should the cafeteria improve food quality and lower prices?”
- Fix: split into two questions.
Ambiguous wording uses terms like “often,” “regularly,” “enough,” and “many,” which mean different things to different people. A stronger alternative is asking for a specific number (for example, days per week).

Response choices and scales

Answer options can shape results. Watch for unbalanced scales (more positive options than negative), missing options (no neutral or not applicable), and overlapping intervals for numeric ranges. If you ask for income ranges, intervals must not overlap and should cover all reasonable values.

Mode of administration (how the survey is conducted)

In-person interviews can raise response rates but may increase social desirability bias.
Phone surveys may have declining response rates and can undercover people without certain phone access.
Online surveys are fast but risk undercoverage and voluntary response problems unless the sample is carefully recruited.
Mail surveys can have high nonresponse unless follow-ups are used.

Pilot testing and revision

A pilot study is a small trial run used to detect confusing wording, unexpected answer patterns, and logistical issues. Many survey problems aren’t obvious until real people attempt to answer.

Observational data and measurement error

Not all data comes from surveys. Even direct measurements (height, reaction time, test scores) can suffer from inconsistent procedures, poorly calibrated instruments, or changing conditions (time of day, environment). Measurement error adds noise and can also cause bias if it consistently pushes values up or down.

Worked example: improving a survey question

Original question: “Do you support the school’s plan to increase security by installing surveillance cameras to keep students safe?”

Problems include leading/loaded wording (“to keep students safe” frames opposition as anti-safety).

Improved version: “Do you support or oppose installing additional surveillance cameras in school hallways?” Then use balanced response options (strongly support, somewhat support, neither, somewhat oppose, strongly oppose).

Exam Focus

Typical question patterns:
- Identify a flaw in survey wording and propose a better revision.
- Describe how a survey mode could create undercoverage, nonresponse, or response bias.
- Explain why a pilot study is useful.
Common mistakes:
- Calling every flawed question “leading” without specifying whether it’s leading, loaded, ambiguous, double-barreled, or an issue with response options.
- Suggesting a “fix” that introduces a new problem (for example, removing bias but making the question ambiguous).
- Ignoring that a random sample can still yield biased data if measurement is poor.

Sampling Variability: Random Error, Accuracy, and Precision

Even with a good random method, samples naturally vary. Sampling variability (also called sampling error) is the random fluctuation from sample to sample. This variability can be described with probability: we can talk about how likely different-sized errors are. In general, sampling variability decreases as sample size increases.

Accuracy vs. precision (and what bias looks like)

When thinking about repeated sampling results:

Accuracy means the sampling method tends to center around the true value (low bias).
Precision means the method produces low spread from sample to sample (low variability).

Low accuracy (a center consistently away from the true value) suggests bias in the selection method.

Shape and variability in the distributions are irrelevant to diagnosing sampling bias. Sampling bias is about where the distribution is centered relative to the true value.

Example 3.5: Interpreting repeated-sample results

Suppose we estimate the mean age of high school teachers using four different sampling methods. We take 10 samples using each method, and later learn the true mean is $\mu = 42$ .

Method A exhibits high accuracy and high precision.
Method B exhibits high accuracy and low precision.
Method C exhibits low accuracy and high precision.
Method D exhibits low accuracy and low precision.

Interpreting these descriptions: methods with low accuracy are likely biased; methods with low precision have high sampling variability.

Exam Focus

Typical question patterns:
- Distinguish sampling variability (random error) from bias (systematic error).
- Interpret “accuracy vs. precision” language in context of repeated sampling.
- Explain why larger random samples reduce variability but do not automatically fix bias.
Common mistakes:
- Calling natural sample-to-sample variation “bias.”
- Thinking a tight cluster of results must be good even if it’s centered at the wrong value.
- Using distribution shape to argue about bias instead of focusing on the center.

Observational Studies: Retrospective vs. Prospective, Confounding, and Why Association Is Not Causation

A major theme is matching your conclusion to your design.

Observational study vs. experiment (deep distinction)

In an observational study, researchers observe individuals and measure variables but do not assign treatments; individuals self-select into groups naturally. In an experiment, researchers deliberately assign individuals to treatments and measure responses.

A concise comparison:

Observational Studies	Experiments
Observe and measure without influencing	Impose treatments and measure responses
Can only show associations	Can suggest cause-and-effect relationships
Use random sampling to generalize to a population	Use random assignment to minimize confounding
Usually less expensive and less time-consuming	Can be expensive and time-consuming
Use strata (and randomization within strata) to reduce variability	Use blocks (and randomization within blocks) to control variables
	Possible ethical concerns over imposing treatments
	Use of blinding and double-blinding

Retrospective vs. prospective observational studies

Observational studies aim to gather information without disturbing the population.

Retrospective studies look backward, examining existing data or past records.
Prospective studies follow individuals into the future and watch for outcomes.

Advantages and disadvantages

Retrospective studies tend to be smaller scale, quicker, and less expensive. They are especially useful for rare outcomes because you can start with already-identified cases. However, researchers have less control, often relying on past record keeping done for other purposes, and they may face inaccurate memories and biases.
Prospective studies usually have greater accuracy in data collection and are less susceptible to recall error, since researchers do their own record keeping and track variables of interest directly. However, they can be expensive and time-consuming because they often follow many subjects for a long time.

Example 3.1: Ebola epidemic (retrospective and prospective)

Retrospective studies of the 2014–2016 Ebola epidemic in West Africa examined the timing, numbers, and locations of reported cases to understand how the outbreak spread. This led to better understanding of transmission through contact with bodily fluids of infected people. Prospective studies include ongoing surveillance to see how improved experience and tools for rapid identification might limit future epidemics.

Explanatory and response variables

The explanatory variable is the variable you think might influence another.
The response variable is the outcome you measure.

Example: In a study of caffeine intake and sleep, caffeine is explanatory and sleep hours is response.

Confounding and lurking variables

A confounding variable is related to both the explanatory variable and the response variable and can create a misleading association. Confounding is the main reason observational studies typically cannot support cause-and-effect conclusions.

Example: If students who play more video games have lower grades, a plausible confounder is time management or available free time, which could influence both gaming hours (explanatory) and grades (response).

A lurking variable is a variable not included in the study that influences interpretation of relationships among variables. On AP, “lurking” and “confounding” are often used similarly, but “confounding” is especially central when comparing groups and thinking about causation.

Why association is not causation

An observed association between two variables can occur because:

The explanatory variable causes changes in the response.
The response causes changes in the explanatory variable.
A confounding/lurking variable influences both.
The association is due to chance (especially with small samples).

Examples: what you can and cannot conclude

Scenario A (observational): Survey students about exercise frequency and stress level.

You may conclude an association (and generalize if sampling is random).
You may not conclude exercise reduces stress (causation).

Scenario B (experimental): Randomly assign students to a 4-week exercise program or no program and measure stress.

A causal conclusion is plausible because random assignment tends to balance other variables.

Example 3.6: SAT review course—observational vs. experimental design

A study is designed to determine whether a commercial review course helps raise SAT scores at a high school.

Observational study: Interview students who have taken the SAT, ask whether they took the review course, and compare SAT scores of those who did and did not take it.
Experiment: On students planning to take the SAT, randomly assign some to take the course and others not to, then compare scores.
The experiment is more appropriate for causation. In the observational study, students who choose to take the course may be more serious students; score differences could be due to seriousness rather than the course.

Exam Focus

Typical question patterns:
- Decide whether a scenario describes an observational study or an experiment and justify.
- Identify whether an observational study is retrospective or prospective, and give a design-appropriate strength/limitation.
- Identify a plausible confounding variable and explain how it relates to both variables.
- State what conclusions are valid: association only vs. cause-and-effect.
Common mistakes:
- Using causal language (“leads to,” “improves,” “reduces”) for an observational study.
- Naming a confounder that is not related to both variables.
- Forgetting that random sampling affects generalization, while random assignment affects causation.

Designing Experiments: Treatments, Random Assignment, Control, and Common Designs

Experiments are powerful because they are built to answer causal questions, but that power comes from careful design.

The language of experiments

Experimental units are the individuals or objects on which treatments are imposed.
If the units are people, they’re often called subjects.
A treatment is a specific condition applied to units.
A factor is an explanatory variable in an experiment.
A level is a specific value of a factor.
The response variable is the outcome measured.

Experiments impose levels of factors (creating treatments) and measure responses.

Example 3.7: Factors, levels, treatments

In an experiment to test exercise and blood pressure reduction, volunteers are randomly assigned to 0, 1, or 2 hours of exercise per day for 5 days over the next 6 months.

Explanatory variable (factor): hours of exercise.
Levels: 0, 1, and 2 hours/day.
Response variable: not specified, but could be the measured change in systolic or diastolic blood pressure after 6 months.

If volunteers were also randomly assigned to follow either the DASH or TLC diet for 6 months, there would be two factors (exercise with three levels and diet plan with two levels) and six treatments total:

DASH with 0 hours, DASH with 1 hour, DASH with 2 hours
TLC with 0 hours, TLC with 1 hour, TLC with 2 hours

Random assignment (not the same as random sampling)

Random assignment uses chance to decide which experimental units get which treatments. It tends to create treatment groups that are similar on both known and unknown variables, strengthening causal conclusions.

Random assignment does not guarantee groups are identical, and it does not guarantee results generalize to a larger population. Generalization requires random sampling.

Control groups (and what counts as a “treatment”)

Many experiments use a control group to serve as a baseline comparison. Control groups can be:

units given no treatment,
units given the current standard treatment,
units given a placebo (inactive treatment made to look like an active one).

Important detail: if you’re asked to list the treatments, a control condition that uses the current standard treatment or a placebo still counts as a treatment.

The three principles of experimental design

Control: Keep other variables the same across groups so the treatment is the main systematic difference. This can include a control group, placebo, and blinding.
Randomization: Random assignment, to balance confounders across groups on average.
Replication: Use enough experimental units per group to reduce chance variation.

Replication refers to having more than one experimental unit in each treatment group, not running the same experiment multiple times.

Placebo effect, blinding, and double-blinding

The placebo effect is the real tendency for people to respond to any perceived treatment (for example, a sugar pill described as a pain reliever).

Blinding occurs when subjects don’t know which treatment they are receiving.
Double-blinding occurs when neither the subjects nor the response evaluators (or the researchers interacting with subjects) know who is receiving which treatment.

Completely randomized design

A completely randomized design assigns all units to treatments purely by chance, with no additional structure.

Example 3.8: Warts—cryotherapy vs. duct tape (completely randomized)

Sixty patients ages 5–12 with common warts are enrolled to compare cryotherapy versus duct tape occlusion.

A completely randomized design:

Label patients 1 to 60.
Randomly select 30 unique numbers (ignore repeats) using a random integer generator (or label 01–60 and use a random digit table, ignoring 00, numbers over 60, and repeats).
Those 30 receive cryotherapy; the remaining 30 receive duct tape.

A third equivalent randomization option is to put names on identical slips, mix well, and draw 30 slips without replacement.

At the end, compare the proportion in each group with complete resolution of the warts.

Randomized block design

A randomized block design first groups units into blocks based on a variable expected to affect the response, then randomizes treatments within each block. Blocking reduces variability by making comparisons within more similar groups.

A block is not the same as a stratum in sampling, but the idea is similar: group based on a relevant variable. The difference is that blocking is used in experiments (random assignment within blocks), while stratifying is used in sampling.

Matched pairs design

A matched pairs design (also called a paired comparison design) is a special block design with blocks of size 2. It is done by:

Giving each subject both treatments in random order (within-subject comparison), or
Pairing similar subjects and then randomly assigning one treatment to each person within each pair.

Example 3.9: Double-blind wristband experiment

A pressure point on the wrist may help control nausea after medical procedures. To run a double-blind experiment on 50 postoperative patients:

Assign patients numbers 1 to 50.
Randomly select 25 unique numbers (ignore repeats) for the “marble over the pressure point” group.
Put wristbands with marbles on all patients, but for the other 25 patients place the marble not over the pressure point.
Have a researcher check by telephone at designated intervals to measure nausea.

To make it double-blind, neither the patients nor the researcher evaluating responses by phone should know who has the marble correctly placed.

Worked example: identifying design flaws and improving them

Scenario: A researcher recruits 40 volunteers to test a memory app. The first 20 volunteers use the app for 2 weeks; the next 20 do not. After 2 weeks, all take a memory test.

Problems:

Assignment isn’t randomized (first 20 vs next 20 can differ systematically).
Volunteers limit generalization.

Improvements:

Randomly assign the 40 volunteers to app vs no-app using a random number generator.
Consider a placebo app if expectations might affect performance.
If initial memory ability varies, block by a pre-test score and randomize within blocks.

Exam Focus

Typical question patterns:
- Identify experimental units, treatments, factors/levels, and response from a description.
- Choose an appropriate design (completely randomized vs. block vs. matched pairs) and justify.
- Explain how random assignment, control, and replication support causation.
- Describe how to carry out random assignment (including handling repeats).
Common mistakes:
- Confusing random assignment with random sampling.
- Calling any comparison study an “experiment” even when treatments weren’t imposed.
- Blocking on a variable that is itself affected by the treatment (blocking must use pre-treatment information).

Interpreting Scope: Generalization, Causation, Practical/Ethical Constraints, and Statistical Significance

Design choices determine what you can legitimately claim. Many AP free-response questions score heavily on whether your conclusion matches the design.

The two big questions

Can I generalize to the population?
Can I make a cause-and-effect claim?

When you can generalize

Generalization is strongest when the sample is selected using a random sampling method from a well-defined population with a sampling frame that matches the population.

If the sample is a convenience sample or volunteers, you may learn something about the respondents, but generalizing to the full population is not well supported.

When you can infer causation

Cause-and-effect conclusions are justified when treatments are imposed and assigned using random assignment in a well-designed experiment.

Random assignment is the key ingredient for causation in AP Statistics, although real-world implementation details (noncompliance, dropouts, measurement issues) still matter.

The four common combinations (how to talk about them)

Random sample + randomized experiment: You can often generalize to the population and make causal conclusions.
Random sample + observational: You can generalize, but only association (no causation).
Not random sample + randomized experiment: You can make causal conclusions for the subjects studied, but generalization is limited.
Not random sample + observational: Neither generalization nor causation is well supported.

Practical issues that affect validity

Noncompliance: Subjects do not follow assigned treatments. This weakens the clean comparison created by random assignment and can make effects harder to detect.
Dropouts: If subjects leave and dropout is related to the response (for example, those with side effects drop out), results can become biased (similar in spirit to nonresponse).
Placebo and experimenter effects: Expectations can change outcomes; blinding helps reduce these.

Replication and generalizability of experimental results

Replication means using enough experimental units per treatment group so real differences are easier to detect over random variation. Larger group sizes generally make results more convincing because chance variation is smaller.

However, to generalize experimental results to a larger population, it is not enough to randomize assignment; you also need the subjects to be randomly selected from that population.

Ethics in data collection

Ethical constraints limit what can be done, especially in experiments. Key principles include informed consent, confidentiality and privacy, and minimizing harm. Some causal questions can’t be studied by random assignment (for example, you can’t ethically assign people to smoke for decades), so observational evidence is often used instead.

Inference and experiments: statistical significance

Even if treatments make no difference, results will vary by chance. A result is statistically significant when the probability of seeing such a difference just by chance is so small that you’re convinced there must be another explanation.

Example 3.10: Coffee vs. cola vs. herbal tea and pulse rates

Sixty students volunteer for an experiment comparing coffee, caffeinated cola, and herbal tea on pulse rates. Twenty are randomly assigned to each treatment, and the change in pulse rate is measured after consuming eight ounces.

Reasonable conclusions:

The median change in pulse for cola drinkers is higher than for coffee drinkers, but given the overall spreads, that difference does not appear significant; it’s likely due to random chance.
Coffee and caffeinated cola show higher pulse-rate changes than herbal tea, with relatively little overlap compared to the overall differences. It is reasonable to conclude this difference is statistically significant: drinking coffee or caffeinated cola results in a greater rise in pulse rate than drinking herbal tea.

Worked example: deciding what can be concluded (sampling + volunteering + random assignment)

Scenario: A state education department randomly samples 60 schools in the state and then asks each school to volunteer two teachers for a study. The volunteers are randomly assigned to use a new curriculum or continue with the old one. Student test scores are compared.

Causation: Random assignment supports cause-and-effect conclusions about the curriculum’s effect for the participating teachers/students in the study.
Generalization: Schools were randomly sampled, but teachers were volunteers within schools, weakening generalization to all teachers in all schools.

A strong conclusion mentions both: strong internal validity for causation among participants, weaker external validity for statewide generalization.

Exam Focus

Typical question patterns:
- State whether results can be generalized to a population and justify using the sampling method.
- State whether cause-and-effect is justified and justify using random assignment/control/replication.
- Identify how nonresponse, noncompliance, dropouts, or placebo/experimenter effects threaten validity.
- Use “statistically significant” in context: explain what it means about chance vs. real effects.
Common mistakes:
- Saying “randomized experiment” when the study only had random sampling (or vice versa).
- Claiming causation from an observational study because it “seems convincing.”
- Ignoring volunteer/undercoverage issues when asked about generalization.
- Treating any visible difference in sample summaries as automatically statistically significant without considering variability.