AP Statistics: Unit 3 Data Collection Mastery

Fundamentals of Data Collection

Statistics works like a bridge. On one side, we have a massive group we want to know about (the population). On the other, we have the small group we actually talk to (the sample). The structural integrity of that bridge depends entirely on how we choose that sample. If the bridge is built poorly (bad sampling), any gathered data will collapse under the weight of scrutiny.

Planning a Study: Populations and Samples

Before calculating a single mean or standard deviation, you must define who or what you are studying.

Key Definitions

Population: The entire group of individuals we want information about. We want to know the truth about this group.
Census: A study that attempts to collect data from every individual in the population. While ideal, this is usually too expensive, time-consuming, or impossible.
Sample: A subset of individuals in the population from which we actually collect data.
Inference: The process of drawing conclusions about a population based on data from a sample.

Diagram showing the relationship between Population, Sample, and Inference

Generalizability

The golden rule of inference is Generalizability.

Rule: You can only generalize your results to the population from which the sample was randomly selected.

Example: If you want to know the average height of all high school seniors in the US, but you only take a random sample of seniors from one specific school, you cannot generalize the results to the whole country. You can only generalize to that specific school.

Randomized Sampling Methods

To minimize bias and allow for valid inference, we must use randomization. Random sampling ensures that the sample is representative of the population by letting chance, rather than human choice, select the individuals.

Simple Random Sample (SRS)

This is the "Gold Standard" of sampling against which others are compared.

Definition: A sample of size $n$ is chosen in such a way that every group of $n$ individuals in the population has an equal chance to be selected as the sample.
How to execute on the AP Exam: You must be specific.
1. Label: Assign a unique number to every individual in the population from 1 to $N$.
2. Randomize: Use a random number generator (RNG) or a table of random digits to select $n$ unique numbers.
3. Select: Identify the individuals corresponding to the generated numbers.

Stratified Random Sampling

Use this when the population contains distinct groups tailored to the variable you are measuring.

Concept: Divide the population into homogeneous groups called strata (individuals within a stratum are similar to each other regarding the variable of interest). Then, perform an SRS within each stratum and combine them.
Why use it?: It reduces sampling variability (making estimates more precise) and ensures all subgroups are represented.
Example: To survey student opinion on a new lunch policy, you divide the school by grade level (Freshmen, Sophomores, Juniors, Seniors) because seniors might feel very differently than freshmen. You randomly select 25 students from each grade.

Cluster Sampling

Use this primarily for logistical efficiency/convenience.

Concept: Divide the population into heterogeneous groups called clusters. Ideally, each cluster is a "mini-population" that mirrors the diversity of the whole. Randomly select a few clusters, and then perform a census on everyone within those selected clusters.
Why use it?: It creates efficiency. It's easier to survey all homes on 5 randomly selected city blocks than 5 random homes scattered across the whole city.
Example: A stadium has 50 sections. You randomly select 3 sections and survey every single person sitting in those 3 sections.

Systematic Random Sampling

Concept: Select a sample from an ordered arrangement of the population by randomly selecting one of the first $k$ individuals and choosing every $k$-th individual thereafter.
Formula: If population size is $N$ and desired sample size is $n$, then $k \approx N/n$.
Condition: You must ensure there is no repeating pattern in the population list that matches your interval $k$, or bias will occur.

Visual comparison of SRS, Stratified, Cluster, and Systematic sampling methods

Comparison Table: Stratified vs. Cluster

Feature	Stratified Sampling	Cluster Sampling
Groups	Homogeneous (Alike within)	Heterogeneous (Diverse within)
Selection	Random selection from every group	All individuals from random groups
Slogan	"Some from All"	"All from Some"
Goal	Precision (lower variability)	Efficiency (time/money)

Sources of Bias in Sampling

Bias is not just an error; it is a systematic implication that favors certain outcomes. If a study is biased, increasing the sample size does not fix it—it just repeats the mistake on a larger scale.

1. Sampling Method Bias (Bad Design)

These occur because of how the sample was chosen (usually non-random).

Voluntary Response Bias: Occurs when individuals can choose whether to participate. usually, only those with strong (often negative) opinions feel compelled to respond.
- Example: A radio host asks listeners to call in. The results will not represent the general public.
Convenience Sampling: Choosing individuals who are easiest to reach.
- Example: Asking the first 20 people who walk into the library about their reading habits (over-represents readers).

2. Implementation Bias (Bad Execution)

These can occur even if you planned to use a random sample.

Undercoverage: When some members of the population cannot be chosen because they are not in the sampling frame (the list from which the sample is drawn).
- Example: Conducting a phone survey using landlines only (misses people who only have cell phones).
Nonresponse Bias: When an individual chosen for the sample cannot be contacted or refuses to participate.
- Note: This is different from voluntary response. In nonresponse, you tried to reach a specific person and failed.
Response Bias: A systematic pattern of incorrect responses. This can be caused by:
- Wording of questions: Leading questions ("Don't you agree that…").
- Interviewer influence: A principal asking students if they have ever cheated.
- Lying: Subjects lying about sensitive topics (drug use, income).

Common Mistakes & Examination Pitfalls

Confusing Stratified and Cluster Sampling:
- Correction: Remember the mnemonics. Stratified: You want to compare groups (e.g., compare males vs. females), so you take some from all. Cluster: You want to save time, so you take all from some (e.g., all students in 3 random homerooms).
"Random Sample" vs. "Random Assignment":
- Correction: Random Sampling allows us to generalize to a population (Unit 3). Random Assignment allows us to prove cause-and-effect in experiments (discussed later in Unit 3). Do not mix these up.
Vague Descriptions:
- Correction: On Free Response Questions (FRQs), never just say "put them in a hat." You must say: "Write names on identical slips of paper, put them in a hat, mix well, and draw $n$ slips without replacement."
Bias vs. Variability:
- Correction: Bias is accuracy (am I aiming at the bullseye?). Variability is precision (are my shots close together?).
- Large random samples have low variability (precise) but can still have high bias if the method is flawed (precisely missing the bullseye).