AP Statistics: Unit 2 Overview â€” Exploring Two-Variable Data

Introduction to Two-Variable Data and Scatterplots

When we move from analyzing a single variable (univariate data) to analyzing two variables simultaneously (bivariate data), our goal shifts from describing distributions to describing relationships. In AP Statistics, we primarily examine the relationship between two quantitative variables.

Explanatory and Response Variables

Before plotting data, you must distinguish between the roles of the two variables:

Explanatory Variable ($x$): The variable that attempts to explain or predict changes in the other variable. Sometimes called the independent variable. It is plotted on the horizontal axis ($x$-axis).
Response Variable ($y$): The variable that measures the outcome of a study. Sometimes called the dependent variable. It is plotted on the vertical axis ($y$-axis).

Memory Aid: The Explanatory variable is the x-planatory variable (x-axis). The Response variable is the Respon-y-sable variable ($y$-axis).

Constructing Scatterplots

A scatterplot shows the relationship between two quantitative variables measured on the same individuals.

Rules for AP Exam Success:

Label Axes: Clearly label both axes with variable names and units.
Scale: Use consistent numerical scales starting from reasonable values (breaks in axes must be marked if used).
Plot Points: Each individual in the dataset appears as a single point $(x, y)$.

Describing Relationships (DUFS)

When asked to "describe the relationship" or "describe the association" between two variables on the AP exam, you must address four specific characteristics. Use the mnemonic DUFS or DOFS.

Scatterplot Examples

Direction:
- Positive: As $x$ increases, $y$ tends to increase.
- Negative: As $x$ increases, $y$ tends to decrease.
- No Association: Changes in $x$ do not predict changes in $y$.
Unusual Features:
- Outliers: Points that fall outside the overall pattern of the rest of the data (in the $x$ direction, $y$ direction, or both).
- Clusters: Distinct groups of points separated by gaps.
Form:
- Linear: The points generally follow a straight line.
- Curved/Non-linear: The points follow a curved pattern (e.g., parabolic, exponential).
Strength:
- How closely the points follow the form.
- Descriptors: Strong, Moderate, Weak.

Standard Sentence Template:

"There is a [Strength], [Direction], [Form] relationship between [Explanatory Variable] and [Response Variable]. There appear to be [Unusual Features]."

Correlation and its Properties

While a scatterplot gives us a visual impression of a relationship, the correlation coefficient gives us a precise numerical measurement.

The Correlation Coefficient ($r$)

The correlation coefficient, denoted by $r$, measures the direction and strength of the linear relationship between two quantitative variables.

Formula:
While you will usually calculate this using technology (TI-84/Nspire), understanding the formula helps conceptually:
$r = \frac{1}{n-1} \sum \left( \frac{x<em>i - \bar{x}}{s</em>x} \right) \left( \frac{y<em>i - \bar{y}}{s</em>y} \right)$

Notice that $r$ is essentially the average product of the $z$-scores for the $x$ and $y$ variables.

Interpreting $r$

Correlation Coefficient Spectrum

Range: $-1 \leq r \leq 1$
Sign:
- $r > 0$: Positive association.
- $r < 0$: Negative association.
Magnitude (Strength):
- $r = 1$ or $r = -1$: Perfect linear relationship.
- $r = 0$: No linear relationship.

Value of $	r	$	Strength Descriptor
$0.8 \le	r	\le 1.0$	Strong
$0.5 \le	r	< 0.8$	Moderate
$0.0 \le	r	< 0.5$	Weak

Key Properties of Correlation

Linearity Only: $r$ only measures linear relationships. You can calculate $r$ for curved data, but the number will be misleading. Always describe the Form (stats) from the graph first.
Symmetry: Correlation makes no distinction between explanatory and response variables. Swapping $x$ and $y$ axes does not change $r$.
Unit Independence: Because $r$ is calculated using standardized scores ($z$-scores), it has no units. Changing units (e.g., feet to meters) does not change $r$.
Non-Resistant: Correlation is not resistant to outliers. A single outlier can drastically lower a strong correlation or create a false correlation.

Correlation vs. Causation

This is the most critical conceptual rule in Unit 2.

Observation: Even a very strong correlation ($r \approx 1$) does not imply that changes in $x$ cause changes in $y$.
Explanation: Association does not imply causation due to the potential presence of lurking variables (variables not included in the study that influence the relationship).

Common Mistakes & Pitfalls

Assuming Linearity based on $r$:
- Mistake: "Since $r = 0.05$, there is no relationship."
- Correction: There is no linear relationship. The data could be a perfect parabola (curved).
Confusing "Correlation" with "Association":
- Mistake: Using the word "correlation" to describe categorical data or curved data.
- Correction: Use "correlation" strictly for linear, quantitative relationships. Use "association" for everything else.
Ignoring Context:
- Mistake: " The relationship is strong and positive."
- Correction: "The relationship between height and arm span is strong and positive."
Claiming Causation:
- Mistake: "Because $r=0.98$, studying more causes higher grades."
- Correction: "There is a strong association between studying and grades, but we cannot claim causation without a controlled experiment."
Forgetting Axes Labels:
- Describing a scatterplot or drawing one without units/labels is the fastest way to lose credit on an FRQ.