Unit 2: Exploring Two-Variable Data

Asking Two-Variable Questions and Choosing the Right Display

In Unit 1 you learned how to describe a single variable. In Unit 2, the focus shifts to relationships between two variables: how one variable behaves as another changes, or whether two categories tend to occur together. A two-variable (bivariate) data set records two pieces of information for each individual (person, object, event). For example, for each student you might record hours studied and test score, or for each car you might record age and resale value.

Types of variables determine the tools you use

The first skill is identifying the type of each variable.

A categorical variable places individuals into groups (brand, gender, yes/no, region). A quantitative variable is numerical with meaningful arithmetic (height, time, income). The type combination determines appropriate displays and summaries.

Variable 1Variable 2Common displaysMain numerical summaries
CategoricalCategoricalTwo-way table, segmented bar chartConditional relative frequencies, differences in proportions
QuantitativeQuantitativeScatterplotCorrelation, regression line, residuals, r^2
CategoricalQuantitativeSide-by-side boxplots (often), dotplots by groupDifferences in centers/spreads by group

AP Statistics Unit 2 emphasizes:
1) Categorical vs categorical relationships using two-way tables and conditional distributions.
2) Quantitative vs quantitative relationships using scatterplots, correlation, and linear regression.

Explanatory vs response variables (the “direction” of the relationship)

Often, you want to see whether one variable helps explain or predict another. The explanatory variable is the variable you think might influence or predict (often labeled x). The response variable is the outcome you care about (often labeled y). Choosing explanatory vs response is guided by the question; swapping them changes the interpretation of a regression equation.

Example: if you’re studying how outside temperature relates to electricity usage, temperature is a natural explanatory variable and electricity usage is the response.

Association is not the same as causation

A key theme throughout the unit is association: patterns of relationship in data. A strong association does not automatically mean one variable causes the other.

A lurking variable is a variable not included in your analysis that helps explain the relationship you see. Confounding occurs when two variables’ effects are mixed together so you can’t separate them.

Exam Focus

Typical question patterns include identifying variable types and picking an appropriate display (scatterplot vs two-way table), identifying explanatory and response variables from a context, and explaining why an observed association does or does not imply causation.

Common mistakes include treating “associated” as meaning “causes,” automatically calling the horizontal-axis variable “explanatory” without reading the context, and using quantitative tools (correlation/regression) on categorical-coded numbers (like 1 = yes, 0 = no) without thinking.

Categorical vs Categorical: Two-Way Tables, Joint and Marginal Information, and Conditional Distributions

When both variables are categorical, you’re looking for whether the category for one variable is related to the category for the other. The central tool is a two-way table (also called a contingency table).

Two-way tables: counts, table total, and cell questions

A two-way table organizes counts for combinations of two categorical variables. The grand total of all cell values is called the table total.

It’s also common to name one variable the row variable and the other the column variable. This matters because many questions specify whether you should condition on a row or on a column.

Example 2.1: “The Cuteness Factor” (structure and what you can compute)

A Japanese study had 250 volunteers look at pictures of cute baby animals, adult animals, or tasty-looking foods, before testing their level of focus in solving puzzles. This produces a two-way table where “pictures viewed” could be treated as a row variable and “level of focus” as a column variable.

A question like “What percent of the people in the survey viewed tasty foods and had a medium level of focus?” is asking for a joint relative frequency:

\text{joint percent} = \frac{\text{count in the (tasty foods, medium focus) cell}}{\text{table total}}

Marginal frequencies and marginal distributions

The standard method of analyzing two-way table data often begins by calculating totals for each row and each column. These totals appear in the right and bottom margins of the table and are called marginal frequencies (or marginal totals).

If you convert the marginal totals into proportions or percentages, you get marginal distributions. For instance, you can find the marginal distribution of the column variable (level of focus) by dividing each column total by the table total. Similarly, you can find the marginal distribution of the row variable (pictures viewed) by dividing each row total by the table total.

Marginal distributions can be displayed with bar graphs. These bar graphs summarize one variable alone and do not, by themselves, describe the relationship between the two variables.

Conditional relative frequencies (the core skill for association)

Raw counts can be misleading when groups have different sizes. The idea of “relationship” here is captured by conditional distributions: how the distribution of one variable changes when you condition on a category of the other.

A conditional relative frequency is a proportion computed within a subgroup, such as:

  • “Among students who play a sport, what proportion have a job?”
  • “Among students who do not play a sport, what proportion have a job?”

If those conditional proportions are meaningfully different, that suggests an association between the variables.

Example: computing conditional distributions

Suppose your two-way table is:

Job: YesJob: NoTotal
Sport: Yes3070100
Sport: No4555100
Total75125200

Conditional distribution of Job status among Sport: Yes:

  • Job: Yes proportion = 30/100 = 0.30
  • Job: No proportion = 70/100 = 0.70

Conditional distribution of Job status among Sport: No:

  • Job: Yes proportion = 45/100 = 0.45
  • Job: No proportion = 55/100 = 0.55

Because 0.30 and 0.45 differ, Job status appears associated with Sport participation.

Segmented bar charts: seeing conditional distributions

A segmented bar chart (a 100% stacked bar chart) visually compares conditional distributions. Each bar represents a group (like Sport: Yes vs Sport: No), and the segments show the conditional percentages for the other variable. The key idea is that conditional distributions are comparisons within each group, so your bars should be scaled to percentages, not counts.

Independence vs association

Two categorical variables are independent if knowing the category of one does not change the conditional distribution of the other. In practice, you rarely conclude perfect independence; instead, you argue whether the conditional distributions are “about the same” or “meaningfully different,” using context.

Simpson’s paradox (why aggregation can mislead)

Simpson’s paradox occurs when a trend appears in several groups of data but disappears or reverses when the groups are combined. If a lurking variable changes group sizes or baseline rates, the overall (marginal) association can be misleading.

A classic situation is comparing success rates across two treatments while ignoring a third variable like severity of illness. Within each severity group, Treatment A may outperform Treatment B, but overall Treatment B appears better because it was used more often on easier cases. The habit to build is: when a conclusion seems surprising, consider whether a third variable might be driving it.

Exam Focus

Typical question patterns include computing and comparing conditional relative frequencies to judge association, interpreting a segmented bar chart in context, finding and interpreting marginal totals/distributions, and explaining how Simpson’s paradox could occur and what variable might be lurking.

Common mistakes include comparing marginal totals when the question asks for conditional comparisons, using counts instead of proportions when group sizes differ, and claiming independence because proportions are “not exactly equal” rather than arguing “about the same” vs “meaningfully different.”

Quantitative vs Quantitative: Scatterplots and Describing Association

When both variables are quantitative, the go-to display is a scatterplot, which shows how paired numerical values move together. These are also called bivariate quantitative data sets.

Scatterplots: what they show

A scatterplot places one variable on the horizontal axis and the other on the vertical axis, plotting one point per individual. You typically put the explanatory variable x on the horizontal axis and the response variable y on the vertical axis.

Scatterplots give an immediate visual impression of a possible relationship and can reveal patterns that single-number summaries can hide: curved relationships, clusters, outliers, and changes in variability.

Example 2.2: speed vs strength (comic book characters)

Speed (measured on a 20-point scale) versus strength (measured on a 50-point scale) for 17 comic book heroes and villains can be plotted in a scatterplot to judge whether there appears to be a linear association.

How to describe a scatterplot (form, direction, strength, unusual features, context)

A strong AP Stats description is organized and specific. You should discuss:

  • Direction: positive or negative.
    • Positively associated: larger values of one variable tend to be associated with larger values of the other.
    • Negatively associated: larger values of one variable tend to be associated with smaller values of the other.
  • Form: linear or nonlinear (curved), and note any clusters.
  • Strength: weak, moderate, or strong, based on how close the points are to the form (especially to a straight line if the form is linear).
  • Unusual features: outliers and clusters.

All descriptions must mention context, not just “positive” or “negative.”

Why form matters: linear vs nonlinear

Many tools later in the unit (correlation, least-squares regression, r^2) are designed for linear relationships. If the form is curved, those tools can give misleading impressions. Before computing anything, ask whether a line looks like a reasonable summary.

Outliers and influential-looking points

An outlier in a scatterplot is a point far from the overall pattern. Outliers matter because they can dramatically change correlation and regression results, may indicate data errors, or may represent a meaningful special case. A good practice is to verify the point, consider a contextual explanation, and analyze with and without it when appropriate.

Example: describing a scatterplot in words

If you plot car age (years) vs resale value (dollars) and see points trending downward roughly along a line with one very expensive outlier (a collectible car), a strong description would be: “There is a moderately strong negative linear association between age and resale value: older cars tend to sell for less. The pattern is roughly linear with some scatter, and there is one outlier corresponding to a much higher resale value than expected for its age, likely a collectible model.”

Exam Focus

Typical question patterns include describing a scatterplot using form, direction, strength, and unusual features in context, and deciding whether a linear model is appropriate.

Common mistakes include describing only “positive/negative” and forgetting the other features, calling a relationship “strong” because the slope is steep (strength is about scatter, not steepness), and ignoring curvature and proceeding with linear methods anyway.

Correlation: Measuring Linear Association

A scatterplot is visual; correlation provides a numerical measure of how strong a linear relationship is. In either case, evidence of a relationship is not evidence of causation.

The correlation coefficient r

The correlation coefficient r measures the direction and strength of the linear association between two quantitative variables. It is always between -1 and 1.

  • r near 1: strong positive linear association.
  • r near -1: strong negative linear association.
  • r near 0: weak linear association, though there may still be a strong nonlinear relationship.

A commonly used formula (in terms of means and standard deviations) is:

r = \frac{1}{n-1}\sum\left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right)

An equivalent form is:

r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_x s_y}

Because it is built from standardized scores (z-scores), correlation is unit-free.

Key properties of r

These properties are heavily tested.

1) The sign matches direction: positive association gives positive r; negative association gives negative r.

2) No units: changing units (inches to centimeters) does not change r.

3) Insensitive to linear transformations: adding a constant to all x values or multiplying by a positive constant does not change r. Multiplying by a negative constant flips the sign because it reverses direction.

4) Symmetry: the formula does not distinguish between which variable is called x and which is called y, so interchanging the variables does not change r.

5) Not resistant: r is strongly affected by outliers and influential points.

6) Only measures linear strength: a strong curve can still have r near 0, and a correlation close to -1 or 1 does not automatically mean a linear model is the most appropriate model.

Correlation and unusual points (vertical outliers and leverage)

A point with an unusual y value compared to the pattern (a vertical outlier) often weakens r. A point with an extreme x value can have high leverage and can pull the regression line, which can also change correlation. If r seems surprising, look back at the scatterplot for outliers or clusters.

Worked example: interpreting r

If calculator output gives r = -0.82 for absences vs course grade, then the negative sign indicates that as absences increase, grades tend to decrease, and the magnitude suggests a strong linear association. A good sentence is: “There is a strong negative linear association between absences and course grade; students with more absences tend to have lower grades.”

Exam Focus

Typical question patterns include interpreting a given r in context (direction and strength), explaining why r might be small even when variables are strongly related (nonlinear pattern), and discussing how an outlier would affect r.

Common mistakes include saying “r = 0 means no relationship” (it means no linear relationship), treating r as a slope, and ignoring the effect of outliers and leverage.

Linear Regression and Least Squares: Building a Predictive Model with a Line

When a scatterplot shows an approximately linear pattern, you can summarize the relationship with a regression line that models how the response variable tends to change as the explanatory variable changes.

The regression equation and prediction

A linear regression model predicts y from x. The predicted value is written \hat{y}.

\hat{y} = a + bx

Here, a is the y-intercept and b is the slope.

Interpreting slope and intercept in context

Interpreting the parameters is more important than doing algebra.

  • Slope b: the predicted change in y for each 1-unit increase in x.
  • Intercept a: the predicted y when x = 0, which is only meaningful if x = 0 is within the data’s range and makes sense in context.

The least-squares regression line (LSRL)

There are many possible lines. The least-squares regression line is the one that minimizes the sum of the squared vertical differences between observed values and predicted values.

Define the residuals as vertical errors (formal definition appears in the residual section). Least squares chooses the line that minimizes:

\sum (y_i-\hat{y}_i)^2

It is reasonable, intuitive, and correct that the best-fitting line passes through the point of averages:

\left(\bar{x},\bar{y}\right)

A useful point-slope form centered at the means is:

\hat{y}-\bar{y} = b(x-\bar{x})

How the LSRL connects to summary statistics

Regression is closely tied to means, standard deviations, and correlation. Using s_x and s_y as standard deviations and r as correlation:

b = r\frac{s_y}{s_x}

a = \bar{y} - b\bar{x}

These relationships imply:

  • The sign of b matches the sign of r.
  • The line always passes through \left(\bar{x},\bar{y}\right).
  • Slope has units of “y-units per x-unit.”

A powerful interpretation in standard deviation units is: each 1-standard-deviation increase in x corresponds to a predicted increase of r standard deviations in \hat{y}.

If you graph z-scores for y against z-scores for x, the regression line has slope exactly r, and the equation becomes:

\hat{z}_y = r z_x

This also explains special cases mentioned often in conceptual questions:

  • If r = 1, then for each s_x increase in x, predicted y increases by s_y.
  • If r = 0.4, then for each s_x increase in x, predicted y increases by 0.4s_y.

Predicting x from y uses a different regression line

Regression depends on which variable is treated as explanatory. The regression line for predicting x from y has slope:

b_{x|y} = r\frac{s_x}{s_y}

Swapping explanatory and response changes the equation and how you interpret it.

Worked example: interpreting and using a regression equation

If the regression line predicting test score from hours studied is:

\hat{y} = 52 + 6.5x

Slope: for each additional hour studied, predicted test score increases by about 6.5 points.

Intercept: a student who studies 0 hours is predicted to score about 52 points (only meaningful if 0 hours is plausible and within the data range).

Prediction for 4 hours:

\hat{y} = 52 + 6.5(4) = 78

Example 2.4: close friends and evening Facebook checks (complete worked analysis)

A sociologist surveys 15 teens. Let X be number of “close friends” and Y be number of times Facebook is checked every evening.

X252330252033182122302626272920
Y1011141281891010151115121411

1) Variables: explanatory X is close friends; response Y is evening Facebook checks.

2) A scatterplot of the 15 points gives a visual impression.

3) Description: the relationship appears linear, positive, and strong.

4) Calculator regression line:

\hat{y} = -1.73 + 0.5492x

Interpret slope: each additional close friend is associated with an average increase of about 0.5492 evening Facebook checks.

5) With r = 0.8836, we have r^2 = 0.78. Interpretation: 78% of the variation in evening Facebook checks is accounted for (explained by, predicted by) the linear relationship with number of close friends.

6) Predict for 24 close friends:

\hat{y} = -1.73 + 0.5492(24) = 11.45

So students with 24 close friends will average about 11.45 evening Facebook checks according to the model.

Example 2.8 and Example 2.9: predictions from summary statistics (process)

In a context where x is movie theater attendance and y is boxes of popcorn sold, you may be given summary statistics (such as \bar{x}, \bar{y}, s_x, s_y, and r) and told the relationship is roughly linear.

  • Example 2.8 asks for predicted boxes of popcorn sold when attendance is 250 and when attendance is 295. The method is to compute:

b = r\frac{s_y}{s_x}

a = \bar{y}-b\bar{x}

Then use \hat{y} = a + bx.

  • Example 2.8 also highlights the slope for predicting x from y:

b_{x|y} = r\frac{s_x}{s_y}

  • Example 2.9 uses the same summary statistics to predict attendance when 160 boxes of popcorn are sold and when 184 boxes are sold, using the regression of x on y.
Exam Focus

Typical question patterns include interpreting slope and intercept in context, using a regression equation to make a prediction, explaining what “least squares” means (minimizes squared residuals), using the mean/SD/correlation relationships (including standard-units ideas), and recognizing that predicting x from y requires a different line.

Common mistakes include interpreting slope backwards (mixing up which variable changes), forgetting units, treating the intercept as meaningful when x = 0 is outside the data, and assuming you can swap x and y without changing the regression model.

Residuals: Measuring and Diagnosing Prediction Errors

A regression line summarizes a trend, but individual points usually don’t lie exactly on the line. The gap between what the line predicts and what actually happened is the residual.

Residuals: definition and meaning

A residual is:

\text{residual} = y - \hat{y}

A positive residual means the model underestimated the actual response value (the point is above the line). A negative residual means the model overestimated the actual response value (the point is below the line). When the regression line is graphed on the scatterplot, the residual is the vertical distance from the point to the regression line.

A key fact about the LSRL: residuals sum to zero

For the least-squares regression line, the sum of the residuals (and thus the mean residual) is always zero.

Example: residual table and the “sum to zero” property

Consider the following observed and predicted values and residuals.

x309090756050
y185630585500430400
\hat{y}220.3613.3613.3515.0416.8351.3
y-\hat{y}-35.316.7-28.3-15.013.248.7

Adding residuals gives:

-35.3 + 16.7 - 28.3 - 15.0 + 13.2 + 48.7 = 0

This is true in general for least squares.

Residual plots: checking whether linear regression is appropriate

A residual plot graphs residuals (vertical axis) against the explanatory variable x (horizontal axis). For a good linear model, residuals show random scatter around 0 with roughly equal spread across x.

Warning signs include a curved residual pattern (nonlinear relationship), fanning out (non-constant variability), clusters or gaps (possible subgroups or missing categorical variable), and an extreme point (potential outlier or influential point).

Worked example: interpreting a residual in context

If a model predicts a house price of \hat{y} = 310 (in thousands of dollars) but the actual price is y = 340, then:

\text{residual} = 340 - 310 = 30

Interpretation: the house sold for 30 thousand dollars more than predicted by the model.

Why residual plots beat correlation for diagnostics

It’s possible to have a moderately high correlation and a plausible-looking regression line, yet still have a clear curved pattern in a residual plot showing the linear model is inappropriate.

Exam Focus

Typical question patterns include computing and interpreting residuals, identifying outliers using residuals, using residual plots to judge linearity and constant spread, and explaining what patterns mean.

Common mistakes include using \hat{y}-y instead of y-\hat{y}, interpreting residuals as percent error when not asked, and claiming that residuals near 0 mean the model is “perfect” (patterns matter more than individual points).

Strength of a Linear Model: Coefficient of Determination r^2

Correlation r tells you how strong the linear association is, but r^2 often gives a more interpretable “percent of variability explained.”

What r^2 means

The coefficient of determination is:

r^2

Interpretation: r^2 is the proportion of the variability in the response variable y that is explained by the linear relationship with the explanatory variable x (using the regression model).

Example: if r = 0.80, then:

r^2 = 0.64

So about 64% of the variation in y is explained by the linear relationship with x.

Variance partition viewpoint (what r^2 is “a ratio” of)

One way to describe r^2 is as the ratio of the variance of the predicted values to the variance of the observed response values: it is the proportion of the y-variance that is predictable from knowledge of x via the linear regression model.

Another common identity expresses r^2 as 1 minus the proportion of unexplained variance (using sums of squares):

r^2 = 1-\frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}

“Explained” does not mean “caused”

“Explained” means the model reduces typical prediction error compared with a baseline like predicting \bar{y}. It does not justify causal conclusions.

Also, r^2 is always between 0 and 1 even when the slope is negative, because it is a squared correlation.

Worked examples

If a regression has r^2 = 0.72 for fuel efficiency y (mpg) on vehicle weight x (pounds), then “About 72% of the variability in fuel efficiency among these vehicles is explained by its linear relationship with vehicle weight.” The remaining 28% is due to other factors and natural variability.

Example 2.3: college football points vs yards

If the correlation between Total Points Scored and Total Yards Gained is r = 0.84, then:

r^2 = (0.84)^2 = 0.7056

Interpretation: 70.56% of the variation in Total Points Scored can be accounted for (predicted by, explained by) the linear relationship between Total Points Scored and Total Yards Gained, and 29.44% remains unexplained.

A common reminder when working backward is that if you compute r from r^2, r may be positive or negative and will always take the same sign as the regression slope.

Exam Focus

Typical question patterns include computing r^2 from r, interpreting r^2 in context (explicitly referencing variability in the response variable), and explaining what the “unexplained” percent means.

Common mistakes include interpreting r^2 as “percent of points on the line,” using causal language (“caused by”) from regression output, and forgetting that r can be positive or negative even though r^2 is nonnegative.

Outliers, Leverage, Influential Points, and the Dangers of Extrapolation

Linear regression is powerful but can be fragile. Certain points can dominate the fitted line, and predictions can become unreliable when you go beyond the observed data.

Regression outliers (large residuals)

In a scatterplot, regression outliers are points falling far away from the overall pattern, meaning they have relatively large discrepancies between the observed response and the predicted response. In residual terms, a point is a regression outlier if its residual is an outlier among the residuals.

Example 2.5: GPA vs weekly TV time (outliers vs “not an outlier in regression context”)

In a scatterplot of GPA versus weekly television time, two regression outliers are identified by direct observation: one student who watches 5 hours weekly yet has only a 1.5 GPA, and another who watches 25 hours weekly yet has a 3.0 GPA.

The point (30, 0.5) illustrates an important nuance: while 30 may be an outlier in the television-hours variable and 0.5 may be an outlier in the GPA variable considered separately, the point (30, 0.5) is not an outlier in the regression context if it follows the straight-line pattern.

Leverage (extreme x values)

A point has high leverage if its x value is far from the mean \bar{x}. Such a point has strong potential to change the regression line. If it lines up with the overall pattern, it may not change the line much, but it can strengthen correlation and r^2.

Influence (impact on the fitted model)

An influential point is one whose removal would noticeably change the regression line (slope/intercept) or correlation. High leverage points are often influential, but not always. A point may have a small residual yet still be highly influential if its x value is extreme.

Example: point A vs point B (influence is about change)

Consider a scatterplot with six points and a regression line. Removing point A greatly changes the regression line, while removing point B does not. Point A is influential and point B is not, even if point A was closer to the original regression line than point B.

Example 2.7: one separated point in four scenarios

Consider four scatterplots each with a cluster plus one separated point.

  • In A, the additional point has high leverage (its x is much greater than \bar{x}), a small residual, and does not appear influential.
  • In B, the additional point has high leverage, likely a small residual, and is very influential; removing it would change the slope dramatically (to close to 0).
  • In C, the additional point has some leverage, has a large residual (a regression outlier), and is somewhat influential; removing it would change the slope to more negative.
  • In D, the additional point has no leverage (its x is close to \bar{x}), has a large residual (a regression outlier), and is not influential; removing it would have little effect on the slope.

Extrapolation: predicting outside the data range

Extrapolation means using a regression model to predict for x values outside the range of observed data. It’s dangerous because the linear trend may not continue outside the data range. Good statistical language is: “This is an extrapolation, so the prediction may not be reliable.”

Worked example: recognizing extrapolation

If your data include students who studied between 1 and 6 hours, predicting a score for 12 hours studied is extrapolation. Even if the computed \hat{y} looks reasonable, the relationship may level off (scores cap at 100) or change (diminishing returns).

Exam Focus

Typical question patterns include identifying interpolation vs extrapolation, explaining how high leverage can affect the line and r, comparing regression output with and without a point to judge influence, and distinguishing outlier vs leverage vs influence.

Common mistakes include assuming extrapolation is safe because “the equation works for any x,” calling any outlier “influential” without considering leverage and impact, and confusing large residual (outlier) with influence (impact on the fitted model).

Interpreting Regression Output and Communicating Conclusions

Regression output is often provided and you’re asked to interpret it in context. Typical output includes the regression equation \hat{y} = a + bx, correlation r, coefficient of determination r^2, and sometimes a residual plot.

Writing a complete regression interpretation

A complete interpretation often includes:
1) Direction and form (from the scatterplot): is linear reasonable?
2) Slope meaning (rate of change, with units)
3) Strength (using r and/or r^2, interpreted correctly)
4) Limitations (outliers, extrapolation, lurking variables)

Example: putting it all together

Suppose monthly water use y (millions of gallons) is modeled from average monthly temperature x (degrees) for 50 cities:

\hat{y} = 12.4 + 0.85x

r = 0.78

r^2 = 0.61

A strong write-up: “There is a moderately strong positive linear association between average temperature and water use. The model predicts that for each 1-degree increase in average temperature, monthly water use increases by about 0.85 million gallons on average. About 61% of the variability in water use among these cities is explained by the linear relationship with temperature. Predictions should be used cautiously for temperatures outside the observed range, and other factors such as population and water restrictions may also affect water use.”

Correlation and regression do not prove causation

Even with a strong regression model, observational data can show association due to confounding. Even in experiments, regression describes a pattern, but causal wording should match the design.

A safe default in AP writing is to use “is associated with,” “tends to,” “predicts,” or “is related to,” and to use “causes” only when the situation is a well-designed randomized experiment and the question invites causal language.

Exam Focus

Typical question patterns include interpreting slope, r, and r^2 from output, writing a context-based conclusion that includes limitations, and deciding whether causation is justified based on study design.

Common mistakes include writing an r^2 interpretation that doesn’t mention variability in the response variable, using causal language for observational data, and interpreting the intercept when it has no contextual meaning.

Nonlinear Relationships and Transformations (Including Logs)

Not all relationships are linear. Many natural and social processes are curved, and forcing a line can be misleading.

Recognizing nonlinearity

Clues from a scatterplot include a clear curve (U-shape, inverted U, exponential-looking rise, leveling off) and changing spread. Clues from a residual plot include a curved pattern around 0, indicating systematic over- and under-prediction.

When nonlinearity is present, correlation and linear regression can understate or misrepresent the relationship.

Transformations: the idea

A transformation changes the scale of one or both variables to make the relationship more nearly linear. Useful transformations often use the log or ln buttons on a calculator to create new variables. Sometimes powers (like square root) are used, and transforming the response can help stabilize spread.

The goal is to find a model whose assumptions match the pattern so predictions and interpretations are more trustworthy.

Log transformations (common and useful)

Log transformations are especially useful when:

  • the response grows or decays by multiplicative (percent) changes,
  • the scatterplot looks exponential,
  • variability increases with the level of the response (fanning out), and logging y stabilizes it.

Conceptually, a linear model on \log(y) corresponds to a multiplicative model on y, so slope interpretations change.

Example: why a log transformation can help

If you plot years since a product launch x vs number of users y and see rapid early growth that suggests multiplicative change, plotting \log(y) vs x may look more linear. You might conclude: “After taking the log of the number of users, the relationship with time appears roughly linear, suggesting an exponential growth pattern on the original scale.”

Example 2.10: population growth over time (high r^2 can still hide nonlinearity)

Consider the data:

Year, x19801990200020102020
Population (1000s), y4465101150230

A linear model can produce a very large coefficient of determination (for instance, 94.3% of variability in population accounted for by the linear model). However, the scatterplot and residual plot can still indicate that a nonlinear relationship would be an even stronger model, motivating a transformation (often a log transformation) to achieve better linearity.

Cautions with transformations

Transformations help but add responsibilities: always state the scale being used, interpret slope and residuals on the transformed scale correctly, and discuss predictions on the original scale when the context requires it.

Exam Focus

Typical question patterns include using scatterplots and residual plots to argue that a linear model is not appropriate, describing how a transformation (often log or ln) could improve linearity or equalize spread, and interpreting the general meaning of a transformation in context (for example, multiplicative growth).

Common mistakes include using correlation to claim a relationship is weak when the pattern is strongly curved, applying a transformation without explaining why (connect it to curvature or changing spread), and forgetting that transformations change the meaning of slope and residuals.