Unit 1: Exploring One-Variable Data - Summary Statistics

Measuring Center: The Typical Value

In AP Statistics, the goal of summarizing data is to simplify a complex set of numbers into values that describe the distribution's key features. The first feature we look for is the center—a single number that represents the "typical" value of the dataset.

The Mean (Arithmetic Average)

The mean is the arithmetic average of a distribution. Because it takes the specific value of every observation into account, it is the fundamental balancing point of the distribution (like a fulcrum on a seesaw).

Notation:

Sample Mean: $\bar{x}$ (pronounced "x-bar")
Population Mean: $\mu$ (the Greek letter "mu")

Formula:
For a sample of size $n$ with observations $x1, x2, …, x_n$:

$\bar{x} = \frac{\sum x_i}{n}$

Where $\sum$ (sigma) means "sum of."

The Median (Midpoint)

The median is the midpoint of the distribution. It is the value such that half the observations are smaller and half are larger. To find the location of the median in an ordered list of $n$ numbers, calculate $\frac{n+1}{2}$.

If $n$ is odd, the median is the value at the exact center.
If $n$ is even, the median is the average of the two center observations.

Robustness (Resistance) and Skewness

A critical concept in Unit 1 is understanding how extreme values (outliers) affect these measures.

Non-Resistant (Non-Robust): The Mean is sensitive to extreme values. One massive outlier pulls the mean toward it.
Resistant (Robust): The Median is resistant to extreme values. A billionaire walking into a room of teachers changes the mean income drastically but barely touches the median income.

Relationship between Shape and Center:

Distribution Shape	Relationship
Symmetric	$\text{Mean} \approx \text{Median}$
Skewed Right (Tail right)	$\text{Mean} > \text{Median}$
Skewed Left (Tail left)	$\text{Mean} < \text{Median}$

Comparison of Mean and Median on skewed distributions

Measuring Variability: The Spread

Describing the center is not enough; we must also describe how spread out the data is. Is the data consistent (low variability) or volatile (high variability)?

The Range

The most basic measure of spread. It is a single number, not an interval.

$Range = Max - Min$

Note: Like the mean, the Range is non-resistant. A single outlier increases the range significantly.

The Interquartile Range (IQR)

The IQR measures the range of the middle 50% of the data. Because it ignores the upper and lower 25% of data points (which is where outliers live), the IQR is a resistant measure of spread.

$IQR = Q3 - Q1$

$Q_1$ (First Quartile): The median of the lower half of the data.
$Q_3$ (Third Quartile): The median of the upper half of the data.

Standard Deviation and Variance

The standard deviation is the most common measure of spread when using the mean as the center. It measures the typical distance of the values from the mean.

Formulas:

Variance ($s^2_x$): The average squared distance from the mean.

$s^2x = \frac{\sum (xi - \bar{x})^2}{n-1}$

Standard Deviation ($s_x$): The square root of the variance (returns the unit to the original scale).

$sx = \sqrt{\frac{\sum (xi - \bar{x})^2}{n-1}}$

Why divide by $n-1$?
Dividing by $n-1$ (degrees of freedom) instead of $n$ creates an unbiased estimator. It corrects for the fact that sample spread tends to consistently underestimate the true population spread.

Properties of Standard Deviation:

$s_x \ge 0$. It is only 0 if all numbers in the dataset are identical.
Ideally used for symmetric distributions.
Non-resistant. Outliers squared create massive contributions to variance, inflating the standard deviation.

Boxplots and the Five-Number Summary

The Five-Number Summary

This summary divides the dataset into four equal parts (quartiles), each containing 25% of the data.

Minimum
First Quartile ($Q_1$)
Median
Third Quartile ($Q_3$)
Maximum

The 1.5 $\times$ IQR Rule for Outliers

In AP Statistics, you cannot simply say a point "looks" like an outlier. You must justify it mathematically.

Outlier Boundaries (Fences):

Lower Fence: $Q_1 - 1.5(IQR)$
Upper Fence: $Q_3 + 1.5(IQR)$

Any data point falling outside these fences is considered an outlier.

Boxplots (Box-and-Whisker Plots)

A boxplot is the graphical representation of the five-number summary.

The Box spans from $Q1$ to $Q3$ (the IQR).
The line inside the box is the Median.
The Whiskers extend to the lowest and highest observations that are not outliers.
Outliers represent individual dots beyond the whiskers.

Anatomy of a Boxplot with Outliers

Worked Example

Dataset: $2, 4, 5, 6, 7, 9, 20$

Calculate Stats:
- $n = 7$
- Median: 6 (4th number)
- $Q_1$: 4 (Median of lower half $2, 4, 5$)
- $Q_3$: 9 (Median of upper half $7, 9, 20$)
- $IQR = 9 - 4 = 5$
Check for Outliers:
- Lower Limit: $4 - 1.5(5) = -3.5$. No values below -3.5.
- Upper Limit: $9 + 1.5(5) = 16.5$. The value 20 is above 16.5.
Conclusion: 20 is a mathematical outlier. The whisker on the right would stop at 9, and 20 would be a distinct point.

Summary: Choosing the Right Statistics

In the AP exam, you are often asked to compare distributions. Your choice of summary statistics depends entirely on the shape of the data.

Distribution Shape	Measure of Center	Measure of Spread
Symmetric / Normal	Mean ($\bar{x}$)	Standard Deviation ($s_x$)
Skewed / Has Outliers	Median	IQR

Why? Because Mean and SD are sensitive to skew/outliers, while Median and IQR are resistant.

Common Mistakes & Pitfalls

Confusing Statistic vs. Parameter:
- Remember: Statistics come from Samples ($\bar{x}, s$). Parameters come from Populations ($\mu, \sigma$).
Misinterpreting the Standard Deviation:
- Bad answer: "The standard deviation is 5.2."
- Good answer: "The values in this sample typically vary by about 5.2 units from the mean."
- Don't forget the context and units!
Incorrect Boxplot Whiskers:
- Students often draw whiskers to the "fences" calculated by the 1.5 IQR rule. This is wrong. Whiskers go to the last actual data point inside the fence.
Vague Comparisons:
- When comparing two distributions, never just list the stats (e.g., "Dataset A has mean 5, Dataset B has mean 10"). You must use comparative language: "The center of Dataset B (10) is higher than the center of Dataset A (5)."

Mnemonic: CUSS and BS

When asked to describe or compare distributions, remember to address:

Center (Mean/Median)
Unusual Features (Outliers/Gaps)
Shape (Skewness/Modes)
Spread (Range/SD/IQR)
Be Specific (Always include context/units!)