Unit 1: Exploring One-Variable Data
Thinking Statistically: Individuals, Variables, and Distributions
Statistics is about learning from data while being honest about variability. In Unit 1, the focus is one-variable data, where each individual contributes one measurement (or one category label). Even with one variable, you can learn a lot by defining the variable clearly, organizing the data well, and describing the distribution in context.
Individuals, variables, and data
An individual is the “who” the data describe (a person, a school, a game, a day—whatever you’re observing). A variable is a characteristic measured on each individual (height, major, number of siblings). Data are the actual recorded values of the variable. A common early mistake is mixing up individuals and variables: “students” are individuals; “GPA” is a variable.
Categorical vs. quantitative variables (and a common trap)
A categorical (qualitative) variable takes values that are category names or group labels (blood type, brand of phone, eye color). A quantitative variable takes numerical values for a measured or counted quantity (age in years, commute time in minutes). The distinction matters because it determines which graphs and numerical summaries make sense.
A subtle but important point is that “numbers” are not automatically quantitative. ZIP codes are numeric-looking labels, so ZIP code is categorical, not quantitative.
Discrete vs. continuous quantitative variables
Quantitative variables are often classified as:
- Discrete quantitative variables, which take a finite or countable number of values with noticeable “gaps” between possible values (for example, number of AP classes).
- Continuous quantitative variables, which can take infinitely many values with no gaps (for example, heights and weights).
What is a distribution?
A distribution shows what values a variable takes and how often it takes them.
- For categorical variables, the distribution is the set of categories and their frequencies/proportions.
- For quantitative variables, the distribution includes the overall pattern (shape), where values cluster, whether it’s symmetric or skewed, and whether there are unusual values.
When you describe a distribution, always tie your description to context. Saying “skewed right” is incomplete unless you interpret what that tail means in the real situation (for example, “a few students have very long commutes”).
Exam Focus
- Typical question patterns
- Identify individuals, variable, and context from a scenario.
- Classify a variable as categorical or quantitative and justify.
- Explain what “distribution” means in words, not just by naming a graph.
- Common mistakes
- Treating numeric labels (ZIP codes, jersey numbers) as quantitative.
- Describing a graph without mentioning what the values represent.
- Confusing “each bar” in a bar chart (categories) with “bins” in a histogram (intervals of numbers).
Organizing and Displaying Categorical Data
Categorical data are about group membership, so the key idea is to summarize counts and proportions.
Frequency tables and relative frequency tables
A frequency table lists each category and its count (frequency). A relative frequency table lists each category and its proportion (or percent) of the total. Relative frequencies are often more informative than raw counts because they allow comparisons across groups of different sizes.
If the total number of individuals is n and a category has count c, then the relative frequency is:
\text{relative frequency} = \frac{c}{n}
Bar charts
A bar chart displays categories on one axis and frequencies (or relative frequencies) on the other. Each category gets its own bar.
Two key differences from histograms are that (1) bars in a bar chart are separated because categories are distinct, and (2) category order is usually arbitrary unless there’s a natural order (like education level). When describing a bar chart, focus on the most common and least common categories and any notable differences.
Pie charts
A pie chart shows relative frequencies as slices of a circle. Pie charts emphasize “parts of a whole,” but they are usually less precise than bar charts for comparing categories—especially when slices are similar.
Dot plots for categorical data
Categorical counts can also be displayed with dot-based displays (for example, dots stacked above category names). The underlying goal is the same: communicate counts or proportions by category.
Example 1.1: frequency and relative frequency table (parents’ preferred school-year length)
During the first week of 2022, a survey of 2000 parents found:
| Desired School Length | Number of Parents (frequency) | Relative Frequency | Percent of Parents |
|---|---|---|---|
| 180 days | 1100 | 1100/2000 = 0.55 | 55% |
| 160 days | 300 | 300/2000 = 0.15 | 15% |
| 200 days | 500 | 500/2000 = 0.25 | 25% |
| No opinion | 100 | 100/2000 = 0.05 | 5% |
This table makes it easy to compare categories by percent, not just by count.
Example: building and interpreting a relative frequency table (study location)
Suppose a class surveys students’ preferred study location:
- Library: 12
- Dorm/room: 18
- Cafe: 6
- Other: 4
Total n = 40. Relative frequency for Library:
\frac{12}{40} = 0.30
So about 30 percent of students prefer the library. In context, that suggests the library is popular but not the majority choice.
Exam Focus
- Typical question patterns
- Create a frequency or relative frequency table from a description.
- Choose an appropriate display (bar chart vs. pie chart) and justify.
- Interpret a relative frequency as a probability-like statement (for example, “about 30 percent”).
- Common mistakes
- Using a histogram for categorical data.
- Forgetting to compute the total before calculating relative frequencies.
- Over-interpreting tiny differences in pie chart slices.
Representing Quantitative Data with Tables and Graphs
Quantitative displays are designed to reveal the shape of a distribution and make patterns (clusters, gaps, skewness, outliers) visible. Quantitative values can be organized into frequency/relative frequency tables or represented with dotplots, stemplots, histograms, boxplots, time plots, or cumulative relative frequency plots.
Dotplots
A dotplot places a dot for each data value along a number line (stacking dots for repeated values). Dotplots are best for small-to-moderate data sets because you can still see individual values. They help you spot clusters, gaps, and potential outliers.
Stemplots (stem-and-leaf plots)
A stemplot splits each number into a stem (leading digits) and a leaf (final digit). It preserves exact data values while showing distribution shape.
Example values: 12, 14, 15, 21, 22, 27.
- stems: 1 and 2
- leaves: 2 4 5 and 1 2 7
Stemplots work best when values have a consistent number of digits and the range isn’t huge. A common pitfall is not including a key (for example, “2|7 means 27”), which makes the plot ambiguous.
Histograms (frequency and relative frequency)
A histogram groups quantitative data into intervals called bins and uses bars to show how many values fall in each bin. Histogram bars touch because the number line is continuous.
Histograms are powerful for large data sets, but bin choices (width and starting boundaries) can change the appearance. Very small bins can look noisy; very large bins can hide structure like two clusters.
Sometimes the vertical axis shows relative frequency (frequency divided by the total) instead of raw counts. The key fact is that the shape stays the same whether you use frequencies or relative frequencies—only the vertical scale changes.
Example 1.2: seniors’ AP classes (relative frequency histogram idea)
Suppose there are 2200 seniors in a city’s six high schools. The number of AP classes taken is:
| Number of AP classes | Frequency | Relative frequency |
|---|---|---|
| 0 | 400 | 400/2200 = 0.18 |
| 1 | 500 | 500/2200 = 0.23 |
| 2 | 900 | 900/2200 = 0.41 |
| 3 | 300 | 300/2200 = 0.14 |
| 4 | 100 | 100/2200 = 0.05 |
If you drew a histogram using frequencies or relative frequencies on the vertical axis, the distribution’s shape would look the same.
Time plots (time series plots)
A time plot graphs a quantitative variable measured over time, with time on the horizontal axis. Time plots are different from histograms/dotplots because the order is meaningful. You look for:
- overall trend (upward/downward)
- seasonality (regular cycles)
- unusual spikes or drops
A common mistake is to use a histogram for time-ordered data when the real question is how the variable changes over time.
Cumulative relative frequency plots (ogives)
A cumulative frequency or cumulative relative frequency plot shows how counts or proportions accumulate as you move from smaller to larger values. These graphs are especially useful for reading medians and quartiles by finding where the cumulative proportion hits 0.50, 0.25, and 0.75.
Choosing the right graph (quick guide)
- Test scores for 25 students: dotplot or stemplot (individual values matter).
- Commute times for 500 workers: histogram (large data set).
- Daily high temperature over 90 days: time plot (order matters).
Exam Focus
- Typical question patterns
- Select an appropriate graph for a quantitative data set and explain why.
- Read and interpret a histogram’s shape (skew, modality, clusters, gaps) in context.
- Explain how changing bin width/boundaries can change a histogram’s appearance.
- Use cumulative relative frequency plots to estimate medians and IQRs.
- Common mistakes
- Calling a histogram a bar chart and treating bins like categories.
- Ignoring how bin choices affect comparisons.
- Using a dotplot for a very large data set where it becomes unreadable.
Describing Quantitative Distributions (SOCS, Shape Language, and Context)
A strong description of a quantitative distribution is more than “the average is about …”. It should capture the overall pattern and any unusual features, always tied to context.
SOCS: a complete description
A common checklist is SOCS:
- Shape
- Outliers (and other unusual features like gaps)
- Center
- Spread
This encourages a complete story: what it looks like, what stands out, what is typical, and how much values vary.
Center and spread as core features
From a graph, two fundamental features are:
- Center, which roughly separates the values (or the area under a histogram) in half.
- Spread, the scope of values from smallest to largest.
Clusters and gaps
Two other important aspects of the overall pattern are:
- Clusters, which suggest natural subgroups. For example, teacher salaries in a college town might form overlapping clusters for different institutions.
- Gaps, which show holes where no values fall. For example, if a dean only writes letters to students with very high GPAs or very low GPAs, the GPA distribution of letter recipients could have a large gap in the middle.
Shape vocabulary (including bell-shaped and uniform)
Distributions come in many shapes, but several common patterns are worth knowing.
- Unimodal: one clear peak.
- Bimodal: two clear peaks (often indicates two subgroups mixed together).
- Symmetric: left and right sides look roughly like mirror images.
- Skewed right: spreads far and thinly toward higher values (long right tail).
- Skewed left: spreads far and thinly toward lower values (long left tail).
- Bell-shaped: symmetric with a central mound and two sloping tails.
- Uniform: the histogram is approximately a horizontal line (roughly equal frequencies across bins).
Example 1.3: bimodality matters (Hodgkin’s lymphoma age at diagnosis)
For female cases of Hodgkin’s lymphoma, simply reporting “the average age is around 50” can miss the most important feature. The histogram is bimodal with two distinct clusters centered around about 25 and 75. This suggests two different age groups are experiencing diagnosis at higher rates.
Mean vs. median and skewness (a useful diagnostic)
Skewness affects measures of center:
- In a right-skewed distribution, the mean is usually greater than the median because the long right tail pulls the mean upward.
- In a left-skewed distribution, the mean is usually less than the median.
Example 1.7: using mean vs. median to infer shape (faculty salaries)
Suppose faculty salaries at a college have a median of 82,500 dollars and a mean of 88,700 dollars. Because the mean is greater than the median, the distribution is probably skewed to the right: a few highly paid professors pull the mean upward while most salaries are lower.
Example 1.8: histograms of z-scores and why area matters
Suppose a histogram is constructed from z-score intervals using the following percentile information:
| z-score | −2 | −1 | 0 | 1 | 2 |
|---|---|---|---|---|---|
| Percentile ranking | 0 | 20 | 60 | 70 | 100 |
This implies 20% of the area lies between z-scores −2 and −1, 40% between −1 and 0, 10% between 0 and 1, and 30% between 1 and 2. Even if the histogram is drawn with many more z-score cutpoints (making bars narrower), the key idea remains: the height at any point is not meaningful by itself; what matters is relative areas.
From the table:
- The percent of area between z-scores +1 and +2 is still 30%.
- The percent to the left of 0 is still 60%.
Outliers: the 1.5·IQR rule
A common AP Statistics rule for flagging outliers uses quartiles.
- Compute:
\text{IQR} = Q_3 - Q_1
- Compute fences:
\text{Lower fence} = Q_1 - 1.5(\text{IQR})
\text{Upper fence} = Q_3 + 1.5(\text{IQR})
Values below the lower fence or above the upper fence are flagged as outliers. This is a screening tool, not proof that a value is an error.
Example: SOCS description from a histogram (homework time)
Suppose a histogram of “minutes spent on homework per night” is unimodal and right-skewed, with most students between 20 and 80 minutes and a few students above 180 minutes. A strong SOCS description might say the distribution is unimodal and skewed right, there are a few unusually large values above about 180 minutes, a typical student spends around 60 minutes, and most students are between about 20 and 80 minutes even though the right tail makes the overall spread larger.
Exam Focus
- Typical question patterns
- Describe a distribution from a graph using SOCS in context.
- Identify clusters, gaps, skewness, modality, and possible outliers.
- Use mean vs. median information to infer skew direction.
- Use the 1.5·IQR rule to flag outliers and interpret what that means.
- Common mistakes
- Giving center/spread without describing shape or unusual features.
- Saying “there are outliers” without stating where (high end/low end) and what they represent.
- Thinking histogram bar heights matter more than the areas they represent.
Numerical Summaries for Quantitative Data (Center, Spread, and Position)
Graphs show patterns; numerical summaries give precise, comparable measures. In this unit you should know what each summary means, how it behaves with skew/outliers, and how transformations affect it.
Descriptive vs. inferential statistics
Descriptive statistics refers to presenting and summarizing data: representative values (center), variability (spread), positions (percentiles, z-scores), and shape. Inferential statistics is the process of drawing conclusions from limited data; it becomes central in later units.
Notation: population vs. sample
A population is the entire group of interest; a sample is a subset.
| Concept | Population parameter | Sample statistic |
|---|---|---|
| Mean | \mu | \bar{x} |
| Standard deviation | \sigma | s |
| Variance | \sigma^2 | s^2 |
| Size | N (sometimes used) | n |
Mean
For values x_1, x_2, \dots, x_n, the sample mean is:
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
The mean is the arithmetic average and can be interpreted as the distribution’s “balance point.” It uses every value, so it is sensitive to outliers.
Median
To find the median, sort the data.
- If n is odd, the median is the middle value.
- If n is even, the median is the average of the two middle values.
The median is resistant to outliers because it depends on order rather than distances.
Quartiles, percentiles, and IQR
Quartiles split ordered data into quarters:
- Q_1 is the 25th percentile.
- Q_2 is the median (50th percentile).
- Q_3 is the 75th percentile.
Then:
\text{IQR} = Q_3 - Q_1
A percentile gives the percent of observations at or below a value (relative standing). For example, “a score of 82 is at the 90th percentile” means about 90% of scores are 82 or lower, not “90% correct.”
AP note: There are slightly different hand-calculation conventions for quartiles (especially when n is odd). On the AP exam, technology is often used; graders prioritize correct interpretation and consistent use.
Variability (dispersion): range, IQR, variance, standard deviation
Variability is a fundamental concept in statistics.
- Range: max minus min (very sensitive to outliers).
- IQR: spread of the middle 50%.
- Variance: average of squared differences from the mean.
- Standard deviation: square root of variance; a typical distance from the mean (in original units).
The sample standard deviation formula used in AP Statistics is:
s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}
Standard deviation is sensitive to outliers because it squares deviations.
Five-number summary
The five-number summary is:
- minimum
- Q_1
- median
- Q_3
- maximum
Position measures: ranking, percentile rank, and z-score
Three recognized ways of designating position are:
- Simple ranking, noting where a value falls after ordering.
- Percentile ranking, the percent of values at or below a given value.
- z-score, the number of standard deviations a value is above or below the mean (developed further in the transformation/Normal sections).
Example 1.4: mean vs. median (home run distances)
Consider these home run distances (feet) to center field in 13 ballparks:
{387, 400, 400, 410, 410, 410, 414, 415, 420, 420, 421, 457, 461}.
The median is 414 (six values below and six above). The mean is:
\bar{x} = \frac{5425}{13} \approx 417.31
The mean is larger because the two very large values near 460 pull it upward.
Example 1.5: how transformations affect the mean (salaries)
Salaries of six employees are 3000, 7000, 15000, 22000, 23000, and 38000 dollars.
a. Mean salary:
\bar{x} = \frac{108000}{6} = 18000
b. If everyone receives a 3000-dollar increase, the new mean is:
18000 + 3000 = 21000
c. If instead everyone receives a 10% raise, the new mean is:
1.10(18000) = 19800
This illustrates that adding the same constant adds that constant to the mean, and multiplying each value by a constant multiplies the mean by that constant.
Example 1.6: measures of variability (ages of 12 mathematics teachers)
Ages are:
{24, 25, 25, 29, 34, 37, 41, 42, 48, 48, 54, 61}.
The mean age is:
\bar{x} = \frac{468}{12} = 39
Measures of variability:
- Range:
61 - 24 = 37
- IQR, Method 1 (trim outer quarters and take range of remaining middle half): remove the lowest quarter {24, 25, 25} and highest quarter {48, 54, 61}, leaving {29, 34, 37, 41, 42, 48}. Then:
48 - 29 = 19
- IQR, Method 2 (quartiles as medians of halves): lower half {24, 25, 25, 29, 34, 37} has median:
Q_1 = \frac{25 + 29}{2} = 27
Upper half {41, 42, 48, 48, 54, 61} has median:
Q_3 = \frac{48 + 48}{2} = 48
So:
\text{IQR} = 48 - 27 = 21
- Variance and standard deviation (treating these 12 teachers as the full population): the sum of squared deviations from 39 is 1630, so:
\sigma^2 = \frac{1630}{12} \approx 135.83
\sigma = \sqrt{135.83} \approx 11.655
Interpretation: the teachers’ ages typically vary by about 11.655 years from the mean of 39 years.
Worked example: mean, median, IQR, and outlier fences
Data (minutes): 12, 14, 15, 18, 20, 22, 25, 28, 60
1) Mean:
\bar{x} = \frac{12+14+15+18+20+22+25+28+60}{9} = \frac{214}{9} \approx 23.78
2) Median: the 5th value (since n=9) is 20.
3) Quartiles (one common method: split excluding the median):
- Lower half: 12, 14, 15, 18 so:
Q_1 = \frac{14+15}{2} = 14.5
- Upper half: 22, 25, 28, 60 so:
Q_3 = \frac{25+28}{2} = 26.5
4) IQR:
\text{IQR} = 26.5 - 14.5 = 12
5) Outlier fences:
\text{Lower fence} = 14.5 - 1.5(12) = -3.5
\text{Upper fence} = 26.5 + 1.5(12) = 44.5
So 60 is flagged as an outlier (it exceeds 44.5). This also shows how an outlier can pull the mean up even when most values are much smaller.
Exam Focus
- Typical question patterns
- Compute and interpret \bar{x}, median, IQR, and standard deviation from data or calculator output.
- Decide whether mean/SD or median/IQR is more appropriate and justify using skew/outliers.
- Interpret percentiles and distinguish them from “percent correct.”
- Compute fences and decide whether a value is flagged as an outlier.
- Common mistakes
- Interpreting standard deviation as an “average value” rather than a typical distance from the mean.
- Mixing up Q_1 and Q_3 or reporting IQR as a point rather than a spread.
- Treating the 1.5·IQR rule as proof a value is an error rather than an unusually large/small (possibly real) observation.
Boxplots (and What They Reveal or Hide)
A boxplot is built from the five-number summary and is especially useful for comparing distributions and highlighting outliers.
How a boxplot is constructed
A standard (modified) boxplot includes:
- a box from Q_1 to Q_3
- a line at the median
- whiskers extending to the smallest and largest non-outlier values
- individual points for outliers (typically flagged by the 1.5·IQR rule)
Boxplots compress data into a compact summary, making side-by-side comparisons efficient.
Interpreting a boxplot
A boxplot helps you compare:
- center (median)
- spread (IQR and overall range)
- skew (unequal whisker lengths and median not centered)
- outliers (plotted points)
What boxplots do not show well is multimodality (two peaks) and detailed clusters/gaps. If you suspect bimodality, a histogram or dotplot is often necessary.
Example: inferring skew from a boxplot
If the upper whisker is much longer than the lower whisker and the median is closer to Q_1 than Q_3, that suggests right skew: larger values are more spread out, creating a longer high-end tail.
Example 1.11: parallel boxplots over time (daily stock price fluctuations)
Parallel boxplots showing daily price fluctuations of a particular stock over five years can reveal trends even without raw data. In this case, the boxplots show that from year to year the median daily stock price steadily rose about 20 points (from about 58 to about 78), the third quartile stayed roughly stable around 84, the yearly low never decreased from the previous year, and the IQR never increased from one year to the next. The lowest median occurred in 2017 and the highest in 2021. The smallest spread (by range) was in 2021 and the largest was in 2018. None of the yearly distributions showed an outlier.
Exam Focus
- Typical question patterns
- Match a boxplot to a description of skew/outliers.
- Compare two groups’ medians and IQRs using side-by-side boxplots.
- Interpret time trends when boxplots are presented year-by-year.
- Common mistakes
- Saying the “mean” is shown on a boxplot (it is not).
- Assuming a larger IQR means “higher values” instead of “more variability.”
- Over-claiming detailed shape (like bimodality) from boxplots alone.
Comparing Distributions (Same Variable, Different Groups)
Often you measure the same variable for two groups (two classes, two neighborhoods, two conferences). This is still one-variable thinking, but you must compare two distributions clearly and with evidence.
How to write a strong comparison
A good comparison addresses:
1) Center: which group tends to have larger typical values (compare medians or means, depending on appropriateness).
2) Spread: which group is more variable (IQR/SD and sometimes range).
3) Shape and unusual features: skewness, clusters, gaps, outliers.
4) Context: interpret differences in real-world terms.
Side-by-side displays (and keeping scales consistent)
Common comparison graphs include:
- back-to-back stemplots
- side-by-side (comparative) histograms
- parallel (side-by-side) boxplots
- cumulative frequency/relative frequency plots
When comparing histograms, always verify that horizontal scales match. Different scales can mislead.
Comparing categorical distributions: compare proportions, not counts
For categorical variables, compare relative frequencies. A “larger” count can come from a larger group size rather than a real difference in preference/behavior.
Example: School A has 60 students and 30 play a sport (50%), while School B has 200 students and 70 play a sport (35%). School B has more athletes by count, but School A has a higher proportion.
Example 1.9: comparing wins in two NBA conferences (back-to-back stemplot)
When comparing wins for Eastern Conference (EC) and Western Conference (WC) teams:
- Shape: EC is roughly bell-shaped; WC is roughly uniform with a low outlier.
- Center: medians (8th out of 15) are m_{EC} = 41 and m_{WC} = 49, so WC has the greater center.
- Spread: ranges are 60 − 17 = 43 (EC) and 57 − 19 = 38 (WC), so EC has the greater spread.
- Unusual features: WC shows an apparent outlier at 19 and a gap between 19 and 33; EC shows no apparent outliers or gaps.
Example 1.10: comparing sleep hours (two histograms)
Two surveys (one of high school students, one of college students) asked for hours of sleep per night.
- Shape: high school distribution is skewed right; college distribution is unimodal and roughly symmetric.
- Center: the high school median (between 6.5 and 7) is less than the college median (between 7 and 7.5).
- Spread: the range for college students is greater than the range for high school students.
- Unusual features: the college distribution shows two distinct gaps (5.5 to 6 and 8 to 8.5) and possible low and high outliers; the high school distribution does not clearly show gaps or outliers.
Example 1.12: comparing populations using cumulative frequency plots (U.S. ages in 1860 vs. 1980)
A cumulative frequency plot of age allows direct reading of median and quartiles.
- Medians: at cumulative proportion 0.5, in 1860 half the population was under age 20; in 1980, ages up to about 32 are needed to include half the population.
- IQRs: at 0.25 and 0.75,
- for 1860, Q_1 = 9 and Q_3 = 35, so:
\text{IQR}_{1860} = 35 - 9 = 26
- for 1980, Q_1 = 16 and Q_3 = 50, so:
\text{IQR}_{1980} = 50 - 16 = 34
Both the median and IQR are greater in 1980 than in 1860.
Example: comparing two boxplots (conceptual)
Suppose side-by-side boxplots show:
- Group 1 median around 72, IQR around 10
- Group 2 median around 68, IQR around 18, with two high outliers
A strong comparison: Group 1 tends to score higher (median about 4 points higher), Group 2 is more variable (larger IQR), and Group 2 has a couple unusually high scores.
Exam Focus
- Typical question patterns
- Write a comparison paragraph using center, spread, and shape with numerical evidence.
- Decide which group is more variable and justify using IQR/SD/range.
- Compare categorical groups using relative frequencies.
- Read medians and IQRs from cumulative frequency/relative frequency plots.
- Common mistakes
- Comparing counts instead of proportions for categorical data.
- Comparing graphs without checking scales.
- Using mean/SD language when skew/outliers suggest median/IQR.
How Changes to Data Affect Summaries (Shifts, Rescaling, and Standardizing)
In real problems, data are often transformed (unit conversions, adding bonuses, scaling). You need to predict how these transformations affect numerical summaries.
Adding or subtracting a constant (shifting)
If you add a constant c to every value (new values y = x + c), then measures of center increase by c, but measures of spread stay the same:
\bar{y} = \bar{x} + c
Intuition: shifting moves the distribution left or right without changing its shape.
Multiplying by a constant (rescaling)
If you multiply every value by a constant a (new values y = ax), then measures of center are multiplied by a and measures of spread are multiplied by |a|:
\bar{y} = a\bar{x}
s_y = |a|s_x
IQR is also multiplied by |a|. If a is negative, the distribution is reflected on the number line, reversing the direction of skew.
z-scores (standardizing)
A z-score tells how many standard deviations a value is from the mean.
For a population model:
z = \frac{x - \mu}{\sigma}
Using sample summaries:
z = \frac{x - \bar{x}}{s}
Interpretation:
- z = 0 means the value equals the mean.
- z = 2 means 2 standard deviations above the mean.
- A negative z-score means below the mean.
z-scores are powerful because they allow comparisons across different scales and connect directly to Normal distribution calculations.
Example: comparing performance using z-scores
Student A scores 86 on a test with mean 80 and standard deviation 3:
z_A = \frac{86 - 80}{3} = 2
Student B scores 92 on a test with mean 88 and standard deviation 2:
z_B = \frac{92 - 88}{2} = 2
Both students performed equally far above their class mean in standard deviation units.
Exam Focus
- Typical question patterns
- Predict how mean/median/IQR/SD change after adding a constant or multiplying by a constant.
- Compute and interpret a z-score in context.
- Use z-scores to compare relative standing across two distributions.
- Common mistakes
- Thinking adding c changes standard deviation (it does not).
- Forgetting the absolute value effect for spread when multiplying by a negative number.
- Treating a z-score as a raw score or percentile without justification.
Density Curves and the Normal Distribution
Sometimes it’s useful to model a distribution with a smooth curve rather than raw data bars or dots.
Density curves
A density curve is a smooth curve describing a distribution where:
- the curve is always on or above the horizontal axis
- the total area under the curve is 1
Area under the curve corresponds to proportion of observations. This is a model: it captures the overall pattern but does not have to match the data perfectly.
For a density curve:
- the median is the point with half the area to the left
- the mean is the balance point
As with histograms, skewness pulls the mean toward the long tail.
The Normal distribution
A Normal distribution is a bell-shaped, symmetric density curve described by two parameters:
- \mu: the mean (center)
- \sigma: the standard deviation (spread)
It is written:
N(\mu,\sigma)
Additional key properties:
- The normal curve is bell-shaped and symmetric with an infinite base (tails extend indefinitely).
- In a Normal distribution, the mean equals the median and is located at the center.
- The curve’s slope is steepest at two points of inflection, one on each side of the mean. The distance from the mean to either inflection point is exactly one standard deviation, which is why measuring horizontal distance in z-scores is so natural.
The standard Normal distribution
The standard Normal has:
\mu = 0
\sigma = 1
This is N(0,1). Any Normal value can be standardized:
z = \frac{x - \mu}{\sigma}
The 68–95–99.7 rule (Empirical Rule)
For a Normal distribution:
- about 68% of observations lie within 1 standard deviation of the mean
- about 95% lie within 2 standard deviations
- about 99.7% lie within 3 standard deviations
Example 1.13: using the empirical rule (taxicab miles)
Taxicabs in New York City are driven an average of 75,000 miles per year with a standard deviation of 12,000 miles. Assuming the distribution is roughly Normal:
- about 68% are between 63,000 and 87,000 miles
- about 95% are between 51,000 and 99,000 miles
- virtually all are between 39,000 and 111,000 miles
Finding Normal probabilities (areas)
To find a proportion such as P(X \le a) when X is Normal:
1) Standardize using z = \frac{a - \mu}{\sigma}.
2) Use technology (normalcdf) or a standard Normal table to find the area to the left of z.
To find P(a \le X \le b), compute the difference of two “left of” areas or use a calculator with lower and upper bounds.
Worked example: Normal probability (hamster weights)
Adult hamster weights are modeled as N(120,10) grams. Find the proportion weighing more than 135 grams.
1) Standardize 135:
z = \frac{135 - 120}{10} = 1.5
2) Convert to probability:
P(X > 135) = 1 - P(Z \le 1.5)
Using a standard Normal table, P(Z \le 1.5) \approx 0.9332, so:
P(X > 135) \approx 1 - 0.9332 = 0.0668
Interpretation: about 6.68% of adult hamsters weigh more than 135 grams under this model.
Finding Normal percentiles (inverse Normal)
Sometimes you’re given a percentile and asked for the corresponding value.
Example: Find the 90th percentile of N(100,15). The 90th percentile corresponds to a z-score near 1.28, so:
x = \mu + z\sigma
x = 100 + (1.28)(15) = 119.2
Interpretation: about 90% of observations are below about 119.2.
Assessing Normality: when is a Normal model reasonable?
A Normal model is a choice that must be justified with evidence such as:
- a histogram that looks roughly symmetric and bell-shaped
- a boxplot with no strong skew and no extreme outliers
- a Normal probability plot (Normal quantile plot)
A Normal probability plot graphs ordered data against expected Normal quantiles. Points close to a straight line support a Normal model; strong curvature suggests skewness; an S-shape often indicates heavy tails.
Exam Focus
- Typical question patterns
- Use N(\mu,\sigma) with z-scores to compute proportions above/below/between values.
- Use invNorm-style reasoning to find a percentile value and interpret it.
- Apply the 68–95–99.7 rule for quick, approximate reasoning.
- Decide whether a Normal model is appropriate based on a histogram, boxplot, or Normal probability plot.
- Common mistakes
- Using \sigma^2 (variance) instead of \sigma in the z-score formula.
- Forgetting to subtract from 1 for “greater than” probabilities.
- Treating a Normal model as automatically valid without checking skew/outliers or providing justification.
- Focusing on curve height rather than area when interpreting Normal probabilities.