1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Two-variable (bivariate) data set
A data set that records two pieces of information (two variables) for each individual (person, object, or event) to study a relationship.
Categorical variable
A variable that places individuals into groups or categories (e.g., brand, gender, yes/no, region).
Quantitative variable
A numerical variable for which arithmetic operations are meaningful (e.g., height, time, income).
Categorical vs categorical analysis
Studying the relationship between two categorical variables, typically using two-way tables and conditional distributions.
Quantitative vs quantitative analysis
Studying the relationship between two quantitative variables, typically using scatterplots, correlation, and linear regression.
Categorical vs quantitative analysis
Comparing a quantitative variable across categories of a categorical variable, often using side-by-side boxplots or dotplots by group.
Explanatory variable
The variable used to help explain or predict another variable; often labeled x in regression and placed on the horizontal axis.
Response variable
The outcome variable being predicted or explained; often labeled y in regression and placed on the vertical axis.
Association
A pattern or relationship between two variables (without automatically implying that one causes the other).
Causation
A cause-and-effect relationship where changes in one variable produce changes in another; not guaranteed by association alone.
Lurking variable
A variable not included in the analysis that may help explain the relationship observed between two variables.
Confounding
When the effects of two variables are mixed together so their individual effects on a response cannot be separated.
Two-way table
A table of counts for combinations of categories from two categorical variables, used to examine possible relationships.
Contingency table
Another name for a two-way table organizing counts for two categorical variables.
Row variable
In a two-way table, the categorical variable whose categories label the rows; affects how conditional distributions are computed.
Column variable
In a two-way table, the categorical variable whose categories label the columns; affects how conditional distributions are computed.
Table total (grand total)
The sum of all cell counts in a two-way table.
Joint relative frequency
A proportion for a specific cell in a two-way table: (cell count) ÷ (table total).
Marginal frequency
A row total or column total in the margins of a two-way table (the “totals” for one variable).
Marginal distribution
The distribution of one variable alone from a two-way table, found by converting marginal totals to proportions/percentages.
Conditional distribution
The distribution of one variable restricted to a specific category of the other variable (i.e., “given that…”).
Conditional relative frequency
A proportion computed within a subgroup (row or column), used to compare groups and judge association.
Difference in proportions
A numerical comparison of two conditional proportions (often used to describe the size of an association for categorical variables).
Segmented bar chart (100% stacked bar chart)
A graph that compares conditional distributions by using bars scaled to 100% so segment lengths represent conditional percentages.
Independence (categorical variables)
Two categorical variables are independent if knowing one variable’s category does not change the conditional distribution of the other.
Simpson’s paradox
A situation where a trend present in several groups disappears or reverses when the groups are combined, often due to a lurking variable.
Scatterplot
A graph of paired quantitative data with one point per individual, used to assess direction, form, strength, and unusual features.
Direction (scatterplot)
Whether y tends to increase as x increases (positive) or decrease as x increases (negative).
Positive association
An association where larger values of one quantitative variable tend to be paired with larger values of the other.
Negative association
An association where larger values of one quantitative variable tend to be paired with smaller values of the other.
Form (scatterplot)
The overall shape of the relationship in a scatterplot (e.g., linear, curved) and whether clusters appear.
Linear relationship
A relationship that is well summarized by a straight line pattern in a scatterplot.
Nonlinear (curved) relationship
A relationship with clear curvature; linear tools like correlation and LSRL can be misleading if the form is curved.
Strength (scatterplot)
How closely points follow the overall form (especially a line if linear); not the same as having a steep slope.
Cluster (scatterplot)
A grouping of points in a scatterplot that may suggest subgroups or a missing categorical variable.
Outlier (scatterplot)
A point far from the overall pattern that can affect correlation and regression and may indicate an error or special case.
Correlation coefficient (r)
A unit-free number measuring the direction and strength of the linear association between two quantitative variables.
Range of r
The correlation r is always between -1 and 1, inclusive.
Unit-free property of r
Correlation has no units and does not change when measurement units are changed (e.g., inches to centimeters).
Non-resistance of r
Correlation is not resistant; outliers or influential-looking points can strongly change r.
Leverage
A property of a point with an extreme x-value (far from x̄) that gives it strong potential to affect the regression line.
Least-squares regression line (LSRL)
The regression line that minimizes the sum of squared residuals, Σ(yᵢ − ŷᵢ)², and passes through (x̄, ȳ).
Regression equation
A linear prediction model written as ŷ = a + bx that predicts the response y from the explanatory variable x.
Predicted value (ŷ)
The value of the response variable predicted by the regression equation for a given x.
Slope (b) in regression
The predicted change in y for each 1-unit increase in x (with units of “y-units per x-unit”).
Intercept (a) in regression
The predicted y-value when x = 0; meaningful only if x = 0 is within the data range and sensible in context.
Residual
The prediction error for a point: residual = y − ŷ; positive means the model underpredicted, negative means it overpredicted.
Residual plot
A graph of residuals versus x used to check model appropriateness; good models show random scatter around 0 with roughly constant spread.
Coefficient of determination (r²)
The proportion of variability in the response variable y explained by the linear relationship with x using the regression model (between 0 and 1).
Extrapolation
Using a regression model to predict y for x-values outside the observed data range; risky because the linear trend may not continue.