Multiple Regression, Standardized/Unstandardized Coefficients

Today, we’ll go over the left side of the SEM Pyramid of Success, from the correlation to multiple regression to path analysis, up to the brink of SEM. An important distinction applicable to all of these techniques is between standardized and unstandardized relationships.

The distinction is probably best illustrated, at this point, with multiple regression. Just to remind everyone, in multiple regression we test how well a number of predictor (independent) variables relate to an outcome (dependent) variable. For example, we could use (a) educational attainment, (b) experience on the job, and (c) performance evaluation as predictors of past-year earnings (outcome). The relationship between each predictor and earnings is computed holding constant the effect of the other predictors (e.g., assuming all respondents were equal in their educational attainment and experience on the job, are higher performance evaluations associated with higher earnings?).

[ADDED December 28, 2007: The following PowerPoint slide show provides an extensive review of multiple regression. I noticed an apparent error on the slide entitled, "The Overall Test...," occurring with slides numbered in the high teens to 20, so for discussion of null hypotheses, you should focus on the slide, with numbering in the 40's, that's titled "Test for Individual Terms."]

For each predictor variable in a multiple-regression analysis, the output will provide an unstandardized regression coefficient (usually depicted with the letter B) and a standardized coefficient (usually depicted with the Greek letter Beta, β). Unstandardized results are probably more straightforward to understand, so let’s discuss them first.

Unstandardized relationships are expressed in terms of the variables' original, raw units. Educational attainment would probably be measured in years of education, whereas earnings would probably be measured in dollars. Thus, the unstandardized (B) coefficient for educational attainment could be something like 2000. This would tell us that, for each increment of one raw unit (year) of education, projected earnings would increase by 2000 raw units of income (dollars).

Standardized results represent what happens after all of the variables (predictors and outcome) have initially been converted into z-scores (formula). As you'll recall from your earlier stat classes, z scores convey information in standard-deviation (SD) units; for example, someone who has a z score of +1 on a variable is one SD above the sample mean on that variable (to review SD's, see here and here). If we were measuring respondents' number of miles run per week in an athlete sample, the mean might be, say, 50 miles/week, with an SD of 10. Therefore, an athlete who ran 60 miles/week in training would be at z = +1, or 1 SD above the mean.

Another nice feature of z scores is that, if the data are distributed normally, you can relate them to a person's percentile ranking in the distribution. For example, someone with a z score of +1 on a given variable (84th percentile) is 34 percentile points ahead of someone who has a z score of 0 (50th percentile).

Going back to our example of predicting people's earnings, years of experience may have a standardized regression coefficient (β) of .40. This finding would tell us that, for each increment of one SD of years experience, projected earnings would increase by .40 SD's of income.

To recap to this point:

Unstandardized relationships say that for a one-raw-unit increment on a predictor, the outcome variable increases (or if B is negative, decreases) by a number of its raw units corresponding to what the B coefficient is.

Standardized relationships say that for a one-standard deviation increment on a predictor, the outcome variable increases (or decreases) by some number of SD's corresponding to what the β coefficient is.

When should you use the unstandardized solution and when should you use the standardized one? My own view is as follows: If the raw units are generally familiar (e.g., years, dollars, inches, miles, pounds), I'd go with the unstandardized solution. However, if the variables' raw units are not well-known in everyday usage (e.g., on a marital-satisfaction inventory with a maximum score of 50, what does one point really convey?), then I'd use the standardized solution.

This framework for unstandardized and standardardized solutions applies not only to multiple regression, but also to path analysis and SEM. What is not widely known is that the Pearson r, itself, is a statistic based on standardized variables. The correlation has an unstandardized "cousin," the covariance. The formula for converting between correlations and covariances, which is pretty simple, is shown in this document.

Update (1/19/07): Discussion during our previous class brought out an additional point that I didn't mention in my above write-up (thanks to Kristina).

Within the same regression equation, the different predictor variables' unstandardized B coefficients are not directly comparable to each other, because the raw units for each are (usually) different. In other words, the largest B coefficient will not necessarily be the most significant, as it must be judged in connection with its standard error (B/SE = t, which is used to test for statistical significance).

On the other hand, with standardized analyses, all variables have been converted to a common metric, namely standard-deviation (z-score) units, so the β coefficients can meaningfully be compared in magnitude. In this case, whichever predictor variable has the largest β (in absolute value) can be said to have the most potent relationship to the dependent variable, and this predictor will also have the greatest significance (smallest p value).

Added 4/12/15: Phil Ender has a concise overview of key issues in multiple regression.