Maximum Likelihood Estimation

(Updated April 10, 2018)

Today, let's take some time to talk about Maximum Likelihood Estimation (MLE), which is the default estimation procedure in AMOS and is considered the standard for the field. In my view, MLE is not as intuitively graspable as Ordinary Least Squares (OLS) estimation, which simply seeks to locate the best-fitting line in a scatter plot of data so that the line is as close to as many of the data points as possible. In other words, OLS minimizes the squared deviation scores between each actual data point and where an individual with a given score on the X-axis would fall on the best-fitting line, hence "least squares." However, Maximum Likelihood is considered to be statistically advantageous.

This website maintained by S. Purcell provides what I think is a very clear, straightforward introduction to MLE. In particular, we'll want to look at the second major heading on the page that comes up, Model-Fitting.

Purcell describes the mission of MLE as being to "find the parameter values that make the observed data most likely." Here's an analogy I came up with, fitting Purcell's definition. Suppose we observed a group of people laughing uproariously (the "data"). One could then ask which generating-model would make the laughter most likely, a television comedy show or a drama about someone dying of cancer?

Another site lists some of the advantages of MLE, vis-a-vis OLS.

Lindsay Reed, our former computer lab director, once loaned me a book on the history of statistics, the unusually titled, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (by David Salsburg, published in 2001).

This book discusses the many statistical contributions of Sir Ronald A. Fisher, among which is MLE. Writes Salsburg:

In spite of Fisher's ingenuity, the majority of situations presented intractable mathematics to the potential user of the MLE (p. 68).

Practically speaking, obtaining MLE solutions required repeated iterations, which was very difficult to achieve, until the computer revolution. Citing the ancient mathematician Robert Recorde, Salsburg writes: first guess the answer and apply it to the problem. There will be a discrepancy between the result of using this guess and the result you want. You take that discrepancy and use it to produce a better guess... For Fisher's maximum likelihood, it might take thousands or even millions of iterations before you get a good answer... What are a mere million iterations to a patient computer? (p. 70).

UPDATE I: The 2013 textbook by Texas Tech Business Administration professor Peter Westfall and Kevin Henning, Understanding Advanced Statistical Methods, includes additional description of MLE. The above-referenced Purcell page provides an example with a relatively simple equation for the likelihood function. Westfall and Henning, while providing a more mathematically intense discussion of MLE, have several good explanatory quotes:

In cases of complex advanced statistical models such as regressions, structural equation models, and neural networks, there are often dozens or perhaps even hundreds of parameters in the likelihood function (p. 317).

In practice, likelihood functions tend to be much more complicated [than the book's examples], and you won't be able to solve the calculus problem even if you excel at math. Instead, you'll have to use numerical methods, a fancy term for "letting the computer do the calculus for you." ... Numerical methods for finding MLEs work by iterative approximation. They start with an initial guess... then update the guess to some value... by climbing up the likelihood function... The iteration continues until the successive values... are so close to one another that the computer is willing to assume that the peak has been achieved. When this happens, the algorithm is said to converge (p. 325; emphasis in original).

This is what the Minimization History portion of the AMOS output refers to, along with the the possible error message that one's model has failed to converge.

UPDATE II: The reference given by our 2014 guest speaker on MLE is:

Ferron, J. M., & Hess, M. R. (2007). Estimation in SEM: A concrete example. Journal of Educational and Behavioral Statistics, 32, 110-120.

Deriving Degrees of Freedom for Love Style Model (Plus Discussion of Free vs. Fixed Parameters)

(Updated February 18, 2017)

The advice below applies when one is running models using the AMOS program.  Suggestions when using ONYX are shown in red. 

A key element of this discussion involves freely estimated (or free) parameters vs. fixed parameters. The term "freely estimated" refers to the program determining the value for a path or variance in accordance with the data and the mathematical estimation procedure. A freely estimated path might come out as .23 or .56 or -.33, for example. Freely estimated parameters are what we're used to thinking about. However, for technical reasons, we sometimes must "fix" a value, usually to 1. This means that a given path or variance will take on a value of 1 in the model, simply because we tell it to. Fixed values only apply to unstandardized solutions; a value fixed to 1 will appear as 1 in an unstandardized solution, but usually appear as something different in a standardized solution. These examples should become clearer as we work through models.

Here is an initial example with a hypothetical one-factor, three-indicator model (thanks to Andrea P. for the photograph). Without fixing the unstandardized factor loading for indicator "a" to 1 (in AMOS), the model would be seeking to freely estimate 7 unknown parameters from only 6 known pieces of information. The model would thus be under-identified (also referred to as "unidentified"), which metaphorically is like being in "debt."

Keiley et al. (2005, in Sprenkle & Piercy, eds., Research Methods in Family Therapy) discuss the metric-setting rationale for fixing a single loading per factor to 1:

One of the problems we face in SEM is that the latent constructs are unobserved; therefore, we do not know their natural metric. One of the ways that we define the true score metric is by setting one scaling factor loading to 1.00 from each group of items (pp. 446-447).

In ONYX, it seems to make more sense to me to let all the factor loadings be freely estimated (none of the fixed to 1), but instead fix the factor variance to 1.

Below is the photograph Kristina took of the board in 2008, with the derivation of degrees of freedom for the Hendrick & Hendrick Love Styles model. (This photo has been annotated over the years.)

In ONYX, there are also 63 unknown, freely estimated parameters, but I would allocate them differently than how I would in AMOS. In ONYX, there would be 24 free factor loadings; 15 non-directional correlations; & and 24 indicator residuals. (I would fix the 6 construct variances to 1 in ONYX.)

This slideshow (especially slides 29-31) provides more information on making sure your model is identified.

One of the students in the class, noting the repeated references to "knowns" and "unknowns" in running the model, sent me this video link to provide some levity.