Likelihood

Edward A. Roualdes

Contents

Introduction
Intuition

Introduction

The likelihood method estimates parameters using random variables from an assumed distribution. To estimate the parameters, standard methods of calculusApex Calculus I is a great reference if you need an Open Educational Resource (free) resource to review derivatives, the calculus behind maximization andminimization. are employed, although derivatives are not taken with respect to standard variable names, e.g. xx. The goal is to find the most likely values of the parameter(s) given a set of random variables. As such the best guess of the parameter(s) derived from this method is called the maximum likelihood estimate.

The logic underlying the likelihood method goes like this. To estimate an unknown parameter θ\theta given a set of NN random variables X1,X2,,XN=XX_1, X_2, \ldots, X_N = \mathbf{X}, first set up the likelihood function. Then, by treating the parameter θ\theta as the argument to the likelihood functon, find the value that maximizes the likelihood function. The solution, denoted θ^\hat{\theta} will be a function of the random variables and is called the maximum likelihood estimator. Often, this is written as θ^(X)=argmaxθL(θX)\hat{\theta}(\mathbf{X}) = \text{argmax}_{\theta} L( \theta | \mathbf{X}) to denote that the estimate is the maximal argumentDon't let the new notation extargmaxxf(x) ext{argmax}_x f(x) stand in your way. Consider an example that you can reason about somewhat easily. What is the argument xx that maximizes the function f(x)=x2f(x) = -x^2? to the likelihood function for the random variables X\mathbf{X}. The calculus is then left to the practioner.

This page aims to provide a short introduction to the intuition behind the likelihood function and to show a common analytical strategyNumerical methods on a computer are often employed for more complex problems, where analytical solutions are too difficult for finding the maximum likelihood estimator.

Intuition

The likelihood function is defined relative to the density function ff of the distribution that is assumed to have generated the random variables. The likelihood is defined to be the product of the density function evaluated at each random variable. We think of the likelihood function as a function of the parameter(s) of interest, here generalized to θ\theta, given the random variables X\mathbf{X}. L(θX)=n=1Nf(Xnθ)L(\theta | \mathbf{X}) = \prod_{n = 1}^N f(X_n | \theta)

The intuition behind the product of the density function comes from (the assumed) independence of the random variables. Imagine you have 44 observations from a fair coin, H, H, T, H. The probability associated with this outcome is 12121212\frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2}.

Now, imagine that you don't know that the coin is fair, instead the probability of heads is just pp. The probability You'd be on the right track if you're imagining that in four flips, three heads and one tail might suggest that p=0.75p = 0.75. is rewritten as pp(1p)pp \cdot p \cdot (1 - p) \cdot p

Next, write this probability using the density function of a Bernoulli distributionSee the notes on the Bernoulli distribution if you need a quick refresher. Since we map heads to 11 and tails to 00, we have f(1p)f(1p)f(0p)f(1p)f(1 | p) \cdot f(1 | p) \cdot f(0 | p) \cdot f(1 | p)

The last step in understanding the setup of the likelihood function is to recognize that until we observe, say, H, H, T, H, we might as well treat these observations as random variables, X1,X2,X3,X4X_1, X_2, X_3, X_4. In this case the functional form is f(X1p)f(X2p)f(X3p)f(X4p)f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p)

The discussion above captures the intuition behind the setup of the likelihood function. From here the main differences are notational and a conceptual understanding of how we can treat this product as a function of the unknown parameter pp.

To get from f(X1p)f(X2p)f(X3p)f(X4p)f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p) to the formal definition of the likelihood function, we generalize the unknown parameter pp to θ\theta, thinking that this method should apply to any distribution's density function, and we use product notation, which is analogous to summation notation, to expand the sample to any number NN of random variables n=1Nf(Xnθ)\prod_{n = 1}^N f(X_n | \theta)

Once we observe the outcomes of these random variables, their values are bound to specific numbers. Even so, the parameter θ\theta is not is still unknown. The conceptual jump of the likelihood function is to treat the form L(θX)=n=1Nf(Xnθ)L(\theta | \mathbf{X}) = \prod_{n = 1}^N f(X_n | \theta ) as a function of the unknown parameter θ\theta.

By putting θ\theta first, the notation L(θX)L(\theta | \mathbf{X}) implies that the argument to the likelihood function is the unknown parameter θ\theta and the random variables X\mathbf{X} are held fixed at whatever values they might eventually be bound to. If a likelihood function maps one sample X\mathbf{X} to more than one values of θ\theta, we call the parameters θ\theta unidentifiable. The specific value of θ\theta that maximizes the likelihood function is the best guess of the unknown parameter. The value θ^\hat{\theta} is called the maximum likelihood estimate of θ\theta, where the hat ^\hat{} over θ\theta is used to denote a best guess of the unknown value of θ\theta.

In general, θ^\hat{\theta} will be some function of random variables. Once the random variables are observed and thus bound to some values, we can plug in these data to the function and get out an actual number that represents a best guess of the unknown parameter θ^\hat{\theta}.

To bring the general likelihood function back down to earth, consider the following plot depicting the scenario introduced above: the observations H, H, T, H from a Bernoulli distribution with unknown population parameter pp. From exactly these four observations, (H, H, T, H) represented as (1, 1, 0, 1), the argument that maximizes the likelihood function for random variables from the Bernoulli distribution is p^=1+1+0+14=3/4=0.75\hat{p} = \frac{1 + 1 + 0 + 1}{4} = 3/4 = 0.75.

TODO remake plot

Maximization

TODO give user ability to generate data, see the likelihood function from the data, and see a new number dependent on those data.

Suppose you have a sample of NN observations X1,,XNX_1, \ldots, X_N random variables from the Rayleigh distribution. The Rayleigh depends on the parameter σ\sigma, which can be estimated from some data (observed random variables). The density function of the Rayleigh distribution is f(xσ)=xσ2ex2/(2σ2)f(x | \sigma ) = \frac{x}{\sigma^2} e^{-x^2 / (2 \sigma^2)} for x[0,)x \in [0, \infty) and σ>0\sigma > 0.

To find the maximum likelihood estimate,First step, write out the likelihood function. start by writing out the likelihood function. L(σX)=n=1Nf(Xnσ)=n=1NXnσ2eXn2/(2σ2)L( \sigma | \mathbf{X} ) = \prod_{n = 1}^N f(X_n | \sigma) = \prod_{n = 1}^N \frac{X_n}{\sigma^2} e^{-X_n^2 / (2 \sigma^2)}

The goal is to find the value σ\sigma that maximizes the likelihood function L(σX)L( \sigma | \mathbf{X} ). Both humans and computers have difficulty working with products and exponents of functions. Therefore, it is common take the natural logNext, find the log-likelihood, =logL\ell = \log{L}. of the likelihood function. This is so common, the log of the likelihood function is often just referred to as the log-likelihood function. We'll denote this function (σX)=logL(σX)\ell(\sigma | \mathbf{X}) = \log{L(\sigma | \mathbf{X})}.

Recall from calculus that we can find local maxima/minima by differentiating a function, setting the derivative equal to zero, and solving for the variable of interest. In this scenario, the variable of interest is the unknown parameter σ\sigma.

It is helpful to simplify the log-likelihood function before differentiating. The log-likelihood simplifies as (σX)=n=1Nlog{Xnσ2eXn2/(2σ2)}=n=1N{logXn2logσXn2/(2σ2)}=n=1NlogXn2Nlogσ12σ2n=1NXn2 \begin{aligned} \ell(\sigma | \mathbf{X}) & = \sum_{n = 1}^N \log \left\{ \frac{X_n}{\sigma^2} e^{-X_n^2 / (2 \sigma^2)} \right\} \\ & = \sum_{n = 1}^N \left\{ \log{X_n} - 2 \log{\sigma} - X_n^2 / (2\sigma^2) \right\} \\ & = \sum_{n = 1}^N \log{X_n} - 2N\log{\sigma} - \frac{1}{2\sigma^2}\sum_{n = 1}^N X_n^2 \\ \end{aligned}

Proceed by taking the derivativeTake the derivative of the simplified log-likelihood with respect to the unknown parameter., with respect to the unknown parameter, of the simplified log-likelihood. ddσ=2Nσ+1σ3n=1NXn2\frac{d \ell}{d \sigma} = - \frac{2N}{\sigma} + \frac{1}{\sigma^3} \sum_{n = 1}^N X_n^2

Set the derivative equal to zeroSet the derivative equal to zero and solve. and solve for the unknown parameter. 2Nσ=1σ3n=1NXn2\frac{2N}{\sigma} = \frac{1}{\sigma^3} \sum_{n=1}^N X_n^2

Collecting σ\sigmas on the left hand side yields, 2Nσ2=n=1NXn22N\sigma^2 = \sum_{n=1}^N X_n^2

Manipulate the expression until you find a solution for the parameter of interest. At this point, we put a hat over the parameter to recognize that it is our best guess of the parameter σ\sigma given the set of NN random variables X1,X2,,XNX_1, X_2, \ldots, X_N. σ^(X)=12Nn=1NXn2\hat{\sigma}(\mathbf{X}) = \sqrt{\frac{1}{2N} \sum_{n = 1}^N X_n^2}

The maximum likelihood estimatePut a hat on your solution to formally note that this is your best guess based on the data. σ^\hat{\sigma} is the final solution. With data from a Rayleigh distribution, we could use the function defined by σ^(X)\hat{\sigma}(\mathbf{X}) to plug in the observed values for the random variables X1,X2,,XNX_1, X_2, \ldots, X_N to find an estimate for the parameter σ\sigma.


This work is licensed under the Creative Commons Attribution 4.0 International License.