# Likelihood

Edward A. Roualdes

Introduction
Intuition

## Introduction

The likelihood method estimates parameters using random variables from an assumed distribution. To estimate the parameters, standard methods of calculusApex Calculus I is a great reference if you need an Open Educational Resource (free) resource to review derivatives, the calculus behind maximization andminimization. are employed, although derivatives are not taken with respect to standard variable names, e.g. $x$. The goal is to find the most likely values of the parameter(s) given a set of random variables. As such the best guess of the parameter(s) derived from this method is called the maximum likelihood estimate.

The logic underlying the likelihood method goes like this. To estimate an unknown parameter $\theta$ given a set of $N$ random variables $X_1, X_2, \ldots, X_N = \mathbf{X}$, first set up the likelihood function. Then, by treating the parameter $\theta$ as the argument to the likelihood functon, find the value that maximizes the likelihood function. The solution, denoted $\hat{\theta}$ will be a function of the random variables and is called the maximum likelihood estimator. Often, this is written as $\hat{\theta}(\mathbf{X}) = \text{argmax}_{\theta} L( \theta | \mathbf{X})$ to denote that the estimate is the maximal argumentDon't let the new notation $ext{argmax}_x f(x)$ stand in your way. Consider an example that you can reason about somewhat easily. What is the argument $x$ that maximizes the function $f(x) = -x^2$? to the likelihood function for the random variables $\mathbf{X}$. The calculus is then left to the practioner.

This page aims to provide a short introduction to the intuition behind the likelihood function and to show a common analytical strategyNumerical methods on a computer are often employed for more complex problems, where analytical solutions are too difficult for finding the maximum likelihood estimator.

## Intuition

The likelihood function is defined relative to the density function $f$ of the distribution that is assumed to have generated the random variables. The likelihood is defined to be the product of the density function evaluated at each random variable. We think of the likelihood function as a function of the parameter(s) of interest, here generalized to $\theta$, given the random variables $\mathbf{X}$. $L(\theta | \mathbf{X}) = \prod_{n = 1}^N f(X_n | \theta)$

The intuition behind the product of the density function comes from (the assumed) independence of the random variables. Imagine you have $4$ observations from a fair coin, H, H, T, H. The probability associated with this outcome is $\frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2}$.

Now, imagine that you don't know that the coin is fair, instead the probability of heads is just $p$. The probability You'd be on the right track if you're imagining that in four flips, three heads and one tail might suggest that $p = 0.75$. is rewritten as $p \cdot p \cdot (1 - p) \cdot p$

Next, write this probability using the density function of a Bernoulli distributionSee the notes on the Bernoulli distribution if you need a quick refresher. Since we map heads to $1$ and tails to $0$, we have $f(1 | p) \cdot f(1 | p) \cdot f(0 | p) \cdot f(1 | p)$

The last step in understanding the setup of the likelihood function is to recognize that until we observe, say, H, H, T, H, we might as well treat these observations as random variables, $X_1, X_2, X_3, X_4$. In this case the functional form is $f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p)$

The discussion above captures the intuition behind the setup of the likelihood function. From here the main differences are notational and a conceptual understanding of how we can treat this product as a function of the unknown parameter $p$.

To get from $f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p)$ to the formal definition of the likelihood function, we generalize the unknown parameter $p$ to $\theta$, thinking that this method should apply to any distribution's density function, and we use product notation, which is analogous to summation notation, to expand the sample to any number $N$ of random variables $\prod_{n = 1}^N f(X_n | \theta)$

Once we observe the outcomes of these random variables, their values are bound to specific numbers. Even so, the parameter $\theta$ is not is still unknown. The conceptual jump of the likelihood function is to treat the form $L(\theta | \mathbf{X}) = \prod_{n = 1}^N f(X_n | \theta )$ as a function of the unknown parameter $\theta$.

By putting $\theta$ first, the notation $L(\theta | \mathbf{X})$ implies that the argument to the likelihood function is the unknown parameter $\theta$ and the random variables $\mathbf{X}$ are held fixed at whatever values they might eventually be bound to. If a likelihood function maps one sample $\mathbf{X}$ to more than one values of $\theta$, we call the parameters $\theta$ unidentifiable. The specific value of $\theta$ that maximizes the likelihood function is the best guess of the unknown parameter. The value $\hat{\theta}$ is called the maximum likelihood estimate of $\theta$, where the hat $\hat{}$ over $\theta$ is used to denote a best guess of the unknown value of $\theta$.

In general, $\hat{\theta}$ will be some function of random variables. Once the random variables are observed and thus bound to some values, we can plug in these data to the function and get out an actual number that represents a best guess of the unknown parameter $\hat{\theta}$.

To bring the general likelihood function back down to earth, consider the following plot depicting the scenario introduced above: the observations H, H, T, H from a Bernoulli distribution with unknown population parameter $p$. From exactly these four observations, (H, H, T, H) represented as (1, 1, 0, 1), the argument that maximizes the likelihood function for random variables from the Bernoulli distribution is $\hat{p} = \frac{1 + 1 + 0 + 1}{4} = 3/4 = 0.75$.

TODO remake plot

## Maximization

TODO give user ability to generate data, see the likelihood function from the data, and see a new number dependent on those data.

Suppose you have a sample of $N$ observations $X_1, \ldots, X_N$ random variables from the Rayleigh distribution. The Rayleigh depends on the parameter $\sigma$, which can be estimated from some data (observed random variables). The density function of the Rayleigh distribution is $f(x | \sigma ) = \frac{x}{\sigma^2} e^{-x^2 / (2 \sigma^2)}$ for $x \in [0, \infty)$ and $\sigma > 0$.

To find the maximum likelihood estimate,First step, write out the likelihood function. start by writing out the likelihood function. $L( \sigma | \mathbf{X} ) = \prod_{n = 1}^N f(X_n | \sigma) = \prod_{n = 1}^N \frac{X_n}{\sigma^2} e^{-X_n^2 / (2 \sigma^2)}$

The goal is to find the value $\sigma$ that maximizes the likelihood function $L( \sigma | \mathbf{X} )$. Both humans and computers have difficulty working with products and exponents of functions. Therefore, it is common take the natural logNext, find the log-likelihood, $\ell = \log{L}$. of the likelihood function. This is so common, the log of the likelihood function is often just referred to as the log-likelihood function. We'll denote this function $\ell(\sigma | \mathbf{X}) = \log{L(\sigma | \mathbf{X})}$.

Recall from calculus that we can find local maxima/minima by differentiating a function, setting the derivative equal to zero, and solving for the variable of interest. In this scenario, the variable of interest is the unknown parameter $\sigma$.

It is helpful to simplify the log-likelihood function before differentiating. The log-likelihood simplifies as \begin{aligned} \ell(\sigma | \mathbf{X}) & = \sum_{n = 1}^N \log \left\{ \frac{X_n}{\sigma^2} e^{-X_n^2 / (2 \sigma^2)} \right\} \\ & = \sum_{n = 1}^N \left\{ \log{X_n} - 2 \log{\sigma} - X_n^2 / (2\sigma^2) \right\} \\ & = \sum_{n = 1}^N \log{X_n} - 2N\log{\sigma} - \frac{1}{2\sigma^2}\sum_{n = 1}^N X_n^2 \\ \end{aligned}

Proceed by taking the derivativeTake the derivative of the simplified log-likelihood with respect to the unknown parameter., with respect to the unknown parameter, of the simplified log-likelihood. $\frac{d \ell}{d \sigma} = - \frac{2N}{\sigma} + \frac{1}{\sigma^3} \sum_{n = 1}^N X_n^2$

Set the derivative equal to zeroSet the derivative equal to zero and solve. and solve for the unknown parameter. $\frac{2N}{\sigma} = \frac{1}{\sigma^3} \sum_{n=1}^N X_n^2$

Collecting $\sigma$s on the left hand side yields, $2N\sigma^2 = \sum_{n=1}^N X_n^2$

Manipulate the expression until you find a solution for the parameter of interest. At this point, we put a hat over the parameter to recognize that it is our best guess of the parameter $\sigma$ given the set of $N$ random variables $X_1, X_2, \ldots, X_N$. $\hat{\sigma}(\mathbf{X}) = \sqrt{\frac{1}{2N} \sum_{n = 1}^N X_n^2}$

The maximum likelihood estimatePut a hat on your solution to formally note that this is your best guess based on the data. $\hat{\sigma}$ is the final solution. With data from a Rayleigh distribution, we could use the function defined by $\hat{\sigma}(\mathbf{X})$ to plug in the observed values for the random variables $X_1, X_2, \ldots, X_N$ to find an estimate for the parameter $\sigma$.