# Segment 20 Sanmit Narvekar

Jump to navigation Jump to search

## Segment 20

#### To Calculate

1. (See lecture slide 3.) For one-dimensional $\displaystyle x$ , the model $\displaystyle y(x | \mathbf b)$ is called "linear" if $\displaystyle y(x | \mathbf b) = \sum_k b_k X_k(x)$ , where $\displaystyle X_k(x)$ are arbitrary known functions of $\displaystyle x$ . Show that minimizing $\displaystyle \chi^2$ produces a set of linear equations (called the "normal equations") for the parameters $\displaystyle b_k$ .

First we write down $\displaystyle \chi^2$ , the quantity we wish to minimize:

$\displaystyle \chi^2 = \sum_i \left( \frac{y_i - y(x_i|b)}{\sigma_i}\right)^2$

Where i is the number of training examples or data points. Now we minimize with respect to each parameter b_k:

$\displaystyle \frac{\partial \chi^2}{\partial b_k} = 2 \sum_i \left( \frac{y_i - y(x_i|b)}{\sigma_i}\right) \frac{X_k(x)}{\sigma_i} = 0$

We will drop the constant factor 2 since it doesn't affect the minimization. Now we can expand the inner term and rewrite as follows:

$\displaystyle \sum_i \frac{1}{\sigma_i^2}y_i X_k(x) = \sum_i \frac{1}{\sigma_i^2} y(x_i|b) X_k(x)$

Thus, this results in k equations, one for each of the k parameters.

2. A simple example of a linear model is $\displaystyle y(x | \mathbf b) = b_0 + b_1 x$ , which corresponds to fitting a straight line to data. What are the MLE estimates of $\displaystyle b_0$ and $\displaystyle b_1$ in terms of the data: $\displaystyle x_i$ 's, $\displaystyle y_i$ 's, and $\displaystyle \sigma_i$ 's?

Using the above derivation and plugging in the new form of y(x|b), first we solve the first equation to get b_0:

$\displaystyle \sum_i \frac{1}{\sigma_i^2}y_i = \sum_i \frac{1}{\sigma_i^2} (b_0 + b_1x_i)$

By some simple rewriting:

$\displaystyle b_0 = \frac{\sum_i \frac{1}{\sigma_i^2} y_i - \sum_i \frac{1}{\sigma_i^2} b_1 x_i}{\sum_i \frac{1}{\sigma_i^2}}$

We can do the same for b_1. First the normal equation:

$\displaystyle \sum_i \frac{1}{\sigma_i^2}y_ix_i = \sum_i \frac{1}{\sigma_i^2}(b_0 + b_1x_i) x_i$

$\displaystyle b_1 = \frac{\sum_i \frac{x_iy_i}{\sigma_i^2} - \sum_i \frac{b_0x_i}{\sigma_i^2}}{\sum_i \frac{x_i^2}{\sigma_i^2}}$

Note that in the above expression b_0 appears in the equation for b_1 and vice versa. The substitution rule can be used to plug in the value for one in the other, and then it can be solved in the usual way.

#### To Think About

1. We often rather casually assume a uniform prior $\displaystyle P(\mathbf b)= \text{constant}$ on the parameters $\displaystyle \mathbf b$ . If the prior is not uniform, then is minimizing $\displaystyle \chi^2$ the right thing to do? If not, then what should you do instead? Can you think of a situation where the difference would be important?

It seems like a non-uniform prior over the parameters is equivalent to regularization. So, this could be useful in a setting where you want to prevent overfitting when you have many parameters

2. What if, in lecture slide 2, the measurement errors were $\displaystyle e_i \sim \text{Cauchy}(0,\sigma_i)$ instead of $\displaystyle e_i \sim N(0,\sigma_i)$ ? How would you find MLE estimates for the parameters $\displaystyle \mathbf b$ ?