# Segment 20 Sanmit Narvekar

## Segment 20

#### To Calculate

1. (See lecture slide 3.) For one-dimensional $\displaystyle x$ , the model $\displaystyle y(x | \mathbf b)$ is called "linear" if $\displaystyle y(x | \mathbf b) = \sum_k b_k X_k(x)$ , where $\displaystyle X_k(x)$ are arbitrary known functions of $\displaystyle x$ . Show that minimizing $\displaystyle \chi^2$ produces a set of linear equations (called the "normal equations") for the parameters $\displaystyle b_k$ .

First we write down $\displaystyle \chi^2$ , the quantity we wish to minimize:

$\displaystyle \chi^2 = \sum_i \left( \frac{y_i - y(x_i|b)}{\sigma_i}\right)^2$

Where i is the number of training examples or data points. Now we minimize with respect to each parameter b_k:

$\displaystyle \frac{\partial \chi^2}{\partial b_k} = 2 \sum_i \left( \frac{y_i - y(x_i|b)}{\sigma_i}\right) \frac{X_k(x)}{\sigma_i} = 0$

We will drop the constant factor 2 since it doesn't affect the minimization. Now we can expand the inner term and rewrite as follows:

$\displaystyle \sum_i \frac{1}{\sigma_i^2}y_i X_k(x) = \sum_i \frac{1}{\sigma_i^2} y(x_i|b) X_k(x)$

Thus, this results in k equations, one for each of the k parameters.

2. A simple example of a linear model is $\displaystyle y(x | \mathbf b) = b_0 + b_1 x$ , which corresponds to fitting a straight line to data. What are the MLE estimates of $\displaystyle b_0$ and $\displaystyle b_1$ in terms of the data: $\displaystyle x_i$ 's, $\displaystyle y_i$ 's, and $\displaystyle \sigma_i$ 's?

Using the above derivation and plugging in the new form of y(x|b), first we solve the first equation to get b_0:

$\displaystyle \sum_i \frac{1}{\sigma_i^2}y_i = \sum_i \frac{1}{\sigma_i^2} (b_0 + b_1x_i)$

By some simple rewriting:

$\displaystyle b_0 = \frac{\sum_i \frac{1}{\sigma_i^2} y_i - \sum_i \frac{1}{\sigma_i^2} b_1 x_i}{\sum_i \frac{1}{\sigma_i^2}}$

We can do the same for b_1. First the normal equation:

$\displaystyle \sum_i \frac{1}{\sigma_i^2}y_ix_i = \sum_i \frac{1}{\sigma_i^2}(b_0 + b_1x_i) x_i$

$\displaystyle b_1 = \frac{\sum_i \frac{x_iy_i}{\sigma_i^2} - \sum_i \frac{b_0x_i}{\sigma_i^2}}{\sum_i \frac{x_i^2}{\sigma_i^2}}$

Note that in the above expression b_0 appears in the equation for b_1 and vice versa. The substitution rule can be used to plug in the value for one in the other, and then it can be solved in the usual way.

1. We often rather casually assume a uniform prior $\displaystyle P(\mathbf b)= \text{constant}$ on the parameters $\displaystyle \mathbf b$ . If the prior is not uniform, then is minimizing $\displaystyle \chi^2$ the right thing to do? If not, then what should you do instead? Can you think of a situation where the difference would be important?

It seems like a non-uniform prior over the parameters is equivalent to regularization. So, this could be useful in a setting where you want to prevent overfitting when you have many parameters

2. What if, in lecture slide 2, the measurement errors were $\displaystyle e_i \sim \text{Cauchy}(0,\sigma_i)$ instead of $\displaystyle e_i \sim N(0,\sigma_i)$ ? How would you find MLE estimates for the parameters $\displaystyle \mathbf b$ ?