(Rene) Segment 19: The ChiSquare statistic
1. Prove the assertion on lecture slide 5, namely that, for a multivariate normal distribution, the quantity , where is a random draw from the multivariate normal, is distributed.
Using the Cholesky factorization of the covariance matrix, , and applying a change of variable , we have that,
where . Then, as in the lecture, we can define a new variable, , such that and , for . Then, the probability
Hence, , so .
Now, in case of degrees of freedom, , the Chisquare distribution is,
Its corresponding characteristic function is computed using Mathematica as,
Hence, this is also the characteristic function corresponding to the sum of independent random variables, that are chi-squared distributed. This proves the assertion that,
To Think About
1. Why are we so interested in t-values? Why do we square them?
t-values are dimensionless numbers, so they are not problem dependent. Furthermore, they represent a measure of the fluctuation w.r.t. the mean value. There are two reasons two square t-values. One is to obtain a positive value, since a positive and negative deviation from the mean is statistically equally important (since we consider normal distributions). Furthermore, for multivariate normal distributions, if we would sum over the fluctuations from the mean, we would get cancelation. By squaring the fluctuations from the mean, terms can't cancel.
2. Suppose you measure a bunch of quantities , each of which is measured with a measurement accuracy and has a theoretically expected value . Describe in detail how you might use a chi-square test statistic as a p-value test to see if your theory is viable? Should your test be 1 or 2 tailed?
Suppose that the dimension of the data is n. We compute the statistic,
Compute a p-value according to,
which can be easily computed once we have the cdf of the chisquare distribution.
A one-sided test is required for which there are two cases:
- right tail test: test if hypothesis need to be rejected since they are unlikely because they overestimate the measurement errors.
- left tail test: hypothesis that need to be rejected since they are unlikely because they underestimate the measurement errors.
We have generated a data set of np points from a d-variate normal distribution with randomly chosen mean and covariance. From this data set we have approximated the mean and covariance. In the next part we compare obtained chisquare statistics and p-values of the dataset with real and approximated parameters.
- plot of distribution for values . The shape of the chisquare distribution can be interpreted intuitively from the chi-squared statistic. For one degree of freedom, obtaining a small value has high probability. Squaring the quantity makes it even much smaller. Therefore the plot goes to infinite at x=0. For more degrees of freedom, the probability of obtaining all small values is decreasing with increasing degrees of freedom.
- Results with d=2 and np = 200. Plot 1 shows the datapoints in the x-y axis, giving an idea of the correlation between the two. In the second plot we compare the chisquare values for all np datapoints for real and approximated mean and variance, with the exact chi-square distribution with 2 degrees of freedom (d=2) . The last plot shows the overall chisquare value, which is close to the average of the chisquare distribution of np x d degrees of freedom.
- Same as above, however now with d=3.