(Rene) Segment 18: The Correlation Matrix

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Problems

To Calculate

1. Random points i are chosen uniformly on a circle of radius 1, and their coordinates in the plane are recorded. What is the 2x2 covariance matrix of the random variables and ? (Hint: Transform probabilities from to . Second hint: Is there a symmetry argument that some components must be zero, or must be equal?)


Random points i are chosen uniformly on a unit circle, so we can define a uniform probability distribution for angle

  (Note we define probability as a volume form - exterior calculus)

Applying the change of variables to and the relation we obtain,



Then the probability distribution can be rewritten in terms of or as,



From symmetry arguments we can conclude that . Furthermore, symmetry with respect to angle suggests that the covariance of X and Y, is zero as well. We have confirmed these numbers using Mathematica. Furthermore, the variances have been computed using Mathematica as

 

So the 2 by 2 Covariance matrix is given by


2. Points are generated in 3 dimensions by this prescription: Choose uniformly random in . Then a point's coordinates are . What is the covariance matrix of the random variables in terms of ? What is the linear correlation matrix of the same random variables?

The probability distribution of is:

 (Again we define probabilities as volume forms - exterior calculus)

Its first and second moment are, respectively,

 

Hence, the mean is and the variance is .

Let and . Then the covariance matrix is given by


Furthermore, the linear correlation matrix with coefficients is simply the identity matrix


To Think About

1. Suppose you want to get a feel for what a linear correlation (say) looks like. How would you generate a bunch of points in the plane with this value of ? Try it. Then try for different values of . As increases from zero, what is the smallest value where you would subjectively say "if I know one of the variables, I pretty much know the value of the other"?


Defining a Multivariate normal distribution where and . The following figure shows 1000 deviates generated with this distribution where a cross correlation of r=0.30, r=90 and r=0.99 is chosen.

Corr030.jpg Corr090.jpg Corr099.jpg

We see that with a cross correlation value of r=0.99 we basically can deduce the value of the second variable, given that we know the first variable.


2. Suppose that points in the plane fall roughly on a 45-degree line between the points (0,0) and (10,10), but in a band of about width w (in these same units). What, roughly, is the linear correlation coefficient ?

Again, from question 1, the cross correlation coefficient r is roughly 0.99.


Class activities

Work problems in teams:

1. Suppose you want to get a feel for what a linear correlation (say) looks like. How would you generate a bunch of points in the plane with this value of ? Try it. Then try for different values of . As increases from zero, what is the smallest value where you would subjectively say "if I know one of the variables, I pretty much know the value of the other"?

This question has been worked out above.

2. Suppose that points in the plane fall roughly on a 45-degree line between the points (0,0) and (10,10), but in a band of about width w (in these same units). What, roughly, is the linear correlation coefficient ?


Since all points fall roughly on a45-degree line, we know by symmetry that var[x]=var[y]. Furthermore, since almost all point fall within (0,0) and (10,10) we also know that the mean is (5,5) and that 3 standard deviations in x-direction or y-direction is is of length 5.

Denoting with C the correlation matrix,

We can determine directly, since three standard deviations is approximately equal to 5. Hence, . Furthermore, b is positive, since the points lie on a +45 degree line, w.r.t the coordinate axis.

By applying a change of coordinates from x-y to principal coordinates (i.e. do an eigenvector decomposition of matrix C and determine the eigenvalues) we can find a relation between the bandwidth w and b, or equivalently r in the linear correlation matrix. The eigenvectors are, as can be expected, the vectors that make a (+/-) 45-degrees angle with the xy-coordinate axis. The corresponding eigenvalues are: and . Now the standard deviation in de direction of the second eigenvector corresponding to is given by . Since most most points are lying in a band with width , we have that the is given by


Since we have the equivalent description in terms of ,



3. Prove that satisfies . (There are several ways to do this, so, for extra credit, think of two different ways.)

Since we already performed an eigenvalue decomposition, this is easy to see. Since all eigenvalues are required to be non-negative, since C is a positive define matrix (standard deviations are required to be real non-negative numbers), we see setting that and .


4. Is it possible to have X positively correlated with Y (that is, ) and Y positively correlated with Z, but Z negatively correlated with X (that is, ). If so, what is the largest value of such that we could have and . What are the limitations on constructing a set of R.V.s with arbitrarily specified positive and negative pairwise values of in ?

The linear correlation matrix is thus,


The corresponding eigenvalues are: . Hence, , such that is positive definite.

Or, if you want to work on something completely different (from last year's midterm exam, but here you get to use computers):

5. You hypothesize that the following list of 20 numbers is drawn from a uniform distribution on the interval (0, 1):

0.6816, 0.4633, 0.1646, 0.0985, 0.8236, 0.1750, 0.1636, 0.6660, 0.1640, 0.5166, 0.1638, 0.1536, 0.9535, 0.5409, 0.1637, 0.0366, 0.8092, 0.7486, 0.1202,0.1639

Test your hypothesis.