Segment 29: GMMS in N-Dimensions - 4/10/2013

From Computational Statistics (CSE383M and CS395T)
Jump to navigation Jump to search

My code is here.

Problem 1

In your favorite computer language, write a code for K-means clustering, and cluster the given data using (a) 3 components and (b) 8 components. Don't use anybody's K-means clustering package for this part: Code it yourself. Hint: Don't try to do it as limiting case of GMMs, just code it from the definition of K-means clustering, using an E-M iteration. Plot your results by coloring the data points according to which cluster they are in. How sensitive is your answer to the starting guesses?
It is very sensitive to the start position. I di several runs of each trial, one for 3 and one for 8. Based on what start point I use, some Gaussian Functions color lots of points. Also, as you said the lecture, 3-comp is not a very good representation of data. 8-comp I think makes too many clusters in areas that don't particular need it, but perhaps that's how the data is formed. Here some of the pictures I created. Also, at first I was alarmed when I saw colors in different places, but since clusters could overlap with one another, it would make sense why in some areas, the colors are split.

Fitted Data to 3 Components

[2.5682, 2.738]
[4.1635999999999997, 2.0754999999999999]
[2.9722, 2.0754999999999999]

[2.9767999999999999, 4.1763000000000003]
[2.4870999999999999, 2.3304]
[4.4443000000000001, 3.6352000000000002]

Fitted Data to 8 Components

Center Points
[2.1072000000000002, 2.0253000000000001]
[3.1593, 3.2124999999999999]
[4.0327000000000002, 4.6214000000000004]
[3.2801, 3.3464]
[2.786, 2.4249000000000001]
[3.2334999999999998, 3.4982000000000002]
[4.6619999999999999, 3.0110999999999999]
[1.9242999999999999, 2.9100999999999999]
[3.7924000000000002, 3.5394999999999999]
[5.0156999999999998, 4.5659000000000001]
[4.1627999999999998, 2.9499]
[3.4260000000000002, 4.3129999999999997]
[3.1139000000000001, 3.7126000000000001]
[2.9916999999999998, 3.4241999999999999]
[4.7687999999999997, 4.3994]
[4.1699999999999999, 4.1692999999999998]

Problem 2

In your favorite computer language, and either writing your own GMM program or using any code you can find elsewhere (e.g., Numerical Recipes for C++, or scikit-learn, which is installed on the class server, for Python), construct mixture models like those shown in slide 8 (for 3 components) and slide 9 (for 8 components). You should plot 2-sigma error ellipses for the individual components, as shown in those slides.
For this problem, all I would have to do is just run the functions that color the point by asking the user to input their data and how many Gaussians they would want to fit and number of graphs they would want to generate. I'm not able to create the error ellipses in python, but it would be a more clear indication of where the clusters are without having to see colors in two different places

To Think About 1

The segment (or the previous one) mentioned that the log-likelihood can sometimes get stuck on plateaus, barely increasing, for long periods of time, and then can suddenly increase by a lot. What do you think is happening from iteration to iteration during these times on a plateau?
Well, a Logarithmic function is a very slow growing function. It would just evaluate like it is suppose to, having any value from 1 to infinity. The only time where there is a sudden increase/decrease of values is in between the 0 and 1 . Most likely on the pateau it is just iterating through the values that are greater than 1