GMM activity

From Computational Statistics Course Wiki
Jump to navigation Jump to search

1. Draw a sample of 100 points from the uniform distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U(0,1)} . This is your data set. Fit GMM models to your sample (now considered as being on the interval Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle -\infty < x < \infty} ) with increasing numbers of components Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} , at least Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K=1,\ldots,5} . Plot your models. Do they get better as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} increases? Did you try multiple starting values to find the best (hopefully globally best) solutions for each Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} ?

Competition: Who can create the best visualization of the convergence of the iterative fitting process in question 1?

2. Using a pre-existing package (gmdistribution for Matlab, or scikit-learn, which is installed on the class server, for Python), construct mixture models like those shown in Segment slide 8 (for 3 components) and slide 9 (for 8 components). You should plot 2-sigma error ellipses for the individual components, as shown in those slides.

The data is at Twoexondata.txt or on the IPython server.

3. In your favorite computer language, write a code for K-means clustering, and cluster the same data using (a) 3 components and (b) 8 components. Don't use anybody's K-means clustering package for this part: code it yourself. Hint: Don't try to do it as limiting case of GMMs, just code it from the definition of K-means clustering, using an E-M iteration. Plot your results by coloring the data points according to which cluster they are in. How sensitive is your answer to the starting guesses?

Competition: Who can create the best visualization of the convergence of the iterative fitting processes in questions 2 and 3?

Honorable mention

Histfin.png

Award winners

Best in show, Best use of animation: Todd

Animation2 ts.gif

Best use of Matlab: Daniel

Daniel resized.gif

Best use of Python: Nick

Scroll down to see animations.

{{#widget:Iframe |url=http://nbviewer.ipython.org/github/CS395T/2014/blob/master/Nick%20Wilson%2004-02-14%20Segment%2029%20In%20Class%20Beauty%20Contest.ipynb |width=1000 |height=625 |border=1 }}

Best use of still image: Anonymous

Convergence transparency.png