Segment 28. Gaussian Mixture Models in 1-D

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Watch this segment

(Don't worry, what you see statically below is not the beginning of the segment. Press the play button to start at the beginning.)

{{#widget:Iframe |url=http://www.youtube.com/v/n7u_tq0I6jM&hd=1 |width=800 |height=625 |border=0 }}

The direct YouTube link is http://youtu.be/n7u_tq0I6jM

Links to the slides: PDF file or PowerPoint file

Problems

To Calculate

1. Draw a sample of 100 points from the uniform distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U(0,1)} . This is your data set. Fit GMM models to your sample (now considered as being on the interval Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle -\infty < x < \infty} ) with increasing numbers of components Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} , at least Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K=1,\ldots,5} . Plot your models. Do they get better as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} increases? Did you try multiple starting values to find the best (hopefully globally best) solutions for each Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} ?

2. Multiplying a lot of individual likelihoods will often underflow. (a) On average, how many values drawn from Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U(0,1)} can you multiply before the product underflows to zero? (b) What, analytically, is the distribution of the sum of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N} independent values Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \log(U)} , where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U\sim U(0,1)} ? (c) Is your answer to (a) consistent with your answer to (b)?

To Think About

1. Suppose you want to approximate some analytically known function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x)} (whose integral is finite), as a sum of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} Gaussians with different centers and widths. You could pretend that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x)} (or some scaling of it) was a probability distribution, draw Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N} points from it and do the GMM thing to find the approximating Gaussians. Now take the limit Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N\rightarrow \infty} , figure out how sums become integrals, and write down an iterative method for fitting Gaussians to a given Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x)} . Does it work? (You can assume that well-defined definite integrals can be done numerically.)

Class Activity

Let's explore a data set and try to make sensible statements about it.

netflixishdata.txt

Rows are 200 movie watchers, columns are 100 movies, entries are their ratings on a scale of 1 (I hated it!) to 5 (I loved it!). This is not real data, of course, so it is only Netflixish, not Netflix.

Questions to explore

How much are people alike?
How much are movies alike?
Distribution of the data in various ways?

By summing over all the columns and dividing by number of entires, we got the average rating for each movie. Something surprising was that the max of all the mean ratings was 3.46 (so there was no mean rating greater than 3.46 stars!), the min was 2.3650, the mean of all the mean movie ratings is 2.9998, the median of these mean ratings was 2.9925.

Insight #1:
Looking at the actual data set, we see that there are a lot of "haters" i.e. there are a lot of people who gave a lot of 1 ratings.

Insight #2:
Overall there are exactly 4000 (+ or - 1) of each rating.

Insight #3 There seem to be exactly 4 kinds of movies:

Why? Movie ratings were generated from the sides of a regular tetrahedron