Segment 29 Sanmit Narvekar
1. In your favorite computer language, write a code for K-means clustering, and cluster the given data using (a) 3 components and (b) 8 components. Don't use anybody's K-means clustering package for this part: Code it yourself. Hint: Don't try to do it as limiting case of GMMs, just code it from the definition of K-means clustering, using an E-M iteration. Plot your results by coloring the data points according to which cluster they are in. How sensitive is your answer to the starting guesses?
Here is the MATLAB code for k-means clustering:
K = 3; % Number of clusters data = load('Twoexondata.txt'); rIndices = randi(length(data), K, 1); centers = data(rIndices,:); for iter=1:10 fprintf('Iteration %d\n', iter); % Assign data to centers (E-step) assignments = zeros(length(data), 1); for d=1:length(data) % Calculate distances to each center cDistances = zeros(K, 1); for c = 1:length(centers) cDistances(c) = norm(data(d,:) - centers(c,:)); end [~, index] = min(cDistances); assignments(d) = index; end % Recompute centers (M-step) for c=1:K cIndices = find(assignments == c); centers(c,:) = mean(data(cIndices,:)); end end % Visualize colors = 'rgbymcwk'; for c=1:K cIndices = find(assignments == c); hold on; scatter(data(cIndices,1),data(cIndices,2), 16, [colors(c), 'x']) h = plot(centers(c,1), centers(c,2), ['k', 'o'], 'markers', 8); set(h, 'MarkerFaceColor', get(h, 'Color')); end title(sprintf('K = %d', K))
The resulting clusters for K= 3 and K = 8 are shown below. The centers are shown by black circles. Unfortunately, Matlab only has 8 colors, and one of them is white, so one of the clusters may be hard to see. In general, the clustering with 3 components was not that sensitive to initial parameters. Clustering with 8 components was much more sensitive.
2. In your favorite computer language, and either writing your own GMM program or using any code you can find elsewhere (e.g., Numerical Recipes for C++, or scikit-learn, which is installed on the class server, for Python), construct mixture models like those shown in slide 8 (for 3 components) and slide 9 (for 8 components). You should plot 2-sigma error ellipses for the individual components, as shown in those slides.
Here is the code in MATLAB, using the built in gmm fitting functions:
K = 3; % number of mixtures stdev = 2; % plot 2 sigma error ellipsoids data = load('Twoexondata.txt'); % Plot the dataset hold on; scatter(data(:,1), data(:,2), 'rx') gmm = gmdistribution.fit(data, K); mu = gmm.mu; sigma = gmm.Sigma; for k=1:K n = 100; L = chol(sigma(:,:,k), 'lower'); circle = [cos(2*pi*(0:n)/n); sin(2*pi*(0:n)/n)].*stdev; ellipse = L*circle + repmat(mu(k,:)',[1,n+1]); x = ellipse(1,:); y = ellipse(2,:); plot(x, y, 'b') end title(sprintf('K = %d', K));
And here are the mixture graphs, with 2 sigma error ellipsoids:
To Think About
1. The segment (or the previous one) mentioned that the log-likelihood can sometimes get stuck on plateaus, barely increasing, for long periods of time, and then can suddenly increase by a lot. What do you think is happening from iteration to iteration during these times on a plateau?