Segment 29 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 29

To Calculate

1. In your favorite computer language, write a code for K-means clustering, and cluster the given data using (a) 3 components and (b) 8 components. Don't use anybody's K-means clustering package for this part: Code it yourself. Hint: Don't try to do it as limiting case of GMMs, just code it from the definition of K-means clustering, using an E-M iteration. Plot your results by coloring the data points according to which cluster they are in. How sensitive is your answer to the starting guesses?

Here is the MATLAB code for k-means clustering:


K = 3;  % Number of clusters
data = load('Twoexondata.txt');
rIndices = randi(length(data), K, 1);
centers = data(rIndices,:);

for iter=1:10
    fprintf('Iteration %d\n', iter);
    % Assign data to centers (E-step)
    assignments = zeros(length(data), 1);
    for d=1:length(data)

        % Calculate distances to each center
        cDistances = zeros(K, 1);
        for c = 1:length(centers)
            cDistances(c) = norm(data(d,:) - centers(c,:));
        end
        [~, index] = min(cDistances);
        assignments(d) = index;

    end


    % Recompute centers (M-step)
    for c=1:K
        cIndices = find(assignments == c);
        centers(c,:) = mean(data(cIndices,:));
    end

end

% Visualize
colors = 'rgbymcwk';

for c=1:K
    cIndices = find(assignments == c);
    hold on;
    scatter(data(cIndices,1),data(cIndices,2), 16, [colors(c), 'x'])
    
    h = plot(centers(c,1), centers(c,2), ['k', 'o'], 'markers', 8);
    set(h, 'MarkerFaceColor', get(h, 'Color'));
    
end

title(sprintf('K = %d', K))

The resulting clusters for K= 3 and K = 8 are shown below. The centers are shown by black circles. Unfortunately, Matlab only has 8 colors, and one of them is white, so one of the clusters may be hard to see. In general, the clustering with 3 components was not that sensitive to initial parameters. Clustering with 8 components was much more sensitive.

SanmitSeg29K3.png

SanmitSeg29K8.png

2. In your favorite computer language, and either writing your own GMM program or using any code you can find elsewhere (e.g., Numerical Recipes for C++, or scikit-learn, which is installed on the class server, for Python), construct mixture models like those shown in slide 8 (for 3 components) and slide 9 (for 8 components). You should plot 2-sigma error ellipses for the individual components, as shown in those slides.

Here is the code in MATLAB, using the built in gmm fitting functions:

K = 3; % number of mixtures
stdev = 2;  % plot 2 sigma error ellipsoids

data = load('Twoexondata.txt');

% Plot the dataset
hold on;
scatter(data(:,1), data(:,2), 'rx')

gmm = gmdistribution.fit(data, K);
mu = gmm.mu;
sigma = gmm.Sigma;

for k=1:K
    n = 100;
    L = chol(sigma(:,:,k), 'lower');
    circle = [cos(2*pi*(0:n)/n); sin(2*pi*(0:n)/n)].*stdev;
    ellipse = L*circle + repmat(mu(k,:)',[1,n+1]);
    x = ellipse(1,:);
    y = ellipse(2,:);
    plot(x, y, 'b')
end
title(sprintf('K = %d', K));

And here are the mixture graphs, with 2 sigma error ellipsoids:

SanmitSeg29GMM3.png

SanmitSeg29GMM8.png


To Think About

1. The segment (or the previous one) mentioned that the log-likelihood can sometimes get stuck on plateaus, barely increasing, for long periods of time, and then can suddenly increase by a lot. What do you think is happening from iteration to iteration during these times on a plateau?