Segment 29 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 29

To Calculate

1. In your favorite computer language, write a code for K-means clustering, and cluster the given data using (a) 3 components and (b) 8 components. Don't use anybody's K-means clustering package for this part: Code it yourself. Hint: Don't try to do it as limiting case of GMMs, just code it from the definition of K-means clustering, using an E-M iteration. Plot your results by coloring the data points according to which cluster they are in. How sensitive is your answer to the starting guesses?

Here is the MATLAB code for k-means clustering:

K = 3;  % Number of clusters
data = load('Twoexondata.txt');
rIndices = randi(length(data), K, 1);
centers = data(rIndices,:);

for iter=1:10
    fprintf('Iteration %d\n', iter);
    % Assign data to centers (E-step)
    assignments = zeros(length(data), 1);
    for d=1:length(data)

        % Calculate distances to each center
        cDistances = zeros(K, 1);
        for c = 1:length(centers)
            cDistances(c) = norm(data(d,:) - centers(c,:));
        [~, index] = min(cDistances);
        assignments(d) = index;


    % Recompute centers (M-step)
    for c=1:K
        cIndices = find(assignments == c);
        centers(c,:) = mean(data(cIndices,:));


% Visualize
colors = 'rgbymcwk';

for c=1:K
    cIndices = find(assignments == c);
    hold on;
    scatter(data(cIndices,1),data(cIndices,2), 16, [colors(c), 'x'])
    h = plot(centers(c,1), centers(c,2), ['k', 'o'], 'markers', 8);
    set(h, 'MarkerFaceColor', get(h, 'Color'));

title(sprintf('K = %d', K))

The resulting clusters for K= 3 and K = 8 are shown below. The centers are shown by black circles. Unfortunately, Matlab only has 8 colors, and one of them is white, so one of the clusters may be hard to see. In general, the clustering with 3 components was not that sensitive to initial parameters. Clustering with 8 components was much more sensitive.



2. In your favorite computer language, and either writing your own GMM program or using any code you can find elsewhere (e.g., Numerical Recipes for C++, or scikit-learn, which is installed on the class server, for Python), construct mixture models like those shown in slide 8 (for 3 components) and slide 9 (for 8 components). You should plot 2-sigma error ellipses for the individual components, as shown in those slides.

Here is the code in MATLAB, using the built in gmm fitting functions:

K = 3; % number of mixtures
stdev = 2;  % plot 2 sigma error ellipsoids

data = load('Twoexondata.txt');

% Plot the dataset
hold on;
scatter(data(:,1), data(:,2), 'rx')

gmm =, K);
mu =;
sigma = gmm.Sigma;

for k=1:K
    n = 100;
    L = chol(sigma(:,:,k), 'lower');
    circle = [cos(2*pi*(0:n)/n); sin(2*pi*(0:n)/n)].*stdev;
    ellipse = L*circle + repmat(mu(k,:)',[1,n+1]);
    x = ellipse(1,:);
    y = ellipse(2,:);
    plot(x, y, 'b')
title(sprintf('K = %d', K));

And here are the mixture graphs, with 2 sigma error ellipsoids:



To Think About

1. The segment (or the previous one) mentioned that the log-likelihood can sometimes get stuck on plateaus, barely increasing, for long periods of time, and then can suddenly increase by a lot. What do you think is happening from iteration to iteration during these times on a plateau?