# Segment 29 Sanmit Narvekar

Jump to navigation Jump to search

## Segment 29

#### To Calculate

1. In your favorite computer language, write a code for K-means clustering, and cluster the given data using (a) 3 components and (b) 8 components. Don't use anybody's K-means clustering package for this part: Code it yourself. Hint: Don't try to do it as limiting case of GMMs, just code it from the definition of K-means clustering, using an E-M iteration. Plot your results by coloring the data points according to which cluster they are in. How sensitive is your answer to the starting guesses?

Here is the MATLAB code for k-means clustering:

```
K = 3;  % Number of clusters
data = load('Twoexondata.txt');
rIndices = randi(length(data), K, 1);
centers = data(rIndices,:);

for iter=1:10
fprintf('Iteration %d\n', iter);
% Assign data to centers (E-step)
assignments = zeros(length(data), 1);
for d=1:length(data)

% Calculate distances to each center
cDistances = zeros(K, 1);
for c = 1:length(centers)
cDistances(c) = norm(data(d,:) - centers(c,:));
end
[~, index] = min(cDistances);
assignments(d) = index;

end

% Recompute centers (M-step)
for c=1:K
cIndices = find(assignments == c);
centers(c,:) = mean(data(cIndices,:));
end

end

% Visualize
colors = 'rgbymcwk';

for c=1:K
cIndices = find(assignments == c);
hold on;
scatter(data(cIndices,1),data(cIndices,2), 16, [colors(c), 'x'])

h = plot(centers(c,1), centers(c,2), ['k', 'o'], 'markers', 8);
set(h, 'MarkerFaceColor', get(h, 'Color'));

end

title(sprintf('K = %d', K))

```

The resulting clusters for K= 3 and K = 8 are shown below. The centers are shown by black circles. Unfortunately, Matlab only has 8 colors, and one of them is white, so one of the clusters may be hard to see. In general, the clustering with 3 components was not that sensitive to initial parameters. Clustering with 8 components was much more sensitive.

2. In your favorite computer language, and either writing your own GMM program or using any code you can find elsewhere (e.g., Numerical Recipes for C++, or scikit-learn, which is installed on the class server, for Python), construct mixture models like those shown in slide 8 (for 3 components) and slide 9 (for 8 components). You should plot 2-sigma error ellipses for the individual components, as shown in those slides.

Here is the code in MATLAB, using the built in gmm fitting functions:

```K = 3; % number of mixtures
stdev = 2;  % plot 2 sigma error ellipsoids

data = load('Twoexondata.txt');

% Plot the dataset
hold on;
scatter(data(:,1), data(:,2), 'rx')

gmm = gmdistribution.fit(data, K);
mu = gmm.mu;
sigma = gmm.Sigma;

for k=1:K
n = 100;
L = chol(sigma(:,:,k), 'lower');
circle = [cos(2*pi*(0:n)/n); sin(2*pi*(0:n)/n)].*stdev;
ellipse = L*circle + repmat(mu(k,:)',[1,n+1]);
x = ellipse(1,:);
y = ellipse(2,:);
plot(x, y, 'b')
end
title(sprintf('K = %d', K));

```

And here are the mixture graphs, with 2 sigma error ellipsoids:

#### To Think About

1. The segment (or the previous one) mentioned that the log-likelihood can sometimes get stuck on plateaus, barely increasing, for long periods of time, and then can suddenly increase by a lot. What do you think is happening from iteration to iteration during these times on a plateau?