Segment 28 Sanmit Narvekar

From Computational Statistics Course Wiki
Jump to navigation Jump to search

Segment 28

To Calculate

1. Draw a sample of 100 points from the uniform distribution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U(0,1)} . This is your data set. Fit GMM models to your sample (now considered as being on the interval Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle -\infty < x < \infty} ) with increasing numbers of components Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} , at least Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K=1,\ldots,5} . Plot your models. Do they get better as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} increases? Did you try multiple starting values to find the best (hopefully globally best) solutions for each Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} ?

Here is the MATLAB code to fit GMMs to 100 samples drawn from the uniform distribution:


clear;
clc;
close all;

data = rand(100,1);

% Chose number of mixtures
K = 2;

% Initialize models "randomly"
mu = randsample(data, K)';
sigma = ones(1, K) * 0.3;

for iter=1:10

    pr = @(x) exp(-0.5*((x-mu)./sigma).^2)./(2.506 * sigma);
    prn = @(x) pr(x) ./ sum(pr(x));

    % E-step (compute assignments of all points)
    prns = zeros([numel(data),K]);
    for j=1:numel(data); prns(j,:)=prn(data(j)); end;



    % M-step

    % Re-estimate mean
    prnsx = zeros([numel(data), K]);
    for j=1:numel(data)
        prnsx(j,:) = data(j) * prns(j,:);
    end
    mu = sum(prnsx) ./ sum(prns);


    % Re-estimate cov
    xmmu = zeros([numel(data), K]);
    for j=1:numel(data)
        xmmu(j,:) = data(j) - mu;
    end
    sigma = sqrt(sum(prns .* (xmmu.^2)) ./ sum(prns));

    % Re-estimate P(k)
    Phat = sum(prns, 1) / numel(data);

    % Plot mixture pdf
    mixFunc = @(x) sum(Phat .* pr(x), 2);
    x = -1:.01:2;
    px = arrayfun(mixFunc,x);
    
    plot(x, px)
    

end

hold on;

[f x] = ksdensity(data);
plot(x,f,'r')
hold off;
title(sprintf('K = %d', K))
legend('GMM', 'KS Density')


Here are the resulting mixture pdfs. In general, adding more mixtures does help. However, since the underlying distribution is NOT a mixture of Gaussians, it fails to capture the underlying distribution correctly.

SanmitSeg28K1.png

SanmitSeg28K2.png

SanmitSeg28K3.png

SanmitSeg28K4.png

SanmitSeg28K5.png


2. Multiplying a lot of individual likelihoods will often underflow. (a) On average, how many values drawn from Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U(0,1)} can you multiply before the product underflows to zero? (b) What, analytically, is the distribution of the sum of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N} independent values Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \log(U)} , where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle U\sim U(0,1)} ? (c) Is your answer to (a) consistent with your answer to (b)?


Part A

Here is the MATLAB code to compute it:

dist = zeros(10000,1);

for d=1:length(dist)
    a = 1;
    count = 0;
    while a ~= 0

        a = a * rand();
        count = count + 1;
    end
    dist(d) = count;
end

mu = mean(dist)
std = std(dist)

On average, it takes 746.6751 +/- 27.2855 numbers before it underflows.


Part B

First we calculate the distribution of a single sample from log(U). The mean is as follows:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mu = \mathbb{E}[\log(U)]}

Recall the following property of expectation, where X is distributed according to f, and g is a function of X:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \operatorname{E}[g(X)] = \int_{-\infty}^\infty g(x) f(x)\, \mathrm{d}x .}

If we allow g to be log, and f to be the uniform distribution, then the mean becomes:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mu = \int_0^1 \log x dx = -1}

Then to calculate the variance, we first compute (using the same property as before):

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb{E}[\log^2(U)] = \int_0^1 \log^2(x) dx = 2}

Then the variance is:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{Var}(x) = \mathbb{E}[\log^2(U)] - (\mathbb{E}[\log(U)])^2 = 2 - 1 = 1}


Now, if we take the sum of N independent samples from log(U), they will be distributed as:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mu = \mathbb{E}[X_1+X_2+\ldots X_N] = \mathbb{E}[X_1] + \mathbb{E}[X_2] + \ldots \mathbb{E}[X_N] = N \times -1 = -N}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{Var}(x) = Var(X_1 + X_2 + \ldots + X_N) = Var(X_1) + Var(X_2) + \ldots + Var(X_N) = N \times 1 = N}

By the Central Limit Theorem, we can say they are distributed Normally with mean -N and variance N.


Part C

In part A, we found we could multiply about 746 numbers before reaching an underflow. Since this is equivalent to taking the log of the exp of these numbers, we can use our result from part B, which states that the mean of 746 numbers distributed as log(U) will be -746. Thus, we just check if the exp of -746 is the point at which we underflow.

MATLAB code:

for N=-740:-1:-750
    fprintf('%d %e\n', N, exp(N))
end

And the output, which is what we expect:

-740 4.199558e-322
-741 1.531604e-322
-742 5.434722e-323
-743 1.976263e-323
-744 9.881313e-324
-745 4.940656e-324
-746 0.000000e+00
-747 0.000000e+00
-748 0.000000e+00
-749 0.000000e+00
-750 0.000000e+00


To Think About

1. Suppose you want to approximate some analytically known function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x)} (whose integral is finite), as a sum of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K} Gaussians with different centers and widths. You could pretend that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x)} (or some scaling of it) was a probability distribution, draw Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N} points from it and do the GMM thing to find the approximating Gaussians. Now take the limit Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N\rightarrow \infty} , figure out how sums become integrals, and write down an iterative method for fitting Gaussians to a given Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f(x)} . Does it work? (You can assume that well-defined definite integrals can be done numerically.)