(Rene) Segment 41: Markov Chain Monte Carlo, Example 2

From Computational Statistics Course Wiki
Jump to navigation Jump to search

To Calculate

1. Show that the waiting times (times between events) in a Poisson process are Exponentially distributed. (I think we've done this before.)

The geometric distribution, given below, has the interpretation of the number of failures in a sequence of Bernoulli trials until the first success.

.

Here is the probability of success.

Partitioning the real line into n steps such that and defining the average rate , we have in the limit of

.

Hence, the waiting time between two successive events is exponentially distributed.

2. Plot the pdf's of the waiting times between (a) every other Poisson event, and (b) every Poisson event at half the rate.

TbetweenArr.jpg TbetweenArr3.jpg TbetweenArr2.jpg


3. Show, using characteristic functions, that the waiting times between every Nth event in a Poisson process is Gamma distributed. (I think we've also done one before, but it is newly relevant in this segment.)

The characteristic function of an exponential distribution is:


Let . Then, we want to know the distribution of . We have that,


Which is precisely the characteristic function of the Gamma distribution with shape parameter $k$ and rate parameter $\lambda$.

To Think About

1. In slide 5, showing the results of the MCMC, how can we be sure (or, how can we gather quantitative evidence) that there won't be another discrete change in or if we keep running the model longer. That is, how can we measure convergence of the model?

2. Suppose you have two hypotheses: H1 is that a set of times are being generated as every 26th event from a Poisson process with rate 26. H2 is that they are every 27th event from a Poisson process with rate 27. (The mean rate is thus the same in both cases.) How would you estimate the number of data points that you need to clearly distinguish between these hypotheses?

Class activity

An urn that contains five different types of balls that are drawn without replacement. Each time a ball is drawn, the probability of drawing a specific type of ball is proportional to the number of that type remaining and to a type-specific weight.

The experiment conducted was: draw from the urn until all balls have been drawn and record the order of colors produced. This experiment was carried out 50 independent times, producing urn_data.txt.

The initial counts in the urn are:

  • Red - 2
  • Blue - 4
  • Orange - 7
  • Purple - 3
  • Green - 5

The data consists of 50 lines that look like:

1 2 1 2 3 4 2 3 3 2 4 4 1 2 1 2 2 4 4 0 0

which says "For this trial, the draws went blue, orange, blue, orange, purple, ...."

Your job is to infer the values of the type-specific weights from the data by MCMC.

Each team must divide into three sub-teams, each of which has a specific task.

Sub-team 1: Produce a function that takes a set of weights and computes the likelihood of the entire data set given the weights.

function logp = likelyhoodfn(data,weight)

    dim = size(data);       % size data
    count = [2 4 7 3 5];    % initial counts per color
    p = zeros(dim(1),dim(2));
    
    % tile arrays for fast acces
    weight = repmat(weight,dim(1),1);
    count = repmat(count,dim(1),1);

    for k=1:dim(2); 

        % draw ball from box
        ball = data(:,k); 

        % probability of draw
        ind = sub2ind([dim(1) 5], [1:dim(1)]', ball+1);
        pr = weight.* count;
        
        p(:,k) = pr(ind) ./ sum(pr,2);

        % initialize next loop
        count(ind) = count(ind) - 1;

    end

    % return log likelhood
    logp = sum(log(p(:)));

end

Sub-team 2: Produce a function that takes the current set of weights and proposes a new set, returning the proposal and the corresponding ratio of q's.

function [wnew,pnew] = mcmcstep(wold,data)

    % proposal distribution: lognormal
    wnew = wold.*lognrnd(0,0.05,1,numel(wold));
    
    % compute likelyhood at old and new point
    pnew = likelyhoodfn(data,wnew);
    pold = likelyhoodfn(data,wold);
    
    % compute acceptance probability
    alpha = min(1, exp(pnew-pold) * prod(wnew./wold));
    
    % decide if to accept proposal
    if rand>alpha
        wnew = wold;
    end
    wnew = wnew/wnew(1);
    
end

Sub-team 3: Produce the logic that calls the two functions that will be supplied by the other sub-teams to carry out the MCMC. This will look like: For some number of steps,

  • propose a new set of proportions given the current ones
  • evaluate the likelihood of the data given the new proposal and combine with the q-ratio to produce alpha, and accept the proposal with probability alpha.
  • record the new values of the proportions used and the new likelihood.

Plot the trajectories of the proportions and the likelihood over the course of the steps.

%% MCMC
clear all; close all; clc;

% add statistics functions 
addpath('../functions');

% initialize
load Urndata;   % load data
 
% number of mcmc steps 
n = 1000;
 
% do mcmc simulation
chain = zeros(n,5); ploglike = zeros(n,1);
chain(1,:) = ones(1,5);
for k=2:n
    [chain(k,:),ploglike(k)] = mcmcstep(chain(k-1,:),Urndata);
end

figure;
plot(ploglike)
figure;
plot(chain);
string = {'red','blue','orange','purple','green'};
for i = 1:5
    figure; hist(chain(200:end,i),25)
    title(string{i})
end


Below are the results. From figure 1, the log likelyhood, we can see that the burn in-period is about one hundred steps. Figure 2 gives the evolution of the weights scaled with weight 1. Histograms of the data is depicted in Figure 3 to 7.

Loglikelyhood.jpg Mcmc.jpg

Red.jpg Blue.jpg Orange.jpg Purple.jpg Green.jpg