Extra Credit Segments

From Computational Statistics (CSE383M and CS395T)
Jump to navigation Jump to search

So I wasn't sure what to put here since there were no problems. So I decided to go to the concept page and write some notes on that and try to find a real life application of the segment I have watched.

Segment 25: Fitting Models to Counts

Similarity of Histogram to Contingency Table: The histogram represents bins for the all the possible events that happened in the test. A contingency table has test represented by categorical events and the how many of that particular event happened in the test. However, since events in the contingency tables are not independent from one another, the histogram actually represents multivariate binomial distribution.
Poisson as approximation to Binomial: By approximating and reducing terms in the binomial distribution we can have a poisson distribution represent it and it depends if the total amount of events that happened was a constraint or just happened.
Mean-Square Error (relation to chi-square): For this kind of Chi-Squared, the Mean-Squared Error represents the chi-squared over N. Thus, we should have a peak value for this distribution at 1.
Pearson vs. modified Neyman chi-square: Despite the factor that they are slightly different equations, Pearson is divided by the expectation while Neyman is divided by the count + some psuedo count, Neyman will produce a better model than person. This is because Pearson uses the paramater <math>p_i</math> in its calculation (skewing the model) where as Neyman uses the data only to compute its chi-squared. Also Neyman can recover from a spurred p-value where p is very small for the expected probability.
Pseudo-Count: an added amount to the Neyman chi-squared denominator from lambda from the Poisson Distribution
Chi-By-Eye: Accept a chi-squared test that isn't really a good representation of the data
Application of this Segment: Since this distribution can only act on categorical data, could probability relate different categories of human's physical appearance such as eye color, skin color, gender, and perhaps even diseases and see how they correspond with each other

Segment 26: The Poisson Count Pitfall

What makes a statistic accurately chi-square: For Pearson's Chi-Squared, we can sum any number of terms that are individual accurately squares of t-values. Or summing a large number of terms with the same mean and standard deviation using the Central limit theorem.
Normal Approximation to Chi-square distribution: By Central Limit Theorem, for a random variable in Pearson's we, get our mean = 1, and our variance to equal 2. Thus, we have many random variables, the normal approximation should be N(<math>\nu\text{, } \sqrt{2\nu}</math>). However, due to accomodating for small amount of bins such as outliers that have only a few counts, our variance is actually <math>2\nu + \Sigma_i^N \mu_i^{-1} </math>, thus increasing our variance.
Corrected Chi-Square Statistic for Poisson data: Could be wrong about this, but to correct this, we could eliminate small bins in our model so they do not screw up our distribution.

Application: This Segment was about some of the things that papers have done wrong when using Pearson's Chi-Squared Test. They believe as we increase the number of random variables in our distribution, that it should approach N(<math>\nu\text{, } \sqrt{2\nu}</math>). However, since these people do not account for the small numbers that may appear in outlier bins, the variance is actually larger than what it should be and thus accept test that they should reject.

Segment 35: Ordinal vs. Nominal Tables

Ordinal vs. Nominal Data Ordinal tables have columns that have some order to it and it does affect the outcome of stuff!. Nominal means regardless what the name is, it will get the same outcome
Advantages of Ordinal Data (re contingency tables): It is very easy to collect and categorize so that we may apply statistics to it.
False pos vs. false neg in contingency table permutation test: In Permutation Tests, false positive are the results of our p-values once we have calculated the statistic using either pearson/wald's test. False Neg is used when we do a bunch of boot strapping for the data, which can occur on either side of the test
Fragility of 2-tailed Fisher Exact Test: The Fisher's exact test is fragile based on what statistic you use to compute the data, especially for small amounts of data.

Application: This segment helped me understand Segment 34 a little better. When I was asked in Segment 34 to use a method that did not destroy my computer when computing permutations, I fell under the misconception that the permutation test was similar to bootstrapping. It is similar but not. Yes! you are sampling data and finding many contingency tables, but the way the permutation test breaks down the contingency table destroys the assumptions under the null hypothesis. In addition, both methods are interpretted differently. One thing that I also learned from the sample is why my p-values were so different when I calculate the Pearson's and the Wald Statistic for my contingency table. The Fisher Exact Test when data is small and the marginals are fixed can have very drastic results based on what statistic you choose to use. Not to mention that the Fisher Exact Test when marginals are fixed is very unrealistic since usually we have no idea what the marginals could be (well, at least a factor or so we don't know).

Segment 36: Contingency Tables Have Nuisance Parameters

Dirichlet distribution as conjugate to multinomial: It simply using a version of multinomial distribution by rather than taking the combinatory function of the observed, we use a gamma function and multiply it by the respective probabilities of the observed account occuring times the distribution of the prior for our parameters
How to generate Dirichlet deviates: First we take a particular count number and draw a probability from our gamma function. We do that for how many values we have for n. Then we take each value and divided it by the sum of it. With that value, then we create a multinomial distribution with that parameter we have just created in order to form a contingency table.
p or q as nuisance parameters in experimental protocols (contingency tables): Whenever finding p or q, we just set one set of marginals constant and the other can change.
Application: This is a more practical use of marginalizing over the columns/rows. This is more pratical because normally we know one thing in our data (counts of women) but we don't know the factors of what is we are trying to find the association to for our null hypothesis.