# Concept Studying Guide For Oral Exam

I did not want to change the stuff Lori wrote but here is my list of study guide answers.

### Noah's lecture notes

I took notes for each segment. I just scanned and uploaded them, in case you could find a use for them. They may be slightly helpful for looking up things Professor Press mentioned in the segment, but didn't put on the slides. I make no guarantees of correctness or completeness.

### Segment 1: Let's talk about probability

What is computational statistics: Simulations done on data sets in order to find a statistic or distribution that may describe the set of data

Probability: The chance or estimation of an event happening.
Calculus of Inference: Takes data and turns it into a probability.
Probability axioms: In Probability there are 3 Axioms:

• Axiom 1: 1 $\geq$ P(E) $\geq$ 0 - meaning the probability of an event is greater than or equal to zero, but less than or equal to 1.
• Axiom 2: For all possible outcomes of a given event, the sum of their probabilites must add up to 1
• Axiom 3: If the intersection of 2 events is empty, the P(A $\cup$ B) = P(A) + P(B)

Law of Or-ing: P(A $\cup$ B) = P(A) + P(B) - P(AB)
Law of And-ing: P(AB) = P(B|A)*P(A) - a conditional probability times the probability of an event
Law of Exhaustion: A set of Events that covers all of the possible outcomes of an event.
Law of De-Anding: Given a set of Events H, where they are EME (Exhausted and Mutually Exclusive) and we want to find an Event A (a subset of that given event), we can find the P(A) by summing over the conditional probabilities of A given the different events of H times the probability of the different events of H.

### Segment 2: Bayes

Bayes Theorem: Suppose we have an Event H where Event H has i outcomes and we have an Event B that is composed of several outcomes from Event H. We know the probability of $H_i$. If we wanted to find the probability of an Event $H_i$ given the Probability of an event, we can find the probability by doing P(B|$H_i$)*P($H_i$) divided by the sum of all the conditioanl probabilities of B given different values of H.
EME hypotheses: In order for Bayes Theorem to work, we must have a set of mutually exclusive and exhaustive events for us to apply this theorem without any kind of correction.
Contrast Bayesians and Frequentists: Bayesians like to find the probability of an hypothesis given the data (an odd form, but powerful form of conditional probabilities), yet Frequentists believe that we shouldn't care about finding the hypothesis of an event that has already happened (ask for more info on it.
Probabilities Modified by Data: The idea the probability of events can change based on what data is observed. TFor example, if we have 2 coins where one is fair and one is unfair, we have a 50% chance choosing either coin for us. However, when we see the data and it seems to be morely weighted, we are more likely to choose the unfair coin. Thus given by observations that we see, the probability of the event we are trying to look at may change based on the observation.
Prior Probability: The probability you are given for an event.
Posterior Probability: The probability of an hypothesis given an event (or seeing the data).
Evidence Factor: The conditional probability of an event B given a hypothesis that we know the Prior Probability to.
Bayes Denominator: The sum of all conditional statement times the prior probability of the given conditional statement. Used when having to find the reversed conditional statement
Background information: Normally whenever we have a prior probability, that prior probability is based on information we have seen before. Thus priors are usually written as P(H|I), where I represents the background information in order to get the prior probability for H.
Commutativity and Associativity of Evidence: In conditional statements, regardless whether you perform a particular condition statement first or not, you are guarantee the same answer.
Hempel's paradox: The case of an hypothesis observations adds support to an hypothesis

### Segment 3: Monty Hall

Monty Hall Problem: This problem related to the idea that once knowing more information, does the probability of what you were trying to find changes. In the problem we were given 3 doors, one door having a brand new car behind it. We pick a door and the host then opens the door that does not have the car behind it. The question is: Do we change our pick? Through numerical analysis, we learned that we should always switch from our original thought because we now know more information and our probability of picking the right door has increased.

### Segment 4: The Jailer's Tip

Marginalization: Whenever we are trying to find the conditional probability of an event but there other factors that we must take into consideration but is not apart of the problem, we marginalize over it by integrating over our conditional probability we are trying to find, multiplied by the distribution of the event we are marginalizing over.
Uninteresting Parameters in a Model: Parameters that are not important to the probability that we are trying to find, but we must take them into consideration. We do this by marginalizing over the uninteresting parameters in our probability.
Probability Density Function (PDF): The distribution of possible values of a particular variable where the area underneath the curve (or sum of the probabilities) of the PDF is equal to 1.
Dirac Delta Function: Used when we want to find the massed prior for a particular outcome. Written as $\frac{1}{1+x}$ where we stick our massed prior probability for x and evaluate it. We don't need to integrate over it because a Dirac delta function has a sharp spike at wherever $\delta$ - massed prior = 0 and 0 everywhere else.
Massed Prior: A prior where we choose a hypothesis more over than another and we use the Dirac Delta Function to find it.
Uniform Prior: A prior where the probability of getting a particular probability for a hypothesis is uniform, thus the prior probability equals 1.
Uninformative Prior: Priors that are marginalized over the uninteresting parameters and should not be included in the final answer. A prior that will assign that will be indifferent between models. (I believe it's mainly the first sentence I said).

### Segment 5: Bernoulli Trials

i.i.d.: Independent and Identically Distributed (applies to Random Variables)
Bernoulli trials: Trials that either have a failure or success outcome and they are i.i.d
Sufficient Statistic: All the information you can get about the data such as counts and number of trials. That will be used to help us find what our parameters are.
Conjugate Prior: A prior probability that preserves its analytical form.
Beta Distribution: A distribution with the parameters alpha and beta where it is a polynomial function of x^alpha *(1-x)^beta

### Segment 6: The Towne Family Tree

Variable Length Short Tandem Repeat (VLSTR): Where we see a short string of characters repeat several times in a row in DNA.
Binomial Distribution: A distribution with N Bernoulli Trials (i.i.d), using the parameters N (total number of trials), k (Number of successful trials), and p(probability of succesful trials)
Conditional Independence: Suppose we want to find P(be|a), this event is conditionally independent if and only the product of the events given a individual equal the conditional probability of the original event.
Naive Bayes Models: When you multiple events that should be treated as conditional probabilities, but instead you just multiply their probabilities out right.
Improper Prior: A Prior where it takes on a constant value that can be any number between 0 and infinity.
Log-Uniform Prior: An equal probability for each order of magnitude, where we use an improper prior $\frac{1}{r}$ as our prior probability.
Statistical Model: A representation of the data.
Data Trimming: To remove data in hope to improve results

### Segment 7: Central Tendency and Moments

Measures of Central Tendency: This is a measure of where most of the distribution falls. This can be measured by either the mean or the median.
Mean Minimizes Mean Square Deviation: By minimizing (x-a)^2, we can easily find that having mean when setting the derivative of the function to 0 gives us the a value of a (the mean).
Median Minimizes Mean Absolute Deviation: This is due to the fact that we are comparing each data point to a center. The center is at the median of the data where we have equal number of data points (approximately) on both sides of it.
Centered Moments: Moments that evaluated the Expected value of (x-<x>)^i, where i represents the ith moment.
Skewness and Kurtosis: Third moment leads to skewness, where the data is mainly centered at a certain part of the data and trickles off significantly to some other end. Fourth Moment is the idea of Kurtosis, where there is a sharp peak (stupid, but it resembles a butt) in the distribution.
Standard Deviation: How far the data is away from the mean.
Additivity of Mean and Variance: If random variables are independent from one another, their means and variances can be added.
Semi-invariants: Combinations of Moments by either addition or multiplication
Semi-invariants of Gaussian and Poisson: All Semi-Invariants greater than 2 for Gaussian are zero, for Possion, all semi-invariants are equal to its mean.

### Segment 8: Some Standard Distributions

Normal (Gaussian) distribution: A distribution that is composed of a mean and a standard deviation, fast falling tails
Student distribution: Has paramaters mean, standard deviation, and nu (degrees of freedom). When nu = 1, it resembles the Cauchy Distribution, as nu goes to inifinity the distribution resembles a Gaussian distribution with the corrosponding mean and standard deviation, power-falling tails
Cauchy distribution: Does not have a first moment, slightly skewed, has slow falling tails, parameters are mean and sigma
Heavy-tailed distributions: Converges to zero slower and will have less moments.
William Sealy Gosset: Discovered Student Distribution
Exponential distribution: Has parameter Lambda
Lognormal distribution: Has parameter mu and sigma, looks exactly like normal distribution but takes the log(x) in its formula and inverse of x in denominator.
Gamma distribution: Takes in parameters alpha and beta, and is similar to a Beta Distribution except the exponents are applied to x and e rather than (1-x). Formula = $\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha - 1}*e^{-\beta x}$ and The Gamma Function is simply (alpha - 1)!
Chi-square distribution: The Sum of the uni-normal random variables, taking the paramater nu, where nu is the number of random variables we are summing over.
Probability Density Function (PDF): Represents the probability for x in the distribution. The integral or sum of the PDF should be 1
Cumulative Distribution Function (CDF): Represents the probability for x less than some particular number. The CDF y-values range between 0 and 1 and eventually will peak off at 1 and stay at 1 as x goes to infinity

### Segment 9: Characteristic Functions

Characteristic Function of a Distribution: Suppose to helps us understand the Central Limit Theorem. The sum of 2 independent random variables are have a characteristic function where it is the product of their individual characteristic functions. It is the fourier transform of the pdfs where in this form we may be able to do calculations easier.
Fourier convolution theorem: The sum of 2 independent random variables are have a characteristic function where it is the product of their individual characteristic functions.
Characteristic function of a Gaussian: $e^{i\mu t - \frac{1}{2}\sigma^2t^2}$
Characteristic function of Cauchy distribution: $e^{i\mu t - \sigma^|t|}$

### Segment 10: The Central Limit Theorem

Central Limit Theorem (CLT): The sum of any (and I mean any!) random variables will be approximately normal.
Taylor series around zero can fail: It can fail if the distribution is a Cauchy Distribution since it does not have first moment.
Maximum a Posteriori (MAP): By Bayesians, the most likely value of our parameters
Maximum likelihood (MLE): By Frequentists, the most likely value of our parameters
Sample Mean and Variance: The mean and variances calculated from the sample (not the same as population mean and variance)
Estimate parameters of a Gaussian: We find it by finding the derivatives of the parameters of the statistical model we have to represent the data and minimize them by setting the derivative equal to zero and solving for the parameter.

### Segment 11: Random Deviates

Random Deviate: A random draw from a distribution
U(0,1): A uniform distribution where we can draw a value between 0 and 1 with equal probability
Transformation Method (random deviates): We have a pdf of some distribution and we want to draw numbers from that distribution.

• Step 1: We choose a U(0,1)
• Step 2: Put the number we just got into the CDF inverse of the distribution and we will get a random draw.

Rejection Method (random deviates): We have a pdf that is too hard to find the cdf of.

• Step 1: We have a function that is used to bound our pdf for our distribution where we can easily for the CDF of it. The CDF will go from 0 to some number slightly about 1 since it's bounding the pdf and obviously will not have an area that is equal to 1, thus we call that number A.
• Step 2: Draw a U(0, A), call it T
• Step 3: Put it into Inverse CDF function of the bounded function and get an X value.
• Step 4: Find what p(X) from our pdf function, call it R.
• Step 5: Draw a U(0, T), call it S
• Step 6: If S is less than equal to R, then that is our random draw. Otherwise start back at Step 2 and continue until we get a X value within the pdf's probability.

Ratio of uniforms method (random deviates): Create a tear drop bound, then uniformly pick u and v (coordinates for x and y). If we have a coordinate within the tear drop, we accept, otherwise we reject; The teardrop is created by having u and v on axes. If the ratio of v/u is greater than u, we reject it, otherwise we accept it. The breaking point where we either accept or reject is the thin line that creates the tear drop shape
Squeeze (random deviates): We use the ratio of uniforms methods as 2 functions, outer and inner. If we get a value that is outside of the outer teardrop, we reject, if we are inside of inner tear drop, we accept. If we are in the Squeeze area, we simply evaluate our functions similar to the Rejection method
Leva's algorithm for normal deviates: Efficient way through coding to do the Squeeze theorem. The conditional represents the area in between the tear drops.

### Segment 12: P-Value Tests

P-Value Test: We do test where we have a population mean and standard deviation as well as sample data. We take the mean of that sample and normalize by doing $\frac{x - \mu}{\sigma}$ to get Z score. Then based on if we are doing a 1-tailed or 2-tailed test, we compare the probability to a known alpha.
Null Hypothesis: What we believe that population statisitc should be
Test Statistic: The Statisitc that we get from the sample data
Advantage of Tail Tests over Bayesian methods: It is a way to measure your statistic against one particular hypothesis you are interested. Bayesian methods will tell you probability of the data showing up over all the possible hypothesis in our study. However, usually studies focus on one particular behavior, and Frequentist use these tailed test to tell us how significant the data is given our supposed parameters.
Distribution of P-Values under the Null Hypothesis: P-Values are uniform distributions for 1 tailed test under the null hypothesis.
T-values: Number of standard deviations away from the mean
p-test critical region: The area we are concerned with as we compare it to alpha. If our probability is within our critical region, we reject the Null Hypothesis
One-Sided vs. Two-Sided P-Value Tests: It is the area we are concerned when we carry out the test. We should always use two-sided tests unless we are certain which direction the data should fall

### Segment 13: The Yeast Genome

Saccharomyces Cerevisiae: Baker's Yeast, which is a strand of DNA from the chromosone inside of it
Multinomial Distribution: A Distribution that is composed of N tirals with k distinct events (which could occur any number of times) and each event has a probability of $p_k$ where the probability of each k event add up to 1.

### Segment 14: Bayesian Criticism of P-Values

Stopping Rule Paradox: Where we can have different results for evaluating for the same hypothesis but with different methods.
Bayes Odds Ratio: We compute the probability of each hypothesis and then if we find the ratio between 2 hypothesis, it compares by what ratio one hypothesis is to another
Normal approximation to binomial distribution: Simply when we have Large number of trials for our Binomial Distribution, we can approximately make a Normal Distribution with the parameters N(Np, $\sqrt{Np(1-p)}$)
Ronald Aylmer Fisher: Invented P-value test were we see how a given data has values extreme or more extreme based on a particular sample mean where our alpha is 0.05
p=0.05 pros and cons: Other than, why do we use that alpha value and not all test are invalid when they do not meet the expectations of that alpha level, yet they do enforce a rule.

### Segment 15: The Towne Family -- Again

Posterior Predictive P-value: Using the data twice, once to find the posterior probability, and then a second time test itself. If we get results where regardless if we keep or take out data we get relatively the same answer, then we can conclude that this is a good statistic for the p-value
Empirical Bayes: Using the data to find the prior distribution where we take $\int_0^\infty P(r)p(r|data)$ where r is our parameter

### Segment 16: Multiple Hypotheses

Multiple Hypothesis Correction: Whenever dealing with data that you compare multiple hypothesis to it, we can use the Bonferroni Correction or FDR to ensure the we reduce the significance level of any one test when we are going to apply multiple testing to it.
Bonferroni correction: Calculate the probability of 1 or more test will fall in the critical region which is approximately, alpha/N
False Discovery Rate (FDR): We have a set of p-values that we got from our N tests then we sort them. Then we compare them to alpha*index over the number of Tests. If the p-value is less than we calculated, then we reject it, otherwise we accept.
Bayesian approach to multiple hypotheses: Given the fact that the hypothesis are EME, we can simply calculate P(data | Hi) and thus we do not have to do corrections.

### Segment 17: The Multivariate Normal Distribution

Multivariate Normal Distribution: A distribution that is composed of a vector $\mu$ and a Covariance Matrix $\Sigma$, where the number of components in $\mu$ are the number of normal distributions we have.
Covariance Matrix: Represents the Variances of different distributions in relation to each other. The diagonals are the Variances of each distribution. If each Gaussian is independent from one another, then their Covariances are 0.
Estimate mean, Covariance from multivariate data: Simply we to estimate the mean, we add up all the data in our sample and divide by the number data points. For covariance, we simply find it by taking the square of subtracting each point by the mean and adding them up. If we believed that there is one more distribution at work here, we would take several samples from our data, where each sample represents a different distribution. We would then calculate the mean and variance accordingly and place the components into the mean vector and covariance matrix accordingly. EAch row would represent its own sample data. To optimize it, we would use a function in our favorite programming language.
Fitting data by a multivariate normal distribution: We would fit it by optimizing the data as described above.
Slice or Projection of a multivariate normal r.v.: A Projection is simply a marginalization over a the variables that we are projecting over. A slice is conditioning over the variable that we want to.
Cholesky decomposition: Used to make proving multivariate distributions easier. This is done by taking square root of a variance matrix in order to help take apart the $\Sigma$ in a multivariate distribution (in a sense). This can only be done when the Matrix is a positive definite Matrix.
How to generate multivariate normal deviates: Assign a mean vector of length M and a covariance Matrix M x M. In your favorite programming language we can call of function that can derive deviates with our specified mean vector and covariance matrix.
How to compute and draw error ellipses: Conceptually, we would find the where the center of data is and find the correlation of our data. Then we draw a ellipse with the standard deviation of the x and y in regards to that data and slant the ellipse based on the correlation of our data, or line we best fit the data to. This is use to help create confidence intervals.

### Segment 18: The Correlation Matrix

Covariance Matrix: Holds the Covariances of the distributions in relation to other distributions. Diagonals should be the variances of each distribution. If they are independent of one another their covariance should be 0.
Linear Correlation Matrix: Covariance(X,Y)/sqrt(Var[X]*Variance[y]). Diagonals should be 1.
Test for Correlation: If the absolute value of our correlation is near 1, it has a stronger correlation. If it is near 0, then it has a weaker correlation.

### Segment 19: The Chi-Square Statistic

Chi-square Statistic: The sum of uni-normal random variables, known as t-values
Chi-square distribution: Has a mean of N degrees of freedom and a Variance of 2N
Transformation Law of Probabilities: Given some variable X that has a particular distribution (e.g. uniform, normal, etc.) With that variable X, you apply some function to it calling Y. We can find the probability of y by p(y) = $\frac{dx}{dy}$p(x) where we take the derivative of x in respect to y of our transformation of x
characteristic function of chi-square distribution: $(1-2it)^{\frac{\nu}{2}}. This is derivative as where each t-value has a [itex]\nu$ = 1.[/itex]
generalization of chi-square to non-independent data: It is still a chi-squared distribution by applying chloesky compesition and letting Ly = x - $\mu$

### Segment 20: Nonlinear Least Squares Fitting

Normal error model: We have a mean of 0 and some standard deviation that represents the error we may have in our data
Correlated Normal error model: We have a mean of 0, but the errors are related to the different uni-variables in some way, so we use a covariance matrix instead.
Maximum Likelihood Estimation of Parameters: We want to find the parameters given the data. This is done by creating a model that would calculate some variable analytically, and we take that value and compare it to what we got from the data using a chi-squared test. When then optimize this to find the best parameters to fit the chi-square.
Relation of chi-square to posterior probability: Well the posterior is the P(b|data), chi-square can attempt to fit the data to given parameters by summing over the t-values in its sum by taking that test statistic and finding it in regards to its distribution.
Nonlinear Least Squares Fitting: Fit the parameters to a vector b
Chi-square Fitting: Same as Nonlinear Least Squares Fitting
Accuracy of Fitted Parameters: Applying p-value tests can help us determine the accuracy of fitted parameters, or getting a chi-squared value that is close to the mean of the chi-squared distribution.
Hessian Matrix and Relation to Covariance Matrix: Inverses of each other
Posterior Distribution of Fitted Parameters: The p-value result of test statistic in the chi-squared distribution
Calculation of Hessian matrix: Inverse of Covariance Matrix

### Segment 21: Marginalize vs. Condition Uninteresting Fitted Parameters

How to marginalize over uninteresting parameters: You sum/integrate over the unintersting parameters
How to condition on known parameter values: I think you set those known parameters and only look at how the unknown parameters related based on those fixed values (check this)
Covariance matrix of fitted parameters vs. of data: Shows how the variables relate or vary from one another based on the data
Consistency (property of MLE): Converges to the true values of the parameters
Asymptotic Efficiency (property of MLE): It has consistency as well as the idea that the estimate of function of parameter = function of estimate of parameter (equivariance), and asymptompically normal
Fisher Information Matrix: Another name for the Hessian Log where it is the second derivative of the log of the model
Asymptotic Normality (property of MLE): model is based off a normal model
How to get confidence intervals from chi-square values: By finding contours of the model by creating error ellipses.

### Segment 22: Uncertainty of Derived Parameters

Linearized Propagation of Errors: We take a function of b where b represents the true parameters of a model. We know that:
b1 = b - b0, where b1 = the error we can get (which is the difference and its a multivariate normal RV), b0 is are the estimates, and b is our true parameters (we do not know this).
f(b) = f(b1) + f(b0) If you re arrange the problem.
Remember what Linearization is from Calculus? It is simple L(x) = f(a) + f'(a)(x-a), where a is our known. The same thing applies. So we have f(b0) + the gradient of b0 dotted with b1 (b0 is our known due to the fact we chose those estimates).
Expected value of b is simply f(b0). This is because:
<f> = <f(b0) + $\nabla$f b1 > = <f(b0)> + <$\nabla$fb1>. We known f(b0) is a constant because we know it, thus the expected value of f(b0) is simply f(b0). Since b1 is a multivariate normal distribution with a mean of 0. The dot product of the gradient of f turns to 0.
The variance is calculated in a similar way where we get <f^2> - <f>^2
f^2 = f(b0)^2 + 2f(b0)$\nabla$f b1 + ($\nabla$f b1)^2
<f^2> - <f>^2 = <f(b0)^2 + 2f(b0)$\nabla$f b1 + <($\nabla$f b1)^2> - f(b0)^2
The f(b0)^2 cancel out since htey are the same term. The 2f(b0)$\nabla$f b1 cancels out since we'll be dotting everything with 0. Thus we are left with <($\nabla$f b1)^2>
This new term is equal to $\nabla$f b1 * b1^T $\nabla f^T$ = $\nabla f\Sigma_b\nabla f^T$
Sampling the posterior distribution (in least squares fitting): We take a model where we draw a large number of values from a multivariate normal distribution. Then we apply a function to each b in that vector, and then plot the results we get with a histogram. This help us with the uncertainty of parameters
Ratio of Two Normals as Example of something: Becomes a Cauchy Distribution

### Segment 23: Bootstrap Estimation of Uncertainty

Bootstrap Resampling: Draw a sample from our data from the population and then multiple times draw data from our data set that represents the population data with replacement.
population distribution vs. sample distribution: Sample distribution is usually has smaller counts as is sparser. It somewhat takes on the shape of the populataion distribution we increase the number of draws from it
Drawing With and Without Replacement: If we were to draw without replacement, we would get the same data set over and over again. By drawing with replacement within the data set, we could get a variety of data sets possible within that sample.
Bootstrap Theorem: The distribution of any resampled quantity around its full-data-set value estimates the distribution of the data set value around the population value
Compare bootstrap with sampling the posterior: less smooth than sampling the posterior

### Segment 24: Goodness of Fit

Precision improves as square root of data quantity: The statement says it all?
What Chi-square Value indicates a Good Fit?: a chi-squared that matches the degree of freedom
Degrees of Freedom in chi-square fit: number of random variables minus the parameters
goodness-of-fit p-value (in least squares fitting): Fit the parameters to match a function for the data
linear constraints (chi-square): Coefficients placed on the Random Variables
nonlinear constraints (chi-square): The T-vales in the chi square summation

### Segment 27: Mixture Models

Forward Statistical Model: A model that calculates the probability of the data given the parameters that has a mean vector, a covariance matrix as well as the proportion of data coming from the different models.
Mixture Model: A model that has data that is drawn from different distributions.
Assignment Vector (mixture model): A vector that tells us what distribution each piece data came from based on index.
Marginalization in mixture models: We can marginalize over 2 types of parameters.
If we marganalize over s (the parameter that tells us what fraction of the model came from a particular model), we simply just multiply the distributions by a particular fraction and add them together. Then multiply by the probability of choosing the fraction of a particular distribution.
Hierarchical Bayesian Models: Probability of the fraction of the model not coming form a uniform distribution, but from other distributions where you also have to marginalize on those nuisance parameters.

### Segment 28: Gaussian Mixture Models in 1-D

Gaussian mixture model: A model that is composed of data drawn from different Gaussians with different distributions
Expectation-maximization (EM) methods: A method that is used to figure out where the data came from by find the expectation (or where do we expect the the data comes from ) and then maximize the parameters to give us the best result. We iterate over this until we converge to a particular answer.
Probabilistic Assignment to Components (GMMs): Different probabilities assigned by the fraction of data coming from a particular model. Found in the E-step
Expectation or E-step: We know the data, but not the assignments. So we find it by taking marginalizing over all the gaussian probabilities of the data coming from a particular probabilities.
Maximization or M-step: We know the assignments of the data, but not the model. By doing the E-Step, where it returns an array of IxN, where I is the number of points, and N is the number of predicted models we are fixing the data to. We can estimate the mean vector, the covariance matrix and the fraction of the data that comes from the particular Gaussian Model
Overall Likelihood of a GMM: We can maximize the likelihood of the data coming from particular models by iterating over E and M step. Once they parameters don't change very much, we can find the best parameters that best represents the data given the data.
log-sum-exp formula: Taking the log of the E-Step in order to help prevent under/over flow during operations

### Segment 29: GMMs in N-Dimensions

Starting values for GMM iteration: The starting are all based on how well the models can fit to the data. It is very sensitive to what starting values we choose so it is best that we can pick starting values that are somewhat representative of what the models should be.
Number of Components in a GMM (pros and cons): When figuring out the number of components that should be in a GMM, it is hard to tell since we never know exactly where and how many models we will need. Based on where we put them (starting values) and what we believe to be the number of models, our results may not be as accurate as we want them to be. Adding more models may not be the right approach to fitting the middle since it requires more computation, but not enough models won't fit the data all that well either. The E-M step is a good way to fit models, but what if the models are not Gaussian? Perhaps our fit would be better if we didn't use Gaussians. Thus many factors are involved in trying to find the best fit.
K-means clustering This is done but having a data grouped together in general areas where the come together. My Clustering algorithm was that I could choose as many starting points as I wanted (randomly of course) and then I would cluster them by coordinates that were near my potential starting points. Based on what starting point I gave it, it would either create good clustering or bad clustering.

### Segment 30: Expectation Maximization (EM) Methods

Jensen's inequality: Jensen's inequality is when a function is concave (downward), then function(interpolation) >= interpolation(function), where the interpolation is an estimate of the function.
Concave Function (EM methods): Concave functions are functions that we are trying to estimate the parameters to. We can do this by find a simpler function that bounds every value in the function we are trying to approximate it. Those this bounded function becomes our interpolation function in which we can use this to approximate the parameters for our data (check this).
EM theorem (e.g., geometrical interpretation): We have 3 variables: X = data, Z = missing data/nuisance variables, $\Theta$ = the parameters we are trying to maximize. We have this function L($\theta$) = ln(x|$\theta$) and this can be find by marganalizing over our nuisance variables in order to maximize the parametrs (check this).
Missing data (EM methods): This missing data is similar to the v parameters in our GMM method, where it describes what model the data came from, which can tell us the fraction of data coming from a particular model. The EM methods, it simply the nuisance variables where the data is coming from (check this)
GMM as an EM: what is the missing data, what are the parameters? : z (missing) is the assignment of data points to components (our v vector, and $\theta$, which represents the parameters, are our mean vector and covariance matrix.

### Segment 31: A Tale of Model Selection

Use of Student Distributions vs. Normal Distribution: Student Distributions are used to for heavy tailed distributions that do not fall very quickly. Normal are used for quick falling of tails
Heavy-tailed Models in MLE: Fitting models with Student distributions rather than gaussians due to tails not dropping off fast enough
Model Selection: Picking models based off of what the data looks like.
Akaiki information criterion (AIC): Used to see if we should fit with more parameters based off it changes the chisquared value more than 1.
Bayes information criterion (BIC): Used to see if we should fit with more parameters based off of $\frac{1}{2}$ln(N) where N is the number of data points.

### Segment 32: Contingency Tables, A First Look

Contingency Table: A table that holds counts towards categorical factors
Cross-Tabulation: Same thing of Contingency Table
Row or Column Marginals: Whenever we marginalize over a particular row or column, we look at the relation of the table given that particular column/row. Row marginals are when the rows counts are fixed and column marginals are where the column counts are fixed.
Chi-square or Pearson statistic for contingency table: This is done by summing (observed - expected)^2/expected to find the chi-squared statistic. This is done when chi-squared is looking at the statistic in regards to data that are counts.
Conditions vs. Factors: These are 2 categorical categories where they could possible have sub categories. Then we use Pearson's Chi-squared test to see if the conditions and factors are related somehow.
Hypergeometric Distribution: hyper(r;N, m, n) - C(m,r)*C(N-m, n-r)/C(N,n), where we use combinations in order to find probability. We normally only know the counts, not the probabilities in this case.
Multinomial Distribution: The probability of seeing the exact data of counts (with replacement) that have particular probabilities of occurring. $\frac{N!}{n_1!*n_2!*...*n_k!}p_1^{n_1}p_2^{n_2}...p_k^{n_k} where \Sigma_i^k n_i = N and \Sigma_i^k p_i = 1$

### Segment 33: Contingency Table Protocols and Fisher Exact Test

Retrospective analysis or case/control study: Marginalizing rows, where we know the conditions, but we do not know the relation of the factors.
Prospective Experiment or Longitudinal Study: We know what the factors are for a study, but we do not know how many people will get the condition. This is done by marginalizing over columns.
Nuisance Parameter: The parameter we are marginalizing over.
Cross-sectional or Snapshot study: The probability of seeing the table, which is a multinomial distribution
Example of protocol with all marginals fixed: It's very rare, but usually when we know exactly the number of factors and conditions are such as the example as the number of senators from Democrats and Republicans and female and males. The number of senators are always set a number.
Fisher's Exact Test: We create a sample contingency table and loop over all possible contingency tables with the same marginalization. And find the p-values using a hypergeometric distribution and find whatever you are looking for. The p-value you get from summing the probabilities of the values you are concerned with are your p-values
Sufficient Statistic (re contingency tables): when no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter. (check this)
Wald statistic (re contingency tables): t-Value = $\frac{p1-p2}{p(1-p)(M^{-1} + N^{-1})}\text{ where p1 = }\frac{m}{M}, \text{ p2 = }\frac{n}{N}, p = \frac{m+n}{M+N}$. Used for when the marginals are fixed

### Segment 34: PermutationTests

Permutation Test (re contingency tables): Similar to the Fisher's Exact Test, we keep our marginals the same and we create all permutations of the contingency tables with those exact marginals. A problem that this may run into is the idea that permutations can get really big and thus is very hard to estimate. (May change if my idea is wrong lol), but a way to handle this is to believe in the central limit theorem and hope that enough independent draws from a population sample will create sample distribution that represents the population distribution.
Monte Carlo calculation: I believe it is when you use the Wald Statistic when calculating probabilities using the Permutations test (check this)

### Segment 39: MCMC and Gibbs Sampling

Bayes denominator (re MCMC): In this case, we use it in the Gibbs Sampler and thus we simply do an integration of the different dimensions given the parameter.
Sampling the posterior distribution (re MCMC): First we pick a dimension that we don't want to marginalize over:

• Then cycle through the different direction of a particular coordinate (all the directions that we are marginalizing over)
• Sample along this dimensional distributions by fixing the other components of that vector point
• Fix the new coordinate we get and rinse and repeat.

Markov chain: Point determined based on the previous point drawn by some chosen distribution
Detailed balance: The condition for P(x2 | x1) that we will guarantee that sequence will be an ergodic it is we will visit all the points in the probability distribution in proportional their probability. Thus $\pi(x_2)p(x_1|x_2) = \pi(x_1)p(x_2|x_1)$
Ergodic Sequence: Over time, we will have a curve that will fill in all values of x proportionally to the probability distribution of x
Metropolis-Hastings algorithm: First we pick a proposal distribution for p(x2 | x1). It can be whatever we want it to be (Multivariate distribution center around x1 can work!)

• Then draw an x2c (a canindate value from x's distribution)
• Then you create an acceptance probability while you take the min of 1 and the ratio of the formulas mentioned in detailed balance.
• Then you let p(x2 | x1) = q(x2 | x1)alpha(x1,x2)

Proposal distribution (re MCMC): A distribution that you believe accuratetly represents a Markov Chain multivariate normal distributions work. If choosen right, you will accept this proposal distribution since it will acccept proposals that increase the probability on the conditional, and sometimes accept proposals that do not
Gibbs Sampler: Special case of Metropolis-Hastings Algorithm. We marginalize on all parameters except one. Looking at every possible case. Can be exponential if not careful
Burn-in: The burn-in time is once you start your Markov Chain, you might have points that are completely off! You ignore them and those points are your burn-in points

### Segments 40 and 41: MCMC Examples

Waiting time in a Poisson process: Gamma Distributed (in a sense) with a rate of lambda to the kth event. The physical wait times are exponential distributed.
Good vs. Bad Proposal Generators in MCMC: When defining steps for what our proposal statistics should be:
Bad Way: Make our steps either 0 or +/- 1 randomly somehow. This could lead to our assumptions being way off because the step size could be too big, causing many rejections to happen
Good Way: We should make our step size between 0 and 1, a relatively small number, and keep it constant throughout the algorithm. This will change the clumping statistics only by a small amount.

### Segment 47: Low Rank Approximation of Data

Data Matrix or Design Matrix: Matrix composed of data with many segments to it where the rows are the data and the columns are the parts of the data that is describing it.
Singular Value Decomposition (SVD): Can decompose the matrix into an orthonormal matrix where the columns dotted with themselves are 1 and zero with every other column. S matrix where ish a diagonal matrix with eigenvalues (or scalers) in the matrix, and V transpose is just another orthnormal basis.
Orthogonal Matrix: Matrix that is composed of 1's and -1 where the columns dotted with themselves = 1 and dotted with any other column = 0
Optimal Decomposition into rank 1 matrices: You can decompose it by create M vectors by $\sum_{i=1}^Ms_iU_ixV_i$ where UxV is the Rank 1 Matrix.
Singular Values: Eigenvalues decomposed by the original matrix

### Segment 48: Principal Component Analysis

Principal Component Analysis (PCA): I know that the principal component are the column components in each row, which describes to us where are we a long the axises which represents the data. So what we do is that we decompose our data into our U,V, and S matrix then we create our diagonalied S^2 matrix my multiplying (XV)^T by XV. Lastly we plot the values in our S^2 matrix and try to figure out where the signals(important stuff of the data) and noice.
Diagonalizing the Covariance Matrix: Creating the S^2 Matrix I think
How much Total Variance is Explained by Principal Components?: By plotting the fractional variance(which is how much data you need to see in order get a rough estimate of the true variance). So sometimes by just looking at the first 20 or first 100 components, you can get any idea of how much variance the overall strucutre of the data can be where the remaing points slightly fits it to its true variance, but is consider as noise.
Dimensional Reduction: By plotting a relation of the strongest correlation variables?