Inference and Modeling

Parameters and Estimates

The task of statistical inference is to estimate an unknown population parameter using observed data from a sample. In a sampling model, the collection of elements in the urn is called the population. A parameter is a number that summarizes data for an entire population.

We want to predict the proportion of the blue beads in the urn, the parameter p . The proportion of red beads in the urn is \(1-p\) and the spread is \(2p-1\) .

Parameters and Estimates

The Central Limit Theorem in Practice

Because \(\bar{X}\) is the sum of random draws divided by a constant, the distribution of \(\bar{X}\) is approximately normal.

We can convert \(\bar{X}\) to a standard normal random variable Z : \[Z=\frac{\bar{X}-E(\bar{X})}{SE(\bar{X})}\]

The probability that \(\bar{X}\) is within .01 of the actual value of p is: \[Pr(Z\leq0.01/\sqrt{p(1-p)/N)}-Pr(Z\leq-0.01/\sqrt{p(1-p)/N)}\]

The Central Limit Theorem (CLT) still works if \(\bar{X}\) is used in place of p . This is called a plug-in estimate. Hats over values denote estimates. Therefore:

\[\hat{SE}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N}\]

Using the CLT, the probability that \(\bar{X}\) is within .01 of the actual value of p is:

\[Pr(Z\leq0.01/\sqrt{\bar{X}(1-\bar{X})/N)}-Pr(Z\leq-0.01/\sqrt{\bar{X}(1-\bar{X})/N)}\]

The Central Limit Theorem in Practice

Confidence Intervals and p-Values

We can use statistical theory to compute the probability that a given interval contains the true parameter p.

95% confidence intervals are intervals constructed to have a 95% chance of including p. The margin of error is approximately a 95% confidence interval.

The start and end of these confidence intervals are random variables. To calculate any size confidence interval, we need to calculate the value z for which \(Pr(−z≤Z≤z)\) equals the desired confidence. For example, a 99% confidence interval requires calculating z for \(Pr(−z≤Z≤z)=0.99\).

For a confidence interval of size q , we solve for \(z=1−\frac{1−q}{2}\) .

To determine a 95% confidence interval, use z <- qnorm(0.975). This value is slightly smaller than 2 times the standard error.

Confidence Intervals

Statistical Models

Poll aggregators combine the results of many polls to simulate polls with a large sample size and therefore generate more precise estimates than individual polls. Polls can be simulated with a Monte Carlo simulation and used to construct an estimate of the spread and confidence intervals.

The actual data science exercise of forecasting elections involves more complex statistical modeling, but these underlying ideas still apply.

Statistical Models

Bayesian Statistics

In the urn model, it does not make sense to talk about the probability of p being greater than a certain value because p is a fixed value. With Bayesian statistics, we assume that p is in fact random, which allows us to calculate probabilities related to p . Hierarchical models describe variability at different levels and incorporate all these levels into a model for estimating p .

Bayesian Statistics

Election Forecasting

Pollsters tend to make probabilistic statements about the results of the election. For example, “The chance that Obama wins the electoral college is 91%” is a probabilistic statement about a parameter which in previous sections we have denoted with d . We showed that for the 2016 election, FiveThirtyEight gave Clinton an 81.4% chance of winning the popular vote. To do this, they used the Bayesian approach we described.

We assume a hierarchical model similar to what we did to predict the performance of a baseball player.

Election Forecasting

Association Tests

Fisher's exact test determines the p-value as the probability of observing an outcome as extreme or more extreme than the observed outcome given the null distribution.

Data from a binary experiment are often summarized in two-by-two tables.

The p-value can be calculated from a two-by-two table using Fisher's exact test with the function fisher.test().

Association Tests

Page last modified on March 08, 2021, at 07:54 AM
Powered by PmWiki