ParametersAndEstimates

The task of statistical inference is to estimate an unknown population parameter using observed data from a sample. In a sampling model, the collection of elements in the urn is called the population. A parameter is a number that summarizes data for an entire population.

We want to predict the proportion of the blue beads in the urn, the parameter p . The proportion of red beads in the urn is \(1-p\) and the spread is \(2p-1\) .

The Sample Average

Many common data science tasks can be framed as estimating a parameter from a sample. We illustrate statistical inference by walking through the process to estimate p. From the estimate of p , we can easily calculate an estimate of the spread, \(2p−1)\ .

Consider the random variable X that is 1 if a blue bead is chosen and 0 if a red bead is chosen. The proportion of blue beads in N draws is the average of the draws \(X_1, ... , X_N\) .

\(\overline{X}\) is the sample average. \(\overline{X}\) is a random variable because it is the average of random draws - each time we take a sample, \(\overline{X}\) is different.

⚠ $\displaystyle{ \overline{X}=\frac{X_1+X_2+...+X_N}{N} }$

The number of blue beads drawn in N draws, \(N\bar{X}\) , is N times the proportion of values in the urn. However, we do not know the true proportion: we are trying to estimate this parameter p .

Polling versus Forecasting

A poll taken in advance of an election estimates p for that moment, not for election day.

In order to predict election results, forecasters try to use early estimates of p to predict p on election day.

Properties of Our Estimate

When interpreting values of \(\overline{X}\) , it is important to remember that \(\overline{X}\) is a random variable with an expected value and standard error that represents the sample proportion of positive events.

The expected value of \(\overline{X}\) is the parameter of interest p . This follows from the fact that \(\overline{X}\) is the sum of independent draws of a random variable times a constant 1/N .

⚠ $\displaystyle{ E(\overline{X})=p }$

As the number of draws N increases, the standard error of our estimate \(\overline{X}\) decreases. The standard error of the average of \(\overline{X}\) over N draws is:

\[SE(\overline{X})=\sqrt{p\cdot (1-p)/N}\]

In theory, we can get more accurate estimates of p by increasing N . In practice, there are limits on the size of N due to costs, as well as other factors

We can also use other random variable equations to determine the expected value of the sum of draws E(S) and standard error of the sum of draws SE(S)

\[E(S)=Np\]

\[SE(S)=\sqrt{Np(1-p)}\]

Example 1

# `N` represents the number of people polled
N <- 25

# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length.out=100)

# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(p*(1-p)/N)

# Plot `p` on the x-axis and `se` on the y-axis
plot(p, se)

Example 2

# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)

# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)

# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.

for (val in sample_sizes) {
se <- sqrt(p * (1-p) / val)
plot(p, se, ylim=c(0,max(se)))
}

Example 3

# `N` represents the number of people polled
N <- 25

# `p` represents the proportion of Democratic voters
p <- 0.45

# Calculate the standard error of the spread. Print this value to the console.
2 * sqrt(p * (1-p) / N)
Page last modified on March 02, 2021, at 01:53 PM
Powered by PmWiki