In my last post I explored the “Large Population” or “10%” condition for statistical inference using the traditional formulas, specifically as it relates to proportions. After much twitter conversation, I am coming around to the point of view that this should be explored deeply with students, if possible, as it will generate good conversations about precision, sampling distributions, and other key concepts of statistics.

If this condition is met, then using a normal approximation for estimating proportion is fairly precise (within about 5%). if it’s not met, it’s not precise. However, it’s important, I think, that students realize that using a large sample relative to the population actually *makes the estimate better*. The reason we don’t want to use the normal approximation with the traditional formulas in those cases is because we will be *too generous* – our confidence intervals will be too wide and our p-values will be too high. But that could be okay! If these conditions are not met, applying the traditional procedure will still result in useable answers; we simply shouldn’t put too much stock in the confidence level or p-value in those cases.

In that respect, this condition has something in common with the “Randomness” condition, which requires a simple random sample. If the sample is stratified, then the formulas are again “broken,” even though the samples are (hopefully) actually *less variable*.

To help students explore this idea, I wrote this problem. Let me know what you think!

A principal at a high school with 200 students per class picks 100 of the seniors at random and find that 80 of them play sports outside of school. She wants to create a 90% confidence interval for the proportion of all seniors at her school that play sports outside of school.

- Which condition(s) for proportion inference are
*not*met in this scenario? - The principal decides to
*bootstrap*a confidence interval for the proportion of all students in her high school that play sports outside the school, since the conditions are not satisfied. She uses a tool to bootstrap 10,000 possible samples*without replacement*using her information, and creates the following relative frequency histogram of her simulated samples, with shown summary stats. - Based ONLY on this graph, what is an estimated, reasonable 90% confidence interval for the true proportion of all students who play sports outside of her school? Briefly explain how you made your estimate.
- Calculate the standard error of the sampling distribution for p̂ the “normal” way. How does this value compare to the bootstrapped standard deviation?
- Calculate the confidence interval the “normal” way. How does it compare to your bootstrapped interval?