Chapter 09 - demo: advantage of stratified sampling¶

In the lecture slides, it is mentioned that stratified sampling is advantageous when the distribution of the variable of interest varies from stratum to stratum. We will show this through a simple demo.

The population of interest is a fictional group of elementary school children. The data is generated in the following cell.

InĀ [13]:
##############
### generate a population (of elementary school kids)
############
library(tidyverse)
n = 100
B=1e3 # number of repeats
set.seed(100)
mean_height <- c(110, 120, 134, 140, 155, 163, 165)
#plot(mean_height)
prop_grade <- c(0.18, 0.18, 0.14, 0.16, 0.14, 0.1, 0.1)
N <- sum(round(prop_grade*500))
dd <- tibble(ID = 1:N, grade = (rep(1:7, times= rep(N*prop_grade))),
             `commute_time (min)`=rgamma(N, 6, 0.8))
dd <- mutate(dd, `height (cm)` = mean_height[grade]+rnorm(N,sd=5), grade = as_factor(grade))

The boxplots below visualize the population distribution for two variables of interest, commute time and height, within each grade.

Notice the distribution of height changes across grades, while commute time is more or less the same regardless of grade.

InĀ [3]:
#### plotting population distribution
par(mfrow=c(1,1))
#hist_commute <- hist(dd$`commute_time (min)`)
#hist_height <- hist(dd$`height (cm)`,breaks=seq(85,185, by=8), xlim=c(85, 185))
boxplot_height <- ggplot(dd, aes(y=`height (cm)`, x=(grade)))+ geom_boxplot()
boxplot_commute <- ggplot(dd, aes(y=`commute_time (min)`, x=(grade)))+ geom_boxplot()
boxplot_height
boxplot_commute
No description has been provided for this image
No description has been provided for this image

Simple random sampling¶

Let's compare the population distribution of height with the sample distribution for a random sample of 100 students obtained via simple random sampling without replacement.

InĀ [4]:
################################
#### simple random sampling
################################
# repeat the following 4 lines of code if you want to visualize the histogram
# of a new simple random sample.
dd_sample <- sample_n(dd, size=sum(round(prop_grade*n)))
par(mfrow=c(2,1))
hist(dd$`height (cm)`, freq=F, breaks=seq(85,185, by=8), xlim=c(85, 185), main="distribution of height in population")
hist(dd_sample$`height (cm)`, freq=F, breaks=seq(85,185, by=8), xlim=c(85, 185), main="distribution of height in sample")
No description has been provided for this image

The two distributions above are somehwat similar. Since the sample is randomly chosen, we can repeat the sampling process, and see if another sample still resembles the population distribution by re-running the previous code cell.

But note that it is rather difficult to compare how "different" are two distributions, so we'll use a numeric summary instead. Let us compare the population mean height with the sample mean height.

Furthermore, We'll repeat the simple random sampling with the code below in order to see how far off the sample means are from the true population mean height.

InĀ [14]:
#repeated SR sampling, calculate sample average each time.
ybar_height <- rep(NULL, B)
ybar_commute <- rep(NULL, B)
for(j in 1:B){
  dd_sample <- sample_n(dd, size=sum(round(prop_grade*n)))
  ybar_height[j] <- mean(dd_sample$`height (cm)`)
  ybar_commute[j] <- mean(dd_sample$`commute_time (min)`)
}

We will hold off showing the result for now.

Stratified sampling¶

Next let's perform stratified sampling. Let's compare a typical sample with the population distribution:

InĀ [7]:
#########################
### stratified sampling
#########################
# repeat the following 9 lines of code if you want to visualize the histogram
# of a new stratified sample.
dd_sample_strat <- tibble()
for(i in 1:7){
  dd_sample_strat <-rbind(dd_sample_strat, dd |> 
                            filter(grade == i) |> 
                            sample_n(size=round(prop_grade[i]*n)))
}
par(mfrow=c(2,1))
hist(dd$`height (cm)`, freq=F, breaks=seq(85,185, by=8), xlim=c(85, 185), main="distribution of height in population")
hist(dd_sample_strat$`height (cm)`, freq=F, breaks=seq(85,185, by=8), xlim=c(85, 185), main="distribution of height in sample")
No description has been provided for this image

The sample captures most of the features of the population distribution. We will again calculate the sample mean and comapre it to population mean as a simple summary of how different these distributions are.

Again, We'll repeat the stratified sampling with the code below in order to see how far off the sample means are from the true population mean height.

InĀ [9]:
### repeated stratified sampling, calculating sample mean each time.
ybar_height_strat = rep(NULL, B)
ybar_commute_strat = rep(NULL, B)

for(j in 1:B){
  dd_sample_strat <- tibble()
  for(i in 1:7){
    dd_sample_strat <-rbind(dd_sample_strat, dd |> 
                              filter(grade == i) |> 
                              sample_n(size=round(prop_grade[i]*n)))
  }
  ybar_height_strat[j] <- mean(dd_sample_strat$`height (cm)`)
  ybar_commute_strat[j]<- mean(dd_sample_strat$`commute_time (min)`)

}

The results¶

How does simple random sampling compare with stratified sampling? Here's the grand reveal:

InĀ [11]:
################################
### boxplot of sampling distribution of sample mean (of height)
par(mfrow=c(1,1))
mean_height_repeated <- tibble(sample_mean = c(ybar_height, ybar_height_strat), 
       type=rep(c("Simple random", "Stratified"), each=B))
boxplot(sample_mean~type, mean_height_repeated, main="Boxplot of sample mean (of height variable)", xlab="sampling method")
abline(h= mean(dd$`height (cm)`), col="green", lty=2, lwd=2)#overlay a line indicating population mean height
No description has been provided for this image

The boxplot shows that, the sample mean height of samples obtained via stratified sampling tends to be much closer to the true value than sample mean from simple random sampling!

When the distribution of variable of interest varies from stratum to stratum, stratified sampling typically gives better, more representative samples than simple random sampling. As a result, the sample mean of sample from stratified sampling has higher chance of being close to the true population mean.

On the other hand, if a variable we are investigating is not really associated with the strata variable (e.g. commute time is not associated with student's grade), then there's nothing to be gained by stratified sampling, as shown in the boxplots below.

InĀ [12]:
### boxplot of sampling distribution of sample mean (of commute time)
mean_commute_repeated <- tibble(sample_mean = c(ybar_commute, ybar_commute_strat), 
                                type=rep(c("Simple random", "Stratified"), each=B))

boxplot(sample_mean~type, mean_commute_repeated, main="Boxplot of sample mean (of commute time)", xlab="sampling method")
abline(h= mean(dd$`commute_time (min)`), col="green", lty=2, lwd=2)#overlay a line indicating population mean commute time
No description has been provided for this image