Chapter 9: Sample surveys¶
UBC-V STAT 200 - Elementary Statistics for Applications
2024 Winter Term 1
date()
© 2024 Vivian Meng & Eugenia Yu – Material Licensed under CC BY-NC-ND 4.0
Answering a study question with data¶
In the process of statistical investigation, a researcher first needs to form a study question.
- Example 1: "What is the average age of students enrolled in STAT 200 this term"?
- Example 2: "What is the distribution of perferred mode of transportation for UBC students?"
Given the study question, the next step is to figure out who (or what) we're studying, and the feasible ways to collect data.
- If the study question does not relate to "cause and effect", but rather a description of current state of things, it can be answered via obtaining a "snapshot" of the world
The complete picture: population, parameter, census¶
Target population: refers to the complete collection of individuals that the researcher needs to study in order to answer the study question.
- Also referred to as "population of interest",
"study population"or "population" for short.
Parameter: a numerical characteristic in the target population. Also referred to as "population parameter".
- Knowing the values of population parameters should perfectly answer a well-defined study question.
Census: collection of data on the whole target population
- provides complete information; required to find the exact value of population parameters.
- disadvantage: can be costly, time-inefficient, or impossible to carry out.
Sample: a subset of individuals selected from the target population.
Statistic: a numerical characteristic in the sample.
- We use statistics calculated from a sample to estimate parameters.
Sample survey: collection of data on a subset of the target population.
- provide reliable information about a population as long as the subset is representative of the population.
Obtaining representative sample¶
There are two key ideas in getting a representative sample — randomization and sufficient sample size.
Randomization¶
- Randomness plays an important role in sampling and in theories based
on which we make inference about a population from results obtained from a sample.
- Randomization tends to give samples that have characteristics
comparable to the population, thus minimizing the chance of obtaining a nonrepresentative sample.
- Any randomized sampling method results in sampling variability, i.e. difference in characteristics from one sample to another.
It is the sample size that matters¶
- How big a random sample is needed such that it can well represent the
population? The actual number of individuals or subjects in the sample is important, but not the size of the population nor the fraction of the population that is sampled.
- The larger the sample size, the smaller the sampling variability.
- With a smaller between-sample variability, there's higher chance sample will be reliable.
- Intuition: taste-testing a pot of soup. As long as you stir well, you just need to make sure you test a big-enough spoonful, regardless of the pot size.
First, we require a sampling frame: a list of individuals from which the sample is drawn.
- the sampling frame should include every individual in the population...
Example: A sampling frame for a population of $N$ individuals:
| Unique ID | Name |
| 1 | John |
| 2 | Max |
| 3 | Julie |
| 4 | Alex |
| $\vdots$ | $\vdots$ |
| $N$ | Sam |
Simple random sampling (without replacement)¶
Pick $n$ individuals at random from the population such that any combination of $n$ individuals is equally likely to be selected.
Think: From a well shuffled bag of $N$ distinctly labelled marbles, pick a handful of $n$ marbles at once without looking.
- Equivalently: pick $n$ marbles out of the bag, one at a time, without replacing the ones you took back into the bag. (hence "without replacement").
- a sample chosen using this method is called a simple random sample.
Suppose, in the sampling frame, we also have data on a variable that is informative of the variable we want to learn.
Example:
- We want to learn the average height of children at an Elementary school.
- We have data on the grade of each student in the sampling frame.
| Unique ID | Name | Grade |
| 1 | John | 1 |
| 2 | Max | 2 |
| 3 | Julie | 1 |
| 4 | Alex | 7 |
| $\vdots$ | $\vdots$ | $\vdots$ |
| $N$ | Sam | 7 |
Stratified sampling: the population is first divided into strata (non-overlapping groups) then each strata is sampled according to simple random sampling.
- The number of individuals sampled from a stratum is proportional to the overall proportion of the stratum in the population.
- e.g. Suppose there are two strata in the population, with 60% of population in strata A and 40% population in strata B. When doing stratified sampling with a final sample size of $n$, we sample $0.6n$ people from strata A and $0.4n$ from strata B.
- Advantage: A stratified sample may have smaller variability across samples and hence gives more reliable results when individuals within stratum are more alike than individuals from different strata.
Suppose we also have information on a variable that makes carrying out the survey easy.
Example:
- We want to learn the average commute time of children at an Elementary school.
- In the sampling frame, we also know the room location of everyone's 9am class (homeroom)
| Unique ID | Name | Grade | 9am class location |
| 1 | John | 1 | Room A |
| 2 | Max | 2 | Room A |
| 3 | Julie | 1 | Room B |
| 4 | Alex | 7 | Room J |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| $N$ | Sam | 7 | Room J |
Clustered sampling: the population is divided into groups called "clusters" (usually formed by some convenience factor), then a subset of clusters is chosen by simple random sampling.
If all individuals from the selected clusters are included in the sample, the final sample is a one-stage cluster sample.
If a random subset of individuals from the selected clusters are included in the sample, the final sample is a two-stage cluster sample.
- Question: since we are selecting clusters at random, what information does the sampling frame need to contain at a minimum? A list of...?
Caution:
- Cluster sampling is used for the sake of convenience, practicality and cost.
- It does not always work well... e.g. choosing only a few large clusters when each cluster not representative of the overall population.
- okay if choosing many small clusters of similar sizes.
Systematic sampling¶
A systematic sample is obtained by selecting every $k$-th individual from the sampling frame beginning with a randomly chosen row. This sampling method is justified as long as the list of individuals in the sampling frame does not contain any hidden order.
Multistage sampling¶
Multistage sampling involves more than one stage or more than one sampling procedure in obtaining a sample. Two-stage cluster sampling is an example of multistage sampling.
- any sampling procedure that results in systematic nonrepresentativeness is biased.
- a sample chosen from a biased sampling procedure/method is called a biased sample.
- a biased sample cannot be redemmed by increasing its sample size.
Watch out for the following problems which would result in biased samples.
Undercoverage¶
When a sampling frame or a sampling procedure completely excludes or underrepresents certain kinds of individuals from the target population, it is said to suffer from undercoverage.
e.g. A librarian wants to find out how often UBC students use library service (borrowing books, say). She only surveys students visiting the Woodward Biomedical Library.
note: the subset of target population that you can sample from is typically referred to as the study population. What is the study population and target population in the example above?
Convenience sampling¶
The selection of individuals from the population based on easy availability and accessibility.
e.g. A market researcher wants to estimate the average price of housings in Vancouver. He collects information on the prices by sending out a survey to 50 households in his neighbourhood.
Voluntary response bias¶
If the participation in survey is voluntary, individuals with strong opinions tend to respond more often and thus will be overrepresented. e.g. call-in polls
Nonresponse bias¶
Individuals who do not respond in a survey might differ from the respondents in certain aspects. Including only the respondents in a sample will lead to nonresponse bias.
e.g. mail-in questionnaires Voluntary response bias is a form of nonresponse bias, but nonresponse may occur for other reasons. For example, those who are at work during the day won't respond to a telephone survey conducted only during working hours.
Response bias¶
Not the opposite of nonresponse bias!
It is when a surveyed subject's response is influenced by how a question is phrased or asked, or due to misunderstanding of a question or unwillingness to disclose the truth, response bias has occurred.
e.g. a question in a drug usage survey that asks "Have you ever smoked marijuana?": marijuana users might lie and respond that they have never smoked marijuana.