Prevalence studies play a crucial role in responding to the COVID-19 pandemic. Estimating the number of individuals who have been infected with SARS-CoV-2 by the number of positive clinical tests is likely to be an underestimate due to shortages of test supplies, other barriers to testing, and the high proportion of subclinical infections that do not prompt testing. Point prevalence and seroprevalence studies can offer better estimates of the true number or proportion of infections within a given population. Additionally, these studies can be used to model transmission dynamics, identify risk factors for infection, and compare disease prevalence across population subgroups and/or over time.
This document summarizes some of the statistical issues that should be considered when designing a SARS-CoV-2 prevalence study. A careful design is needed to allow the results of the study to generalize to the target population. Because of the statistical issues involved, it is best to consult with a statistician or epidemiologist at the design phase. In North Carolina, several universities across the state have statistics, biostatistics, and/or epidemiology departments (e.g., UNC-Chapel Hill, Duke University, NC State University, East Carolina University), and survey research organizations specializing in sampling methods are another potential resource.
Point prevalence studies aim to estimate the number or proportion of active SARS-CoV-2 infections in a population. These studies provide researchers with data to model transmission dynamics and to evaluate risk factors for infection. Active infections are those in which SARS-CoV-2 RNA is detectable from a nasal or throat swab using reverse transcription-polymerase chain reaction (RT-PCR) tests. The sensitivity and specificity of PCR tests vary by the type of test, disease severity, and the timing of testing relative to symptom onset.
References/Additional Resources:
Antibodies are proteins that the body makes in response to an infection. In the case of COVID-19, measurement of antibodies against SARS-CoV-2 can tell us something about whether a person has been previously infected with the virus. Seroprevalence studies aim to estimate the proportion of the target population with detectable SARS-CoV-2 antibodies. Serological (antibody) tests identify the presence of SARS-CoV-2 antibodies in the blood, and positive tests are considered evidence of a prior infection. People infected with SARS-CoV-2 typically develop antibodies 1-3 weeks following infection, though for some individuals antibodies take longer to develop and some individuals, particularly those with immune compromise or mild disease, may not develop detectable levels of antibodies. Further, it is not yet known how long antibodies can be detected following infection and how long protection against subsequent infection may last. Serological tests have an important role in surveillance because they aim to detect antibody levels to prior infection, including among individuals who were asymptomatic (did not show signs of the disease) or did not seek care (either due to having a less severe case or due to barriers to care).
References/Additional Resources:
The target population is the population to which the research team wants to make inference. For a SARS-CoV-2 prevalence study, the target population could be the general population, meaning all persons who reside in a given country, state, county, or municipality. Alternatively, the target population could consist of a subset of the general population, such as hospitalized patients, pregnant women, or persons in a school system.
If the goal is to estimate point prevalence or seroprevalence of SARS-CoV-2 within a target population, sampling is often conducted. This is because it is typically too time and resource-prohibitive to test everyone in the target population. If appropriate sampling methods and analytic techniques are followed, unbiased estimators can be used to obtain estimates of point prevalence or seroprevalence from a relatively small sample of the population. The generalizability of the sample results back to the target population depends on the sampling methods, analytic approaches, and the validity of any underlying assumptions on which these methods rely. Regardless of the sampling method used, the sample should be a subset of the target population.
The generalizability of the sample results back to the target population depends on the sampling methods, analytic approaches, and the validity of any underlying assumptions on which these methods rely.
There are two broad categories of sampling methods for prevalence studies. When the sample is selected from the target population using a random process, this is known as probability-based sampling. When selection into the sample is not random, this is known as nonprobability sampling. There are biases that can be introduced in either setting, and the analytic approaches employed must consider these potential biases and adjust for them to the extent possible. The sections below provide more details about each sampling approach and provide examples of biases that can be introduced with these designs.
With probability-based sampling, selection into the sample is random. The researcher develops a sampling frame, which is a list of members of the target population. From the sampling frame, the researcher randomly selects a sample of persons or households and recruits them for participation in the prevalence study. Because selection into the sample is random and not driven by the participant or the researcher, when survey weights are applied (as discussed below), the sample is representative of the target population in expectation.
With probability-based sampling, each unit $i$ has a known probability of selection, $\pi_i$. Researchers have the choice of whether to select units from the sampling frame with equal probability (i.e., $\pi_i=\pi_j$ for all $i,j$) or whether to select some units with higher probability than others (i.e., $\pi_i \ne \pi_j$ for some $i, j$). When some units are selected with higher probability compared to others, this is known as oversampling. Oversampling is typically implemented using stratification, where units on the sampling frame are divided into mutually-exclusive and exhaustive strata, and separate random samples are selected within each stratum. In a prevalence study, stratification can facilitate oversampling of population subgroups (e.g., populations more vulnerable to COVID-19).
Cluster sampling is another common tool used with probability-based designs. Researchers again divide the sampling frame into mutually-exclusive and exhaustive groups (called clusters). Instead of sampling within all groups, the researcher randomly selects a subset of clusters for inclusion in the study. Cluster sampling helps facilitate logistic feasibility when (1) data collection will take place in person and thus sampled units need to be more closely grouped geographically, and/or (2) a sampling frame of the entire population is not available, so the researcher selects clusters from which sampling frames can later be obtained. Clusters can be selected with equal probability or selected using probability proportional to size (PPS) sampling. With PPS sampling, clusters are selected proportional to a size measure.
Probability sampling tends to take more time to implement and be more expensive than nonprobability sampling, because it takes time to design the sample and recruit participants rather than relying on participants to self-select into the study sample. To obtain estimates more quickly with probability-based sampling, researchers can partner with an ongoing representative cohort within the geography of interest and can recruit participants for the seroprevalence study within that cohort. For example, researchers at UNC are recruiting participants for the Chatham County COVID-19 Cohort (C4) seroprevalence study from the Chatham Community Assessment, an ongoing representative panel of Chatham County residents. Partnering with an ongoing study reduces the time and cost associated with selecting a new probability sample and facilitates recruitment because participants in the cohort have already participated in the partner study. Of course, this approach is only viable in areas with existing probability-based studies.
When selecting a sampling frame for a probability-based study, it is critical to consider what the target population is for that study and how recruitment will be conducted. For studies of the general population, address-based sampling (ABS) frames, enumerated lists, and Random Digit Dialing (RDD) frames are commonly used. Custom sampling frames are used for target populations consisting of subsets of the general population. The ideal sampling frame includes as many members of the target population as possible, with few members who are outside the target population. It is necessary to consider when the sampling frame was constructed, and which members of the target population might be excluded from the frame.
The ideal sampling frame includes as many members of the target population as possible, with few members who are outside the target population.
Because selection into the sample is random, selection bias due to the sampling process is eliminated for probability-based samples. However, other errors can be introduced.
Probability-based samples are commonly analyzed using methods from the finite-population inferential paradigm. Sampling weights $w_i$ are assigned to each member of the sample, typically equal to the reciprocal of their probability of selection (i.e., $w_i=\pi_i^{-1}$). Sampling weights indicate the number of members of the target population represented by each sampled individual. Sampling weights can be further adjusted to account for sampling frame undercoverage and/or nonresponse using raking techniques or calibration estimators. Common statistical software packages, including R, SAS, Stata, SPSS, and SUDAAN, have built-in procedures to appropriately analyze survey data, accounting for the features of the design such as weighting, stratification, and/or clustering.
References/Additional Resources:
With nonprobability sampling, often referred to as convenience sampling, selection from the target population into the sample is not random. Instead, samples are selected from “convenient” populations or using subjective methods. Because selection is not random, the characteristics of nonprobability samples often differ from those of the target population. Researchers using nonprobability designs must rely on analytic adjustments with estimators that are unbiased under a given set of assumptions. When these assumptions do not hold, estimates from nonprobability samples do not generalize to the target population.
Researchers using nonprobability designs must rely on analytic adjustments with estimators that are unbiased under a given set of assumptions.
Several SARS-CoV-2 seroprevalence studies have been conducted to date using nonprobability sampling. Examples of nonprobability sampling frames are volunteers recruited on social media platforms, patient populations, or individuals at shopping centers.
Selection Bias: Selection bias occurs when some parts of the target population are not included in the sample, or when some members of the target population are sampled at different rates than intended by the researcher.
Because selection into a nonprobability sample is not random, there can be characteristics of individuals that are associated both with their chances of being included in the nonprobability sample and also with their risk for SARS-CoV-2 infection. For example, individuals with suspected prior COVID-19 infections due to recent symptoms might be more likely to participate in a seroprevalence study compared to asymptomatic individuals. Certain demographic groups (e.g. age groups or race/ethnic groups) might be more or less likely compared to other groups to volunteer for a prevalence study due to the recruitment method(s) used.
Because of the likely differences between the nonprobability sample and the target population, analytic adjustments are needed to generalize the results of the nonprobability sample to the target population. Weighting, model-based, and doubly-robust estimators seek to adjust for the characteristics that affect participation and/or infection risk. These methods rely on the assumption that the nonprobability sample is like a stratified random sample from the target population, where the adjustment characteristics define the strata. This assumption would not hold in a setting where the data are missing not at random (i.e., when participation in the study is directly driven by COVID-19 infection status). This assumption is also violated when an important factor drives both study participation and infection risk (e.g., age, race/ethnicity), but is not controlled for in the analysis. Unfortunately, this assumption cannot be validated. These adjustment factors should be identified based on subject-matter expertise when planning the study to ensure collection of these characteristics during data collection.
References/Additional Resources:
Prevalence studies rely on the results of imperfect diagnostic tests. Ignoring the sensitivity and specificity of the test used can lead to biased prevalence estimates. Sensitivity is the probability that an individual tests positive given that s/he is infected, while specificity is the probability that an individual tests negative given that s/he is not infected. Estimates from prevalence studies can be adjusted to account for the sensitivity and specificity of the PCR or serology test used.
Rogan Gladen proposed the following estimator of the true population prevalence $p$:
\[\hat{p}=\frac{\hat{t}+S_{p}-1}{S_{e}+S_{p}-1}\]
Where $\hat{t}$ represents the sample estimated proportion of persons testing positive, $S_{e}$ represents the sensitivity of the test, and $S_{p}$ represents the specificity of the test. Consideration needs to be given to whether sensitivity and specificity are known with certainty or estimated. Estimated standard errors and confidence intervals should also account for test sensitivity and specificity and, when applicable, the fact that these quantities were estimated rather than known.
References/Additional Resources:
Careful consideration should be given to the sample size of a prevalence study to ensure the sample will produce estimates with adequate precision. The sample size required depends on features of the design (stratification and/or clustering), anticipated prevalence in the target population, the finite population size, and the desired level of precision. The R package “PracTools” is freely-available to the research community and includes functions for sample size calculations under multiple probability-based designs.
Careful consideration should be given to the sample size of a prevalence study to ensure the sample will produce estimates with adequate precision.
Because prevalence of SARS-CoV-2 infection tends to be fairly low in most populations, confidence intervals based on a normal approximation can undercover the true population prevalence. The Wilson method and log-odds methods are alternatives that have better coverage properties for prevalences close to the extremes of the parameter space. The “nWilson” and “nLogOdds” functions in the PracTools package calculate required sample sizes using these methods based on user-specified values of the targeted margin of error and population prevalence. These functions assume a simple random sampling design. Alternatively, the “power.prop.test” function within the Stats package in R can be used if the researcher is interested in powering the study based on a test of differences in prevalence between two non-overlapping groups when simple random sampling is used. The examples below demonstrate the use of the “nWilson” and “power.prop.test” functions for two different hypothetical designs.
Another reasonable method for power or sample size calculations, particularly for complex designs, is a simulation study. Researchers can simulate the finite population under a given set of assumptions, select a random sample from that finite population under the proposed sampling design, and calculate estimates based the sample. By repeating this process many times, the researchers can estimate power or precision empirically.
Regardless of the method used, it is important to remember that the sample sizes provided in the power or precision calculation reflect the number of participants in the study. When persons who are not eligible to participate are included on the sampling frame and/or nonresponse is anticipated, the sample size should be inflated to account for the anticipated eligibility and response rates for the study. For probability-based samples, it is good practice to select a larger sample than is anticipated being required and dividing the sample into random replicates that can be released over time if sampling targets are not achieved.
Example Code:
Example 1: Assume that a research team is conducting a seroprevalence study of the general population in a given county. They believe that population seroprevalence is approximately $8\%$, and would like to estimate seroprevalence from a probability-based sample with a corresponding $95\%$ Wilson confidence interval with a margin of error of $\pm 2\%$. The sample will be recruited using address-based sampling, with addresses selected via a simple random sample. As demonstrated in the R code below, the researchers need to obtain $713$ participants to achieve their desired level of precision.
nWilson(moe.sw=1,alpha=0.05,pU=0.08,e=0.02)
$`n.sam`
[1] 712.077
$`CI lower limit`
[1] 0.06225363
$`CI upper limit`
[1] 0.1022536
$`length of CI`
[1] 0.04
Example 2: A research team is conducting a seroprevalence study among healthcare workers within the local healthcare system. They plan to stratify the sampling frame of healthcare workers by whether or not they work directly with COVID-19 patients and select random samples of $n=500$ within each stratum. They anticipate that seroprevalence among healthcare workers who do not work directly with COVID-19 patients will be $10\%$, and they want to know what the minimum detectable difference is assuming a Type I error rate of $0.05$ and $80\%$ power. Based on this design, the study would have $80\%$ power to detect a significant difference if the true seroprevalence among workers who work directly with COVID-19 patients is $15.9\%$.
library(stats)
power.prop.test(n = 500, p1 = 0.10, p2 = NULL,
sig.level = 0.05, power = 0.80,
alternative = c("two.sided"))
Two-sample comparison of proportions power calculation:
n = 500
p1 = 0.1
p2 = 0.1594911
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
References/Additional Resources: