Review Article

Split Viewer

Journal of Minimally Invasive Surgery 2023; 26(1): 9-18

Published online March 15, 2023

https://doi.org/10.7602/jmis.2023.26.1.9

© The Korean Society of Endo-Laparoscopic & Robotic Surgery

Sample size calculation in clinical trial using R

Suyeon Park1,2,3 , Yeong-Haw Kim3 , Hae In Bang4 , Youngho Park5

1Department of Biostatistics, Academic Research Office, Soonchunhyang University Seoul Hospital, Seoul, Korea
2International Development and Cooperation, Graduate School of Multidisciplinary Studies Toward Future, Soonchunhyang University, Asan, Korea
3Department of Applied Statistics, Chung-Ang University, Seoul, Korea
4Department of Laboratory Medicine, Soonchunhyang University Seoul Hospital, Seoul, Korea
5Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, Daejeon, Korea

Correspondence to : Hae In Bang
Department of Laboratory Medicine, Soonchunhyang University Seoul Hospital, 59 Daesagwan-ro, Yongsan-gu, Seoul 04401, Korea
E-mail: genuine43@schmc.ac.kr
ORCID:
https://orcid.org/0000-0001-7854-3011

Youngho Park
Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, 70 Hannamro, Daedeok-gu, Daejeon 34430, Korea
E-mail: yhpark@hnu.kr
ORCID:
https://orcid.org/0000-0002-7096-3967

Hae In Bang and Youngho Park contributed equally to this study as co-corresponding authors.

Received: February 20, 2023; Revised: March 12, 2023; Accepted: March 12, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Since the era of evidence-based medicine, it has become a matter of course to use statistics to create objective evidence in clinical research. As an extension of this, it has become essential in clinical research to calculate the correct sample size to demonstrate a clinically significant difference before starting the study. Also, because sample size calculation methods vary from study design to study design, there is no formula for sample size calculation that applies to all designs. It is very important for us to understand this. In this review, each sample size calculation method suitable for various study designs was introduced using the R program (R Foundation for Statistical Computing). In order for clinical researchers to directly utilize it according to future research, we presented practice codes, output results, and interpretation of results for each situation.

Keywords Sample size, Effect size, Continuous outcome, Categorical outcome

This article will cover the following topics: (1) Why is sample size calculation important?; (2) Components of sample size calculation; and (3) How to calculate the required sample size?

The main purpose of sample size calculation is to determine the minimum number of subjects required to detect a clinically relevant treatment effect. The fundamental reason for calculating the number of subjects in the study can be divided into the following three categories [1,2].

Economic reasons

In clinical studies, if the sample is not large enough, statistical significance may not be found even if an important relationship or difference exists. In other words, it may not be possible to successfully conclude the study because the study may lack the power to detect the effect. Conversely, when a study is based on a very large sample, small effect differences may be considered statistically significant and lead to clinical misjudgment. Either way, your research may not be successful for other reasons and the conclusion is a waste of money, time, and resources.

Ethical reasons

Oversized studies are likely to include more subjects than the study actually needs, exposing unnecessarily many subjects to potentially harmful or futile treatments. Similarly, in undersized studies, ethical issues may arise in that subjects are exposed to unnecessary situations in studies that may have low success rates.

Scientific reasons

If a negative result is obtained after conducting a study, it is necessary to consider whether the sample size of the study was sufficient or insufficient. First, if the study was conducted with sufficient sample size, it can be interpreted that there is no clinically significant effect. However, if the study is conducted with insufficient sample size, meaningful clinical results with statistically significant differences in practice may be missed. Notice that not being able to reject the null hypothesis does not mean that it is true; it means that we do not have enough evidence to reject it.

Additionally, calculating sample size at the study design stage, when receiving ethics committee approval, has become a requirement rather than an option. As a result, calculating the optimal sample size is an important process that must be done at the design stage before a study is conducted in order to ensure the validity, accuracy, reliability, and scientific and ethical integrity of the study.

Appropriate sample size usually depends on the statistical hypotheses made with the study’s primary outcome and the study design parameters. The basic statistical six concepts that must be considered essential for estimating the sample size are as follows.

Study design

There are various research designs [3] in clinical research, but among them, the most commonly used design is the parallel design. A crossover design [4,5] can be used in studies where the number of subjects is difficult to collect. A crossover design requires fewer samples than a parallel design, but is complex and must satisfy several conditions, so an appropriate study design should be selected according to the purpose of the study.

Parallel design. Group A receives only treatment A and group B receives only treatment B.

Crossover design. It is a study in which one group receives treatment A first, then treatment B, and the other group receives treatment B and then treatment A. Therefore, it is important to have an appropriate wash-out period at the time of treatment change.

Null and alternative hypotheses testing

When establishing statistical hypotheses in research, two hypotheses are always required, which we call the null hypothesis (H0) and the alternative hypothesis (H1 or Hα). In this case, the two hypotheses must consist of two mutually exclusive statements. The null hypothesis (H0) usually contains the opposite of what the researcher claims in the study and is set to be rejected. That is, include ‘no difference’ when forming hypotheses. Conversely, an alternative hypothesis (H1) is a statement in which the researcher proposes a potential outcome, and that hypothesis includes ‘there is a difference.’ There are also different types of hypothesis testing problems, depending on the purpose of the study. In Table 1, hypotheses can be established depending on whether it is an equality, equivalence, superiority, or non-inferiority test. Let μS = mean of standard treatment, μT = mean of new treatment, δ = the minimum clinically important difference, and δNI = the non-inferiority margin.

Test for equality. To determine whether a clinically meaningful difference or effect exists (δ = 0).

Test for equivalence. To demonstrate the difference between the new treatment and standard treatment has no clinical importance.

Test for superiority. To demonstrate that the new treatment is superior to the standard treatment.

Test for non-inferiority. A study whose primary purpose is to evaluate whether a new treatment is less effective or not inferior to standard treatment (δNI > 0).

One-sided and two-sided tests

The one-sided test is a method of testing whether it is greater than or less than a certain value, and in Table 1, a superiority or non-inferiority trial. A two-sided test is performed when the expected value is greater than or less than the specified range of values. When using a two-sided test, a two-way relationship is tested regardless of the direction of the hypothesized relationship. In Table 1, equality and equivalence trials are two-sided trials.

Type I error and type II error

The hypothesis testing process is as follows: (1) assume that the null hypothesis is true, calculate the test statistic using sample data and (2) decide whether or not to reject the null hypothesis according to the result. That is, we always choose one of the four decisions shown in Table 2 and two types of errors inevitably occur: type I errors (α) and type II errors (β).

Type I error and significance level. The probability of rejecting the null hypothesis when it is actually true is called a type I error. This essentially means saying that the alternative hypothesis is true when it is not true. Therefore, it is recommended to keep the type I error as small as possible. The type I error rate is known as the significance level and is usually set to 0.05 (5%) or less [6,7]. Type I error is inversely proportional to sample size.

Type II error and power. The probability of not rejecting the null hypothesis when it is false is called a type II error. Conversely, power is the probability that the test will correctly reject the null hypothesis when the alternative hypothesis is true. Type II error can be denoted as β and power as 1 – β. Conventionally, the power is set to 80% or 90% [6,7] when calculating the sample size and it is directly proportional to the sample size.

Primary outcome

Variables to see clinically significant differences may vary, but the most important factor among them should be selected. This is called the primary outcome, and the other measurements are referred to as secondary outcomes. The number of samples is calculated using the primary outcome. At this time, the parameter information of the primary outcome for calculating the sample size can be obtained from prior studies or pilot studies. Both continuous and categorical data can be used as primary outcomes, and the parameters used to calculate the minimal meaningful detectable difference (MD) and standard deviation (SD) depend on the type of variable. For continuous data, mean/sd is used as a parameter, and for categorical data, a proportion is used as a parameter.

Minimal meaningful detectable difference (MD). The smallest difference considered clinically meaningful in the primary outcome.

Standard deviation (SD). It tells you how spread out the data is from the mean.

Dropout rate

The sample size estimation formula yields the minimum number of subjects required to meet statistical significance for a given hypothesis. However, in an actual study, subject dropout may occur during the study, so in order to satisfy all the number of subjects desired by the researcher, the total number of subjects considering the dropout rate must be calculated so that more subjects can be enrolled. If ‘n’ is the number of samples calculated according to the formula and ‘dr’ is the dropout rate, then the adjusted sample size ‘N’ is given by: N = n / (1 – dr).

Others

Depending on the study design, there are many more considerations in addition to the six concepts mentioned above. Although not considered in the practice below, we would like to mention three points that are frequently mentioned and used in actual clinical research to help researchers.

Adjustment for unequal sample size

In clinical trials, available patients, treatment costs, and treatment resources may influence the allocation ratio (k) decision. According to Lachin [8] and van Belle [9], it can be applied as follows.

(1) Calculate the sample size n per group, assuming equal numbers per group.

(2) Let k = n2n1, n1=12n1+1k and n2=12n(1+k).

Interim analysis

In the confirmatory trials, there are cases in which interim analysis, whether planned or unplanned, is performed at the research planning stage. When calculating the number of subjects taking this into account, the false positive rate increases with the number of interim analyses, so type I error should be considered.

Sample size for survival time

In survival analysis, the outcome variable is the time until a specific event such as death occurs, and whether or not an event occurs for each subject and the time from the start of the clinical trial to the occurrence of the event (or censoring) are used as outcome variables. In particular, the power of survival analysis is a function of the number of events and generally increases with a shorter period (T0) of recruitment to study subjects and a longer total follow-up period (T). Let λ1 and λ2 are the hazard ratio for each group, the formula for calculating the number of subjects is [10]:

n=(Z1α/2Z1β )2[(λ1)+(λ2)](λ1λ2)2 where(λ)=λ21eλT or (λ)=λ21[eλ(TTo)eλT]/λT0.

However, for studies with relatively low event rates and high censoring, the following sample size formula using only event rates can be used:

n = 2(Z1α/2+Z1β)2lnλ2 λ1 2.

Using the 17 tests in Table 3, which are widely used in research, we would like to show an example using an R program version 4.1.2 (R Foundation for Statistical Computing; ‘pwr’, ‘exact2x2’ and ‘WebPower’ [11] packages), one of the free programs. Basically, when using R, you need to install a package that includes the function you want to analyze and then use it. After that, you can use the function you want to use after calling package using the ‘library()’ function. More details will be explained through the example below.

All studies intend to use a parallel group design. A two-tailed test with a significance of 0.05 and a power of 80% was established. The dropout rate is different for each research field, but here we will unify it at 20%. For nonparametric tests on continuous variables, as a rule of thumb [12], calculate the sample size required for parametric tests and add 15%. Effect size can be defined as ‘a standardized measure of the magnitude of the mean difference or relationship between study groups’ [13]. In other words, an index that divides the effect size by its dispersion (standard deviation, etc.) is not affected by the measurement unit and can be used regardless of the unit, and is called an ‘effect size index’ or ‘standardized effect size.’ Cohen intuitively introduced effect sizes as small, medium, and large for easy understanding [14]. However, since the value presented by Cohen may vary depending on the population or distribution of the variable, there may be limitations in using it as an absolute value. When estimating the number of subjects, effect sizes (such as Cohen’s d, r, or the relative ratio, etc.) should be calculated using parameter information (MD and SD) found in the literature relevant to each primary outcome and entered as arguments to the function. Additionally, whether an effect size should be interpreted as small, medium, or large may depend on the analysis method. We use the guidelines mentioned by Cohen [14] and Sawilowsky [15] and use the medium effect size considered for each test in the examples below.

When the primary outcome considered in the study is continuous data, the number of samples can be calculated using the ‘pwr’ package. At this time, you can consider comparing the mean of a single group, two groups, or more than three groups, and Cohen’s d and f will be used for the effect size. When applied to your study, parameters can be taken from a previous or pilot study and calculated using the effect size calculation formula below.

Practice 1

The pwr.t.test() function (Supplementary data 1, Table 1) can be utilized with the ‘type’ argument for (1) one-sample t test (type = “one.sample”), (2) two-sample t test (type = “two.sample”), or (3) paired t test (type = “paired”). Cohen’s d is used as the effect size, and the size definition [14,15] is as follows; very small (d = 0.01), small (d = 0.2), medium (d = 0.5), large (d = 0.8), very large (d = 1.2), and huge (d = 2). In our example, we will use medium effect size (d = 0.5).

One-sample t test (Table 3, no.1)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required to demonstrate statistical significance is 34 when the effect size d = 0.5. Considering the dropout rate of 20%, a total of 43 samples are required.

Two-sample t test (Table 3, no. 2)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required for each group to demonstrate statistical significance is 64 when the effect size d = 0.5. Considering a dropout rate of 20%, 80 subjects are required for each group, for a total of 160 subjects.

Paired t test (Table 3, no. 3)

In the case of paired samples, if there is a correlation coefficient (r) between the variables before and after, it can be calculated as the SDpoolSD12+SD222rSD1SD2/2(1r). Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of pairs required to demonstrate statistical significance is 34 when the effect size d = 0.5. Considering the dropout rate of 20%, a total of 43 pairs are required.

One-sample Wilcoxon test (Table 3, no. 5)

A total of 43 was calculated by one-sample t test and adding 15% gives a total of 65.

Mann-Whitney U test (Table 3, no. 6)

By two-sample t test, 80 people were calculated for each group, and a total of 240 people was calculated by considering an additional 15% for each group.

Paired Wilcoxon test (Table 3, no. 7)

The 43 pairs were calculated by paired t test, taking into account an additional 15%, the total 65 pairs are required.

Practice 2

The pwr.anova.test() function (Supplementary data 1, Table 2) can be used in studies that compare averages of three or more groups. In this function, ‘k’ means the number of comparison groups and ‘f’ means the effect size, and Cohen’s f is used here. The detailed calculation formula can be found below. Cohen suggests that f values of 0.1, 0.25, and 0.4 indicate small, medium, and large effect sizes respectively. Also, we will use medium effect size = 0.25.

One-way analysis of variance (ANOVA) (Table 3, no. 4)

Assume that the p-value is 0.05, the power is 80%, and the two-tailed test is performed. When the total comparison group was three groups and the effect size value was 0.25, the number of subjects calculated was 53 in each group. Considering a dropout rate of 20%, a total of 198 samples are required, which is calculated as 66 per group.

Kruskal-Wallis test (Table 3, no. 8)

By one-way ANOVA, 66 people were calculated for each group, and if 15% of each group is additionally considered, a total of 297 people are calculated.

If the primary outcome considered in your study is categorical data, you can use the ‘pwr’ package for parametric tests and the ‘exact2x2’ package for nonparametric tests to calculate the number of samples.

Practice 3

The pwr.p.test() and pwr.2p.test() functions (Supplementary data 1, Table 3) are used when comparing one-sample and two-sample proportions, respectively. Cohen’s h is used here as the effect size. The calculation formula is: h = 2arcsin(p1)–2arcsin(p2). Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively. Also, we will use medium effect size = 0.5.

One-sample proportion test (Table 3, no. 9)

In the one-sample proportion test, p2 is the proportion under the null hypothesis and p1 is the proportion under the alternative hypothesis. Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required to demonstrate statistical significance is 32 when the effect size h = 0.5. Considering the dropout rate of 20%, a total of 40 samples are required.

Two-sample proportion test (Table 3, no. 10)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required for each group to demonstrate statistical significance is 63 when the effect size h = 0.5. Considering a dropout rate of 20%, 79 subjects are required for each group, for a total of 158 subjects.

Practice 4

In the chi-square test, which is a commonly used method for measuring the association between categorical data, Cohen’s w is used as a measure of effect size. The pwr.chisq.test() function (Supplementary data 1, Table 3) takes ‘w’ as an argument for effect size and ‘df’ as an argument for degrees of freedom. Assuming that two categorical variables have l categories and k categories, respectively, we can create a contingency table consisting of a total of m = l × k cells, and the ‘df’ in this table is (l − 1) × (k − 1). Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively. We will use medium effect size = 0.3.

Chi-square test (Table 3, no. 11)

Similarly, we assumed a p-value of 0.05 and a power of 80%. Looking at the association between the two-category variables and the three-category variables, the minimum required number of subjects is 107 when the effect size is 0.3. Considering the dropout rate of 20%, a total of 134 people are needed.

Practice 5

For nonparametric testing of categorical data, power calculation can be performed using the ss2x2() function [17] (Supplementary data 2, Table 1). Fisher exact test and McNemar test are considered, and they are performed separately using the ‘pair’ argument. In the example below, we set the event ratio for the control group to 0.2 (p0 = 0.2), the event ratio for the treatment group to 0.8 (p1 = 0.8), and the allocation ratio between groups to 1:1 (n1.over.n0 = 1).

Fisher exact test (Table 3, no. 12)

Assuming that the event rate of the control group was 0.2 and that of the treatment group was 0.8, the allocation ratio of each group was set at 1:1. If a two-sided test is performed with a significance level of 0.05 and a power of 80%, 12 samples are calculated for each group. Considering a dropout rate of 20%, 15 subjects are required for each group, for a total of 30 subjects.

McNemar test (Table 3, no. 13)

Assuming that the event rate of the matched control group was 0.2 and that of the matched case (or treatment) group was 0.8, the allocation ratio of each group was set at 1:1. If a two-sided test is performed with a significance level of 0.05 and a power of 80%, 13 samples are calculated for each group. Considering a dropout rate of 20%, 16 subjects are required for each group, for a total of 32 subjects.

Correlation analysis determines whether there is a linear relationship between two continuous variables. The ‘pwr’ package will be used for this test.

Practice 6

The pwr.r.test() function (Supplementary data 1, Table 5) can be used in correlation analysis. The correlation coefficient (r) is used as a measure of effect size. Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively. We will use a medium effect size of 0.3.

Correlation analysis (Table 3, no. 14)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required to demonstrate statistical significance is 84 for an effect size of r = 0.3. Considering a dropout rate of 20%, 105 subjects are required.

Generalized linear models [18] have been formulated as a way to incorporate a variety of other statistical models, including linear regression, logistic regression, and Poisson regression. We will use the ‘pwr’ package for linear regression and the ‘We’ package for logistic/Poisson regression.

Practice 7

The pwr.f2.test() function (Supplementary data 1, Table 6) can be used for multiple linear regression analysis. We will use Cohen’s f2 as the effect size using the R2 value used as a measure of goodness of fit in regression analysis (Cohen’s f2 = R2/(1-R2)). The ‘u’ is the number of predictors (or risk factors) considered in the analysis, and the ‘v’ is n (the total number of subjects) – u – 1. That is, if you set only the value of u to the function, the value of v is calculated and this value is used to calculate the total number of subjects (n ≥ v + u + 1). Cohen suggests f2 values of 0.02, 0.15, and 0.35 represent small, medium, and large effect sizes. We will use medium effect size = 0.15 and u = 3.

Linear regression (Table 3, no. 15)

Similarly, we assumed a p-value of 0.05 and a power of 80%. Considering the three risk factors (u = 3), if the effect size = 0.15, v = 73. Finally, a total of 77 (73 + 3 + 1) are calculated and considering a dropout rate of 20, 96 people should be recruited.

Practice 8

The wp.logistict() and wp.poisson() function (Supplementary data 3, Table 1 and 2) can be used for logistic and Poisson regression analysis respectively. The two arguments “family” and “parameter” should contain information about the distribution of the predictor (or risk factor). Default values are used when the information in a predictor is unknown. You can change the parameter value if you know.

Logistic regression [19] (Table 3, no. 16)

If predictor (X) is a continuous variable, it can be used as family = “normal” and the ‘parameter’ is used as default. The way p0 and p1 are calculated can be calculated using the 1SD range of X. You can set p1 to the probability of being in range and p0 to the probability of being out of range. In this example, p0 = 0.15 and p1 = 0.1 were used. Similarly, we assumed a p-value of 0.05 and a power of 80%. The minimum number of samples satisfying these conditions is 299, and a total of 374 is required considering the dropout rate of 20%.

Poisson regression [20] (Table 3, no 17)

If predictor (X) is a binary variable, it can be used as family = “bernoulli” and the ‘parameter’ will be used as its default value. For exp0, a base rate of 1 under the null hypothesis was used, and for exp1, expected relative risk = 1.2 was set as the relative increment of the event rate. Similarly, we assumed a p-value of 0.05 and a power of 80%. The minimum number of samples satisfying these conditions is 866, and a total of 1083 is required considering the dropout rate of 20%.

In conclusion, sample size calculation plays the most important role in the research design process before starting the study. In particular, since randomized controlled trial studies, which are frequently conducted in clinical settings, are directly related to cost issues, the number of samples must be carefully calculated. However, although there are various references related to sample size calculation, it can be difficult to correctly use a method suitable for your own study. Of course, it would be better to seek expert advice for more complex studies, but we hope that this article will help researchers calculate the right number of subjects for their own research.

Authors’ contributions

Conceptualization: YHK, HIB, YP

Data curation: SP, YHK, YHK

Formal analysis: SP, HIB

Investigation: SP, HIB

Methodology: SP, YHK

Project administration: YHK, Y.P

Visualization: HIB

Writing–Original Draft: SP, HIB

Writing–Review & Editing: All authors

Conflict of interest

All authors have no conflicts of interest to declare.

Funding/support

This work was supported by the Soonchunhyang University Research Fund.

Supplementary materials

Supplementary data 1–3 can be found via https://doi.org/10.7602/jmis.2023.26.1.9.

jmis-26-1-9-supple.pdf
Table. 1.

Types of hypothesis testing

Test forNull hypothesis (H0)Alternative hypothesis (H1)
EqualityμTμs = 0μTμs ≠ 0
Equivalence|μTμs| ≥ δ|μTμs| < δ
SuperiorityμTμsδμTμs > δ
Non-inferiorityμTμs ≤ –δNIμTμs > –δNI

Table. 2.

Type I and type II error

True statusDecision

H0 (accept H0)H1 (reject H0)
H0Correct decisionType I error (α)
H1Type II error (β)Correct decision

Table. 3.

Tests for calculating sample size

TestR packageFunction

No.Type# of groupName
1Continuous/Parametric1One-sample t testpwrpwr.t.test
22Two-sample t testpwrpwr.t.test
32Paired t testpwrpwr.t.test
4≥3One-way ANOVApwrpwr.anova.test
5Continuous/Nonparametric1One-sample Wilcoxon testpwrpwr.t.test
62Mann-Whitney U testpwrpwr.t.test
72Paired Wilcoxon testpwrpwr.t.test
8≥3Kruskal-Wallis testpwrpwr.anova.test
9Categorical/Parametric1One-sample proportion testpwrpwr.p.test
102Two-sample proportion testpwrpwr.2p.test
11-Chi-square testpwrpwr.chisq.test
12Categorical/Nonparametric2Fisher exact testexact2x2ss2x2
132McNemar testexact2x2ss2x2
14Correlation analysispwrpwr.r.test
15Linear regressionpwrpwr.f2.test
16Logistic regressionWebPowerwp.logistic
17Poisson regressionWebPowerwp.poisson

ANOVA, analysis of variance.


  1. Altman DG. Statistics and ethics in medical research: III How large a sample? Br Med J 1980;281:1336-1338.
    Pubmed KoreaMed CrossRef
  2. Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 1994;272:122-124.
    Pubmed CrossRef
  3. Foulkes M. Study designs, objectives, and hypotheses [Internet]. Johns Hopkins Bloomberg School of Public Health; 2008 [cited 2023 Feb 20].
    Available from: https://docplayer.net/38128249-Study-designs-objectives-and-hypotheses-mary-foulkes-phd-johns-hopkins-university.html.
  4. Bose M, Dey A. Optimal crossover designs. World Scientific; 2009.
  5. Johnson DE. Crossover experiments. WIREs Comp Stat 2010;2:620-625.
    CrossRef
  6. Agency EM. ICH: E 9: Statistical principles for clinical trials - Step 5 [Internet]. European Medicines Agency; 1998 [cited 2023 Feb 20].
    Available from: https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-9-statistical-principles-clinical-trials-step-5_en.pdf.
  7. Chow SS, Shao J, Wang H, Lokhnygina Y. Sample size calculation in clinical trial. Marcel Dekker Inc; 2003.
  8. Lachin JM. Chapter 3. Sample size, power, and efficiency. In: Lachin JM. Biostatistical methods. John Wiley & Sons; 2000. p. 61-86.
  9. van Belle G. Statistical rules of thumb. 2nd ed. John Wiley & Sons; 2011.
  10. Lachin JM. Introduction to sample size determination and power analysis for clinical trials. Control Clin Trials 1981;2:93-113.
    Pubmed CrossRef
  11. Zhang Z, Yuan KH. Practical statistical power analysis using Webpower and R. ISDSA Press; 2018.
  12. Lehmann EL. Nonparametrics: statistical methods based on ranks. Springer; 2006.
  13. McGraw KO, Wong SP. A common language effect size statistic. Psychol Bull 1992;111:361-365.
    CrossRef
  14. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Taylor & Francis; 2013.
  15. Sawilowsky SS. New effect size rules of thumb. J Mod Appl Stat Methods 2009;8:26.
    CrossRef
  16. Bonett DG. Confidence intervals for standardized linear contrasts of means. Psychol Methods 2008;13:99-109.
    Pubmed CrossRef
  17. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. 3rd ed. John Wiley & Sons; 2013.
  18. Nelder JA, Wedderburn RW. Generalized linear models. J R Stat Soc Ser A 1972;135:370-384.
    CrossRef
  19. Demidenko E. Sample size determination for logistic regression revisited. Stat Med 2007;26:3385-3397.
    Pubmed CrossRef
  20. Cohen J. A power primer. Psychol Bull 1992;112:155-159.
    Pubmed CrossRef

Article

Review Article

Journal of Minimally Invasive Surgery 2023; 26(1): 9-18

Published online March 15, 2023 https://doi.org/10.7602/jmis.2023.26.1.9

Copyright © The Korean Society of Endo-Laparoscopic & Robotic Surgery.

Sample size calculation in clinical trial using R

Suyeon Park1,2,3 , Yeong-Haw Kim3 , Hae In Bang4 , Youngho Park5

1Department of Biostatistics, Academic Research Office, Soonchunhyang University Seoul Hospital, Seoul, Korea
2International Development and Cooperation, Graduate School of Multidisciplinary Studies Toward Future, Soonchunhyang University, Asan, Korea
3Department of Applied Statistics, Chung-Ang University, Seoul, Korea
4Department of Laboratory Medicine, Soonchunhyang University Seoul Hospital, Seoul, Korea
5Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, Daejeon, Korea

Correspondence to:Hae In Bang
Department of Laboratory Medicine, Soonchunhyang University Seoul Hospital, 59 Daesagwan-ro, Yongsan-gu, Seoul 04401, Korea
E-mail: genuine43@schmc.ac.kr
ORCID:
https://orcid.org/0000-0001-7854-3011

Youngho Park
Department of Big Data Application, College of Smart Interdisciplinary Engineering, Hannam University, 70 Hannamro, Daedeok-gu, Daejeon 34430, Korea
E-mail: yhpark@hnu.kr
ORCID:
https://orcid.org/0000-0002-7096-3967

Hae In Bang and Youngho Park contributed equally to this study as co-corresponding authors.

Received: February 20, 2023; Revised: March 12, 2023; Accepted: March 12, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Since the era of evidence-based medicine, it has become a matter of course to use statistics to create objective evidence in clinical research. As an extension of this, it has become essential in clinical research to calculate the correct sample size to demonstrate a clinically significant difference before starting the study. Also, because sample size calculation methods vary from study design to study design, there is no formula for sample size calculation that applies to all designs. It is very important for us to understand this. In this review, each sample size calculation method suitable for various study designs was introduced using the R program (R Foundation for Statistical Computing). In order for clinical researchers to directly utilize it according to future research, we presented practice codes, output results, and interpretation of results for each situation.

Keywords: Sample size, Effect size, Continuous outcome, Categorical outcome

INTRODUCTION

This article will cover the following topics: (1) Why is sample size calculation important?; (2) Components of sample size calculation; and (3) How to calculate the required sample size?

WHY IS SAMPLE SIZE CALCULATION IMPORTANT?

The main purpose of sample size calculation is to determine the minimum number of subjects required to detect a clinically relevant treatment effect. The fundamental reason for calculating the number of subjects in the study can be divided into the following three categories [1,2].

Economic reasons

In clinical studies, if the sample is not large enough, statistical significance may not be found even if an important relationship or difference exists. In other words, it may not be possible to successfully conclude the study because the study may lack the power to detect the effect. Conversely, when a study is based on a very large sample, small effect differences may be considered statistically significant and lead to clinical misjudgment. Either way, your research may not be successful for other reasons and the conclusion is a waste of money, time, and resources.

Ethical reasons

Oversized studies are likely to include more subjects than the study actually needs, exposing unnecessarily many subjects to potentially harmful or futile treatments. Similarly, in undersized studies, ethical issues may arise in that subjects are exposed to unnecessary situations in studies that may have low success rates.

Scientific reasons

If a negative result is obtained after conducting a study, it is necessary to consider whether the sample size of the study was sufficient or insufficient. First, if the study was conducted with sufficient sample size, it can be interpreted that there is no clinically significant effect. However, if the study is conducted with insufficient sample size, meaningful clinical results with statistically significant differences in practice may be missed. Notice that not being able to reject the null hypothesis does not mean that it is true; it means that we do not have enough evidence to reject it.

Additionally, calculating sample size at the study design stage, when receiving ethics committee approval, has become a requirement rather than an option. As a result, calculating the optimal sample size is an important process that must be done at the design stage before a study is conducted in order to ensure the validity, accuracy, reliability, and scientific and ethical integrity of the study.

COMPONENTS OF SAMPLE SIZE CALCULATION

Appropriate sample size usually depends on the statistical hypotheses made with the study’s primary outcome and the study design parameters. The basic statistical six concepts that must be considered essential for estimating the sample size are as follows.

Study design

There are various research designs [3] in clinical research, but among them, the most commonly used design is the parallel design. A crossover design [4,5] can be used in studies where the number of subjects is difficult to collect. A crossover design requires fewer samples than a parallel design, but is complex and must satisfy several conditions, so an appropriate study design should be selected according to the purpose of the study.

Parallel design. Group A receives only treatment A and group B receives only treatment B.

Crossover design. It is a study in which one group receives treatment A first, then treatment B, and the other group receives treatment B and then treatment A. Therefore, it is important to have an appropriate wash-out period at the time of treatment change.

Null and alternative hypotheses testing

When establishing statistical hypotheses in research, two hypotheses are always required, which we call the null hypothesis (H0) and the alternative hypothesis (H1 or Hα). In this case, the two hypotheses must consist of two mutually exclusive statements. The null hypothesis (H0) usually contains the opposite of what the researcher claims in the study and is set to be rejected. That is, include ‘no difference’ when forming hypotheses. Conversely, an alternative hypothesis (H1) is a statement in which the researcher proposes a potential outcome, and that hypothesis includes ‘there is a difference.’ There are also different types of hypothesis testing problems, depending on the purpose of the study. In Table 1, hypotheses can be established depending on whether it is an equality, equivalence, superiority, or non-inferiority test. Let μS = mean of standard treatment, μT = mean of new treatment, δ = the minimum clinically important difference, and δNI = the non-inferiority margin.

Test for equality. To determine whether a clinically meaningful difference or effect exists (δ = 0).

Test for equivalence. To demonstrate the difference between the new treatment and standard treatment has no clinical importance.

Test for superiority. To demonstrate that the new treatment is superior to the standard treatment.

Test for non-inferiority. A study whose primary purpose is to evaluate whether a new treatment is less effective or not inferior to standard treatment (δNI > 0).

One-sided and two-sided tests

The one-sided test is a method of testing whether it is greater than or less than a certain value, and in Table 1, a superiority or non-inferiority trial. A two-sided test is performed when the expected value is greater than or less than the specified range of values. When using a two-sided test, a two-way relationship is tested regardless of the direction of the hypothesized relationship. In Table 1, equality and equivalence trials are two-sided trials.

Type I error and type II error

The hypothesis testing process is as follows: (1) assume that the null hypothesis is true, calculate the test statistic using sample data and (2) decide whether or not to reject the null hypothesis according to the result. That is, we always choose one of the four decisions shown in Table 2 and two types of errors inevitably occur: type I errors (α) and type II errors (β).

Type I error and significance level. The probability of rejecting the null hypothesis when it is actually true is called a type I error. This essentially means saying that the alternative hypothesis is true when it is not true. Therefore, it is recommended to keep the type I error as small as possible. The type I error rate is known as the significance level and is usually set to 0.05 (5%) or less [6,7]. Type I error is inversely proportional to sample size.

Type II error and power. The probability of not rejecting the null hypothesis when it is false is called a type II error. Conversely, power is the probability that the test will correctly reject the null hypothesis when the alternative hypothesis is true. Type II error can be denoted as β and power as 1 – β. Conventionally, the power is set to 80% or 90% [6,7] when calculating the sample size and it is directly proportional to the sample size.

Primary outcome

Variables to see clinically significant differences may vary, but the most important factor among them should be selected. This is called the primary outcome, and the other measurements are referred to as secondary outcomes. The number of samples is calculated using the primary outcome. At this time, the parameter information of the primary outcome for calculating the sample size can be obtained from prior studies or pilot studies. Both continuous and categorical data can be used as primary outcomes, and the parameters used to calculate the minimal meaningful detectable difference (MD) and standard deviation (SD) depend on the type of variable. For continuous data, mean/sd is used as a parameter, and for categorical data, a proportion is used as a parameter.

Minimal meaningful detectable difference (MD). The smallest difference considered clinically meaningful in the primary outcome.

Standard deviation (SD). It tells you how spread out the data is from the mean.

Dropout rate

The sample size estimation formula yields the minimum number of subjects required to meet statistical significance for a given hypothesis. However, in an actual study, subject dropout may occur during the study, so in order to satisfy all the number of subjects desired by the researcher, the total number of subjects considering the dropout rate must be calculated so that more subjects can be enrolled. If ‘n’ is the number of samples calculated according to the formula and ‘dr’ is the dropout rate, then the adjusted sample size ‘N’ is given by: N = n / (1 – dr).

Others

Depending on the study design, there are many more considerations in addition to the six concepts mentioned above. Although not considered in the practice below, we would like to mention three points that are frequently mentioned and used in actual clinical research to help researchers.

Adjustment for unequal sample size

In clinical trials, available patients, treatment costs, and treatment resources may influence the allocation ratio (k) decision. According to Lachin [8] and van Belle [9], it can be applied as follows.

(1) Calculate the sample size n per group, assuming equal numbers per group.

(2) Let k = n2n1, n1=12n1+1k and n2=12n(1+k).

Interim analysis

In the confirmatory trials, there are cases in which interim analysis, whether planned or unplanned, is performed at the research planning stage. When calculating the number of subjects taking this into account, the false positive rate increases with the number of interim analyses, so type I error should be considered.

Sample size for survival time

In survival analysis, the outcome variable is the time until a specific event such as death occurs, and whether or not an event occurs for each subject and the time from the start of the clinical trial to the occurrence of the event (or censoring) are used as outcome variables. In particular, the power of survival analysis is a function of the number of events and generally increases with a shorter period (T0) of recruitment to study subjects and a longer total follow-up period (T). Let λ1 and λ2 are the hazard ratio for each group, the formula for calculating the number of subjects is [10]:

n=(Z1α/2Z1β )2[(λ1)+(λ2)](λ1λ2)2 where(λ)=λ21eλT or (λ)=λ21[eλ(TTo)eλT]/λT0.

However, for studies with relatively low event rates and high censoring, the following sample size formula using only event rates can be used:

n = 2(Z1α/2+Z1β)2lnλ2 λ1 2.

HOW TO CALCULATE THE SAMPLE SIZE?

Using the 17 tests in Table 3, which are widely used in research, we would like to show an example using an R program version 4.1.2 (R Foundation for Statistical Computing; ‘pwr’, ‘exact2x2’ and ‘WebPower’ [11] packages), one of the free programs. Basically, when using R, you need to install a package that includes the function you want to analyze and then use it. After that, you can use the function you want to use after calling package using the ‘library()’ function. More details will be explained through the example below.

All studies intend to use a parallel group design. A two-tailed test with a significance of 0.05 and a power of 80% was established. The dropout rate is different for each research field, but here we will unify it at 20%. For nonparametric tests on continuous variables, as a rule of thumb [12], calculate the sample size required for parametric tests and add 15%. Effect size can be defined as ‘a standardized measure of the magnitude of the mean difference or relationship between study groups’ [13]. In other words, an index that divides the effect size by its dispersion (standard deviation, etc.) is not affected by the measurement unit and can be used regardless of the unit, and is called an ‘effect size index’ or ‘standardized effect size.’ Cohen intuitively introduced effect sizes as small, medium, and large for easy understanding [14]. However, since the value presented by Cohen may vary depending on the population or distribution of the variable, there may be limitations in using it as an absolute value. When estimating the number of subjects, effect sizes (such as Cohen’s d, r, or the relative ratio, etc.) should be calculated using parameter information (MD and SD) found in the literature relevant to each primary outcome and entered as arguments to the function. Additionally, whether an effect size should be interpreted as small, medium, or large may depend on the analysis method. We use the guidelines mentioned by Cohen [14] and Sawilowsky [15] and use the medium effect size considered for each test in the examples below.

CONTINUOUS OUTCOME

When the primary outcome considered in the study is continuous data, the number of samples can be calculated using the ‘pwr’ package. At this time, you can consider comparing the mean of a single group, two groups, or more than three groups, and Cohen’s d and f will be used for the effect size. When applied to your study, parameters can be taken from a previous or pilot study and calculated using the effect size calculation formula below.

Practice 1

The pwr.t.test() function (Supplementary data 1, Table 1) can be utilized with the ‘type’ argument for (1) one-sample t test (type = “one.sample”), (2) two-sample t test (type = “two.sample”), or (3) paired t test (type = “paired”). Cohen’s d is used as the effect size, and the size definition [14,15] is as follows; very small (d = 0.01), small (d = 0.2), medium (d = 0.5), large (d = 0.8), very large (d = 1.2), and huge (d = 2). In our example, we will use medium effect size (d = 0.5).

One-sample t test (Table 3, no.1)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required to demonstrate statistical significance is 34 when the effect size d = 0.5. Considering the dropout rate of 20%, a total of 43 samples are required.

Two-sample t test (Table 3, no. 2)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required for each group to demonstrate statistical significance is 64 when the effect size d = 0.5. Considering a dropout rate of 20%, 80 subjects are required for each group, for a total of 160 subjects.

Paired t test (Table 3, no. 3)

In the case of paired samples, if there is a correlation coefficient (r) between the variables before and after, it can be calculated as the SDpoolSD12+SD222rSD1SD2/2(1r). Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of pairs required to demonstrate statistical significance is 34 when the effect size d = 0.5. Considering the dropout rate of 20%, a total of 43 pairs are required.

One-sample Wilcoxon test (Table 3, no. 5)

A total of 43 was calculated by one-sample t test and adding 15% gives a total of 65.

Mann-Whitney U test (Table 3, no. 6)

By two-sample t test, 80 people were calculated for each group, and a total of 240 people was calculated by considering an additional 15% for each group.

Paired Wilcoxon test (Table 3, no. 7)

The 43 pairs were calculated by paired t test, taking into account an additional 15%, the total 65 pairs are required.

Practice 2

The pwr.anova.test() function (Supplementary data 1, Table 2) can be used in studies that compare averages of three or more groups. In this function, ‘k’ means the number of comparison groups and ‘f’ means the effect size, and Cohen’s f is used here. The detailed calculation formula can be found below. Cohen suggests that f values of 0.1, 0.25, and 0.4 indicate small, medium, and large effect sizes respectively. Also, we will use medium effect size = 0.25.

One-way analysis of variance (ANOVA) (Table 3, no. 4)

Assume that the p-value is 0.05, the power is 80%, and the two-tailed test is performed. When the total comparison group was three groups and the effect size value was 0.25, the number of subjects calculated was 53 in each group. Considering a dropout rate of 20%, a total of 198 samples are required, which is calculated as 66 per group.

Kruskal-Wallis test (Table 3, no. 8)

By one-way ANOVA, 66 people were calculated for each group, and if 15% of each group is additionally considered, a total of 297 people are calculated.

CATEGORICAL OUTCOME

If the primary outcome considered in your study is categorical data, you can use the ‘pwr’ package for parametric tests and the ‘exact2x2’ package for nonparametric tests to calculate the number of samples.

Practice 3

The pwr.p.test() and pwr.2p.test() functions (Supplementary data 1, Table 3) are used when comparing one-sample and two-sample proportions, respectively. Cohen’s h is used here as the effect size. The calculation formula is: h = 2arcsin(p1)–2arcsin(p2). Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes respectively. Also, we will use medium effect size = 0.5.

One-sample proportion test (Table 3, no. 9)

In the one-sample proportion test, p2 is the proportion under the null hypothesis and p1 is the proportion under the alternative hypothesis. Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required to demonstrate statistical significance is 32 when the effect size h = 0.5. Considering the dropout rate of 20%, a total of 40 samples are required.

Two-sample proportion test (Table 3, no. 10)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required for each group to demonstrate statistical significance is 63 when the effect size h = 0.5. Considering a dropout rate of 20%, 79 subjects are required for each group, for a total of 158 subjects.

Practice 4

In the chi-square test, which is a commonly used method for measuring the association between categorical data, Cohen’s w is used as a measure of effect size. The pwr.chisq.test() function (Supplementary data 1, Table 3) takes ‘w’ as an argument for effect size and ‘df’ as an argument for degrees of freedom. Assuming that two categorical variables have l categories and k categories, respectively, we can create a contingency table consisting of a total of m = l × k cells, and the ‘df’ in this table is (l − 1) × (k − 1). Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively. We will use medium effect size = 0.3.

Chi-square test (Table 3, no. 11)

Similarly, we assumed a p-value of 0.05 and a power of 80%. Looking at the association between the two-category variables and the three-category variables, the minimum required number of subjects is 107 when the effect size is 0.3. Considering the dropout rate of 20%, a total of 134 people are needed.

Practice 5

For nonparametric testing of categorical data, power calculation can be performed using the ss2x2() function [17] (Supplementary data 2, Table 1). Fisher exact test and McNemar test are considered, and they are performed separately using the ‘pair’ argument. In the example below, we set the event ratio for the control group to 0.2 (p0 = 0.2), the event ratio for the treatment group to 0.8 (p1 = 0.8), and the allocation ratio between groups to 1:1 (n1.over.n0 = 1).

Fisher exact test (Table 3, no. 12)

Assuming that the event rate of the control group was 0.2 and that of the treatment group was 0.8, the allocation ratio of each group was set at 1:1. If a two-sided test is performed with a significance level of 0.05 and a power of 80%, 12 samples are calculated for each group. Considering a dropout rate of 20%, 15 subjects are required for each group, for a total of 30 subjects.

McNemar test (Table 3, no. 13)

Assuming that the event rate of the matched control group was 0.2 and that of the matched case (or treatment) group was 0.8, the allocation ratio of each group was set at 1:1. If a two-sided test is performed with a significance level of 0.05 and a power of 80%, 13 samples are calculated for each group. Considering a dropout rate of 20%, 16 subjects are required for each group, for a total of 32 subjects.

CORRELATION ANALYSIS

Correlation analysis determines whether there is a linear relationship between two continuous variables. The ‘pwr’ package will be used for this test.

Practice 6

The pwr.r.test() function (Supplementary data 1, Table 5) can be used in correlation analysis. The correlation coefficient (r) is used as a measure of effect size. Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes respectively. We will use a medium effect size of 0.3.

Correlation analysis (Table 3, no. 14)

Assuming a p-value of 0.05 and a power of 80% in a two-tailed test, the minimum number of subjects required to demonstrate statistical significance is 84 for an effect size of r = 0.3. Considering a dropout rate of 20%, 105 subjects are required.

GENERALIZED LINEAR MODEL

Generalized linear models [18] have been formulated as a way to incorporate a variety of other statistical models, including linear regression, logistic regression, and Poisson regression. We will use the ‘pwr’ package for linear regression and the ‘We’ package for logistic/Poisson regression.

Practice 7

The pwr.f2.test() function (Supplementary data 1, Table 6) can be used for multiple linear regression analysis. We will use Cohen’s f2 as the effect size using the R2 value used as a measure of goodness of fit in regression analysis (Cohen’s f2 = R2/(1-R2)). The ‘u’ is the number of predictors (or risk factors) considered in the analysis, and the ‘v’ is n (the total number of subjects) – u – 1. That is, if you set only the value of u to the function, the value of v is calculated and this value is used to calculate the total number of subjects (n ≥ v + u + 1). Cohen suggests f2 values of 0.02, 0.15, and 0.35 represent small, medium, and large effect sizes. We will use medium effect size = 0.15 and u = 3.

Linear regression (Table 3, no. 15)

Similarly, we assumed a p-value of 0.05 and a power of 80%. Considering the three risk factors (u = 3), if the effect size = 0.15, v = 73. Finally, a total of 77 (73 + 3 + 1) are calculated and considering a dropout rate of 20, 96 people should be recruited.

Practice 8

The wp.logistict() and wp.poisson() function (Supplementary data 3, Table 1 and 2) can be used for logistic and Poisson regression analysis respectively. The two arguments “family” and “parameter” should contain information about the distribution of the predictor (or risk factor). Default values are used when the information in a predictor is unknown. You can change the parameter value if you know.

Logistic regression [19] (Table 3, no. 16)

If predictor (X) is a continuous variable, it can be used as family = “normal” and the ‘parameter’ is used as default. The way p0 and p1 are calculated can be calculated using the 1SD range of X. You can set p1 to the probability of being in range and p0 to the probability of being out of range. In this example, p0 = 0.15 and p1 = 0.1 were used. Similarly, we assumed a p-value of 0.05 and a power of 80%. The minimum number of samples satisfying these conditions is 299, and a total of 374 is required considering the dropout rate of 20%.

Poisson regression [20] (Table 3, no 17)

If predictor (X) is a binary variable, it can be used as family = “bernoulli” and the ‘parameter’ will be used as its default value. For exp0, a base rate of 1 under the null hypothesis was used, and for exp1, expected relative risk = 1.2 was set as the relative increment of the event rate. Similarly, we assumed a p-value of 0.05 and a power of 80%. The minimum number of samples satisfying these conditions is 866, and a total of 1083 is required considering the dropout rate of 20%.

CONCLUSIONS

In conclusion, sample size calculation plays the most important role in the research design process before starting the study. In particular, since randomized controlled trial studies, which are frequently conducted in clinical settings, are directly related to cost issues, the number of samples must be carefully calculated. However, although there are various references related to sample size calculation, it can be difficult to correctly use a method suitable for your own study. Of course, it would be better to seek expert advice for more complex studies, but we hope that this article will help researchers calculate the right number of subjects for their own research.

NOTES

Authors’ contributions

Conceptualization: YHK, HIB, YP

Data curation: SP, YHK, YHK

Formal analysis: SP, HIB

Investigation: SP, HIB

Methodology: SP, YHK

Project administration: YHK, Y.P

Visualization: HIB

Writing–Original Draft: SP, HIB

Writing–Review & Editing: All authors

Conflict of interest

All authors have no conflicts of interest to declare.

Funding/support

This work was supported by the Soonchunhyang University Research Fund.

Supplementary materials

Supplementary data 1–3 can be found via https://doi.org/10.7602/jmis.2023.26.1.9.

jmis-26-1-9-supple.pdf

Table 1 . Types of hypothesis testing.

Test forNull hypothesis (H0)Alternative hypothesis (H1)
EqualityμTμs = 0μTμs ≠ 0
Equivalence|μTμs| ≥ δ|μTμs| < δ
SuperiorityμTμsδμTμs > δ
Non-inferiorityμTμs ≤ –δNIμTμs > –δNI

Table 2 . Type I and type II error.

True statusDecision

H0 (accept H0)H1 (reject H0)
H0Correct decisionType I error (α)
H1Type II error (β)Correct decision

Table 3 . Tests for calculating sample size.

TestR packageFunction

No.Type# of groupName
1Continuous/Parametric1One-sample t testpwrpwr.t.test
22Two-sample t testpwrpwr.t.test
32Paired t testpwrpwr.t.test
4≥3One-way ANOVApwrpwr.anova.test
5Continuous/Nonparametric1One-sample Wilcoxon testpwrpwr.t.test
62Mann-Whitney U testpwrpwr.t.test
72Paired Wilcoxon testpwrpwr.t.test
8≥3Kruskal-Wallis testpwrpwr.anova.test
9Categorical/Parametric1One-sample proportion testpwrpwr.p.test
102Two-sample proportion testpwrpwr.2p.test
11-Chi-square testpwrpwr.chisq.test
12Categorical/Nonparametric2Fisher exact testexact2x2ss2x2
132McNemar testexact2x2ss2x2
14Correlation analysispwrpwr.r.test
15Linear regressionpwrpwr.f2.test
16Logistic regressionWebPowerwp.logistic
17Poisson regressionWebPowerwp.poisson

ANOVA, analysis of variance..


References

  1. Altman DG. Statistics and ethics in medical research: III How large a sample? Br Med J 1980;281:1336-1338.
    Pubmed KoreaMed CrossRef
  2. Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 1994;272:122-124.
    Pubmed CrossRef
  3. Foulkes M. Study designs, objectives, and hypotheses [Internet]. Johns Hopkins Bloomberg School of Public Health; 2008 [cited 2023 Feb 20]. Available from: https://docplayer.net/38128249-Study-designs-objectives-and-hypotheses-mary-foulkes-phd-johns-hopkins-university.html.
  4. Bose M, Dey A. Optimal crossover designs. World Scientific; 2009.
  5. Johnson DE. Crossover experiments. WIREs Comp Stat 2010;2:620-625.
    CrossRef
  6. Agency EM. ICH: E 9: Statistical principles for clinical trials - Step 5 [Internet]. European Medicines Agency; 1998 [cited 2023 Feb 20]. Available from: https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e-9-statistical-principles-clinical-trials-step-5_en.pdf.
  7. Chow SS, Shao J, Wang H, Lokhnygina Y. Sample size calculation in clinical trial. Marcel Dekker Inc; 2003.
  8. Lachin JM. Chapter 3. Sample size, power, and efficiency. In: Lachin JM. Biostatistical methods. John Wiley & Sons; 2000. p. 61-86.
  9. van Belle G. Statistical rules of thumb. 2nd ed. John Wiley & Sons; 2011.
  10. Lachin JM. Introduction to sample size determination and power analysis for clinical trials. Control Clin Trials 1981;2:93-113.
    Pubmed CrossRef
  11. Zhang Z, Yuan KH. Practical statistical power analysis using Webpower and R. ISDSA Press; 2018.
  12. Lehmann EL. Nonparametrics: statistical methods based on ranks. Springer; 2006.
  13. McGraw KO, Wong SP. A common language effect size statistic. Psychol Bull 1992;111:361-365.
    CrossRef
  14. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Taylor & Francis; 2013.
  15. Sawilowsky SS. New effect size rules of thumb. J Mod Appl Stat Methods 2009;8:26.
    CrossRef
  16. Bonett DG. Confidence intervals for standardized linear contrasts of means. Psychol Methods 2008;13:99-109.
    Pubmed CrossRef
  17. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions. 3rd ed. John Wiley & Sons; 2013.
  18. Nelder JA, Wedderburn RW. Generalized linear models. J R Stat Soc Ser A 1972;135:370-384.
    CrossRef
  19. Demidenko E. Sample size determination for logistic regression revisited. Stat Med 2007;26:3385-3397.
    Pubmed CrossRef
  20. Cohen J. A power primer. Psychol Bull 1992;112:155-159.
    Pubmed CrossRef

Supplementary File

Metrics for This Article

Share this article on

  • kakao talk
  • line

Journal of Minimally Invasive Surgery

pISSN 2234-778X
eISSN 2234-5248