Review Article

Split Viewer

Journal of Minimally Invasive Surgery 2024; 27(2): 55-71

Published online June 15, 2024

https://doi.org/10.7602/jmis.2024.27.2.55

© The Korean Society of Endo-Laparoscopic & Robotic Surgery

Propensity score matching for comparative studies: a tutorial with R and Rex

Bora Lee1,2 , Nam-eun Kim3 , Sungho Won2,3,4 , Jungsoo Gim1,5,6

1Institute of Well-Aging Medicare & CSU G-LAMP Project Group, Chosun University, Gwangju, Korea
2Department of Public Health Sciences, Seoul National University, Seoul, Korea
3Institute of Health and Environment, Seoul National University, Seoul, Korea
4RexSoft Inc., Seoul, Korea
5Department of Biomedical Science, Chosun University, Gwangju, Korea
6Gwangju Alzheimer’s & Related Dementia Cohort Research Center, Chosun University, Gwangju, Korea

Correspondence to : Jungsoo Gim
Department of Biomedical Science, Chosun University, 309 Pilmun-daero, Dong-gu, Gwangju 61452, Korea
E-mail: jgim@chosun.ac.kr
https://orcid.org/0000-0002-6454-9751

Received: May 20, 2024; Revised: May 29, 2024; Accepted: June 7, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Recently, there has been considerable progress in developing new technologies and equipment for the medical field, including minimally invasive surgeries. Evaluating the effectiveness of these treatments requires study designs like randomized controlled trials. However, due to the nature of certain treatments, randomization is not always feasible, leading to the use of observational studies. The effect size estimated from observational studies is subject to selection bias caused by confounders. One method to reduce this bias is propensity scoring. This study aimed to introduce a propensity score matching process between two groups using a practical example with R. Additionally, Rex, an Excel add-in graphical user interface statistical program, is provided for researchers unfamiliar with R programming. Further techniques, such as matching with three or more groups, propensity score weighting and stratification, and imputation of missing values, are summarized to offer approaches for more complex studies not covered in this tutorial.

Keywords Propensity score, Matching, Selection bias, R, Rex

The demand for minimally invasive surgeries has grown since they minimize the surgical site and bleeding, leading to faster recovery times. Consequently, new equipment and technologies incorporating minimally invasive treatments are being developed for various conditions, such as degenerative brain disorders and spinal diseases in older adults and brain tumor removal in infants [14]. Accurate evaluations of these new treatments are crucial for several reasons: (1) they clarify the benefits of the new technologies and the potential side effects; (2) they update the guidelines to offer better strategies in clinical practice; and (3) they ensure efficient allocation of medical resources [5,6]. Additionally, future medical research can focus on effective technologies and encourage investment in promising ones [7].

Selecting the appropriate outcome indicators and comparable controls is essential to get the precise treatment effects. If unrelated outcomes are chosen, or if the control group differs from the treatment group, it becomes challenging to determine whether the effects are due to the treatment or other factors. Selecting appropriate outcomes is relatively straightforward, but ensuring the comparability of well-balanced characteristics in the treatment and control groups is generally problematic. In observational studies, the characteristics of the treatment and control groups usually differ. Fortunately, there are various statistical approaches to strengthen causal inference by controlling for confounders (Fig. 1) [8]. One such method is the propensity score.

Fig. 1. Research design and analyses in observational studies for causal inference.

Propensity scores can be applied in several ways, such as matching, weighting, and stratification. Propensity score-based analyses perform efficient evaluations even with a small sample size [9,10]. Notably, they ensure the validity and generalizability of the result in causal inference, whereas conventional analyses infer only association, not causality [11].

This study aimed to provide a practical guide to propensity score analyses, focusing on matching. Unlike previous tutorials on propensity score matching (PSM), this study covers traditional and machine learning-based matching with detailed descriptions using the recently updated R packages. It also offers an easy way of performing PSM with an Excel add-in statistical program called Rex, providing point-and-click analysis for researchers unfamiliar with R programming. To this end, we explained the concept of propensity score and outlined the step-by-step process of PSM, along with available statistical programs. As a practical example, we used PSM to compare the surgical outcomes between patients who underwent noninvasive high-intensity focused ultrasound (HIFU) brain surgery and conventional brain resection surgery. Finally, we summarize additional considerations not covered in this study.

Definition of propensity score

When comparing the outcomes of treatment and control groups, differing characteristics, aside from the treatment itself, can make it difficult to determine whether the differences in outcomes are due to the treatment or other factors. This leads to a biased estimate of the treatment effect, with the other influencing factors referred to as ‘confounders.’

One way to obtain an unbiased effect size is through randomized controlled trials (RCTs), which randomly assign subjects to the treatment or control groups. However, when random assignment is not feasible due to the nature of the study, observational studies are necessary. In this case, propensity scores are used to control for confounders in a manner similar to random assignment.

To better understand propensity scores, consider a balanced two-arm RCT. The concept of random assignment in RCTs can be expressed mathematically as P(Xi = 1 | (C1i, C2i, ···, Cki) = 0.05, where Xi = 1 meaning that the ith subject is assigned to the treatment group. This equation indicates that the probability of a subject with k characteristics (C1i, C2i, ···, Cki) being assigned to the treatment group is 0.5, resulting in a 0.5 probability of being assigned to the control group. In other words, it indicates equal assignment probabilities between two groups. It has been shown that the k characteristics (C1i, C2i, ···, Cki) of subjects assigned to the groups are similar.

When adopting this concept to observational studies, the propensity score estimates the probability that the ith subject is assigned to the treatment group using covariates. PSM pairs subjects from both groups with similar scores to emulate the effects of random assignment.

Propensity score analysis process

Propensity score analyses can be broadly classified into the following six steps (Fig. 2).

Fig. 2. Process of propensity score analysis.

1. Set variables: Define the causal variable (treatment), outcome variable, and confounding variables.

2. Estimate propensity scores: Choose the optimal model to estimate the propensity scores based on the defined variables.

3. Perform analyses: Use the estimated propensity scores to conduct analyses, such as matching, weighting, and stratification.

4. Check balance: Assess the balance of propensity scores and covariates between the treatment and control groups.

5. Evaluate treatment effect: From the balanced data, evaluate the treatment effect and conduct a sensitivity analysis on the effect size.

6. Iterate if necessary: If covariate imbalance is detected after propensity score analysis, or if the treatment effect estimates are not robust on the sensitivity analysis, redo the process starting from variable selection or propensity score estimation (PSE).

Propensity score matching

The techniques for propensity score analyses are as follows: (1) PSM, which matches subjects with similar propensity scores from the treatment and control groups; (2) propensity score weighting (PSW), which calculates the inverse-treatment probability weights based on the propensity scores to adjust the distribution of covariates into two groups for comparability; and (3) propensity score stratification (PSS), which stratifies subjects from both groups into similar propensity score strata, then calculates and integrates the treatment effects estimated from each stratum [11].

This study focused on PSM, the most frequently used technique among the various propensity score analysis techniques. Generalized linear models (GLMs) and their extension with regularization terms were introduced for propensity score estimation. Machine learning models were also added, including a generalized boosted model (GBM), random forest, Bayesian additive regression trees (BART), and a neural network model. In total, four algorithms were used for matching: nearest neighbor matching, optimal matching, full matching, and genetic matching.

Matched data analyses for treatment effect evaluation

Treatment effects can be estimated if covariate balance is achieved based on the propensity scores. The fundamental approach in causal inference is to estimate the average treatment effect using weighted means. However, due to a lack of consensus on computing the standard error of the estimates, bootstrapping or non-parametric tests are used to estimate the 95% confidence intervals or determine statistical significance [11,12]. Alternatively, a weighted generalized linear regression model can be fitted to estimate the treatment effect [13].

In practice, the most widely used method is to treat the matched data as a regular dataset and apply conventional statistical analyses. In this case, the matched data can be regarded as pooled data, and the analyses can be conducted under the independence assumption. Alternatively, pairwise or conditional analyses can be performed based on the matched pairs [14]. Since there is still debate about the suitable approach, researchers are advised to follow the methods commonly used in their field. The available methods for each case are shown in Fig. 3.

Fig. 3. Matched data analyses based on outcome type and independence assumption.

Statistical programs for propensity score matching

Various statistical programs, including R, STATA, SAS, and Python, can analyze propensity scores. Each program possesses unique features and strengths. The selection of a specific program hinges on the characteristics of the data and the requirements of the analysis. This study introduces practical applications using R and Rex. R is an open-source statistical software freely available and continually enhanced by its users. R packages for propensity score analyses include MatchIt, cobalt, WeightIt, twang, and others [1518]. This study employed the MatchIt and cobalt packages for PSM. In contrast, Rex is an Excel add-in program that offers various R-based statistical analysis modules in a graphical user interface (GUI) format [19]. Rex includes complimentary modules for descriptive statistics, group comparison, categorical data analysis, and survival analysis. However, certain modules, such as PSM and structural equation modeling, require purchase (a 2-week free trial is available).

This section describes an example of propensity score using R. Implementation using Rex is presented in Supplementary Material 1.

Example data

HIFU is a noninvasive surgical technique used to treat target areas in the brain without the need for an incision, offering shorter recovery times and reduced infection risks compared with conventional deep brain surgery (DBS). In this example, PSM was conducted between the HIFU treatment group (treatment = 1) and the DBS control group (treatment = 0). Covariates, including age (in years), sex (1 = male, 2 = female), presence of comorbidities (0 = no, 1 = yes; referred to as “comor”), and the preoperative symptom score (in points; referred to as “pre_sx”), were utilized. The objective was to estimate the effect of the treatment on the HIFU postoperative symptom score. Data for this analysis was generated using the R code provided in Supplementary Material 2.

Categorical covariates were converted to factors after importing the data saved as a CSV file. It is important to note that including categorical covariates as dummy variables (one less than the number of levels) in the model can yield different propensity scores or matching outcomes. Therefore, converting them to factors in R is recommended instead of using dummy variables. However, if treatment variables are not numeric variables coded as 0 and 1, they cannot be fitted to machine learning models. Thus, they should not be converted to factors even though they are categorical variables.

Basic usage of MatchIt::matchit()

The ‘matchit()’ function available in the MatchIt package (version 4.5.5) facilitates PSE and PSM. Table 1 outlines the primary function arguments for PSM and PSE for two treatment groups.

Table 1 . Primary arguments in MatchIt::matchit() for PSM

For useArgumentFormat (with default values)Details
PSE or PSMFormulaa)formula = treatment ~ covariate1 + covariate2 + …The model representing the relationship between the treatment and covariates for propensity score estimation
Only the treatment variable can be entered for matching based on the pre-calculated propensity score vector
PSE or PSMdatadata = datafameA data frame containing the variables specified in the formula
PSE or PSMdistancedistance = ‘glm’The method for calculating the propensity score used for matching
PSE or PSMlinklink = ‘logit’Specify the link function when the distance is set to ‘glm’, ‘ridge’, ‘lasso’, ‘elasticnet’, or ‘gbm’
PSE or PSMdistance.optionsdistance.options=list()Set the specific options for each model to compute the propensity score
PSMmethodmethod = ‘nearest’The method for matching
PSMexactexact = NULLExact matching for specific variables (this format matches subjects with the exact same value of covariate1 and covariate2)
PSMratioratio = 1The number of control subjects to be matched with one treated subject
PSMreplacereplace = FALSEMatching without replacement (restricts each control subject to be matched with one treated subject)
PSMdiscarddiscard = ‘none’The way to manage subjects that fall outside the common support region of the propensity scores before matching
PSMreestimatereestimate = FALSEDetermines whether to reestimate the propensity scores after some subjects have been discarded based on the ‘discard’ option
PSMcalipercaliper = NULLThe width of the caliper to be used in the matching
PSMstd.caliperstd.caliper = TRUEThe way to measure the caliper
PSMm.orderm.order = NULLThe order that the matching is performed

PSE, propensity score estimation; PSM, propensity score matching.

a)The argument ‘formula’ must be specified in any case.



The basic structure of the analysis code is as follows. To conduct only PSE, utilize the ‘formula,’ ‘data,’ ‘distance,’ ‘link,’ and ‘distance.options’ parameters. For PSM based on pre-estimated propensity scores, input the relevant propensity score vector in ‘distance’ and specify the treatment variable in ‘formula.’ When performing both PSE and PSM simultaneously, all arguments are applied. Default values are used for arguments that are not explicitly designated.

Step 1: Variable selection

For PSE, the causal variable (treatment), the outcome variable, and the covariates expected to be associated with the treatment and outcome variables need to be set [12]. Among the covariates that precede or co-occur with the treatment, a variable correlated with both the treatment and outcome is called a ‘confounder.’ In contrast, if a treatment affects an intermediate indicator, which affects the outcome, then that intermediate indicator is called a ‘mediator.’ Controlling for mediators in the model can lead to underestimation of the treatment effect. Therefore, researchers should review relevant factors from previous studies on the causal relationship when estimating propensity scores. Mediators should be excluded, and confounders should be included in the model for PSE.

In this example, the surgery type (‘treat’) is the causal variable, and the postoperative symptom score (‘post_sx’) is the outcome. Confounders related to the treatment assignment and outcome are age, sex, presence of underlying disease (‘diz’), and preoperative symptom score (‘pre_sx’). These were selected as covariates for the PSE.

Step 2: Propensity score estimation

Methods for estimating propensity scores can be divided into parametric and non-parametric methods. The most commonly used parametric method is the GLM, including logistic regression. GLMs, by specifying probability distributions and link functions, can be applied to binary, multinomial, and continuous treatment variables. These models can also be extended to regularized regression models like Ridge, Lasso, and ElasticNet, which add penalty terms to the sum of the squared errors.

Recently, machine learning techniques have been widely used for estimating propensity scores. These non-parametric methods do not assume a specific distribution and can model complex patterns inherent in data. However, it is necessary to adjust the hyperparameters to prevent overfitting.

In the ‘matchit()’ function, ‘method=NULL’ allows for PSE without matching. Table 2 lists the available models for estimation in ‘matchit().’ To build various PSE models, refer to the model-specific options for each method.

Table 2 . Specifications of each model in matchit() for propensity score estimation

ArgumentDetailsUtilized functionSpecific options
distance = ‘glm’ (default) or distance = ‘ps’A generalized linear model (e.g. logistic regression)glm()family, link
distance = ‘ridge’A generalized linear model with L2 regularization (Ridge model)glmnet::cv.glmnet()family, link
distance = ‘lasso’A generalized linear model with L1 regularization (Lasso model)glmnet::cv.glmnet()family, link
distance = ‘elasticnet’A generalized linear model with L1 and L2 regularizations (Elastic Net model)glmnet::cv.glmnet()family, link, alpha
distance = ‘gbm’A generalized boosted modelgbm::gbm()n.trees, interaction.depth, shrinkage, bag.fraction, n.minobsinnode
distance = ‘randomforest’A random forestrandomForest::ra
ndomForest()
mtry, ntree, nodesice, sampsize, maxnodes
distance = ‘bart’The Bayesian additive regression trees (BART)dbarts::bart2()num.trees, num.burn, num.iter, alpha, beta, sigma
distance = ‘nnet’A single-hidden-layer neural networknnet::nnet()size, decay, maxit, linout, trace


Generalized linear model with various link functions

In GLMs, various models can be specified by the link function. The logistic regression model, which is most widely used for a binary treatment group, uses the link function of logit (link = ‘logit’). Features of each link function are listed in Table 3.

Table 3 . Link functions in matchit()

ArgumentFunctionModelSuitable case
link = ‘logit’logp1pLogistic regressionCommonly used for binary outcome in practice
link = ‘probit’Φ1(p)Probit regressionUsed for binary outcome which has more extreme values close to 0 or 1
link = ‘cloglog’log(log(1p))Complementary log-log regressionUsed for binary outcome of rare event
link = ‘cauchit’tan(π(p0.5))Cauchit regressionUsed for binary outcome which has heavy tails and is very sensitive to outliers


Here is an example code for fitting a GLM with a logit link function with the setting of ‘distance = ‘glm’ and link = ‘logit’, the default probability distribution was binomial (family = ‘binomial’), so no additional ‘distance.options’ were specified.

Generalized linear model with regularization

Propensity scores can also be estimated using regularized regression models, which add a regularization term to the sum of squared errors in the generalized linear model discussed in previous section. Specifying ‘ridge,’ ‘lasso,’ or ‘elasticnet’ for the ‘distance’ parameter in the ‘matchit()’ function performs Ridge, Lasso, or Elastic Net regression, respectively. Ridge regression handles multicollinearity, and Lasso regression performs variable selection by shrinking some weights to zero. Elastic Net combines the strengths of both models.

Special options for these models include the parameter lambda, which adjusts the regularization strength. The ‘cv.glmnet()’ function in the glmnet package, used by ‘matchit(),’ automatically calculates the optimal lambda value, so it does not need to be specified separately. However, for Elastic Net, an additional parameter alpha is required to adjust the rate of L1, with a default value of 0.5. Below is an example code for fitting Ridge and Elastic Net model based on logistic regression. In this code, ‘glmr2’ model correspond to an Elastic Net model with 25% L1 (lasso) regularization.

Generalized boosted model

The GBM is one of the boosting techniques that combine multiple decision trees by iterative updates with weights to reduce the errors in the previous tree. The final estimated probability is the aggregate of the predicted probabilities of all the generated trees, which is used as the propensity score.

The GBM effectively captures nonlinear relationships and interaction effects. However, to avoid overfitting, it is necessary to tune the parameters, such as the number of trees (n.trees), depth (interaction.depth), and learning rate (shrinkage). The hyperparameters available in this model are shown in Table 4.

Table 4 . Distance options for generalized boosted model in matchit()

ArgumentDefault valueDetailsImpact
n.trees5,000The number of trees to be generatedMore trees can improve model fitting in training set, but increase the risk of overfitting.
interaction.depth3The maximum depth of each tree or allowable interactionsGreater depth can capture complex patterns better but may lead to overfitting.
shrinkage0.01Learning rate, which is a scalar coefficient multiplied to each treeLower values make the model learn more slowly and can improve generalization, but too low can make training very slow.
n.minobsinnode10The minimum number of observations required in the terminal node of each treeLarger values limit three growth, helping to prevent overfitting.


The code below is an example of PSE based on a GBM. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(1) Generalized boosted model’ section of Supplementary Material 3 as a reference.

Random forest

Generating multiple bootstrap samples, random forest models create several decision trees by randomly combining variables. Then, the model combines the prediction results of each tree to produce a final estimated probability, known as the propensity score. This method is robust against overfitting and useful when evaluating the relative importance of various variables, especially in cases with many covariates where the importance of each covariate is unknown. Relevant options are summarized in Table 5.

Table 5 . Distance options for random forest in matchit()

ArgumentDefault valueDetailsImpact
n.trees500The number of trees to be generatedMore trees can improve model stability and accuracy, but increase computation.
mtrysqrt(p)The number of covariates to consider at each split where p is the total number of covariatesLarger values can improve model accuracy but may increase the risk of overfitting and computation.
nodesize1The minimum number of observations required in the terminal node of each treeLarger values make the trees less complex, which can help prevent overfitting but may reduce accuracy.
maxnodesNULLThe maximum number of terminal nodes that trees can haveLimiting the number of nodes can reduce overfitting, but too small value may result in underfitting.
importanceFALSEWhether to compute variable importanceVariable importance can provide insights into the model but increases computation time.


The code below provides an example of PSE based on a random forest model. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(2) Random forest’ section of Supplementary Material 3 as a reference.

Bayesian additive regression trees

BART is an ensemble model based on the Bayesian approach combining multiple regression trees. It employs the Markov chain Monte Carlo (MCMC) algorithm to sample parameters from Bayesian prior distributions, aiming to reduce errors from prior trees, thereby iteratively producing new regression trees.

For each generated tree, the final estimated probability is calculated by weighting predicted probabilities according to the contribution from the Bayesian posterior distribution. Relevant options are outlined in Table 6.

Table 6 . Distance options for Bayesian additive regression trees (BART) in matchit()

ArgumentDefault valueDetailsImpact
num_trees200The number of trees to be generatedMore trees can improve model stability and accuracy but increase computation.
k2A hyperparameter controlling the variance of the tree predictionsLarger values reduce variance but can make the model more conservative, potentially underfitting.
alpha0.95The base terminal node probability in the tree-building processHigher values increase the probability of smaller trees, reducing complexity and overfitting.
beta2A hyperparameter influencing the depth of the treesHigher values encourage deeper trees, which make model more complex, potentially overfitting.
n_min5The minimum number of observations required in the terminal node of each treeLarger values limit three growths, helping to prevent overfitting.
n_iter1,000The number of iterations for the MCMC algorithmMore iterations can improve model convergence and stability but increase computation time.
burn_in100The number of initial MCMC iterations to be discardedLarger values ensure more stable models but increase computation time.

MCMC, Markov chain Monte Carlo.



This method effectively estimates propensity scores when there is a significant variance in the covariates between the treatment and control groups. Notably, the prediction from each tree not only provides estimates but also captures the uncertainty of these estimates via MCMC sampling. Consequently, the final estimated probabilities are computed to reflect this uncertainty. However, this comes with a substantial computational burden.

The code below is an example of a PSE based on BART. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(3) Bayesian additive regression trees’ section of Supplementary Material 3 as a reference.

Neural network model

A neural network is a network of interconnected neurons organized into layers. A neural network model (NNET) mimics the structure of neural networks, with covariates as the input layer and treatment variables as the output layer. A basic neural network with only these two layers is equivalent to a linear regression model.

Adding a layer with hidden neurons between these two layers creates a multilayer structure capable of capturing nonlinear relationships. Neurons in each hidden layer, often termed nodes, receive inputs from the previous layer, compute a weighted linear combination, and transport these outputs to the subsequent layer. To prevent overfitting, regularization parameters (decay) control the magnitude of the weights, similar to a Ridge regression. Increasing the number of layers and nodes per layer allows for structuring more complex patterns in the data.

A single-hidden-layer neural network model structure is applied by setting ‘distance = ‘nnet’ in ‘matchit().’ The number of nodes (size) in this hidden layer and the regularization parameter (decay) optimize the model (refer to Table 7). Due to longer training times, it is recommended to use this approach when computing resources are sufficient.

Table 7 . Distance options for a single-hidden-layer neural network in matchit()

ArgumentDefault valueDetailsImpact
size5The number of nodes in the hidden layerMore nodes can capture more complex patterns but increase the risk of overfitting and computation time.
decay0.1The weight decay parameter for regularizationHigher values help prevent overfitting but may lead to underfitting.
maxit100The maximum number of iterations for training the neural networkMore iterations improve convergence but increase the risk of overfitting and computation time.
linoutFALSESpecifies if the activation function should be linearFor propensity score of binary treatment, a nonlinear activation function is more appropriate.
traceTRUEWhether to produce output during trainingOutputs provide insight into the training process, which can be useful for debugging and monitoring, but also increase computation.


The code below is an example of PSE based on a neural network model. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(4) Neural network model with a single-hidden-layer’ section of Supplementary Material 3.

Model selection for propensity score estimation

The criteria for the optimal model for PSE can be divided into model fitness and the distribution of propensity scores. Akaike information criterion (AIC) or Bayesian information criterion (BIC) are goodness-of-fit measurements for generalized linear models, but they are not computed in machine learning models. Therefore, the propensity score distribution is used for the most appropriate model.

The propensity score distribution between treatment and control groups is assessed by the common support region (CSR). When there is no overlap in propensity scores between the two groups, known as ‘separation,’ the groups cannot be compared, making it impossible to calculate treatment effects. In this case, adjusting the model and its parameters or redefining the covariate settings is crucial. Here are the code and results for inspecting the CSR of estimated propensity scores from glm1 and rf1 objects. Since glm1 demonstrates a broader and more uniform CSR compared with rf1, it is a better model (Fig. 4).

Fig. 4. Propensity score distribution from two models.

Step 3: Propensity score matching

Once optimal propensity scores are calculated, different matching methods are applied. Popular algorithms are nearest neighbor matching, optimal matching, full matching, and genetic matching. Table 8 summarizes the features of each algorithm and the scenarios in which they are useful.

Table 8 . Matching algorithms in matchit()

ArgumentMatching algorithmDetailsSuitable case
method = NULLNoneEstimate the propensity score without matchingNone
method = ‘nearest’ (default)Nearest neighbor matchingMatch each subject to the closest propensity score within the designated caliperEvenly distributed propensity scores
method = ‘optimal’Optimal pair matchingSelect the optimal one from all possible matching combinationsHigh dimensional data with many covariates, high imbalanced between treated and control
method = ‘full’Optimal full matchingFind the overall optimal matching from all subjects for matchingUseful when desired to use all subjects
method = ‘genetic’Matching by genetic algorithmOptimize covariate balance across groups based on genetic algorithmHigh dimensional data with many covariates, high imbalanced between treated and control, or small sample size


Additionally, options used for matching include exact, ratio, replace, discard, reestimate, caliper, std.caliper, and m.order, as presented in Table 1. Table 9 outlines the detailed descriptions of each option.

Table 9 . Options for propensity score matching in matchit()

ArgumentFormatDetails
exactexact = ~ covariate1 + covariate2Exact matching on specific variables (in this format, match subjects with the exactly same value of covariate1 and covariate2)
ratio1 (default)The number of control subjects to be matched with one treated subject
replaceFALSE (default)Matching without replacement (restrict each control subject to be matched with one treated subject)
TRUEMatching with replacement (allow control subjects to be matched with several treated subjects)
discardnone’ (default)No units are discarded
control’Discard only control subjects with propensity scores outside the common support region
treated’Discard only treated subjects with propensity scores outside the common support region
both’Discard both treated and control subjects with propensity scores outside the common support region
reestimateFALSE (default)Use the original propensity score for matching, even for some subjects are discarded
TRUEReestimate the propensity score using only the subjects who remained after discard process
caliperNULL (default) (for no caliper)The width of the caliper to be used in matching
std.caliperTRUE (default)The caliper is defined as a fraction or multiple of the standard deviation of the logit of the propensity score
FALSEThe caliper is defined in the raw units of the propensity score
m.orderNUKK (default)Corresponds to ‘largest’
random’Match in a random order
smallest’Match subjects with the smallest propensity scores first
largest’Match subjects with the largest propensity scores first


Using the propensity scores estimated from the glm1 model, nearest neighbor matching was conducted within a caliper corresponding to 10% of the standard error of the logit of the propensity scores, with identical age and sex ensured. The code was set up for a 1:2 matching ratio without replacement. After matching, the results were examined using the ‘summary(matching1)’ function. Due to the exact matching of age and sex, the final matching includes 11 subjects in the treatment group and 13 subjects in the control group, totaling 24 subjects, which is relatively small (Fig. 5).

Fig. 5. Number of matched and unmatched cases

Step 4: Covariate balance assessment

After PSM, it is essential to verify whether a balance has been achieved between the covariates of the treatment and control groups. This usually involves comparing the standardized covariates between the two groups and evaluating whether the means and variances of the propensity scores are similar. Histograms or other visualizations can be used to compare the propensity score distributions before and after matching.

When comparing group means and variances, since the units of each covariate are different, the propensity scores and covariates are standardized. The group difference in means and the ratio of variances are then calculated. These are respectively referred to as the standardized mean difference (SMD) or absolute SMD (ASMD) and the variance ratio (VR). Typically, if the absolute value of the SMD is less than 0.1 and the VR falls between 0.5 and 2.0, it is considered that there is no imbalance in the covariates and propensity scores [20].

Checking the covariate balance before and after matching can also be done through histograms or density plots. However, it is common to use a plot called a Love plot to visualize changes in the SMD before and after matching easily. If there is an imbalance in the covariates, it is recommended the propensity scores be reestimated.

The following code was used to examine the SMD and VR for the matching1 object created in step 3. Histograms, density plots, and a Love plot were generated to visualize the covariate balance. Since all covariates had an SMD under 0.1 and a VR between 0.5 and 2.0, it was determined that a covariate balance was achieved (Fig. 6). The Love plot in Fig. 6C illustrates the changes in the ASMD for each covariate before and after matching. Initially, as depicted by the blue line, each covariate had an ASMD greater than 0.1, sorted from the highest ASMD starting with comorbidity. After matching, represented by the red line, the ASMD for all four covariates fell below 0.1, indicating an effective covariate balancing.

Fig. 6. Plot for checking the covariate balance. (A) Histogram of propensity score before and after propensity score matching (PSM). (B) Density plot of mean difference of each covariate before and after PSM. (C) Absolute standardized mean difference plot.

Step 5: Treatment effect evaluation

Once the post-PSM has achieved covariate balance, the matched data are utilized to perform analyses to evaluate the treatment effect.

Matched data extraction

The process for extracting the matched dataset is outlined below. Since the matching data were obtained from the matching1 object generated in step 3, it was evident that age and sex were identical in the 1:2 matched nine pairs (Fig. 7). Within this dataset, the variable ‘ps’ represents the estimated propensity score, ‘matching weight’ denotes the sample weight resulting from the 1:2 matching, and ‘match_id’ indicates the pair number of the matching.

Fig. 7. Matched data subset.

Treatment effect analysis

Postoperative symptom scores are categorized into two cases: improvement (post_imp = 1) for scores less than 40 or no improvement (post_imp = 0). Two approaches evaluated this variable: (1) a standard logistic regression assuming independence of the two groups in the matched data, and (2) a conditional logistic regression of the matched pairs. The following code and results illustrate these analyses. Since this involves a 1:k matching where k is greater than or equal to 2, the sample weights are incorporated as weighting variables.

Step 6: Sensitivity analysis

In propensity score analyses, a sensitivity analysis evaluates the robustness of the estimated effect size. Various methods have been proposed to assess the impact of omitted variable bias when all relevant covariates are not included. However, implementing all the methods presented in Fig. 3 may not be practical [2123].

A commonly used sensitivity analysis involves applying different models during the PSM and conducting the same analysis on multiple matched datasets derived from various matching methods [24]. When the effect size of a specific dataset is not robust, further investigation is necessary to identify the cause. This may involve verifying the error in the PSM, checking for omitted covariates, or ensuring that the parameters are correctly included. If the estimated treatment effects remain valid across multiple datasets, it strengthens the validity of causal inference. However, in this study, a separate sensitivity analysis was not conducted.

Propensity score weighting and stratification

While this study primarily focused on matching between two groups, alternative techniques such as PSW and stratification also deserve attention. PSW uses the inverse probability of treatment weights (IPTW), calculated from the propensity score of each subject, to analyze the data. This method strives to equalize covariate distributions across treatment and control groups, improving their comparability. Unlike matching, it includes all available data, enhancing sample efficiency. However, extremely large or small weights can skew the results. To minimize this issue, truncation or stabilization techniques should be applied [25,26].

Instead of directly applying weighted analysis, another approach has been introduced to construct matched sub-controls based on the IPTW in studies with asymmetric sample sizes, particularly for those with larger control groups. Using these matched sub-controls, a post-genome-wide association study (GWAS) analysis was conducted, identifying additional candidate variants associated with the phenotypes [27].

PSS includes partitioning the dataset into multiple strata based on propensity scores, calculating treatment effects within each stratum, and aggregating these effects. This method helps reduce bias stemming from unobserved confounding variables, leading to more precise and stable estimates. Typically, PSS involves creating strata of equal size based on the propensity score distribution.

Propensity score analyses with nonbinary treatment

Propensity score analysis can be extended to treatment variables with three or more levels or continuous causal variables. However, no established methodology or tool developed explicitly for PSM exists for these cases. Typically, for treatment variables with three or more levels, a link function based on a multinomial distribution is specified. For continuous variables, a generalized linear regression model is fitted using distributions such as Gaussian or Gamma, depending on the data distribution. Propensity scores are then estimated using these models, and PSW analysis is applied accordingly.

Fundamental approach to treatment effect evaluation

In observational studies, treatment effects are often categorized into ‘average treatment effect for the treated (ATT),’ ‘average treatment effect for the control (ATC),’ or ‘average treatment effect (ATE)’ [28,29]. ATT refers to the average effect of the treatment on the treated group compared with the matched control group. ATC represents the average effect of the treatment on the control group compared with the matched treated group. ATE calculates the average effect of the treatment for the entire population. In this study, PSM primarily aimed to estimate the treatment effect using control groups similar to the treatment group, thus corresponding to ATT. Conversely, PSW analysis estimates the treatment effect based on the entire population, aligning with ATE.

Missing imputation before propensity score analyses

In observational studies, the missing values rate is generally higher than in experimental studies. It is common for the probability of missing covariates to differ between the treatment and control groups. Removing missing values on a pairwise basis can significantly distort the treatment effect. Therefore, when analyzing data with missing values, it is recommended to consider various assumptions related to missing data mechanisms, such as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), as well as missing data imputation techniques, such as maximum likelihood and multiple minputation. After addressing missing data, conducting propensity score analysis is advisable [30]. In R, packages like ‘mi,’ ‘mice,’ and ‘missForest’ can help perform missing data imputation [31,32].

This study introduced PSM to reduce bias in observational studies, providing a comprehensive overview of each step involved in the process. It demonstrated the practical application using an example in R and introduced the GUI program Rex. However, this tutorial is limited to matching treatment variables between two groups, indicating a need for future research to expand PSM to include a treatment variable with three or more levels or continuous causal variables. Additionally, exploring other analysis techniques, such as PSW, stratification, and handling missing data through imputation methods, are beneficial areas for further investigation

Author’s contributions

Conceptualization, Investigation: BL, SW, JG

Funding acquisition: BL, JG

Project administration: JG

Visualization: BL, NK

Writing–original draft: BL, JG

Writing–review & editing: All authors

All authors read and approved the final manuscript.

Conflict of interest

All authors have no conflicts of interest to declare.

Funding/support

This research was supported by the Global-Learning & Academic Research Institution for Master’s·PhD students, and Postdocs (LAMP) Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No. RS-2023-00285353).

Data availability

All the R code in this paper is available at http://www.gimlab.online.

Supplementary materials

Supplementary materials can be found via https://doi.org/10.7602/jmis.2024.27.2.55.

jmis-27-2-55-supple.pdf
  1. Biebl M, Alkatout I. Recent advances in minimally invasive surgery. MDPI-Multidisciplinary Digital Publishing Institute; 2021.
  2. Meinzer A, Alkatout I, Krebs TF, et al. Advances and trends in pediatric minimally invasive surgery. J Clin Med 2020;9:3999.
    Pubmed KoreaMed CrossRef
  3. Tang K, Goldman S, Avrumova F, Lebl DR. Background, techniques, applications, current trends, and future directions of minimally invasive endoscopic spine surgery: a review of literature. World J Orthop 2023;14:197-206.
    Pubmed KoreaMed CrossRef
  4. Lin T, Xie Q, Peng T, Zhao X, Chen D. The role of robotic surgery in neurological cases: a systematic review on brain and spine applications. Heliyon 2023;9:e22523.
    Pubmed KoreaMed CrossRef
  5. Ming J, He Y, Yang Y, et al. Health technology assessment of medical devices: current landscape, challenges, and a way forward. Cost Eff Resour Alloc 2022;20:54.
    Pubmed KoreaMed CrossRef
  6. van Lieshout C, Frederix GW, Schoonhoven L. Economic evaluations in medical technological innovations a mapping review of methodologies. Cost Eff Resour Alloc 2024;22:23.
    Pubmed KoreaMed CrossRef
  7. Millar R, Morton A, Bufali MV, et al. Assessing the performance of health technology assessment (HTA) agencies: developing a multi-country, multi-stakeholder, and multi-dimensional framework to explore mechanisms of impact. Cost Eff Resour Alloc 2021;19:37.
    Pubmed KoreaMed CrossRef
  8. Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press; 2015.
    CrossRef
  9. Pirracchio R, Resche-Rigon M, Chevret S. Evaluation of the propensity score methods for estimating marginal odds ratios in case of small sample size. BMC Med Res Methodol 2012;12:70.
    Pubmed KoreaMed CrossRef
  10. Huber M, Lechner M, Wunsch C. The performance of estimators based on the propensity score. J Econom 2013;175:1-21.
    CrossRef
  11. Pan W, Bai H. Propensity score methods for causal inference: an overview. Behaviormetrika 2018;45:317-334.
    CrossRef
  12. Leite W. Practical propensity score methods using R. Sage Publications; 2016.
    CrossRef
  13. Raad H, Cornelius V, Chan S, Williamson E, Cro S. An evaluation of inverse probability weighting using the propensity score for baseline covariate adjustment in smaller population randomised controlled trials with a continuous outcome. BMC Med Res Methodol 2020;20:70.
    Pubmed KoreaMed CrossRef
  14. Wan F. Matched or unmatched analyses with propensity-score-matched data? Stat Med 2019;38:289-300.
    Pubmed CrossRef
  15. Ho D, Imai K, King G, Stuart EA. MatchIt: nonparametric preprocessing for parametric causal inference. J Stat Softw 2011;42:1-28.
    CrossRef
  16. Greifer N. Covariate balance tables and plots: a guide to the cobalt package [Internet]. Published 2024 Apr 2.
    Available from: https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt.html.
  17. Greifer N. Using WeightIt to estimate balancing weights [Internet]. Published 2024 Jun 9.
    Available from: https://ngreifer.github.io/WeightIt/articles/WeightIt.html.
  18. Burgette L, Griffin BA, McCaffrey D. Propensity scores for multiple treatments: a tutorial for the mnps function in the twang package. Rand Corporation; 2017.
    CrossRef
  19. Lee B, An J, Lee S, Won S. Rex: R-linked EXcel add-in for statistical analysis of medical and bioinformatics data. Genes Genomics 2023;45:295-305.
    Pubmed CrossRef
  20. Zhang Z, Kim HJ, Lonjon G, Zhu Y; written on behalf of AME Big-Data Clinical Trial Collaborative Group. Balance diagnostics after propensity score matching. Ann Transl Med 2019;7:16.
    Pubmed KoreaMed CrossRef
  21. Carnegie NB, Harada M, Hill JL. Assessing sensitivity to unmeasured confounding using a simulated potential confounder. J Res Educ Eff 2016;9:395-420.
    CrossRef
  22. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70:41-55.
    CrossRef
  23. Rosenbaum PR. In: Zubizarreta JR, Stuart EA, Small DS, Rosenbaum PR, eds. Handbook of matching and weighting adjustments for causal inference. Chapman and Hall/CRC; 2023:21-38.
    CrossRef
  24. Rudolph KE, Stuart EA. Using sensitivity analyses for unobserved confounding to address covariate measurement error in propensity score methods. Am J Epidemiol 2018;187:604-613.
    Pubmed KoreaMed CrossRef
  25. Gürel S, Leite WL. Comparison of propensity score weighting methods to remove selection bias in average treatment effect estimates. Int J Turk Educ Stud 2023;11:989-1031.
    CrossRef
  26. Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011;6:e18174.
    Pubmed KoreaMed CrossRef
  27. Gim J, Choi S, Im J, Kim JK, Park T. A post-hoc genome-wide association study using matched samples. Int J Data Min Bioinform 2016;14:197-209.
    CrossRef
  28. Guo S, Fraser MW. Propensity score analysis: statistical methods and applications. 2nd ed. SAGE Publications; 2014.
    CrossRef
  29. Morgan SL, Winship C. Counterfactuals and causal inference: methods and principles for social research. 2nd ed. Cambridge University Press; 2014.
    CrossRef
  30. D'Agostino RB, Jr, Rubin DB. Estimating and using propensity scores with partially missing data. J Am Stat Assoc 2000;95:749-759.
    CrossRef
  31. Allison PD. In: Millsap RE, Maydeu-Olivares A, eds. The Sage handbook of quantitative methods in psychology. Sage Publications; 2009:72-89.
    CrossRef
  32. Enders CK. Missing data: An update on the state of the art. Psychol Methods 2023 [accepted manuscript]. . DOI: 10.1037/met0000563.
    CrossRef

Article

Review Article

Journal of Minimally Invasive Surgery 2024; 27(2): 55-71

Published online June 15, 2024 https://doi.org/10.7602/jmis.2024.27.2.55

Copyright © The Korean Society of Endo-Laparoscopic & Robotic Surgery.

Propensity score matching for comparative studies: a tutorial with R and Rex

Bora Lee1,2 , Nam-eun Kim3 , Sungho Won2,3,4 , Jungsoo Gim1,5,6

1Institute of Well-Aging Medicare & CSU G-LAMP Project Group, Chosun University, Gwangju, Korea
2Department of Public Health Sciences, Seoul National University, Seoul, Korea
3Institute of Health and Environment, Seoul National University, Seoul, Korea
4RexSoft Inc., Seoul, Korea
5Department of Biomedical Science, Chosun University, Gwangju, Korea
6Gwangju Alzheimer’s & Related Dementia Cohort Research Center, Chosun University, Gwangju, Korea

Correspondence to:Jungsoo Gim
Department of Biomedical Science, Chosun University, 309 Pilmun-daero, Dong-gu, Gwangju 61452, Korea
E-mail: jgim@chosun.ac.kr
https://orcid.org/0000-0002-6454-9751

Received: May 20, 2024; Revised: May 29, 2024; Accepted: June 7, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Recently, there has been considerable progress in developing new technologies and equipment for the medical field, including minimally invasive surgeries. Evaluating the effectiveness of these treatments requires study designs like randomized controlled trials. However, due to the nature of certain treatments, randomization is not always feasible, leading to the use of observational studies. The effect size estimated from observational studies is subject to selection bias caused by confounders. One method to reduce this bias is propensity scoring. This study aimed to introduce a propensity score matching process between two groups using a practical example with R. Additionally, Rex, an Excel add-in graphical user interface statistical program, is provided for researchers unfamiliar with R programming. Further techniques, such as matching with three or more groups, propensity score weighting and stratification, and imputation of missing values, are summarized to offer approaches for more complex studies not covered in this tutorial.

Keywords: Propensity score, Matching, Selection bias, R, Rex

INTRODUCTION

The demand for minimally invasive surgeries has grown since they minimize the surgical site and bleeding, leading to faster recovery times. Consequently, new equipment and technologies incorporating minimally invasive treatments are being developed for various conditions, such as degenerative brain disorders and spinal diseases in older adults and brain tumor removal in infants [14]. Accurate evaluations of these new treatments are crucial for several reasons: (1) they clarify the benefits of the new technologies and the potential side effects; (2) they update the guidelines to offer better strategies in clinical practice; and (3) they ensure efficient allocation of medical resources [5,6]. Additionally, future medical research can focus on effective technologies and encourage investment in promising ones [7].

Selecting the appropriate outcome indicators and comparable controls is essential to get the precise treatment effects. If unrelated outcomes are chosen, or if the control group differs from the treatment group, it becomes challenging to determine whether the effects are due to the treatment or other factors. Selecting appropriate outcomes is relatively straightforward, but ensuring the comparability of well-balanced characteristics in the treatment and control groups is generally problematic. In observational studies, the characteristics of the treatment and control groups usually differ. Fortunately, there are various statistical approaches to strengthen causal inference by controlling for confounders (Fig. 1) [8]. One such method is the propensity score.

Figure 1. Research design and analyses in observational studies for causal inference.

Propensity scores can be applied in several ways, such as matching, weighting, and stratification. Propensity score-based analyses perform efficient evaluations even with a small sample size [9,10]. Notably, they ensure the validity and generalizability of the result in causal inference, whereas conventional analyses infer only association, not causality [11].

This study aimed to provide a practical guide to propensity score analyses, focusing on matching. Unlike previous tutorials on propensity score matching (PSM), this study covers traditional and machine learning-based matching with detailed descriptions using the recently updated R packages. It also offers an easy way of performing PSM with an Excel add-in statistical program called Rex, providing point-and-click analysis for researchers unfamiliar with R programming. To this end, we explained the concept of propensity score and outlined the step-by-step process of PSM, along with available statistical programs. As a practical example, we used PSM to compare the surgical outcomes between patients who underwent noninvasive high-intensity focused ultrasound (HIFU) brain surgery and conventional brain resection surgery. Finally, we summarize additional considerations not covered in this study.

PROPENSITY SCORE MATCHING

Definition of propensity score

When comparing the outcomes of treatment and control groups, differing characteristics, aside from the treatment itself, can make it difficult to determine whether the differences in outcomes are due to the treatment or other factors. This leads to a biased estimate of the treatment effect, with the other influencing factors referred to as ‘confounders.’

One way to obtain an unbiased effect size is through randomized controlled trials (RCTs), which randomly assign subjects to the treatment or control groups. However, when random assignment is not feasible due to the nature of the study, observational studies are necessary. In this case, propensity scores are used to control for confounders in a manner similar to random assignment.

To better understand propensity scores, consider a balanced two-arm RCT. The concept of random assignment in RCTs can be expressed mathematically as P(Xi = 1 | (C1i, C2i, ···, Cki) = 0.05, where Xi = 1 meaning that the ith subject is assigned to the treatment group. This equation indicates that the probability of a subject with k characteristics (C1i, C2i, ···, Cki) being assigned to the treatment group is 0.5, resulting in a 0.5 probability of being assigned to the control group. In other words, it indicates equal assignment probabilities between two groups. It has been shown that the k characteristics (C1i, C2i, ···, Cki) of subjects assigned to the groups are similar.

When adopting this concept to observational studies, the propensity score estimates the probability that the ith subject is assigned to the treatment group using covariates. PSM pairs subjects from both groups with similar scores to emulate the effects of random assignment.

Propensity score analysis process

Propensity score analyses can be broadly classified into the following six steps (Fig. 2).

Figure 2. Process of propensity score analysis.

1. Set variables: Define the causal variable (treatment), outcome variable, and confounding variables.

2. Estimate propensity scores: Choose the optimal model to estimate the propensity scores based on the defined variables.

3. Perform analyses: Use the estimated propensity scores to conduct analyses, such as matching, weighting, and stratification.

4. Check balance: Assess the balance of propensity scores and covariates between the treatment and control groups.

5. Evaluate treatment effect: From the balanced data, evaluate the treatment effect and conduct a sensitivity analysis on the effect size.

6. Iterate if necessary: If covariate imbalance is detected after propensity score analysis, or if the treatment effect estimates are not robust on the sensitivity analysis, redo the process starting from variable selection or propensity score estimation (PSE).

Propensity score matching

The techniques for propensity score analyses are as follows: (1) PSM, which matches subjects with similar propensity scores from the treatment and control groups; (2) propensity score weighting (PSW), which calculates the inverse-treatment probability weights based on the propensity scores to adjust the distribution of covariates into two groups for comparability; and (3) propensity score stratification (PSS), which stratifies subjects from both groups into similar propensity score strata, then calculates and integrates the treatment effects estimated from each stratum [11].

This study focused on PSM, the most frequently used technique among the various propensity score analysis techniques. Generalized linear models (GLMs) and their extension with regularization terms were introduced for propensity score estimation. Machine learning models were also added, including a generalized boosted model (GBM), random forest, Bayesian additive regression trees (BART), and a neural network model. In total, four algorithms were used for matching: nearest neighbor matching, optimal matching, full matching, and genetic matching.

Matched data analyses for treatment effect evaluation

Treatment effects can be estimated if covariate balance is achieved based on the propensity scores. The fundamental approach in causal inference is to estimate the average treatment effect using weighted means. However, due to a lack of consensus on computing the standard error of the estimates, bootstrapping or non-parametric tests are used to estimate the 95% confidence intervals or determine statistical significance [11,12]. Alternatively, a weighted generalized linear regression model can be fitted to estimate the treatment effect [13].

In practice, the most widely used method is to treat the matched data as a regular dataset and apply conventional statistical analyses. In this case, the matched data can be regarded as pooled data, and the analyses can be conducted under the independence assumption. Alternatively, pairwise or conditional analyses can be performed based on the matched pairs [14]. Since there is still debate about the suitable approach, researchers are advised to follow the methods commonly used in their field. The available methods for each case are shown in Fig. 3.

Figure 3. Matched data analyses based on outcome type and independence assumption.

Statistical programs for propensity score matching

Various statistical programs, including R, STATA, SAS, and Python, can analyze propensity scores. Each program possesses unique features and strengths. The selection of a specific program hinges on the characteristics of the data and the requirements of the analysis. This study introduces practical applications using R and Rex. R is an open-source statistical software freely available and continually enhanced by its users. R packages for propensity score analyses include MatchIt, cobalt, WeightIt, twang, and others [1518]. This study employed the MatchIt and cobalt packages for PSM. In contrast, Rex is an Excel add-in program that offers various R-based statistical analysis modules in a graphical user interface (GUI) format [19]. Rex includes complimentary modules for descriptive statistics, group comparison, categorical data analysis, and survival analysis. However, certain modules, such as PSM and structural equation modeling, require purchase (a 2-week free trial is available).

PRACTICAL EXAMPLE USING R

This section describes an example of propensity score using R. Implementation using Rex is presented in Supplementary Material 1.

Example data

HIFU is a noninvasive surgical technique used to treat target areas in the brain without the need for an incision, offering shorter recovery times and reduced infection risks compared with conventional deep brain surgery (DBS). In this example, PSM was conducted between the HIFU treatment group (treatment = 1) and the DBS control group (treatment = 0). Covariates, including age (in years), sex (1 = male, 2 = female), presence of comorbidities (0 = no, 1 = yes; referred to as “comor”), and the preoperative symptom score (in points; referred to as “pre_sx”), were utilized. The objective was to estimate the effect of the treatment on the HIFU postoperative symptom score. Data for this analysis was generated using the R code provided in Supplementary Material 2.

Categorical covariates were converted to factors after importing the data saved as a CSV file. It is important to note that including categorical covariates as dummy variables (one less than the number of levels) in the model can yield different propensity scores or matching outcomes. Therefore, converting them to factors in R is recommended instead of using dummy variables. However, if treatment variables are not numeric variables coded as 0 and 1, they cannot be fitted to machine learning models. Thus, they should not be converted to factors even though they are categorical variables.

Basic usage of MatchIt::matchit()

The ‘matchit()’ function available in the MatchIt package (version 4.5.5) facilitates PSE and PSM. Table 1 outlines the primary function arguments for PSM and PSE for two treatment groups.

Table 1 . Primary arguments in MatchIt::matchit() for PSM.

For useArgumentFormat (with default values)Details
PSE or PSMFormulaa)formula = treatment ~ covariate1 + covariate2 + …The model representing the relationship between the treatment and covariates for propensity score estimation
Only the treatment variable can be entered for matching based on the pre-calculated propensity score vector
PSE or PSMdatadata = datafameA data frame containing the variables specified in the formula
PSE or PSMdistancedistance = ‘glm’The method for calculating the propensity score used for matching
PSE or PSMlinklink = ‘logit’Specify the link function when the distance is set to ‘glm’, ‘ridge’, ‘lasso’, ‘elasticnet’, or ‘gbm’
PSE or PSMdistance.optionsdistance.options=list()Set the specific options for each model to compute the propensity score
PSMmethodmethod = ‘nearest’The method for matching
PSMexactexact = NULLExact matching for specific variables (this format matches subjects with the exact same value of covariate1 and covariate2)
PSMratioratio = 1The number of control subjects to be matched with one treated subject
PSMreplacereplace = FALSEMatching without replacement (restricts each control subject to be matched with one treated subject)
PSMdiscarddiscard = ‘none’The way to manage subjects that fall outside the common support region of the propensity scores before matching
PSMreestimatereestimate = FALSEDetermines whether to reestimate the propensity scores after some subjects have been discarded based on the ‘discard’ option
PSMcalipercaliper = NULLThe width of the caliper to be used in the matching
PSMstd.caliperstd.caliper = TRUEThe way to measure the caliper
PSMm.orderm.order = NULLThe order that the matching is performed

PSE, propensity score estimation; PSM, propensity score matching..

a)The argument ‘formula’ must be specified in any case..



The basic structure of the analysis code is as follows. To conduct only PSE, utilize the ‘formula,’ ‘data,’ ‘distance,’ ‘link,’ and ‘distance.options’ parameters. For PSM based on pre-estimated propensity scores, input the relevant propensity score vector in ‘distance’ and specify the treatment variable in ‘formula.’ When performing both PSE and PSM simultaneously, all arguments are applied. Default values are used for arguments that are not explicitly designated.

Step 1: Variable selection

For PSE, the causal variable (treatment), the outcome variable, and the covariates expected to be associated with the treatment and outcome variables need to be set [12]. Among the covariates that precede or co-occur with the treatment, a variable correlated with both the treatment and outcome is called a ‘confounder.’ In contrast, if a treatment affects an intermediate indicator, which affects the outcome, then that intermediate indicator is called a ‘mediator.’ Controlling for mediators in the model can lead to underestimation of the treatment effect. Therefore, researchers should review relevant factors from previous studies on the causal relationship when estimating propensity scores. Mediators should be excluded, and confounders should be included in the model for PSE.

In this example, the surgery type (‘treat’) is the causal variable, and the postoperative symptom score (‘post_sx’) is the outcome. Confounders related to the treatment assignment and outcome are age, sex, presence of underlying disease (‘diz’), and preoperative symptom score (‘pre_sx’). These were selected as covariates for the PSE.

Step 2: Propensity score estimation

Methods for estimating propensity scores can be divided into parametric and non-parametric methods. The most commonly used parametric method is the GLM, including logistic regression. GLMs, by specifying probability distributions and link functions, can be applied to binary, multinomial, and continuous treatment variables. These models can also be extended to regularized regression models like Ridge, Lasso, and ElasticNet, which add penalty terms to the sum of the squared errors.

Recently, machine learning techniques have been widely used for estimating propensity scores. These non-parametric methods do not assume a specific distribution and can model complex patterns inherent in data. However, it is necessary to adjust the hyperparameters to prevent overfitting.

In the ‘matchit()’ function, ‘method=NULL’ allows for PSE without matching. Table 2 lists the available models for estimation in ‘matchit().’ To build various PSE models, refer to the model-specific options for each method.

Table 2 . Specifications of each model in matchit() for propensity score estimation.

ArgumentDetailsUtilized functionSpecific options
distance = ‘glm’ (default) or distance = ‘ps’A generalized linear model (e.g. logistic regression)glm()family, link
distance = ‘ridge’A generalized linear model with L2 regularization (Ridge model)glmnet::cv.glmnet()family, link
distance = ‘lasso’A generalized linear model with L1 regularization (Lasso model)glmnet::cv.glmnet()family, link
distance = ‘elasticnet’A generalized linear model with L1 and L2 regularizations (Elastic Net model)glmnet::cv.glmnet()family, link, alpha
distance = ‘gbm’A generalized boosted modelgbm::gbm()n.trees, interaction.depth, shrinkage, bag.fraction, n.minobsinnode
distance = ‘randomforest’A random forestrandomForest::ra
ndomForest()
mtry, ntree, nodesice, sampsize, maxnodes
distance = ‘bart’The Bayesian additive regression trees (BART)dbarts::bart2()num.trees, num.burn, num.iter, alpha, beta, sigma
distance = ‘nnet’A single-hidden-layer neural networknnet::nnet()size, decay, maxit, linout, trace


Generalized linear model with various link functions

In GLMs, various models can be specified by the link function. The logistic regression model, which is most widely used for a binary treatment group, uses the link function of logit (link = ‘logit’). Features of each link function are listed in Table 3.

Table 3 . Link functions in matchit().

ArgumentFunctionModelSuitable case
link = ‘logit’logp1pLogistic regressionCommonly used for binary outcome in practice
link = ‘probit’Φ1(p)Probit regressionUsed for binary outcome which has more extreme values close to 0 or 1
link = ‘cloglog’log(log(1p))Complementary log-log regressionUsed for binary outcome of rare event
link = ‘cauchit’tan(π(p0.5))Cauchit regressionUsed for binary outcome which has heavy tails and is very sensitive to outliers


Here is an example code for fitting a GLM with a logit link function with the setting of ‘distance = ‘glm’ and link = ‘logit’, the default probability distribution was binomial (family = ‘binomial’), so no additional ‘distance.options’ were specified.

Generalized linear model with regularization

Propensity scores can also be estimated using regularized regression models, which add a regularization term to the sum of squared errors in the generalized linear model discussed in previous section. Specifying ‘ridge,’ ‘lasso,’ or ‘elasticnet’ for the ‘distance’ parameter in the ‘matchit()’ function performs Ridge, Lasso, or Elastic Net regression, respectively. Ridge regression handles multicollinearity, and Lasso regression performs variable selection by shrinking some weights to zero. Elastic Net combines the strengths of both models.

Special options for these models include the parameter lambda, which adjusts the regularization strength. The ‘cv.glmnet()’ function in the glmnet package, used by ‘matchit(),’ automatically calculates the optimal lambda value, so it does not need to be specified separately. However, for Elastic Net, an additional parameter alpha is required to adjust the rate of L1, with a default value of 0.5. Below is an example code for fitting Ridge and Elastic Net model based on logistic regression. In this code, ‘glmr2’ model correspond to an Elastic Net model with 25% L1 (lasso) regularization.

Generalized boosted model

The GBM is one of the boosting techniques that combine multiple decision trees by iterative updates with weights to reduce the errors in the previous tree. The final estimated probability is the aggregate of the predicted probabilities of all the generated trees, which is used as the propensity score.

The GBM effectively captures nonlinear relationships and interaction effects. However, to avoid overfitting, it is necessary to tune the parameters, such as the number of trees (n.trees), depth (interaction.depth), and learning rate (shrinkage). The hyperparameters available in this model are shown in Table 4.

Table 4 . Distance options for generalized boosted model in matchit().

ArgumentDefault valueDetailsImpact
n.trees5,000The number of trees to be generatedMore trees can improve model fitting in training set, but increase the risk of overfitting.
interaction.depth3The maximum depth of each tree or allowable interactionsGreater depth can capture complex patterns better but may lead to overfitting.
shrinkage0.01Learning rate, which is a scalar coefficient multiplied to each treeLower values make the model learn more slowly and can improve generalization, but too low can make training very slow.
n.minobsinnode10The minimum number of observations required in the terminal node of each treeLarger values limit three growth, helping to prevent overfitting.


The code below is an example of PSE based on a GBM. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(1) Generalized boosted model’ section of Supplementary Material 3 as a reference.

Random forest

Generating multiple bootstrap samples, random forest models create several decision trees by randomly combining variables. Then, the model combines the prediction results of each tree to produce a final estimated probability, known as the propensity score. This method is robust against overfitting and useful when evaluating the relative importance of various variables, especially in cases with many covariates where the importance of each covariate is unknown. Relevant options are summarized in Table 5.

Table 5 . Distance options for random forest in matchit().

ArgumentDefault valueDetailsImpact
n.trees500The number of trees to be generatedMore trees can improve model stability and accuracy, but increase computation.
mtrysqrt(p)The number of covariates to consider at each split where p is the total number of covariatesLarger values can improve model accuracy but may increase the risk of overfitting and computation.
nodesize1The minimum number of observations required in the terminal node of each treeLarger values make the trees less complex, which can help prevent overfitting but may reduce accuracy.
maxnodesNULLThe maximum number of terminal nodes that trees can haveLimiting the number of nodes can reduce overfitting, but too small value may result in underfitting.
importanceFALSEWhether to compute variable importanceVariable importance can provide insights into the model but increases computation time.


The code below provides an example of PSE based on a random forest model. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(2) Random forest’ section of Supplementary Material 3 as a reference.

Bayesian additive regression trees

BART is an ensemble model based on the Bayesian approach combining multiple regression trees. It employs the Markov chain Monte Carlo (MCMC) algorithm to sample parameters from Bayesian prior distributions, aiming to reduce errors from prior trees, thereby iteratively producing new regression trees.

For each generated tree, the final estimated probability is calculated by weighting predicted probabilities according to the contribution from the Bayesian posterior distribution. Relevant options are outlined in Table 6.

Table 6 . Distance options for Bayesian additive regression trees (BART) in matchit().

ArgumentDefault valueDetailsImpact
num_trees200The number of trees to be generatedMore trees can improve model stability and accuracy but increase computation.
k2A hyperparameter controlling the variance of the tree predictionsLarger values reduce variance but can make the model more conservative, potentially underfitting.
alpha0.95The base terminal node probability in the tree-building processHigher values increase the probability of smaller trees, reducing complexity and overfitting.
beta2A hyperparameter influencing the depth of the treesHigher values encourage deeper trees, which make model more complex, potentially overfitting.
n_min5The minimum number of observations required in the terminal node of each treeLarger values limit three growths, helping to prevent overfitting.
n_iter1,000The number of iterations for the MCMC algorithmMore iterations can improve model convergence and stability but increase computation time.
burn_in100The number of initial MCMC iterations to be discardedLarger values ensure more stable models but increase computation time.

MCMC, Markov chain Monte Carlo..



This method effectively estimates propensity scores when there is a significant variance in the covariates between the treatment and control groups. Notably, the prediction from each tree not only provides estimates but also captures the uncertainty of these estimates via MCMC sampling. Consequently, the final estimated probabilities are computed to reflect this uncertainty. However, this comes with a substantial computational burden.

The code below is an example of a PSE based on BART. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(3) Bayesian additive regression trees’ section of Supplementary Material 3 as a reference.

Neural network model

A neural network is a network of interconnected neurons organized into layers. A neural network model (NNET) mimics the structure of neural networks, with covariates as the input layer and treatment variables as the output layer. A basic neural network with only these two layers is equivalent to a linear regression model.

Adding a layer with hidden neurons between these two layers creates a multilayer structure capable of capturing nonlinear relationships. Neurons in each hidden layer, often termed nodes, receive inputs from the previous layer, compute a weighted linear combination, and transport these outputs to the subsequent layer. To prevent overfitting, regularization parameters (decay) control the magnitude of the weights, similar to a Ridge regression. Increasing the number of layers and nodes per layer allows for structuring more complex patterns in the data.

A single-hidden-layer neural network model structure is applied by setting ‘distance = ‘nnet’ in ‘matchit().’ The number of nodes (size) in this hidden layer and the regularization parameter (decay) optimize the model (refer to Table 7). Due to longer training times, it is recommended to use this approach when computing resources are sufficient.

Table 7 . Distance options for a single-hidden-layer neural network in matchit().

ArgumentDefault valueDetailsImpact
size5The number of nodes in the hidden layerMore nodes can capture more complex patterns but increase the risk of overfitting and computation time.
decay0.1The weight decay parameter for regularizationHigher values help prevent overfitting but may lead to underfitting.
maxit100The maximum number of iterations for training the neural networkMore iterations improve convergence but increase the risk of overfitting and computation time.
linoutFALSESpecifies if the activation function should be linearFor propensity score of binary treatment, a nonlinear activation function is more appropriate.
traceTRUEWhether to produce output during trainingOutputs provide insight into the training process, which can be useful for debugging and monitoring, but also increase computation.


The code below is an example of PSE based on a neural network model. Explaining the details of the search strategies for the optimal hyperparameters is beyond the scope of this study. Please refer to the relevant code in ‘(4) Neural network model with a single-hidden-layer’ section of Supplementary Material 3.

Model selection for propensity score estimation

The criteria for the optimal model for PSE can be divided into model fitness and the distribution of propensity scores. Akaike information criterion (AIC) or Bayesian information criterion (BIC) are goodness-of-fit measurements for generalized linear models, but they are not computed in machine learning models. Therefore, the propensity score distribution is used for the most appropriate model.

The propensity score distribution between treatment and control groups is assessed by the common support region (CSR). When there is no overlap in propensity scores between the two groups, known as ‘separation,’ the groups cannot be compared, making it impossible to calculate treatment effects. In this case, adjusting the model and its parameters or redefining the covariate settings is crucial. Here are the code and results for inspecting the CSR of estimated propensity scores from glm1 and rf1 objects. Since glm1 demonstrates a broader and more uniform CSR compared with rf1, it is a better model (Fig. 4).

Figure 4. Propensity score distribution from two models.

Step 3: Propensity score matching

Once optimal propensity scores are calculated, different matching methods are applied. Popular algorithms are nearest neighbor matching, optimal matching, full matching, and genetic matching. Table 8 summarizes the features of each algorithm and the scenarios in which they are useful.

Table 8 . Matching algorithms in matchit().

ArgumentMatching algorithmDetailsSuitable case
method = NULLNoneEstimate the propensity score without matchingNone
method = ‘nearest’ (default)Nearest neighbor matchingMatch each subject to the closest propensity score within the designated caliperEvenly distributed propensity scores
method = ‘optimal’Optimal pair matchingSelect the optimal one from all possible matching combinationsHigh dimensional data with many covariates, high imbalanced between treated and control
method = ‘full’Optimal full matchingFind the overall optimal matching from all subjects for matchingUseful when desired to use all subjects
method = ‘genetic’Matching by genetic algorithmOptimize covariate balance across groups based on genetic algorithmHigh dimensional data with many covariates, high imbalanced between treated and control, or small sample size


Additionally, options used for matching include exact, ratio, replace, discard, reestimate, caliper, std.caliper, and m.order, as presented in Table 1. Table 9 outlines the detailed descriptions of each option.

Table 9 . Options for propensity score matching in matchit().

ArgumentFormatDetails
exactexact = ~ covariate1 + covariate2Exact matching on specific variables (in this format, match subjects with the exactly same value of covariate1 and covariate2)
ratio1 (default)The number of control subjects to be matched with one treated subject
replaceFALSE (default)Matching without replacement (restrict each control subject to be matched with one treated subject)
TRUEMatching with replacement (allow control subjects to be matched with several treated subjects)
discardnone’ (default)No units are discarded
control’Discard only control subjects with propensity scores outside the common support region
treated’Discard only treated subjects with propensity scores outside the common support region
both’Discard both treated and control subjects with propensity scores outside the common support region
reestimateFALSE (default)Use the original propensity score for matching, even for some subjects are discarded
TRUEReestimate the propensity score using only the subjects who remained after discard process
caliperNULL (default) (for no caliper)The width of the caliper to be used in matching
std.caliperTRUE (default)The caliper is defined as a fraction or multiple of the standard deviation of the logit of the propensity score
FALSEThe caliper is defined in the raw units of the propensity score
m.orderNUKK (default)Corresponds to ‘largest’
random’Match in a random order
smallest’Match subjects with the smallest propensity scores first
largest’Match subjects with the largest propensity scores first


Using the propensity scores estimated from the glm1 model, nearest neighbor matching was conducted within a caliper corresponding to 10% of the standard error of the logit of the propensity scores, with identical age and sex ensured. The code was set up for a 1:2 matching ratio without replacement. After matching, the results were examined using the ‘summary(matching1)’ function. Due to the exact matching of age and sex, the final matching includes 11 subjects in the treatment group and 13 subjects in the control group, totaling 24 subjects, which is relatively small (Fig. 5).

Figure 5. Number of matched and unmatched cases

Step 4: Covariate balance assessment

After PSM, it is essential to verify whether a balance has been achieved between the covariates of the treatment and control groups. This usually involves comparing the standardized covariates between the two groups and evaluating whether the means and variances of the propensity scores are similar. Histograms or other visualizations can be used to compare the propensity score distributions before and after matching.

When comparing group means and variances, since the units of each covariate are different, the propensity scores and covariates are standardized. The group difference in means and the ratio of variances are then calculated. These are respectively referred to as the standardized mean difference (SMD) or absolute SMD (ASMD) and the variance ratio (VR). Typically, if the absolute value of the SMD is less than 0.1 and the VR falls between 0.5 and 2.0, it is considered that there is no imbalance in the covariates and propensity scores [20].

Checking the covariate balance before and after matching can also be done through histograms or density plots. However, it is common to use a plot called a Love plot to visualize changes in the SMD before and after matching easily. If there is an imbalance in the covariates, it is recommended the propensity scores be reestimated.

The following code was used to examine the SMD and VR for the matching1 object created in step 3. Histograms, density plots, and a Love plot were generated to visualize the covariate balance. Since all covariates had an SMD under 0.1 and a VR between 0.5 and 2.0, it was determined that a covariate balance was achieved (Fig. 6). The Love plot in Fig. 6C illustrates the changes in the ASMD for each covariate before and after matching. Initially, as depicted by the blue line, each covariate had an ASMD greater than 0.1, sorted from the highest ASMD starting with comorbidity. After matching, represented by the red line, the ASMD for all four covariates fell below 0.1, indicating an effective covariate balancing.

Figure 6. Plot for checking the covariate balance. (A) Histogram of propensity score before and after propensity score matching (PSM). (B) Density plot of mean difference of each covariate before and after PSM. (C) Absolute standardized mean difference plot.

Step 5: Treatment effect evaluation

Once the post-PSM has achieved covariate balance, the matched data are utilized to perform analyses to evaluate the treatment effect.

Matched data extraction

The process for extracting the matched dataset is outlined below. Since the matching data were obtained from the matching1 object generated in step 3, it was evident that age and sex were identical in the 1:2 matched nine pairs (Fig. 7). Within this dataset, the variable ‘ps’ represents the estimated propensity score, ‘matching weight’ denotes the sample weight resulting from the 1:2 matching, and ‘match_id’ indicates the pair number of the matching.

Figure 7. Matched data subset.

Treatment effect analysis

Postoperative symptom scores are categorized into two cases: improvement (post_imp = 1) for scores less than 40 or no improvement (post_imp = 0). Two approaches evaluated this variable: (1) a standard logistic regression assuming independence of the two groups in the matched data, and (2) a conditional logistic regression of the matched pairs. The following code and results illustrate these analyses. Since this involves a 1:k matching where k is greater than or equal to 2, the sample weights are incorporated as weighting variables.

Step 6: Sensitivity analysis

In propensity score analyses, a sensitivity analysis evaluates the robustness of the estimated effect size. Various methods have been proposed to assess the impact of omitted variable bias when all relevant covariates are not included. However, implementing all the methods presented in Fig. 3 may not be practical [2123].

A commonly used sensitivity analysis involves applying different models during the PSM and conducting the same analysis on multiple matched datasets derived from various matching methods [24]. When the effect size of a specific dataset is not robust, further investigation is necessary to identify the cause. This may involve verifying the error in the PSM, checking for omitted covariates, or ensuring that the parameters are correctly included. If the estimated treatment effects remain valid across multiple datasets, it strengthens the validity of causal inference. However, in this study, a separate sensitivity analysis was not conducted.

CONSIDERATIONS

Propensity score weighting and stratification

While this study primarily focused on matching between two groups, alternative techniques such as PSW and stratification also deserve attention. PSW uses the inverse probability of treatment weights (IPTW), calculated from the propensity score of each subject, to analyze the data. This method strives to equalize covariate distributions across treatment and control groups, improving their comparability. Unlike matching, it includes all available data, enhancing sample efficiency. However, extremely large or small weights can skew the results. To minimize this issue, truncation or stabilization techniques should be applied [25,26].

Instead of directly applying weighted analysis, another approach has been introduced to construct matched sub-controls based on the IPTW in studies with asymmetric sample sizes, particularly for those with larger control groups. Using these matched sub-controls, a post-genome-wide association study (GWAS) analysis was conducted, identifying additional candidate variants associated with the phenotypes [27].

PSS includes partitioning the dataset into multiple strata based on propensity scores, calculating treatment effects within each stratum, and aggregating these effects. This method helps reduce bias stemming from unobserved confounding variables, leading to more precise and stable estimates. Typically, PSS involves creating strata of equal size based on the propensity score distribution.

Propensity score analyses with nonbinary treatment

Propensity score analysis can be extended to treatment variables with three or more levels or continuous causal variables. However, no established methodology or tool developed explicitly for PSM exists for these cases. Typically, for treatment variables with three or more levels, a link function based on a multinomial distribution is specified. For continuous variables, a generalized linear regression model is fitted using distributions such as Gaussian or Gamma, depending on the data distribution. Propensity scores are then estimated using these models, and PSW analysis is applied accordingly.

Fundamental approach to treatment effect evaluation

In observational studies, treatment effects are often categorized into ‘average treatment effect for the treated (ATT),’ ‘average treatment effect for the control (ATC),’ or ‘average treatment effect (ATE)’ [28,29]. ATT refers to the average effect of the treatment on the treated group compared with the matched control group. ATC represents the average effect of the treatment on the control group compared with the matched treated group. ATE calculates the average effect of the treatment for the entire population. In this study, PSM primarily aimed to estimate the treatment effect using control groups similar to the treatment group, thus corresponding to ATT. Conversely, PSW analysis estimates the treatment effect based on the entire population, aligning with ATE.

Missing imputation before propensity score analyses

In observational studies, the missing values rate is generally higher than in experimental studies. It is common for the probability of missing covariates to differ between the treatment and control groups. Removing missing values on a pairwise basis can significantly distort the treatment effect. Therefore, when analyzing data with missing values, it is recommended to consider various assumptions related to missing data mechanisms, such as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), as well as missing data imputation techniques, such as maximum likelihood and multiple minputation. After addressing missing data, conducting propensity score analysis is advisable [30]. In R, packages like ‘mi,’ ‘mice,’ and ‘missForest’ can help perform missing data imputation [31,32].

CONCLUSION

This study introduced PSM to reduce bias in observational studies, providing a comprehensive overview of each step involved in the process. It demonstrated the practical application using an example in R and introduced the GUI program Rex. However, this tutorial is limited to matching treatment variables between two groups, indicating a need for future research to expand PSM to include a treatment variable with three or more levels or continuous causal variables. Additionally, exploring other analysis techniques, such as PSW, stratification, and handling missing data through imputation methods, are beneficial areas for further investigation

Notes

Author’s contributions

Conceptualization, Investigation: BL, SW, JG

Funding acquisition: BL, JG

Project administration: JG

Visualization: BL, NK

Writing–original draft: BL, JG

Writing–review & editing: All authors

All authors read and approved the final manuscript.

Conflict of interest

All authors have no conflicts of interest to declare.

Funding/support

This research was supported by the Global-Learning & Academic Research Institution for Master’s·PhD students, and Postdocs (LAMP) Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No. RS-2023-00285353).

Data availability

All the R code in this paper is available at http://www.gimlab.online.

Supplementary materials

Supplementary materials can be found via https://doi.org/10.7602/jmis.2024.27.2.55.

jmis-27-2-55-supple.pdf

Fig 1.

Figure 1.Research design and analyses in observational studies for causal inference.
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Fig 2.

Figure 2.Process of propensity score analysis.
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Fig 3.

Figure 3.Matched data analyses based on outcome type and independence assumption.
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Fig 4.

Figure 4.Propensity score distribution from two models.
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Fig 5.

Figure 5.Number of matched and unmatched cases
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Fig 6.

Figure 6.Plot for checking the covariate balance. (A) Histogram of propensity score before and after propensity score matching (PSM). (B) Density plot of mean difference of each covariate before and after PSM. (C) Absolute standardized mean difference plot.
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Fig 7.

Figure 7.Matched data subset.
Journal of Minimally Invasive Surgery 2024; 27: 55-71https://doi.org/10.7602/jmis.2024.27.2.55

Table 1 . Primary arguments in MatchIt::matchit() for PSM.

For useArgumentFormat (with default values)Details
PSE or PSMFormulaa)formula = treatment ~ covariate1 + covariate2 + …The model representing the relationship between the treatment and covariates for propensity score estimation
Only the treatment variable can be entered for matching based on the pre-calculated propensity score vector
PSE or PSMdatadata = datafameA data frame containing the variables specified in the formula
PSE or PSMdistancedistance = ‘glm’The method for calculating the propensity score used for matching
PSE or PSMlinklink = ‘logit’Specify the link function when the distance is set to ‘glm’, ‘ridge’, ‘lasso’, ‘elasticnet’, or ‘gbm’
PSE or PSMdistance.optionsdistance.options=list()Set the specific options for each model to compute the propensity score
PSMmethodmethod = ‘nearest’The method for matching
PSMexactexact = NULLExact matching for specific variables (this format matches subjects with the exact same value of covariate1 and covariate2)
PSMratioratio = 1The number of control subjects to be matched with one treated subject
PSMreplacereplace = FALSEMatching without replacement (restricts each control subject to be matched with one treated subject)
PSMdiscarddiscard = ‘none’The way to manage subjects that fall outside the common support region of the propensity scores before matching
PSMreestimatereestimate = FALSEDetermines whether to reestimate the propensity scores after some subjects have been discarded based on the ‘discard’ option
PSMcalipercaliper = NULLThe width of the caliper to be used in the matching
PSMstd.caliperstd.caliper = TRUEThe way to measure the caliper
PSMm.orderm.order = NULLThe order that the matching is performed

PSE, propensity score estimation; PSM, propensity score matching..

a)The argument ‘formula’ must be specified in any case..


Table 2 . Specifications of each model in matchit() for propensity score estimation.

ArgumentDetailsUtilized functionSpecific options
distance = ‘glm’ (default) or distance = ‘ps’A generalized linear model (e.g. logistic regression)glm()family, link
distance = ‘ridge’A generalized linear model with L2 regularization (Ridge model)glmnet::cv.glmnet()family, link
distance = ‘lasso’A generalized linear model with L1 regularization (Lasso model)glmnet::cv.glmnet()family, link
distance = ‘elasticnet’A generalized linear model with L1 and L2 regularizations (Elastic Net model)glmnet::cv.glmnet()family, link, alpha
distance = ‘gbm’A generalized boosted modelgbm::gbm()n.trees, interaction.depth, shrinkage, bag.fraction, n.minobsinnode
distance = ‘randomforest’A random forestrandomForest::ra
ndomForest()
mtry, ntree, nodesice, sampsize, maxnodes
distance = ‘bart’The Bayesian additive regression trees (BART)dbarts::bart2()num.trees, num.burn, num.iter, alpha, beta, sigma
distance = ‘nnet’A single-hidden-layer neural networknnet::nnet()size, decay, maxit, linout, trace

Table 3 . Link functions in matchit().

ArgumentFunctionModelSuitable case
link = ‘logit’logp1pLogistic regressionCommonly used for binary outcome in practice
link = ‘probit’Φ1(p)Probit regressionUsed for binary outcome which has more extreme values close to 0 or 1
link = ‘cloglog’log(log(1p))Complementary log-log regressionUsed for binary outcome of rare event
link = ‘cauchit’tan(π(p0.5))Cauchit regressionUsed for binary outcome which has heavy tails and is very sensitive to outliers

Table 4 . Distance options for generalized boosted model in matchit().

ArgumentDefault valueDetailsImpact
n.trees5,000The number of trees to be generatedMore trees can improve model fitting in training set, but increase the risk of overfitting.
interaction.depth3The maximum depth of each tree or allowable interactionsGreater depth can capture complex patterns better but may lead to overfitting.
shrinkage0.01Learning rate, which is a scalar coefficient multiplied to each treeLower values make the model learn more slowly and can improve generalization, but too low can make training very slow.
n.minobsinnode10The minimum number of observations required in the terminal node of each treeLarger values limit three growth, helping to prevent overfitting.

Table 5 . Distance options for random forest in matchit().

ArgumentDefault valueDetailsImpact
n.trees500The number of trees to be generatedMore trees can improve model stability and accuracy, but increase computation.
mtrysqrt(p)The number of covariates to consider at each split where p is the total number of covariatesLarger values can improve model accuracy but may increase the risk of overfitting and computation.
nodesize1The minimum number of observations required in the terminal node of each treeLarger values make the trees less complex, which can help prevent overfitting but may reduce accuracy.
maxnodesNULLThe maximum number of terminal nodes that trees can haveLimiting the number of nodes can reduce overfitting, but too small value may result in underfitting.
importanceFALSEWhether to compute variable importanceVariable importance can provide insights into the model but increases computation time.

Table 6 . Distance options for Bayesian additive regression trees (BART) in matchit().

ArgumentDefault valueDetailsImpact
num_trees200The number of trees to be generatedMore trees can improve model stability and accuracy but increase computation.
k2A hyperparameter controlling the variance of the tree predictionsLarger values reduce variance but can make the model more conservative, potentially underfitting.
alpha0.95The base terminal node probability in the tree-building processHigher values increase the probability of smaller trees, reducing complexity and overfitting.
beta2A hyperparameter influencing the depth of the treesHigher values encourage deeper trees, which make model more complex, potentially overfitting.
n_min5The minimum number of observations required in the terminal node of each treeLarger values limit three growths, helping to prevent overfitting.
n_iter1,000The number of iterations for the MCMC algorithmMore iterations can improve model convergence and stability but increase computation time.
burn_in100The number of initial MCMC iterations to be discardedLarger values ensure more stable models but increase computation time.

MCMC, Markov chain Monte Carlo..


Table 7 . Distance options for a single-hidden-layer neural network in matchit().

ArgumentDefault valueDetailsImpact
size5The number of nodes in the hidden layerMore nodes can capture more complex patterns but increase the risk of overfitting and computation time.
decay0.1The weight decay parameter for regularizationHigher values help prevent overfitting but may lead to underfitting.
maxit100The maximum number of iterations for training the neural networkMore iterations improve convergence but increase the risk of overfitting and computation time.
linoutFALSESpecifies if the activation function should be linearFor propensity score of binary treatment, a nonlinear activation function is more appropriate.
traceTRUEWhether to produce output during trainingOutputs provide insight into the training process, which can be useful for debugging and monitoring, but also increase computation.

Table 8 . Matching algorithms in matchit().

ArgumentMatching algorithmDetailsSuitable case
method = NULLNoneEstimate the propensity score without matchingNone
method = ‘nearest’ (default)Nearest neighbor matchingMatch each subject to the closest propensity score within the designated caliperEvenly distributed propensity scores
method = ‘optimal’Optimal pair matchingSelect the optimal one from all possible matching combinationsHigh dimensional data with many covariates, high imbalanced between treated and control
method = ‘full’Optimal full matchingFind the overall optimal matching from all subjects for matchingUseful when desired to use all subjects
method = ‘genetic’Matching by genetic algorithmOptimize covariate balance across groups based on genetic algorithmHigh dimensional data with many covariates, high imbalanced between treated and control, or small sample size

Table 9 . Options for propensity score matching in matchit().

ArgumentFormatDetails
exactexact = ~ covariate1 + covariate2Exact matching on specific variables (in this format, match subjects with the exactly same value of covariate1 and covariate2)
ratio1 (default)The number of control subjects to be matched with one treated subject
replaceFALSE (default)Matching without replacement (restrict each control subject to be matched with one treated subject)
TRUEMatching with replacement (allow control subjects to be matched with several treated subjects)
discardnone’ (default)No units are discarded
control’Discard only control subjects with propensity scores outside the common support region
treated’Discard only treated subjects with propensity scores outside the common support region
both’Discard both treated and control subjects with propensity scores outside the common support region
reestimateFALSE (default)Use the original propensity score for matching, even for some subjects are discarded
TRUEReestimate the propensity score using only the subjects who remained after discard process
caliperNULL (default) (for no caliper)The width of the caliper to be used in matching
std.caliperTRUE (default)The caliper is defined as a fraction or multiple of the standard deviation of the logit of the propensity score
FALSEThe caliper is defined in the raw units of the propensity score
m.orderNUKK (default)Corresponds to ‘largest’
random’Match in a random order
smallest’Match subjects with the smallest propensity scores first
largest’Match subjects with the largest propensity scores first

References

  1. Biebl M, Alkatout I. Recent advances in minimally invasive surgery. MDPI-Multidisciplinary Digital Publishing Institute; 2021.
  2. Meinzer A, Alkatout I, Krebs TF, et al. Advances and trends in pediatric minimally invasive surgery. J Clin Med 2020;9:3999.
    Pubmed KoreaMed CrossRef
  3. Tang K, Goldman S, Avrumova F, Lebl DR. Background, techniques, applications, current trends, and future directions of minimally invasive endoscopic spine surgery: a review of literature. World J Orthop 2023;14:197-206.
    Pubmed KoreaMed CrossRef
  4. Lin T, Xie Q, Peng T, Zhao X, Chen D. The role of robotic surgery in neurological cases: a systematic review on brain and spine applications. Heliyon 2023;9:e22523.
    Pubmed KoreaMed CrossRef
  5. Ming J, He Y, Yang Y, et al. Health technology assessment of medical devices: current landscape, challenges, and a way forward. Cost Eff Resour Alloc 2022;20:54.
    Pubmed KoreaMed CrossRef
  6. van Lieshout C, Frederix GW, Schoonhoven L. Economic evaluations in medical technological innovations a mapping review of methodologies. Cost Eff Resour Alloc 2024;22:23.
    Pubmed KoreaMed CrossRef
  7. Millar R, Morton A, Bufali MV, et al. Assessing the performance of health technology assessment (HTA) agencies: developing a multi-country, multi-stakeholder, and multi-dimensional framework to explore mechanisms of impact. Cost Eff Resour Alloc 2021;19:37.
    Pubmed KoreaMed CrossRef
  8. Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press; 2015.
    CrossRef
  9. Pirracchio R, Resche-Rigon M, Chevret S. Evaluation of the propensity score methods for estimating marginal odds ratios in case of small sample size. BMC Med Res Methodol 2012;12:70.
    Pubmed KoreaMed CrossRef
  10. Huber M, Lechner M, Wunsch C. The performance of estimators based on the propensity score. J Econom 2013;175:1-21.
    CrossRef
  11. Pan W, Bai H. Propensity score methods for causal inference: an overview. Behaviormetrika 2018;45:317-334.
    CrossRef
  12. Leite W. Practical propensity score methods using R. Sage Publications; 2016.
    CrossRef
  13. Raad H, Cornelius V, Chan S, Williamson E, Cro S. An evaluation of inverse probability weighting using the propensity score for baseline covariate adjustment in smaller population randomised controlled trials with a continuous outcome. BMC Med Res Methodol 2020;20:70.
    Pubmed KoreaMed CrossRef
  14. Wan F. Matched or unmatched analyses with propensity-score-matched data? Stat Med 2019;38:289-300.
    Pubmed CrossRef
  15. Ho D, Imai K, King G, Stuart EA. MatchIt: nonparametric preprocessing for parametric causal inference. J Stat Softw 2011;42:1-28.
    CrossRef
  16. Greifer N. Covariate balance tables and plots: a guide to the cobalt package [Internet]. Published 2024 Apr 2. Available from: https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt.html.
  17. Greifer N. Using WeightIt to estimate balancing weights [Internet]. Published 2024 Jun 9. Available from: https://ngreifer.github.io/WeightIt/articles/WeightIt.html.
  18. Burgette L, Griffin BA, McCaffrey D. Propensity scores for multiple treatments: a tutorial for the mnps function in the twang package. Rand Corporation; 2017.
    CrossRef
  19. Lee B, An J, Lee S, Won S. Rex: R-linked EXcel add-in for statistical analysis of medical and bioinformatics data. Genes Genomics 2023;45:295-305.
    Pubmed CrossRef
  20. Zhang Z, Kim HJ, Lonjon G, Zhu Y; written on behalf of AME Big-Data Clinical Trial Collaborative Group. Balance diagnostics after propensity score matching. Ann Transl Med 2019;7:16.
    Pubmed KoreaMed CrossRef
  21. Carnegie NB, Harada M, Hill JL. Assessing sensitivity to unmeasured confounding using a simulated potential confounder. J Res Educ Eff 2016;9:395-420.
    CrossRef
  22. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70:41-55.
    CrossRef
  23. Rosenbaum PR. In: Zubizarreta JR, Stuart EA, Small DS, Rosenbaum PR, eds. Handbook of matching and weighting adjustments for causal inference. Chapman and Hall/CRC; 2023:21-38.
    CrossRef
  24. Rudolph KE, Stuart EA. Using sensitivity analyses for unobserved confounding to address covariate measurement error in propensity score methods. Am J Epidemiol 2018;187:604-613.
    Pubmed KoreaMed CrossRef
  25. Gürel S, Leite WL. Comparison of propensity score weighting methods to remove selection bias in average treatment effect estimates. Int J Turk Educ Stud 2023;11:989-1031.
    CrossRef
  26. Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011;6:e18174.
    Pubmed KoreaMed CrossRef
  27. Gim J, Choi S, Im J, Kim JK, Park T. A post-hoc genome-wide association study using matched samples. Int J Data Min Bioinform 2016;14:197-209.
    CrossRef
  28. Guo S, Fraser MW. Propensity score analysis: statistical methods and applications. 2nd ed. SAGE Publications; 2014.
    CrossRef
  29. Morgan SL, Winship C. Counterfactuals and causal inference: methods and principles for social research. 2nd ed. Cambridge University Press; 2014.
    CrossRef
  30. D'Agostino RB, Jr, Rubin DB. Estimating and using propensity scores with partially missing data. J Am Stat Assoc 2000;95:749-759.
    CrossRef
  31. Allison PD. In: Millsap RE, Maydeu-Olivares A, eds. The Sage handbook of quantitative methods in psychology. Sage Publications; 2009:72-89.
    CrossRef
  32. Enders CK. Missing data: An update on the state of the art. Psychol Methods 2023 [accepted manuscript]. . DOI: 10.1037/met0000563.
    CrossRef

Supplementary File

Share this article on

  • kakao talk
  • line

Related articles in JMIS

Journal of Minimally Invasive Surgery

pISSN 2234-778X
eISSN 2234-5248