This guide provides an introduction to applying the
multicate package in health data analyses. The explanations
below are based on the original paper that is based on for development
of this package: Comparison of methods that combine multiple
randomized trials to estimate heterogeneous treatment effects.
Statistics in medicine (Brantner et al., 2024).1 The detailed github
code and readme are in this link(https://github.com/dobengjhu/multicate).
To estiamte CATE, you can use the function
estiamte_cate() and this function requires parameter like
below:
c('var1','var2','var3').c('var1','var2','var3'). Default is
NULL.causal forest for the
estimation method you may adjust extra arguments to compelte splitting
tress faster by setting a smaller number of trees. (e.g.extra_args =
list(num.trees = 50))
The result objects of estimate_cate() and
summary() provide different information.
estimate_cate() result object:
summary() of the result object of estimate_cate():
ate)studycate) with minimum,
median, maximum CATE.
While the primary purpose of this package is for RCTs, but the base dataset can be from randomized controlled trials, observational studies, or a combination of both. Our functions are primarily designed for IPD (Individual Patient Data). The dataset must contain at least 4 key components, a study indicator variable, a treatment indicator variable, An outcome variable, and covariate variables.
You should be careful to use which data formats your data has. For this package, the treatment variable should be numeric (e.g. 0 or 1), the study ID should be character (factor is allowed.), and the outcome must be numeric. Note that this package is primarily designed for continuous outcome, and covariates should be all numeric.
| Data column | Data format |
|---|---|
| treatment | numeric(0,1) |
| studyid | character |
| outcome | numeric |
| covariates | numeric |
Packages like grf (used for causal_forest) will support
missing values in a specific way. The explanations below are from the
grf github reference page.2
Missing values GRF can handle missing covariate values, meaning that a the input covariate matrix can contain NA in some cells instead of numerical values.
These missing values can be informative, in the sense that the fact that a covariate is missing can be predictive of the unobserved value itself or of the outcome Y, or it can be non-informative. For instance, an individual’s annual income could be missing because this individual is wealthy and prefers to conceal his/her income and this in turn is likely to be predictive, for instance, of the type of housing of this individual.
GRF handles missing values implicitly using the missing incorporated in attributes criterion (MIA; Twala et al., 2008), meaning that as soon as there is a missing value in a variable j, there are three candidate splits for a given threshold.
split the samples with observed value for j according to the threshold and send all samples with missing value for j to the left,
split the samples with observed value for j according to the threshold and send all samples with missing value for j to the right,
send all samples with observed value for j to the left and all samples with missing value for j to the right.
Note that by default, missing values are sent to the left, which implies that if the covariates from the training set are completely observed but there are missing values occurring in a new covariate matrix from the test set, then the prediction functions will not produce an error but simply send the incomplete observations to the left side of the corresponding nodes.
Missing values occur almost inevitably in most applications and the way to handle them depends on the type of analysis one wishes to perform and on the mechanism that generated these missing values. For a broader discussion on treatment effect estimation with missing attributes please see Mayer et al. (2019).
If you need to include factor variables, we recommend the following:
The brief summary of solutions suggested by the grf vignett is as follows:
fastdummies to convert
categorical variables into dummy variables easily.Means encoding: Use the sufrep package for
target/means encoding.
Oridinal encoding: If your categorical variable has a natural order (e.g. months), encode it as a numeric variable accordingly.
See the grf vignette on categorical inputs for mode
deail.3
Ensure that the column names or formulas passed to estimate_cate() do
not contain spaces or logical connectors (like ‘and’). The
janitor package is helpful for cleaning column names by
replacing spaces with underscores (e.g., ‘variable name’ becomes
‘variable_name’). This step is especially important when using the
ensemble forest method.
These errors may be caused by lack of sufficient variation or sparsity in one or more covariates. When a variable has very few observations in a particular category (e.g., a binary variable with only a handful of “Yes” values), it can lead to unstable splits in tree-based models like causal forest, or to overfitting, especially in small subgroups.
Before applying the model to data, it is recommended to examine the distribution of each covariate to identify rare levels or highly imbalanced variables that may affect model performance.
dbarts and
grfpackages.
grf’s
causal_forest.
importance is set to impurity, and
keep.inbag is set to TRUE.
Plot4 BLP(Best Linear projection figure) is not available for the ensemble forest method. You can refer to the table below to see which plots are supported by each combination of aggregation and estimation methods.
| No | Plots from plot() |
Aggregation Method | Estimation Method |
|---|---|---|---|
| 1 | Histogram of estimated CATEs | Available to all | Available to all |
| 2 | Boxplot of CATEs by study ID | Available to all | Available to all |
| 3 | 95% CI for all CATEs | Study specific, Study indicator | Available to all |
| 4 | Best Linear Projection | Study specific, Study indicator | Causal forest |
| 5 | Interpretation tree | Available to all | Available to all |
To get started, install and load the multicate package
using either install.packages(multicate) or
pak::pak("dobengjhu/multicate"). If you wouldl like to know
what kinds of data would be needed, for the exploration purpose, you can
refer to our fake dataset dummy_tbl. This fake dataset
includes three previously conducted trials comparing the same two
treatments. It contains:
An outcome variable called response
A treatment indicator tx
Five signal covariates (var to var5)
that may act as effect modifiers
You can view a summary of the dataset with
summary(dummy_tbl).
In this document, we used the packages as follows.
# 1) Install multicate package
#install.packages("multicate")
#pak::pak("dobengjhu/multicate")
library(multicate)
#2) Additional packages for data cleansing
library(tibble)
#install.packages("tableone")
library(tableone)
library(tidyverse)
#install.packages("fastDummies") # if not already installed
library(fastDummies)
#install.packages("kableExtra")
library(kableExtra)
#install.packages("performance")
library(performance)
In this example, we use data from a study of Right Heart Catheterization (RHC), a diagnostic procedure used to measure cardiac function in critically ill patients. While RHC can guide urgent and ongoing treatment decisions, it also carries a risk of serious complications. The observational dataset originates from Murphy and Cluff (1990), was later reanalyzed by Connors et al. (1996), and has since become a benchmark dataset in causal inference research.
The dataset (rhc.xls) includes information on 5,735
hospitalized adult patients across five U.S. medical centers. The
treatment variable (column 2) indicates whether RHC was
administered within 24 hours of admission (TRUE for
treatment; FALSE for control). The outcome variable of this
study is originally dth30 (column 54), records 30-day
post-admission mortality (TRUE for death;
FALSE for survival), but in this example, we set the
outcome Glasgow Coma Score (scoma1 variable) as our
continuous coutcome.
Covariate information (column 3 to 53): age, sex, race (black, white, other), years of education,income, type of medical insurance (private, Medicare, Medicaid, private and Medicare, Medicare and Medicaid, or none), primary disease category, secondary disease category, 10 categories of admission diagnosis, activities of daily living (ADL) and Duke Activity Status Index (DASI) 2 weeks before admission, do-not-resuscitate status on day 1 of admission, cancer (none, localized, metastatic), an estimate of the probability of surviving 2 months, acute physiology component of the APACHE III score, Glasgow Coma Score, weight, temperature, mean blood pressure, respiratory rate, heart rate, PaO2=FIO2 ratio, PaCO2, pH, WBC count, hematocrit, sodium, potassium, creatinine, bilirubin, albumin, urine output, 12 categories of comorbidity illness, and whether the patient transferred from another hospital. Since the number of covariates is too large for demonstration purposes, we will use a smaller subset in the analysis. In practice, however, you can use datasets with many variables.
To demonstrate the functionality of the multicate
package in the context of multiple studies, we simulate study membership
by arbitrarily dividing the dataset into five groups, treating them as
if they came from five distinct medical centers. Although the original
dataset does not indicate which patient belongs to which hospital, this
simulated structure allows us to explore treatment effect heterogeneity
across centers and individual-level covariates.
#Load data
rhc <- read.csv(file="rhc.xls", header=TRUE)
rhc <- rhc %>% mutate(across(where(is.logical), as.numeric))
rhc <- rhc %>% subset(select = c(-ID))
rhc$center <- rep(1:5, each=nrow(rhc)/5)
rhc$center <- as.character(rhc$center)
#Choose covariates to be used for this study
rhc <- rhc %>% select(treatment, center, scoma1,
race, ninsclas,
hrt1, resp1, bili1, pafi1, aps1, temp1, das2d3pc, sod1, liverhx)
| center | treatment | scoma1 | race | ninsclas | hrt1 | resp1 | bili1 | pafi1 | aps1 | temp1 | das2d3pc | sod1 | liverhx |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | white | Medicare | 124 | 10 | 1.0097656 | 68.0000 | 46 | 38.69531 | 23.50000 | 145 | 0 |
| 1 | 1 | 0 | white | Private & Medicare | 137 | 38 | 0.6999512 | 218.3125 | 50 | 38.89844 | 14.75195 | 137 | 0 |
| 1 | 1 | 0 | white | Private | 130 | 40 | 1.0097656 | 275.5000 | 82 | 36.39844 | 18.13672 | 146 | 0 |
| 1 | 0 | 0 | white | Private & Medicare | 58 | 26 | 0.3999634 | 156.6562 | 48 | 35.79688 | 22.92969 | 117 | 0 |
| 1 | 1 | 41 | white | Medicare | 125 | 27 | 1.0097656 | 478.0000 | 72 | 34.79688 | 21.05078 | 126 | 0 |
| 1 | 0 | 0 | white | Medicare | 134 | 36 | 1.0097656 | 184.1875 | 38 | 39.19531 | 17.50000 | 138 | 0 |
| 1 | 0 | 26 | white | Private | 135 | 10 | 1.0097656 | 148.7500 | 29 | 39.39844 | 15.40625 | 136 | 0 |
| 1 | 0 | 100 | white | Private | 102 | 34 | 0.8999023 | 240.0000 | 25 | 39.19531 | 29.14453 | 136 | 0 |
| 1 | 0 | 0 | white | Private | 118 | 30 | 1.0998535 | 333.3125 | 47 | 38.39844 | 26.28906 | 136 | 0 |
| 1 | 1 | 0 | white | Medicaid | 141 | 40 | 0.5000000 | 68.0000 | 48 | 38.50000 | 23.25781 | 146 | 0 |
As described above, the multicate package builds on
other modeling packages, which require the input data to be fully
numeric and free of missing (NA) values. To meet these requirements, we
removed the missing values in the dataset accordingly.
After removing missing values (NA) and simplifying categorical
variable levels, we convert all factor and character covariates
variables into numeric or dummy variables using fastDummies
package. The janitor package’s function,
janitor::clean_names() is helpful for standardizing column
names by replacing spaces and special characters (e.g., converting
condition_no symptoms to
condition_no_symptoms).
This is the table to show how dummy variables are created via
fast_dummies.
| race_other | race_white | ninsclas_medicare | ninsclas_medicare_medicaid | ninsclas_no_insurance | ninsclas_private | ninsclas_private_medicare |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 | 0 | 0 |
This table is the final table about the data we will use for analysis.
| Category | Variables | Data.type | Description |
|---|---|---|---|
| Treatment | treatment | num | [treatment] RHC was administered within 24 hours of admission (Yes = 1, No = 0) |
| Studies | (fake) center | factor | [studies] Arbitarily created variable referring to the hospitals/centers that RHC was administered at |
| Outcome | scoma1 | int | [outcome] Glasgow Coma Score. It is used to assess a neurological status of a patient, with higher indicating a better prognosis. |
| Covariates | hrt1 | int | Heart rate |
| resp1 | num | Respiratroy rate | |
| bili1 | num | Bilirubin in critical illness. High levels are a marker of severity. | |
| pafi1 | num | PaO2/FIO2 ratio (P/F ratio) is a measure of oxygenation in the blood used to assess respiratory failure and the severity of conditions like Acute Respiratory Distress Syndrome (ARDS). | |
| aps1 | int | Acute physiology component of the APACHE score is a value from 0 to 60 that measures a critically ill patients physiological state based on 12 clinical variables collected within the first 24 hours of ICU admission. | |
| temp1 | num | Temperature | |
| das2d3pc | num | DASI (Duke Activity Status Index) is a 12-question, self-administered questionnaire used to measure a patient functional capacity, especially those with cardiovascular disease. | |
| sod1 | int | Sodium | |
| liverhx | num | Cirrhosis, Hepatic Failure | |
| race_other | int | Race Other | |
| race_white | int | Race White | |
| ninsclas_medicare | int | Type of medical insurance - Medicare | |
| ninsclas_medicare_medicaid | int | Type of medical insurance - Medicare and Medicaid | |
| ninsclas_no_insurance | int | Type of medical insurance - No insurance | |
| ninsclas_private | int | Type of medical insurance - Private | |
| ninsclas_private_medicare | int | Type of medical insurance - Private and Medicare |
The multicate package provides tools to combine multiple
RCTs to estimate CATE. It supports three options for handling multiple
studies: (1) pooling trials using a study indicator
("studyindicator"), (2) aggregating datasets via an
ensemble forest approach ("ensembleforest"), and (3)
estimating effects separately within each study without pooling
("studyspecific"). Technically, the last option is not
‘aggregation of multiple studies’, but it is included for settings where
trials are too heterogeneous to justify combining information.
For CATE estimation, the package implements two methods: (1) the
S-learner, which fits a single model for the outcome, treating the
treatment as one of covariates ("slearner"), and (2) the
causal forest, which recursively partitions covariates to best split
based on treatment effect heterogeneity. Further details can be found in
Brantner et al.(2024)9 or in the ‘Background’ section of the
website10
The example below is the sample with causal forest for the estimation method, and study indicator for the aggregation method. Study indicator method pools all data together but keep study ID as an indicator variable and includes them as covariates in the single-study method selected in estiamtion method.
set.seed(100)
cate_mod2 <- estimate_cate(rhc_clean,
estimation_method = "causalforest",
aggregation_method = "studyindicator",
study_col = "center",
treatment_col = "treatment",
outcome_col = "scoma1",
covariate_col = NULL,
drop_col = c("race","ninsclas"))
summary_cate_mod2 <- summary(cate_mod2)
#result
#cate_mod2
#result summary
summary_cate_mod2
#>
#> Estimation Method: causalforest
#> Aggregation Method: studyindicator
#>
#> Variable Importance:
#> hrt1 resp1 bili1 pafi1 aps1 temp1 das2d3pc
#> Value 0.08616683 0.2156197 0.1305946 0.07425747 0.2340573 0.08996657 0.06862788
#> sod1 liverhx race_other race_white ninsclas_medicare
#> Value 0.04009024 0.02242125 0.0003493235 0.002782086 0.002467816
#> ninsclas_medicare_medicaid ninsclas_no_insurance ninsclas_private
#> Value 0.0011903 0.004070067 0.005487103
#> ninsclas_private_medicare center_1 center_2 center_3 center_4
#> Value 0.005270401 0.00320616 0.003143155 0.00296119 0.002510454
#> center_5
#> Value 0.004759996
#>
#> Overall ATE:
#> Estimate Std. Error
#> -3.4392 0.7547
#>
#> Study-Specific ATE:
#> Minimum CATE Median CATE Maximum CATE
#> 1 -11.55 -2.735 2.789
#> 2 -11.89 -2.738 2.523
#> 3 -12.25 -2.842 2.693
#> 4 -12.39 -2.959 2.738
#> 5 -11.21 -2.799 2.960
Study specific method fits the estimation method separately for each study and report out study-specific estiamtes.
set.seed(100)
cate_mod <- estimate_cate(rhc_clean,
estimation_method = "causalforest",
aggregation_method = "studyspecific",
study_col = "center",
treatment_col = "treatment",
outcome_col = "scoma1",
covariate_col = NULL,
drop_col = c("race","ninsclas"))
summary_cate_mod <- summary(cate_mod)
#result
summary_cate_mod
#>
#> Estimation Method: causalforest
#> Aggregation Method: studyspecific
#>
#> Variable Importance:
#> 1 2 3 4
#> hrt1 8.815784e-02 0.1578055817 0.1274604448 0.095335491
#> resp1 3.590424e-01 0.0822063168 0.1319580095 0.105403118
#> bili1 5.913912e-02 0.1064171989 0.0436751343 0.172664202
#> pafi1 8.506052e-02 0.1229695073 0.1898693456 0.149199309
#> aps1 1.004332e-01 0.1530942033 0.1619463391 0.128054155
#> temp1 9.885726e-02 0.1599027173 0.1048710218 0.152842805
#> das2d3pc 8.559751e-02 0.0671526922 0.1203708546 0.061798845
#> sod1 6.227833e-02 0.0827432365 0.0943740709 0.091291067
#> liverhx 1.825143e-02 0.0012113027 0.0033732315 0.018979862
#> race_other 1.350108e-03 0.0009375085 0.0001988937 0.001009043
#> race_white 3.908517e-03 0.0081592309 0.0027869287 0.002858875
#> ninsclas_medicare 3.787342e-03 0.0027478012 0.0055582432 0.002197582
#> ninsclas_medicare_medicaid 8.774907e-05 0.0027714856 0.0002542968 0.001523082
#> ninsclas_no_insurance 5.122805e-05 0.0083558126 0.0046291151 0.010114266
#> ninsclas_private 2.587322e-02 0.0247227159 0.0050555514 0.005159331
#> ninsclas_private_medicare 8.124280e-03 0.0188026885 0.0036185189 0.001568968
#> 5
#> hrt1 0.1572957473
#> resp1 0.1122038054
#> bili1 0.1253152619
#> pafi1 0.0861340315
#> aps1 0.2649563450
#> temp1 0.0881977771
#> das2d3pc 0.0664309899
#> sod1 0.0661750486
#> liverhx 0.0006311238
#> race_other 0.0017621565
#> race_white 0.0032868101
#> ninsclas_medicare 0.0048008674
#> ninsclas_medicare_medicaid 0.0012967847
#> ninsclas_no_insurance 0.0086710562
#> ninsclas_private 0.0073342039
#> ninsclas_private_medicare 0.0055079907
#>
#> Overall ATE:
#> Estimate Std. Error
#> -3.417 NA
#>
#> Study-Specific ATE:
#> Minimum CATE Median CATE Maximum CATE
#> 1 -22.339 -2.734 5.4008
#> 2 -6.629 -1.218 4.5192
#> 3 -14.692 -5.160 0.7346
#> 4 -10.975 -4.400 3.6490
#> 5 -11.593 -2.068 8.8241
Study indicator aggregation method does not provide standard error.
#study indicator
summary_cate_mod$ate
#> Estimate Std. Error
#> -3.41695 NA
Study specific aggregation method does provide standard error information.
#study specific
summary_cate_mod2$ate
#> Estimate Std. Error
#> -3.4391637 0.7546843
The plot() funciton with estimate_cate()
objects provides five visualizations to help interpret the estiamted
CATEs:
estimation_method = "causalforest")For example, in the fourth plot below, study 1 shows a BLP
coefficient for hrt1 (heart rate) close to zero, whereas
study 2 shows a coefficient of approximately 6. This indicates that in
study 2, individuals with higher heart rates tend to have larger
treatment effect heterogeneity, while heart rate appears to have little
relationship with treatment effect heterogeneity in study 1.
In the example below, resp1(respiratory rate) and
aps1(acute physiology component of the APACHE score,
meaning a critical illness level within the first 24 hours of ICU
admission) appear as the primary splitting variables, indicating they
are key modifiers of the treatment effect.
#plot(cate_mod)
Additionally, the plot_vteffect() provides a more
focused visualization if you are interested in exploring treatment
effect heterogeneity with respect to a specific
continuous covariate. It plots the estimated CATE
(tau_hat) for each observation against the selected
covariate, allowing you to see how treatment effects vary across its
range and across different studies.
For example, if you’re interested in how treatment effects differ by
aps1 across studies, you can specify this variable using
the covariate_name = "aps1" parameter.
plot_vteffect(cate_mod, covariate_name = "aps1")
Brantner, C. L., Nguyen, T. Q., Tang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine, 43(7), 1291-1314. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955↩︎
https://grf-labs.github.io/grf/REFERENCE.html#missing-values-1↩︎
https://grf-labs.github.io/grf/articles/categorical_inputs.html↩︎
Hill, J. L. (2011). Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1), 217–240. https://doi.org/10.1198/jcgs.2010.08162↩︎
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10), 4156-4165. https://www.pnas.org/doi/abs/10.1073/pnas.1804597116↩︎
Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. https://arxiv.org/abs/1610.01271↩︎
Brantner, C. L., Nguyen, T. Q., Tang, T., Zhao, C., Hong, H., & Stuart, E. A. (2024). Comparison of methods that combine multiple randomized trials to estimate heterogeneous treatment effects. Statistics in medicine, 43(7), 1291-1314. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.9955↩︎