Data Preparation for Mixed-Subjects Design
msd_data.RdCreate and validate data objects for MSD estimation. Create an MSD data object
Creates a validated data object for mixed-subjects design estimation. Accepts either a single combined dataframe or two separate dataframes for observed and unobserved units. Column names default to "Y", "D", "S0", "S1" and can be overridden explicitly.
Usage
msd_data(
data = NULL,
observed = NULL,
unobserved = NULL,
outcome = "Y",
treatment = "D",
pred_control = "S0",
pred_treated = "S1",
obs_outcome = NULL,
obs_treatment = NULL,
obs_pred_control = NULL,
obs_pred_treated = NULL,
unobs_treatment = NULL,
unobs_pred_control = NULL,
unobs_pred_treated = NULL
)Arguments
- data
A combined dataframe containing both observed and unobserved units. Observed units have non-missing Y values; unobserved units have Y = NA. If provided,
observedandunobservedshould be NULL.- observed
A dataframe of observed (labeled) units with columns for outcome Y, predictions S0/S1, and treatment D. If provided,
datashould be NULL.- unobserved
A dataframe of unobserved (unlabeled) units with columns for predictions S0/S1 and treatment D (no Y column needed).
- outcome
Name of the outcome column. Default: "Y".
- treatment
Name of the treatment column. Default: "D".
- pred_control
Name of the control prediction column. Default: "S0".
- pred_treated
Name of the treatment prediction column. Default: "S1".
- obs_outcome
Outcome column name for observed data only (overrides
outcome).- obs_treatment
Treatment column name for observed data only (overrides
treatment).- obs_pred_control
Control prediction column for observed data only.
- obs_pred_treated
Treatment prediction column for observed data only.
- unobs_treatment
Treatment column name for unobserved data only.
- unobs_pred_control
Control prediction column for unobserved data only.
- unobs_pred_treated
Treatment prediction column for unobserved data only.
Value
An S3 object of class "msd_data" containing:
- observed
Dataframe of observed units with standardized columns Y, S0, S1, D
- unobserved
Dataframe of unobserved units with standardized columns S0, S1, D (or NULL)
- has_S0
Logical indicating if S0 predictions are available
- has_S1
Logical indicating if S1 predictions are available
- has_both_predictions
Logical indicating if both S0 and S1 are available
- n1
Number of treated observed units
- n0
Number of control observed units
- m
Number of unobserved units
- col_mapping
List of original column names used
Details
The function supports flexible column name specification:
Default column names:
By default, the function expects columns named "Y" (outcome), "D" (treatment),
"S0" (control prediction), and "S1" (treatment prediction). Override these
using the outcome, treatment, pred_control, and
pred_treated arguments.
Global specification:
Use outcome, treatment, pred_control, pred_treated
to set column names that apply to both observed and unobserved dataframes.
Per-dataframe specification:
Use obs_* and unobs_* arguments when column names differ
between observed and unobserved dataframes. These override global settings.
Two input modes:
Mode 1: Single combined dataframe
Provide a single dataframe via the data argument. The function will
automatically split it into observed and unobserved based on whether Y is NA.
Mode 2: Separate dataframes
Provide two separate dataframes via observed and unobserved.
Examples
# Default column names (Y, D, S0, S1)
obs_df <- data.frame(
Y = c(1.2, 0.8, 1.5, 0.9),
S0 = c(1.0, 0.7, 1.3, 0.8),
S1 = c(1.1, 0.9, 1.4, 1.0),
D = c(1, 0, 1, 0)
)
msd <- msd_data(observed = obs_df)
# Custom column names (same in both dataframes)
obs_df2 <- data.frame(
response = c(1.2, 0.8, 1.5, 0.9),
pred_ctrl = c(1.0, 0.7, 1.3, 0.8),
pred_trt = c(1.1, 0.9, 1.4, 1.0),
treated = c(1, 0, 1, 0)
)
unobs_df2 <- data.frame(
pred_ctrl = c(1.1, 0.9),
pred_trt = c(1.2, 1.0),
treated = c(1, 0)
)
msd2 <- msd_data(
observed = obs_df2,
unobserved = unobs_df2,
outcome = "response",
treatment = "treated",
pred_control = "pred_ctrl",
pred_treated = "pred_trt"
)
# Different column names in observed vs unobserved
obs_df3 <- data.frame(
outcome = c(1.2, 0.8),
claude_pred_0 = c(1.0, 0.7),
claude_pred_1 = c(1.1, 0.9),
treatment = c(1, 0)
)
unobs_df3 <- data.frame(
s0_claude = c(1.1, 0.9),
s1_claude = c(1.2, 1.0),
D = c(1, 0)
)
msd3 <- msd_data(
observed = obs_df3,
unobserved = unobs_df3,
obs_outcome = "outcome",
obs_treatment = "treatment",
obs_pred_control = "claude_pred_0",
obs_pred_treated = "claude_pred_1",
unobs_treatment = "D",
unobs_pred_control = "s0_claude",
unobs_pred_treated = "s1_claude"
)