Minimal Example
Consider the observed data with time ordering W -> A -> Y. W is a continuous baseline covariate that affects A and Y. A is a binary treatment variable that affects Y. Y is a binary outcome variable.
W ~ N(0, 1)
A ~ binomial with P(A = 1) = expit(W)
Y ~ binomial with P(Y = 1) = expit(W + A)
where \(expit(z) = \frac{1}{1 + e^{-z}}\)
We want to know \(E[Y_1]\) the expected value of Y, intervening to set A to 1.
rexpit <- function(x) rbinom(n=length(x), size=1, prob=plogis(x))
n <- 10000
W <- rnorm(n)
A <- rexpit(W)
Y <- rexpit(W + A)
data <- data.frame(W, A, Y)
head(data)
#> W A Y
#> 1 -0.2364122 0 0
#> 2 -0.2201795 0 0
#> 3 0.6247098 1 1
#> 4 0.4090073 0 0
#> 5 2.0155026 1 1
#> 6 -1.1306505 0 1
We could try to use the simple mean of Y or the mean of Y when A = 1, but these will be biased.
Now we use ltmle()
.
result <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 1)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
#> Estimate of time to completion: < 1 minute
result
#> Call:
#> ltmle(data = data, Anodes = "A", Ynodes = "Y", abar = 1)
#>
#> TMLE Estimate: 0.6931207
Because we’re using simulated data, we can calculate the true value of \(E[Y_1]\)
Single time point
Time ordering of data is W1 -> W2 -> W3 -> A -> Y
n <- 1000
W1 <- rnorm(n)
W2 <- rbinom(n, size=1, prob=0.3)
W3 <- rnorm(n)
A <- rexpit(-1 + 2 * W1 + W3)
Y <- rexpit(-0.5 + 2 * W1^2 + 0.5 * W2 - 0.5 * A + 0.2 * W3 * A - 1.1 * W3)
data <- data.frame(W1, W2, W3, A, Y)
True value of \(E[Y_1]\) is approximately 0.5939.
SuperLearner semiparametric estimation using all parents as regressors
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y", abar=1, SL.library="default")
#> Loading required namespace: SuperLearner
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W1 + W2 + W3 + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W1 + W2 + W3
#>
#> Loading required package: nnls
#> Loading required namespace: arm
#> Estimate of time to completion: 1 minute
TMLE estimate:
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = "A", Lnodes = NULL, Ynodes = "Y",
#> abar = 1, SL.library = "default")
#>
#> Parameter Estimate: 0.60079
#> Estimated Std Err: 0.053992
#> p-value: <2e-16
#> 95% Conf Interval: (0.49497, 0.70661)
IPTW estimate:
summary(result, estimator="iptw")
#> Estimator: iptw
#> Call:
#> ltmle(data = data, Anodes = "A", Lnodes = NULL, Ynodes = "Y",
#> abar = 1, SL.library = "default")
#>
#> Parameter Estimate: 0.59706
#> Estimated Std Err: 0.053743
#> p-value: <2e-16
#> 95% Conf Interval: (0.49173, 0.7024)
SuperLearner semiparametric estimation using correctly specified regressors. This passes W1^2, W2, W3, A and W3:A as columns of matrix X to SuperLearner for the Q regression and W1 and W3 as columns of matrix X to SuperLearner for the g regression.
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y",
Qform=c(Y="Q.kplus1 ~ I(W1^2) + W2 + W3*A"),
gform="A ~ W1 + W3", abar=1, SL.library="default")
#> Estimate of time to completion: 1 minute
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = "A", Lnodes = NULL, Ynodes = "Y",
#> Qform = c(Y = "Q.kplus1 ~ I(W1^2) + W2 + W3*A"), gform = "A ~ W1 + W3",
#> abar = 1, SL.library = "default")
#>
#> Parameter Estimate: 0.60087
#> Estimated Std Err: 0.041261
#> p-value: <2e-16
#> 95% Conf Interval: (0.51999, 0.68174)
glm using correctly specified Qform and gform
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y",
Qform=c(Y="Q.kplus1 ~ I(W1^2) + W2 + W3*A"), gform="A ~ W1 + W3",
abar=1, SL.library=NULL)
#> Estimate of time to completion: < 1 minute
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = "A", Lnodes = NULL, Ynodes = "Y",
#> Qform = c(Y = "Q.kplus1 ~ I(W1^2) + W2 + W3*A"), gform = "A ~ W1 + W3",
#> abar = 1, SL.library = NULL)
#>
#> Parameter Estimate: 0.60077
#> Estimated Std Err: 0.041135
#> p-value: <2e-16
#> 95% Conf Interval: (0.52015, 0.6814)
Get summary measures (additive treatment effect, odds ratio, relative risk) for abar=1 vs abar=0
result <- ltmle(data, Anodes="A", Lnodes=NULL, Ynodes="Y",
abar=list(1, 0), SL.library="default")
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W1 + W2 + W3 + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W1 + W2 + W3
#>
#> Estimate of time to completion: 1 minute
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = "A", Lnodes = NULL, Ynodes = "Y",
#> abar = list(1, 0), SL.library = "default")
#>
#> Treatment Estimate:
#> Parameter Estimate: 0.60079
#> Estimated Std Err: 0.053992
#> p-value: <2e-16
#> 95% Conf Interval: (0.49497, 0.70661)
#>
#> Control Estimate:
#> Parameter Estimate: 0.68227
#> Estimated Std Err: 0.029951
#> p-value: <2e-16
#> 95% Conf Interval: (0.62357, 0.74097)
#>
#> Additive Treatment Effect:
#> Parameter Estimate: -0.081479
#> Estimated Std Err: 0.061743
#> p-value: 0.18695
#> 95% Conf Interval: (-0.20249, 0.039535)
#>
#> Relative Risk:
#> Parameter Estimate: 0.88058
#> Est Std Err log(RR): 0.10002
#> p-value: 0.20353
#> 95% Conf Interval: (0.72382, 1.0713)
#>
#> Odds Ratio:
#> Parameter Estimate: 0.70085
#> Est Std Err log(OR): 0.26413
#> p-value: 0.17838
#> 95% Conf Interval: (0.41763, 1.1761)
Censoring
Time ordering of data is W -> C -> Y. C is a censoring node.
The parameter of interest is Y, intervening to set C to be uncensored.
Censoring nodes are very similar to treatment nodes. The main difference
is that after censoring, other data is expected to be NA. It is assumed
that the parameter of interest always intervenes to set censoring nodes
to uncensored, so this is not specified in abar
. g and Q
regressions always stratify on censoring nodes, whereas g and Q
regressions can either stratify or pool over treatment nodes (by using
the stratify
argument.)
Censoring nodes in data
should be factors with two
levels - “censored” and “uncensored”. The utility function
BinaryToCensoring
can be used to facilitate this.
n <- 100000
W <- rnorm(n)
C <- BinaryToCensoring(is.censored = rexpit(W))
summary(C)
#> censored uncensored
#> 49982 50018
Y <- rep(NA, n)
Y[C == "uncensored"] <- rexpit(W[C == "uncensored"])
data <- data.frame(W, C, Y)
head(data, 20)
#> W C Y
#> 1 -1.566259331 uncensored 0
#> 2 -0.554999113 censored NA
#> 3 -1.096109019 censored NA
#> 4 -1.548840734 uncensored 0
#> 5 -0.005849666 uncensored 0
#> 6 1.362967323 uncensored 1
#> 7 -0.392581007 uncensored 0
#> 8 -0.731283380 censored NA
#> 9 -0.629759353 uncensored 1
#> 10 0.624508916 uncensored 1
#> 11 -0.589029307 uncensored 0
#> 12 0.813695840 censored NA
#> 13 1.165955241 uncensored 1
#> 14 0.870392623 censored NA
#> 15 -0.588027933 uncensored 1
#> 16 -0.011944668 censored NA
#> 17 0.438052724 censored NA
#> 18 0.838405249 censored NA
#> 19 1.077958468 uncensored 0
#> 20 0.242747049 censored NA
result <- ltmle(data, Anodes = NULL, Cnodes = "C", Ynodes = "Y", abar = NULL)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W
#>
#> gform not specified, using defaults:
#> formula for C:
#> C ~ W
#>
#> Estimate of time to completion: < 1 minute
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = NULL, Cnodes = "C", Ynodes = "Y",
#> abar = NULL)
#>
#> Parameter Estimate: 0.50271
#> Estimated Std Err: 0.0023171
#> p-value: <2e-16
#> 95% Conf Interval: (0.49816, 0.50725)
The naive estimate is biased (the true value is 0.5):
Longitudinal data
Time ordering of data is W -> A1 -> L -> A2 -> Y. L is
time-dependent covariate because it occurs after a treatment (or
censoring) variable, A1. We indicate this using the Lnodes
argument.
n <- 1000
W <- rnorm(n)
A1 <- rexpit(W)
L <- 0.3 * W + 0.2 * A1 + rnorm(n)
A2 <- rexpit(W + A1 + L)
Y <- rexpit(W - 0.6 * A1 + L - 0.8 * A2)
data <- data.frame(W, A1, L, A2, Y)
head(data)
#> W A1 L A2 Y
#> 1 1.6389219 1 2.7790856 1 1
#> 2 -0.4181572 1 -0.7027012 0 0
#> 3 -0.2713617 0 -1.1406436 0 1
#> 4 -0.1870266 1 1.1303360 1 0
#> 5 -0.8462701 1 0.3262920 1 0
#> 6 -0.4968114 0 -0.1593570 0 0
Treatment regime of interest: set A1 to 0, set A2 to 0
ltmle(data, Anodes=c("A1", "A2"), Lnodes="L", Ynodes="Y", abar=c(0, 0))
#> Qform not specified, using defaults:
#> formula for L:
#> Q.kplus1 ~ W + A1
#> formula for Y:
#> Q.kplus1 ~ W + A1 + L + A2
#>
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ W
#> formula for A2:
#> A2 ~ W + A1 + L
#>
#> Estimate of time to completion: < 1 minute
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Lnodes = "L", Ynodes = "Y",
#> abar = c(0, 0))
#>
#> TMLE Estimate: 0.4926867
Longitudinal data with censoring
W -> A1 -> C -> L -> A2 -> Y
n <- 1000
W <- rnorm(n)
A1 <- rexpit(W)
C <- BinaryToCensoring(is.censored = rexpit(0.6 * W - 0.5 * A1))
uncensored <- C == "uncensored"
L <- A2 <- Y <- rep(NA, n)
L[uncensored] <- (0.3 * W[uncensored] + 0.2 * A1[uncensored] + rnorm(sum(uncensored)))
A2[uncensored] <- rexpit(W[uncensored] + A1[uncensored] + L[uncensored])
Y[uncensored] <- rexpit(W[uncensored] - 0.6 * A1[uncensored] + L[uncensored] - 0.8 * A2[uncensored])
data <- data.frame(W, A1, C, L, A2, Y)
head(data)
#> W A1 C L A2 Y
#> 1 -0.4613304 0 censored NA NA NA
#> 2 -0.9127803 0 censored NA NA NA
#> 3 -0.9895212 0 censored NA NA NA
#> 4 2.1708838 1 censored NA NA NA
#> 5 0.7778003 1 censored NA NA NA
#> 6 -1.2484291 0 uncensored -2.197913 0 0
Treatment regime of interest: set A1 to 1, set A2 to 0, set C to uncensored:
ltmle(data, Anodes=c("A1", "A2"), Cnodes = "C", Lnodes="L", Ynodes="Y", abar=c(1, 0))
#> Qform not specified, using defaults:
#> formula for L:
#> Q.kplus1 ~ W + A1
#> formula for Y:
#> Q.kplus1 ~ W + A1 + L + A2
#>
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ W
#> formula for C:
#> C ~ W + A1
#> formula for A2:
#> A2 ~ W + A1 + L
#>
#> Estimate of time to completion: < 1 minute
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Cnodes = "C", Lnodes = "L",
#> Ynodes = "Y", abar = c(1, 0))
#>
#> TMLE Estimate: 0.4948815
Dynamic treatment
Treatment regime of interest is: Always treat at time 1 (A1 = 1), treat at at time 2 (A2 = 1) if L > 0
abar <- matrix(nrow=n, ncol=2)
abar[, 1] <- 1
abar[, 2] <- L > 0
result.abar <- ltmle(data, Anodes=c("A1", "A2"), Cnodes = "C", Lnodes="L", Ynodes="Y", abar=abar)
#> Qform not specified, using defaults:
#> formula for L:
#> Q.kplus1 ~ W + A1
#> formula for Y:
#> Q.kplus1 ~ W + A1 + L + A2
#>
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ W
#> formula for C:
#> C ~ W + A1
#> formula for A2:
#> A2 ~ W + A1 + L
#>
#> Estimate of time to completion: < 1 minute
result.abar
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Cnodes = "C", Lnodes = "L",
#> Ynodes = "Y", abar = abar)
#>
#> TMLE Estimate: 0.3400308
The regime can also be specified as a rule
function.
rule
is a function applied to each row of data
which returns a numeric vector of the same length as
Anodes
.
rule <- function(row) c(1, row["L"] > 0)
result.rule <- ltmle(data, Anodes=c("A1", "A2"), Cnodes = "C", Lnodes="L", Ynodes="Y", rule=rule)
#> Qform not specified, using defaults:
#> formula for L:
#> Q.kplus1 ~ W + A1
#> formula for Y:
#> Q.kplus1 ~ W + A1 + L + A2
#>
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ W
#> formula for C:
#> C ~ W + A1
#> formula for A2:
#> A2 ~ W + A1 + L
#>
#> Estimate of time to completion: < 1 minute
result.rule
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Cnodes = "C", Lnodes = "L",
#> Ynodes = "Y", rule = rule)
#>
#> TMLE Estimate: 0.3400308
Specfifying the regime using abar
and using
rule
give the same result:
summary(result.abar)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Cnodes = "C", Lnodes = "L",
#> Ynodes = "Y", abar = abar)
#>
#> Parameter Estimate: 0.34003
#> Estimated Std Err: 0.040897
#> p-value: <2e-16
#> 95% Conf Interval: (0.25987, 0.42019)
summary(result.rule)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = c("A1", "A2"), Cnodes = "C", Lnodes = "L",
#> Ynodes = "Y", rule = rule)
#>
#> Parameter Estimate: 0.34003
#> Estimated Std Err: 0.040897
#> p-value: <2e-16
#> 95% Conf Interval: (0.25987, 0.42019)
Variance estimation
Consider a simple point treatment problem with observed data \(W, A, Y\). But there is a positivity problem - for small values of \(W\), \(Prob(A = 1)\) is very small.
The true parameter value, \(E[Y_1]\) is approximately 0.697.
The true TMLE standard deviation (the standard deviation of the TMLE estimate if we ran it many times on many sets of data) is approximately 0.056.
The true IPTW standard deviation (the standard deviation of the IPTW estimate if we ran it many times on many sets of data) is approximately 0.059.
n <- 1000
W <- rnorm(n)
A <- rexpit(4 * W)
Y <- rexpit(W + A)
df <- data.frame(W, A, Y)
The default variance.method
is "tmle"
- use
TMLE in order to approximate the variance of the TMLE estimator. The
estimated standard deviation is close to the true TMLE standard
deviation.
r1 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
print(summary(r1))
#> Estimator: tmle
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE)
#>
#> Parameter Estimate: 0.6705
#> Estimated Std Err: 0.057671
#> p-value: <2e-16
#> 95% Conf Interval: (0.55746, 0.78353)
If variance.method
is "ic"
, variance is
estimated using the estimated Influence Curve. This is fast to compute,
but may be significantly anticonservative in data with positivity
violations.
r2 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE,
variance.method="ic")
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
#> Warning in CheckForVarianceWarning(inputs, g.ratio): Variance estimate is based
#> on influence curve only, which may be significantly anticonservative because
#> your data appears to contain positivity violations. It is recommended to use
#> variance.method='tmle' or variance.method='iptw' to obtain a more robust
#> variance estimate (but run time may be significantly longer). See
#> variance.method details in ?ltmle
print(summary(r2))
#> Estimator: tmle
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#> variance.method = "ic")
#>
#> Parameter Estimate: 0.6705
#> Estimated Std Err: 0.044465
#> p-value: <2e-16
#> 95% Conf Interval: (0.58335, 0.75765)
If variance.method
is "iptw"
, then use
IPTW in order to approximate the variance of the TMLE
estimator. This is faster to compute than
variance.method = "tmle"
but less accurate (and slower to
compute than variance.method = "ic"
but more accurate).
r3 <- ltmle(df, Anodes="A", Ynodes="Y", abar = 1, estimate.time=FALSE,
variance.method="iptw")
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W
#>
print(summary(r3))
#> Estimator: tmle
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#> variance.method = "iptw")
#>
#> Parameter Estimate: 0.6705
#> Estimated Std Err: 0.055025
#> p-value: <2e-16
#> 95% Conf Interval: (0.56265, 0.77834)
If we use the IPTW estimator, variance.method
does not
change the estimated standard deviation (it only affects the estimated
standard deviation of the TMLE estimator).
print(summary(r1, estimator="iptw"))
#> Estimator: iptw
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE)
#>
#> Parameter Estimate: 0.71528
#> Estimated Std Err: 0.059648
#> p-value: <2e-16
#> 95% Conf Interval: (0.59837, 0.83218)
print(summary(r2, estimator="iptw")) #the same - variance.method only affects TMLE
#> Estimator: iptw
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#> variance.method = "ic")
#>
#> Parameter Estimate: 0.71528
#> Estimated Std Err: 0.059648
#> p-value: <2e-16
#> 95% Conf Interval: (0.59837, 0.83218)
print(summary(r3, estimator="iptw")) #the same - variance.method only affects TMLE
#> Estimator: iptw
#> Call:
#> ltmle(data = df, Anodes = "A", Ynodes = "Y", abar = 1, estimate.time = FALSE,
#> variance.method = "iptw")
#>
#> Parameter Estimate: 0.71528
#> Estimated Std Err: 0.059648
#> p-value: <2e-16
#> 95% Conf Interval: (0.59837, 0.83218)
We can see that the values of g are very small.
summary(r1$cum.g)
#> V1
#> Min. :0.01000
#> 1st Qu.:0.06569
#> Median :0.44668
#> Mean :0.48270
#> 3rd Qu.:0.90980
#> Max. :1.00000
summary(r1$cum.g.unbounded)
#> V1
#> Min. :0.000006
#> 1st Qu.:0.065685
#> Median :0.446682
#> Mean :0.482000
#> 3rd Qu.:0.909795
#> Max. :0.999999
head(data.frame(df, g = r1$cum.g, g.unbounded = r1$cum.g.unbounded), 20)
#> W A Y g g.unbounded
#> 1 2.613635376 1 1 0.99992685 0.999926845
#> 2 -0.505210549 0 1 0.13305643 0.133056426
#> 3 -0.478839225 0 1 0.14457092 0.144570919
#> 4 -1.502913015 0 0 0.01000000 0.003989621
#> 5 -1.284217170 0 1 0.01000000 0.008828602
#> 6 0.004108767 1 0 0.49674578 0.496745781
#> 7 -0.298192950 0 0 0.24643720 0.246437203
#> 8 -0.487429147 0 1 0.14073208 0.140732084
#> 9 -0.749106452 0 0 0.05921951 0.059219507
#> 10 1.191500848 1 1 0.98695241 0.986952411
#> 11 0.902451522 1 1 0.96337663 0.963376634
#> 12 -1.466762913 0 0 0.01000000 0.004550489
#> 13 2.637774884 1 1 0.99993302 0.999933022
#> 14 -0.744537625 0 0 0.06015654 0.060156539
#> 15 0.621926065 0 1 0.90418770 0.904187705
#> 16 -0.842966847 0 0 0.04276009 0.042760091
#> 17 0.345541627 0 0 0.77463424 0.774634236
#> 18 -0.433685727 1 1 0.16619600 0.166196001
#> 19 0.966106594 1 1 0.97075513 0.970755127
#> 20 0.211162967 0 1 0.67778523 0.677785227
Hierarchical data and the id variable
The id
argument can be used to specify hierarchical
data, such as people in a household.
num.households <- 500
people.in.household <- round(runif(num.households, min = 1, max = 10))
length(people.in.household)
#> [1] 500
n <- sum(people.in.household)
n
#> [1] 2807
W.household <- rnorm(num.households)
length(W.household)
#> [1] 500
W.household.expanded <- rep(W.household, times = people.in.household)
W.indiv <- rnorm(n)
length(W.indiv)
#> [1] 2807
A <- rexpit(1.5 * W.household.expanded + 0.4 * W.indiv)
Y <- rexpit(-1 + 2.3 * W.household.expanded - 0.6 * W.indiv + 1.2 * A)
id can be an integer, factor, or character (or any type that can be coerced to factor),
id <- 1:num.households
id.expanded <- rep(id, times = people.in.household)
data <- data.frame(W.household.expanded, W.indiv, A, Y)
head(cbind(id.expanded, data), 20)
#> id.expanded W.household.expanded W.indiv A Y
#> 1 1 -0.3229165 -0.296158709 1 0
#> 2 1 -0.3229165 0.224816142 1 0
#> 3 1 -0.3229165 -0.886436607 0 0
#> 4 1 -0.3229165 0.001967513 0 0
#> 5 2 1.2151784 -0.648021179 1 1
#> 6 2 1.2151784 -0.414830831 1 1
#> 7 3 0.3594622 -0.963051160 0 1
#> 8 3 0.3594622 2.449803009 1 0
#> 9 3 0.3594622 2.587829366 1 0
#> 10 3 0.3594622 0.813768358 1 1
#> 11 3 0.3594622 -0.269002664 0 0
#> 12 4 0.1935798 1.111126398 1 1
#> 13 4 0.1935798 -0.783816635 0 0
#> 14 4 0.1935798 0.055422708 0 0
#> 15 4 0.1935798 0.791141848 0 0
#> 16 4 0.1935798 -1.863196200 0 1
#> 17 4 0.1935798 1.192558718 0 1
#> 18 5 0.6699777 -0.951043060 1 1
#> 19 5 0.6699777 0.780967979 1 1
#> 20 5 0.6699777 -0.756337601 1 1
result.without.id <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 0)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W.household.expanded + W.indiv + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W.household.expanded + W.indiv
#>
#> Estimate of time to completion: < 1 minute
result.with.id <- ltmle(data, Anodes = "A", Ynodes = "Y", abar = 0, id = id.expanded)
#> Qform not specified, using defaults:
#> formula for Y:
#> Q.kplus1 ~ W.household.expanded + W.indiv + A
#>
#> gform not specified, using defaults:
#> formula for A:
#> A ~ W.household.expanded + W.indiv
#>
#> Estimate of time to completion: < 1 minute
Omitting the id argument makes the individuals seem more independent than they are, which gives artificially low variance estimates.
summary(result.without.id)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = "A", Ynodes = "Y", abar = 0)
#>
#> Parameter Estimate: 0.32815
#> Estimated Std Err: 0.016057
#> p-value: <2e-16
#> 95% Conf Interval: (0.29668, 0.35962)
summary(result.with.id)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = "A", Ynodes = "Y", abar = 0, id = id.expanded)
#>
#> Parameter Estimate: 0.32815
#> Estimated Std Err: 0.020807
#> p-value: <2e-16
#> 95% Conf Interval: (0.28737, 0.36893)
The influence curve is a vector with length equal to the number of independent units.
Multiple time-dependent covariates and treatments at each time point, continuous Y values
age -> gender -> A1 -> L1a -> L1b -> Y1 -> A2 -> L2a -> L2b -> Y2
n <- 1000
age <- rbinom(n, 1, 0.5)
gender <- rbinom(n, 1, 0.5)
A1 <- rexpit(age + gender)
L1a <- 2*age - 3*gender + 2*A1 + rnorm(n)
L1b <- rexpit(age + 1.5*gender - A1)
Y1 <- plogis(age - gender + L1a + 0.7*L1b + A1 + rnorm(n))
A2 <- rexpit(age + gender + A1 - L1a - L1b)
L2a <- 2*age - 3*gender + 2*A1 + A2 + rnorm(n)
L2b <- rexpit(age + 1.5*gender - A1 - A2)
Y2 <- plogis(age - gender + L1a + L1b + A1 + 1.8*A2 + rnorm(n))
data <- data.frame(age, gender, A1, L1a, L1b, Y1, A2, L2a, L2b, Y2)
Also show some different ways of specifying the nodes (either names or indexes works):
result <- ltmle(data, Anodes=c(3, 7), Lnodes=c("L1a", "L1b", "L2a", "L2b"),
Ynodes=grep("^Y", names(data)), abar=c(1, 0))
#> Qform not specified, using defaults:
#> formula for L1a:
#> Q.kplus1 ~ age + gender + A1
#> formula for L2a:
#> Q.kplus1 ~ age + gender + A1 + L1a + L1b + Y1 + A2
#>
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ age + gender
#> formula for A2:
#> A2 ~ age + gender + A1 + L1a + L1b + Y1
#>
#> Estimate of time to completion: < 1 minute
#> Warning in CheckForVarianceWarning(inputs, g.ratio): Variance estimate is based
#> on influence curve only, which may be significantly anticonservative because
#> your data appears to contain positivity violations. Robust variance estimate is
#> not currently available with non binary outcomes but this will be addressed in
#> a future release.
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = c(3, 7), Lnodes = c("L1a", "L1b",
#> "L2a", "L2b"), Ynodes = grep("^Y", names(data)), abar = c(1,
#> 0))
#>
#> Parameter Estimate: 0.82284
#> Estimated Std Err: 0.014516
#> p-value: <2e-16
#> 95% Conf Interval: (0.79439, 0.85129)
Specifying Qform
Usually you would specify a Qform
for all of the Lnodes
and Ynodes but in this case L1a, L1b, Y1 is a “block” of L/Y nodes not
separated by Anodes or Cnodes (the same is true for L2a, L2b, Y2). Only
one regression is required at the first L/Y node in a block. You can
pass regression formulas for the other L/Y nodes, but they will be
ignored.
result <- ltmle(data, Anodes=c(3, 7), Lnodes=c("L1a", "L1b", "L2a", "L2b"),
Ynodes=grep("^Y", names(data)), abar=c(1, 0),
Qform=c(L1a="Q.kplus1 ~ 1", L2a="Q.kplus1 ~ 1"))
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ age + gender
#> formula for A2:
#> A2 ~ age + gender + A1 + L1a + L1b + Y1
#>
#> Estimate of time to completion: < 1 minute
#> Warning in CheckForVarianceWarning(inputs, g.ratio): Variance estimate is based
#> on influence curve only, which may be significantly anticonservative because
#> your data appears to contain positivity violations. Robust variance estimate is
#> not currently available with non binary outcomes but this will be addressed in
#> a future release.
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = c(3, 7), Lnodes = c("L1a", "L1b",
#> "L2a", "L2b"), Ynodes = grep("^Y", names(data)), Qform = c(L1a = "Q.kplus1 ~ 1",
#> L2a = "Q.kplus1 ~ 1"), abar = c(1, 0))
#>
#> Parameter Estimate: 0.85108
#> Estimated Std Err: 0.027434
#> p-value: <2e-16
#> 95% Conf Interval: (0.79731, 0.90485)
Gives the same result but prints a message saying some regression formulas will be dropped:
result <- ltmle(data, Anodes=c(3, 7), Lnodes=c("L1a", "L1b", "L2a", "L2b"),
Ynodes=grep("^Y", names(data)), abar=c(1, 0),
Qform=c(L1a="Q.kplus1 ~ 1", L1b="Q.klus1~A1",
Y1="Q.kplus1~L1a", L2a="Q.kplus1 ~ 1", L2b="Q.klus1~A1", Y2="Q.kplus1~A2 + gender"))
#> L/Y nodes (after removing blocks) : L1a L2a
#> Qform names : L1a L1b Y1 L2a L2b Y2
#> The following nodes are not being considered as L/Y nodes because they are part of a block
#> of L/Y nodes. They are being dropped from Qform:
#> L1b
#> Y1
#> L2b
#> Y2
#> gform not specified, using defaults:
#> formula for A1:
#> A1 ~ age + gender
#> formula for A2:
#> A2 ~ age + gender + A1 + L1a + L1b + Y1
#>
#> Estimate of time to completion: < 1 minute
#> Warning in CheckForVarianceWarning(inputs, g.ratio): Variance estimate is based
#> on influence curve only, which may be significantly anticonservative because
#> your data appears to contain positivity violations. Robust variance estimate is
#> not currently available with non binary outcomes but this will be addressed in
#> a future release.
summary(result)
#> Estimator: tmle
#> Call:
#> ltmle(data = data, Anodes = c(3, 7), Lnodes = c("L1a", "L1b",
#> "L2a", "L2b"), Ynodes = grep("^Y", names(data)), Qform = c(L1a = "Q.kplus1 ~ 1",
#> L1b = "Q.klus1~A1", Y1 = "Q.kplus1~L1a", L2a = "Q.kplus1 ~ 1",
#> L2b = "Q.klus1~A1", Y2 = "Q.kplus1~A2 + gender"), abar = c(1,
#> 0))
#>
#> Parameter Estimate: 0.85108
#> Estimated Std Err: 0.027434
#> p-value: <2e-16
#> 95% Conf Interval: (0.79731, 0.90485)
If there were a Anode or Cnode between L1b and Y1, Y1 would also need a Q regression formula.