--- title: "Dynamic Data Definition" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Dynamic Data Definition} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r chunkname, echo=-1} data.table::setDTthreads(2) ``` ```{r, echo = FALSE, message = FALSE} options(digits = 3) library(simstudy) library(ggplot2) library(scales) library(grid) library(gridExtra) library(survival) library(gee) library(data.table) odds <- function (p) p/(1 - p) # TODO temporary remove when added to package plotcolors <- c("#B84226", "#1B8445", "#1C5974") cbbPalette <- c("#B84226","#B88F26", "#A5B435", "#1B8446", "#B87326","#B8A526", "#6CA723", "#1C5974") ggtheme <- function(panelback = "white") { ggplot2::theme( panel.background = element_rect(fill = panelback), panel.grid = element_blank(), axis.ticks = element_line(colour = "black"), panel.spacing =unit(0.25, "lines"), # requires package grid panel.border = element_rect(fill = NA, colour="gray90"), plot.title = element_text(size = 8,vjust=.5,hjust=0), axis.text = element_text(size=8), axis.title = element_text(size = 8) ) } ``` Often, we'd like to explore data generation and modeling under different scenarios. For example, we might want to understand the operating characteristics of a model given different variance or other parametric assumptions. There is functionality built into `simstudy` to facilitate this type of dynamic exploration. First, the functions `updateDef` and `updateDefAdd` essentially allow us to edit lines of existing data definition tables. Second, there is a built-in mechanism - called *double-dot* reference - to access external variables that do not exist in a defined data set or data definition. ## Updating existing definition tables The `updateDef` function updates a row in a definition table created by functions `defData` or `defRead`. Analogously, `updateDefAdd` function updates a row in a definition table created by functions `defDataAdd` or `defReadAdd`. The original data set definition includes three variables `x`, `y`, and `z`, all normally distributed: ```{r} defs <- defData(varname = "x", formula = 0, variance = 3, dist = "normal") defs <- defData(defs, varname = "y", formula = "2 + 3*x", variance = 1, dist = "normal") defs <- defData(defs, varname = "z", formula = "4 + 3*x - 2*y", variance = 1, dist = "normal") defs ``` In the first case, we are changing the relationship of `y` with `x` as well as the variance: ```{r} defs <- updateDef(dtDefs = defs, changevar = "y", newformula = "x + 5", newvariance = 2) defs ``` In this second case, we are changing the distribution of `z` to *Poisson* and updating the link function to *log*: ```{r} defs <- updateDef(dtDefs = defs, changevar = "z", newdist = "poisson", newlink = "log") defs ``` And in the last case, we remove a variable from a data set definition. Note in the case of a definition created by `defData` that it is not possible to remove a variable that is a predictor of a subsequent variable, such as `x` or `y` in this case. ```{r} defs <- updateDef(dtDefs = defs, changevar = "z", remove = TRUE) defs ``` ## Double-dot external variable reference For a truly dynamic data definition process, `simstudy` (as of `version 0.2.0`) allows users to reference variables that exist outside of data generation. These can be thought of as a type of hyperparameter of the data generation process. The reference is made directly in the formula itself, using a double-dot ("..") notation before the variable name. Here is a simple example: ```{r} def <- defData(varname = "x", formula = 0, variance = 5, dist = "normal") def <- defData(def, varname = "y", formula = "..B0 + ..B1 * x", variance = "..sigma2", dist = "normal") def ``` ```{r} B0 <- 4; B1 <- 2; sigma2 <- 9 set.seed(716251) dd <- genData(100, def) fit <- summary(lm(y ~ x, data = dd)) coef(fit) fit$sigma ``` It is easy to create a new data set on the fly with a difference variance assumption without having to go to the trouble of updating the data definitions. ```{r} sigma2 <- 16 dd <- genData(100, def) fit <- summary(lm(y ~ x, data = dd)) coef(fit) fit$sigma ``` The double-dot notation can be flexibly applied using `lapply` (or the parallel version `mclapply`) to create a range of data sets under different assumptions: ```{r, fig.width = 5} sigma2s <- c(1, 2, 6, 9) gen_data <- function(sigma2, d) { dd <- genData(200, d) dd$sigma2 <- sigma2 dd } dd_4 <- lapply(sigma2s, function(s) gen_data(s, def)) dd_4 <- rbindlist(dd_4) ggplot(data = dd_4, aes(x = x, y = y)) + geom_point(size = .5, color = "grey30") + facet_wrap(sigma2 ~ .) + theme(panel.grid = element_blank()) ``` ## Using non-scalar double-dot variable reference The double-dot notation is also *array-friendly*. For example if we want to create a mixture distribution from a vector of values (which we can also do using a *categorical* distribution), we can define the mixture formula in terms of the vector. In this case we are generating permuted block sizes of 2 and 4: ```{r} defblk <- defData(varname = "blksize", formula = "..sizes[1] | .5 + ..sizes[2] | .5", dist = "mixture") defblk ``` ```{r} sizes <- c(2, 4) genData(1000, defblk) ``` In this second example, there is a vector variable *tau* of positive real numbers that sum to 1, and we want to calculate the weighted average of three numbers using *tau* as the weights. We could use the following code to estimate a weighted average *theta*: ```{r} tau <- rgamma(3, 5, 2) tau <- tau / sum(tau) tau d <- defData(varname = "a", formula = 3, variance = 4) d <- defData(d, varname = "b", formula = 8, variance = 2) d <- defData(d, varname = "c", formula = 11, variance = 6) d <- defData(d, varname = "theta", formula = "..tau[1]*a + ..tau[2]*b + ..tau[3]*c", dist = "nonrandom") set.seed(1) genData(4, d) ``` We can simplify the calculation of *theta* by using matrix multiplication: ```{r} d <- updateDef(d, changevar = "theta", newformula = "t(..tau) %*% c(a, b, c)") set.seed(1) genData(4, d) ``` These arrays can also have **multiple dimensions**, as in a $2 \times 2$ matrix. If we want to specify the mean outcomes for a factorial study design with two interventions $a$ and $b$, we can use a simple matrix and draw the means directly from the matrix, which in this example is stored in the variable *effect*: ```{r} effect <- matrix(c(0, 4, 5, 7), nrow = 2) effect ``` Using double dot notation, it is possible to reference the matrix cell values directly: ```{r} d1 <- defData(varname = "a", formula = ".5;.5", variance = "1;2", dist = "categorical") d1 <- defData(d1, varname = "b", formula = ".5;.5", variance = "1;2", dist = "categorical") d1 <- defData(d1, varname = "outcome", formula = "..effect[a, b]", dist="nonrandom") ``` ```{r} dx <- genData(1000, d1) dx ``` It is possible to generate normally distributed data based on these means: ```{r} d1 <- updateDef(d1, "outcome", newvariance = 9, newdist = "normal") dx <- genData(1000, d1) ``` The plot shows the individual values as well as the mean values by intervention arm: ```{r, echo=FALSE} dsum <- dx[, .(outcome=mean(outcome)), keyby = .(a, b)] ggplot(data = dx, aes(x = factor(a), y = outcome)) + geom_jitter(aes(color = factor(b)), width = .2, alpha = .4, size = .2) + geom_point(data = dsum, size = 2, aes(color = factor(b))) + geom_line(data = dsum, linewidth = 1, aes(color = factor(b), group = factor(b))) + scale_color_manual(values = cbbPalette, name = " b") + theme(panel.grid = element_blank()) + xlab ("a") ```