---
title: "Customized Distributions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Customized Distributions}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r chunkname, echo=-1}
data.table::setDTthreads(2)
```

```{r, echo = FALSE, message = FALSE}
library(simstudy)
library(ggplot2)
library(scales)
library(grid)
library(gridExtra)
library(survival)
library(gee)
library(data.table)
library(ordinal)

odds <- function (p)  p/(1 - p) # TODO temporary remove when added to package
plotcolors <- c("#B84226", "#1B8445", "#1C5974")

cbbPalette <- c("#B84226","#B88F26", "#A5B435", "#1B8446",
                "#B87326","#B8A526", "#6CA723", "#1C5974") 

ggtheme <- function(panelback = "white") {
  
  ggplot2::theme(
    panel.background = element_rect(fill = panelback),
    panel.grid = element_blank(),
    axis.ticks =  element_line(colour = "black"),
    panel.spacing =unit(0.25, "lines"),  # requires package grid
    panel.border = element_rect(fill = NA, colour="gray90"), 
    plot.title = element_text(size = 8,vjust=.5,hjust=0),
    axis.text = element_text(size=8),
    axis.title = element_text(size = 8)
  )  
  
}

```

Custom distributions can be specified in `defData` and `defDataAdd` by setting the argument *dist* to "custom". When defining a custom distribution, you provide the name of the user-defined function as a string in the *formula* argument. The arguments of the custom function are listed in the *variance* argument, separated by commas and formatted as "**arg_1 = val_form_1, arg_2 = val_form_2, $\dots$, arg_K = val_form_K**".

Here, the *arg_k's* represent the names of the arguments passed to the customized function, where $k$ ranges from $1$ to $K$. You can use values or formulas for each *val_form_k*. If formulas are used, ensure that the variables have been previously generated. Double dot notation is available in specifying *value_formula_k*. One important requirement of the custom function is that the parameter list used to define the function must include an argument"**n = n**", but do not include $n$ in the definition as part of `defData` or `defDataAdd`.

### Example 1

Here is an example where we would like to generate data from a zero-inflated beta distribution. In this case, there is a user-defined function `zeroBeta` that takes on shape parameters $a$ and $b$, as well as $p_0$, the proportion of the sample that is zero. Note that the function also takes an argument $n$ that will not to be be specified in the data definition; $n$ will represent the number of observations being generated:

```{r}
zeroBeta <- function(n, a, b, p0) {
  betas <- rbeta(n, a, b)
  is.zero <- rbinom(n, 1, p0)
  betas*!(is.zero)
}
```

The data definition specifies a new variable $zb$ that sets $a$ and $b$ to 0.75, and $p_0 = 0.02$:

```{r}
def <- defData(
  varname = "zb", 
  formula = "zeroBeta", 
  variance = "a = 0.75, b = 0.75, p0 = 0.02", 
  dist = "custom"
)
```

The data are generated:

```{r}
set.seed(1234)
dd <- genData(100000, def)
```

```{r, echo = FALSE}
dd
```

A plot of the data reveals dis-proportion of zero's:

```{r, fig.width = 6, fig.height = 3, echo = FALSE}
ggplot(data = dd, aes(x = zb)) +
  geom_histogram(binwidth = 0.01, boundary = 0, fill = "grey60") +
  theme(panel.grid = element_blank()) 
```

### Example 2

In this second example, we are generating sets of truncated Gaussian distributions with means ranging from $-1$ to $1$. The limits of the truncation vary across three different groups. `rnormt` is a customized (user-defined) function that generates the truncated Gaussiandata. The function requires four arguments (the left truncation value, the right truncation value, the distribution average and the standard deviation).

```{r}
rnormt <- function(n, min, max, mu, s) {
  
  F.a <- pnorm(min, mean = mu, sd = s)
  F.b <- pnorm(max, mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  qnorm(u, mean = mu, sd = s)
  
}
```


In this example, truncation limits vary based on group membership. Initially, three groups are created, followed by the generation of truncated values. For Group 1, truncation occurs within the range of $-1$ to $1$, for Group 2, it's $-2$ to $2$ and for Group 3, it's $-3$ to $3$. We'll generate three data sets, each with a distinct mean denoted by M, using the double-dot notation to implement these different means.

```{r}
def <-
  defData(
    varname = "limit", 
    formula = "1/4;1/2;1/4",
    dist = "categorical"
  ) |>
  defData(
    varname = "tn", 
    formula = "rnormt", 
    variance = "min = -limit, max = limit, mu = ..M, s = 1.5",
    dist = "custom"
  )
```

The data generation requires three calls to `genData`. The output is a list of three data sets:

```{r}
mus <- c(-1, 0, 1)
dd <-lapply(mus, function(M) genData(100000, def))
```

Here are the first six observations from each of the three data sets:

```{r, echo=FALSE}
lapply(dd, function(D) head(D))
```

A plot highlights the group differences.

```{r, fig.width = 8, fig.height = 6, echo = FALSE}
pfunc <- function(dx, i) {
  ggplot(data = dx, aes(x = tn)) +
    geom_histogram(aes(fill = factor(limit)), binwidth = 0.05, boundary = 0, alpha = .8) +
    facet_grid( ~ limit) +
    theme(panel.grid = element_blank(),
          legend.position = "none") +
    scale_fill_manual(values = plotcolors) +
    scale_x_continuous(breaks = seq(-3, 3, by =1)) +
    scale_y_continuous(limits = c(0, 1000)) +
    ggtitle(paste("mu =", mus[i]))
}

plist <- lapply(seq_along(dd), function(a) pfunc(dd[[a]], a))
grid.arrange(grobs = plist, nrow = 3)
```