The package morseDR is dedicated to the analysis of data from standard toxicity tests at target time. It provides a simple workflow for exploring/visualising a dataset and computing estimates of risk assessment indicators, such as the typical \(LC_{50}\) or \(EC_{50}\) values.

This document illustrates a typical use of morseDR on binary, count and quantitative continuous data, which can be followed step by step to analyse new datasets. Details on models and parameters can be found in the vignette Dose-Response models in the morseDR package.

Binary data analysis

The following example shows all the steps to perform a dose-response analysis on standard binary toxicity test data and to produce estimates of the \(LC_x\). We are using survival data as an example, but the same analysis can be performed for any type of binary data (e.g., mobility or burrowing).

We will use a dataset from the library named cadmium2, which contains both survival and reproduction data from a chronic laboratory toxicity test. In this experiment, snails were exposed to six concentrations of a heavy metal contaminant (cadmium) for 56 days.

Step 1: check the structure and the dataset integrity

The data from a survival toxicity test should be collected in a data.frame with a specific layout. This is documented in the section on the binaryData() function in the **Reference manual* or the modelData(..., type = 'binary'). You can examine the datasets provided in the package, e.g., cadmium2.

First, we load the dataset and use the binaryDataCheck() function to check if it has the expected layout:

data(cadmium2)
binaryDataCheck(cadmium2)
#> Correct formatting
#> [1] msg
#> <0 rows> (or 0-length row.names)

The output Correct formatting just informs that the dataset is well-formatted.

Step 2: create a `BinaryData` object

The BinaryData class corresponds to correctly formatted binary data and is the basic layout used for the subsequent operations. Note that if the call to binaryDataCheck() returns no error (namely, the output Correct formatting), it is guaranteed that the binaryData() will not fail.

survData_Cd <- binaryData(cadmium2)
head(survData_Cd)
#>   conc time Nsurv Nrepro replicate Ninit
#> 1    0    0     5      0         1     5
#> 2    0    3     5    262         1     5
#> 3    0    7     5    343         1     5
#> 4    0   10     5    459         1     5
#> 5    0   14     5    328         1     5
#> 6    0   17     5    742         1     5

Note also the use of the function modelData(..., type = 'binary'):

survData_Cd <- modelData(cadmium2, type = 'binary')
head(survData_Cd)
#>   conc time Nsurv Nrepro replicate Ninit
#> 1    0    0     5      0         1     5
#> 2    0    3     5    262         1     5
#> 3    0    7     5    343         1     5
#> 4    0   10     5    459         1     5
#> 5    0   14     5    328         1     5
#> 6    0   17     5    742         1     5

Step 3: visualise the raw data

The plot() function can be used to plot the number of surviving individuals as a function of time for all concentrations and replicates.

plot(survData_Cd, pool.replicate = FALSE)

If the argument pool.replicate is TRUE, the data points at a given time point and a given concentration are pooled and only the mean number of survivors is plotted. To observe the full dataset, this option must be set to FALSE.

By selecting one of the concentrations tested, we can visualise the particular plot corresponding to that specified concentration:

plot(survData_Cd, concentration = 124,
     addlegend = FALSE,
     pool.replicate = FALSE)

We can also pool the replicates. In such a situation, the legend becomes useless.

plot(survData_Cd, pool.replicate = TRUE)

We can also plot the survival rate, at a given time point, as a function of concentration, with binomial confidence intervals around the data. This is achieved by using the doseResponse() function and setting the target.time option (the default is the end of the experiment).

# without pooling replicates
survData_Cd_DR_ <- doseResponse(survData_Cd, target.time = 21)

plot(survData_Cd_DR_)

plot(survData_Cd_DR_, log.scale = TRUE)
#> When using log-scale, data with concentration at 0 are removed
#> Warning in scale_x_log10(): log-10 transformation introduced infinite values.

# pooling replicates
survData_Cd_DR <- doseResponse(survData_Cd, target.time = 21, pool.replicate = TRUE)

plot(survData_Cd_DR)

# removing the legend
plot(survData_Cd_DR, addlegend = FALSE)

plot(survData_Cd_DR, target.time = 21, addlegend = TRUE)

Step 4: fit an exposure-response model to the survival data at target time

We are now ready to fit the probabilistic model to the survival data, in order to describe the relationship between the concentration of the chemical compound and survival at a given target time. Our model assumes that the latter is a log-logistic function of the former, from which the package provides estimates of the parameters.

We can use the fit function which automatically recognises the class of the BinaryData object.

survFit_Cd <- fit(survData_Cd, target.time = 21)

An object of the FitTT, here specifically of the BinaryFitTT class is returned with the estimated parameters as a posterior[^1] distribution, quantifying the uncertainty around their true value.

Parameter estimates can be obtained by using the summary() function. If the inference went well, it is expected that the difference between the quantiles of the posteriors will be reduced compared to the prior, meaning that the data were helpful in reducing the uncertainty on the true value of the parameters.

summary(survFit_Cd)
#> Summary: 
#> 
#> The  loglogisticbinom_3  model with a binomial stochastic part was used !
#> 
#> Priors on parameters (quantiles):
#> 
#>         50%      2.5%     97.5%
#> b 1.000e+00 1.259e-02 7.943e+01
#> d 5.000e-01 2.500e-02 9.750e-01
#> e 1.227e+02 5.390e+01 2.793e+02
#> 
#> Posteriors of the parameters (quantiles):
#> 
#>         50%      2.5%     97.5%
#> b 8.571e+00 3.968e+00 1.599e+01
#> d 9.571e-01 9.087e-01 9.864e-01
#> e 2.365e+02 2.100e+02 2.534e+02

Then, we can compute any \(EC_x\) or \(LC_x\) values as the median (i.e., a point estimate) and the 2.5% and 97.5% quantiles of the posterior distribution (i.e., a measure of uncertainty, also known as the 95% credible interval).

LCx_Cd <- xcx(survFit_Cd, x = c(10, 20, 30, 40, 50))
LCx_Cd$quantiles
#>         median   Qinf95   Qsup95
#> x: 10 183.1935 125.4077 214.1247
#> x: 20 201.1952 152.6669 226.1819
#> x: 30 214.2854 173.7298 235.1099
#> x: 40 225.6554 192.4798 243.6166
#> x: 50 236.5119 210.0399 253.3907

We also have the distribution of XCx object:

head(LCx_Cd$distribution)
#>         x: 10    x: 20    x: 30    x: 40    x: 50
#> [1,] 159.6457 183.7281 201.7122 217.7599 233.6077
#> [2,] 190.9065 206.3875 217.3655 226.7987 235.8152
#> [3,] 141.7979 165.9161 184.1751 200.6318 217.0246
#> [4,] 169.4401 188.7189 202.7315 214.9907 226.8920
#> [5,] 207.6755 221.1105 230.5177 238.5269 246.1215
#> [6,] 210.9375 221.0854 228.0991 234.0141 239.5770

And, we can plot the distribution

plot(LCx_Cd)

Plot fitting results

The fit can also be plotted as follows:

plot(survFit_Cd, log.scale = TRUE, adddata = TRUE, addlegend = TRUE)

This plot shows the estimated relationship between the concentration of the chemical compound and the survival probability (orange curve). It is computed by setting each parameter to the median of its posterior. To assess the uncertainty around this median curve, we generate many such curves by sampling the parameters in the posterior distribution. This results in the grey band, which for any given concentration shows an interval (namely the credible interval) that is expected to contain the observed survival probability 95% of the time.

The experimental data are shown as black dots and correspond to the observed survival probabilities when all replicates are pooled. The black error bars correspond to the 95% binomial confidence interval, which is another, simpler way of bounding the most likely value of the observed survival probability for a tested concentration. In favourable situations, we expect that the credible interval around the estimated curve and the confidence interval around the experimental data overlap to a large extent.

Note that the survFitTT() function will warn you if the estimated \(LC_{x}\) estimate lies outside the range of the tested concentrations, as in the following example:

data("cadmium1")
doubtful_fit <- fit(binaryData(cadmium1), target.time = 21)
#> Warning: The LC50 estimation (model parameter e) lies outside the range
#>         of tested concentrations and may be unreliable as the prior
#>         distribution on this parameter is defined from this range !

plot(doubtful_fit, log.scale = TRUE, adddata = TRUE, addlegend = TRUE)

In this example, the experimental design does not include enough high concentrations, so we are missing measurements that would have lead to more than 50% of effect at the highest one. Therefore, even if the fit delivers an estimate of parameter \(e\), it should be considered unreliable.

Step 5: the posterior predictive check as a goodness-of-fit criterion

The fit can be further checked using what is known as a posterior predictive check (PPC): the principle is to plot the observed values against their corresponding predictions, plotted as medians along with their 95% credible intervals. If the fit is correct, we expect about 95% of the data to fall within the credible intervals.

survFit_Cd_PPC <- ppc(survFit_Cd)

plot(survFit_Cd_PPC)

The black dots on the above PPC plot correspond to the observed values (\(x\) coordinate) against their corresponding estimated medians (\(y\) coordinate), along with their 95% credible intervals (vertical segments). Ideally, observations and predictions should be in agreement, so we would expect to see the black dots to be aligned along the first bisector line of equation \(Y = X\) (i.e. the 1:1 line) The implemented model provides a tolerable variation around the predicted mean value as an interval where we expect 95% of the dots to be in average. The intervals are coloured in green if they overlap with the 1:1 line, otherwise they are coloured in red.

The summary of PPC is also available:

summary(survFit_Cd_PPC)
#> Summary: 
#> 
#> Percent of Observation in the 95% Credible Interval:  100

Prior Posterior representation

Cd_PP <- priorPosterior(survFit_Cd)
plot(Cd_PP)

The expectation if to get narrower orange (posterior) distributions compared to the grey(prior) ones, as well as posterior distribution not constrained in their tails by the priors (e.g., parameters \(b\) and \(e\)), except if it corresponds to biological or a mathematical constrain (e.g., parameter \(d\)).

Correlation plot

Using the library GGally, you can get a correlation plot between parameters:

library(GGally)
#> Loading required package: ggplot2
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
Cd_posterior <- posterior(survFit_Cd)
ggpairs(Cd_posterior)

The expectation is to get potato shapes in pairs of parameter plans.

Goodness of Fit

To test the goodness of fit, you can extract the model, and element of the FitTT object to run extra modules like dic or waic:

library(rjags)
#> Loading required package: coda
#> Linked to JAGS 4.3.2
#> Loaded modules: basemod,bugs
fit <-  survFit_Cd
model <- rjags::jags.model(file = textConnection(fit$model.specification$model.text),
                    data = fit$jags.data,
                    n.chains = length(fit$mcmc),
                    n.adapt = 3000)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 6
#>    Unobserved stochastic nodes: 9
#>    Total graph size: 62
#> 
#> Initializing model

n_iter <- nrow(fit$mcmc[[1]])
dic.samples(model, n.iter = n_iter)
#> Mean deviance:  19.23 
#> penalty 3.133 
#> Penalized deviance: 22.36

If you have JAGS version >4.3.0, you can load the module load.module("dic"), and then compute both deviance and WAIC:

load.module("dic")
jags.samples(model, c("deviance", "WAIC"),
             type = "mean", n.iter = n_iter, thin = 10)
#> $WAIC
#> mcarray:
#> [1] 0.04904101 0.04571904 0.04666346 0.09743494 0.63834616 0.49358561
#> 
#> $deviance
#> mcarray:
#> [1] 2.292139 2.287897 2.296126 2.981467 5.092807 4.200028

Count data analysis

The steps for dose-response analysis of count data are exactly the same as those described above for binary data, except that the ultimate goal is to get \(EC_x\) estimates. We use reproduction data as an example, but the same analysis can be performed for any type of count data.

Here is a typical session:

# (1) load dataset
data(cadmium2)

# (2) check structure and integrity of the dataset
countDataCheck(cadmium2)
#> Correct formatting
#> [1] msg
#> <0 rows> (or 0-length row.names)

# (3) create a `reproData` object
dat <- countData(cadmium2)
head(dat)
#>   conc time Nsurv Nrepro replicate Ninit Nreprocumul Nindtime
#> 1    0    0     5      0         1     5           0        0
#> 2    0    3     5    262         1     5         262       15
#> 3    0    7     5    343         1     5         605       35
#> 4    0   10     5    459         1     5        1064       50
#> 5    0   14     5    328         1     5        1392       70
#> 6    0   17     5    742         1     5        2134       85

The last step countData() can be done using the function modelData(..., type = 'count'):

# (3) create a `reproData` object
dat <- modelData(cadmium2, type = 'count')
head(dat)
#>   conc time Nsurv Nrepro replicate Ninit Nreprocumul Nindtime
#> 1    0    0     5      0         1     5           0        0
#> 2    0    3     5    262         1     5         262       15
#> 3    0    7     5    343         1     5         605       35
#> 4    0   10     5    459         1     5        1064       50
#> 5    0   14     5    328         1     5        1392       70
#> 6    0   17     5    742         1     5        2134       85

Then we can plot the data:

# (4) represent the cumulated number of offspring as a function of time
plot(dat, concentration = 124, addlegend = TRUE, pool.replicate = FALSE)

And also plot the dose-response:

# (5) represent the reproduction rate as a function of concentration
dat_DR <- doseResponse(dat, target.time = 28)
plot(dat_DR)

# (7) fit a concentration-effect model at target-time
reproFit <- fit(dat, stoc.part = "bestfit", target.time = 21, quiet = TRUE)
summary(reproFit)
#> Summary: 
#> 
#> The  GP model with a Gamma Poisson stochastic part was used !
#> 
#> Priors on parameters (quantiles):
#> 
#>             50%      2.5%     97.5%
#> b     1.000e+00 1.259e-02 7.943e+01
#> d     1.819e+01 1.538e+01 2.100e+01
#> e     1.488e+02 7.902e+01 2.804e+02
#> omega 1.000e+00 1.585e-04 6.310e+03
#> 
#> Posteriors of the parameters (quantiles):
#> 
#>             50%      2.5%     97.5%
#> b     4.052e+00 3.024e+00 5.746e+00
#> d     1.759e+01 1.535e+01 1.998e+01
#> e     1.357e+02 1.138e+02 1.677e+02
#> omega 1.346e+00 7.713e-01 2.629e+00

# (8) Get ECx estimates
ECx_count <- xcx(reproFit, x = c(10, 20, 30, 40, 50))
ECx_count$quantiles
#>          median    Qinf95   Qsup95
#> x: 10  78.77297  56.69611 111.9041
#> x: 20  96.30477  73.64211 129.7089
#> x: 30 109.97662  87.43214 143.0404
#> x: 40 122.76155 100.50471 155.1730
#> x: 50 135.66685 113.82796 167.7047

# (9) Plot the fit
plot(reproFit, log.scale = TRUE, adddata = TRUE, addlegend = TRUE)

# (10) PPC plot
ppcReproFit <- ppc(reproFit)
plot(ppcReproFit)

As above, the expectation from the PPC plot is that 95% of the data will fall within the predicted 95% credible intervals.

Model comparison

For count data analyses, the morseDR package automatically compares a model that neglects the between-replicate variability (called the Poisson model) with another that takes it into account (called the Gamma-Poisson model). However, you can manually choose either one or the other with the stoc.part option. Setting this option to "bestfit" lets the fit() function decides which models fit the data best. The corresponding choice can be seen by calling the summary function:

summary(reproFit)
#> Summary: 
#> 
#> The  GP model with a Gamma Poisson stochastic part was used !
#> 
#> Priors on parameters (quantiles):
#> 
#>             50%      2.5%     97.5%
#> b     1.000e+00 1.259e-02 7.943e+01
#> d     1.819e+01 1.538e+01 2.100e+01
#> e     1.488e+02 7.902e+01 2.804e+02
#> omega 1.000e+00 1.585e-04 6.310e+03
#> 
#> Posteriors of the parameters (quantiles):
#> 
#>             50%      2.5%     97.5%
#> b     4.052e+00 3.024e+00 5.746e+00
#> d     1.759e+01 1.535e+01 1.998e+01
#> e     1.357e+02 1.138e+02 1.677e+02
#> omega 1.346e+00 7.713e-01 2.629e+00

When the the Gamma-Poisson model is selected (or has been automatically selected), the summary will show an additional parameter called omega, which quantifies the variability between replicates: the higher the omega, the higher the variability between replicates.

Count data versus binary data dedicated functions

In morseDR, count datasets are a special case of binary datasets. Indeed, a count dataset includes the same information as in a binary dataset plus the information about count records. For this reason, the S3 class countData inherits from the binaryData class, which means that any operation on a binaryData object is permitted on a countData object. In particular, to use the plot function related to the binary data analysis on a binaryData object, we can first use binaryData as a formatting function:

dat <- countData(cadmium2)
plot(binaryData(dat))

In Bayesian inference, the parameters of a model are estimated from the data using what is called a prior, which is a probability distribution that represents an initial guess of the true value of the model parameters before looking at the data. The posterior distribution represents the uncertainty in the parameters after looking at the data and combining them with the prior information. To obtain a point estimate of the parameters, it is typical to compute the mean or median of the posterior distribution. We can quantify the uncertainty by reporting the standard deviation or an inter-quantile distance from this posterior distribution.

The summary of PPC is also available:

ppcReproFit <- ppc(reproFit)
summary(ppcReproFit)
#> Summary: 
#> 
#> Percent of Observation in the 95% Credible Interval:  97.2

plot(ppcReproFit)

Quantitative continuous data analysis

The morseDR package also provides turnkey R functions for performing dose-response analysis of quantitative continuous toxicity test data in a Bayesian framework, including the estimation of the x% effective toxicity value, which can be an x% effective rate (\(ER_x\)), an x% effective concentration (\(EC_x\)) or any other expression of your choice. We use growth data as an example, but the same analysis can be performed for any type of quantitative continuous data.

Load growth data

You can have access to already implemented datasets with the data function. The continuousData() function converts a data frame into a continuousData object after performing all the checks require for quantitative continuous analysis.

data("cadmium_daphnia")
gCdd <- continuousData(cadmium_daphnia)
head(gCdd)
#> # A tibble: 6 × 4
#>    time replicate  conc measure
#>   <dbl> <chr>     <dbl>   <dbl>
#> 1     0 D             0    0.99
#> 2     0 D             0    1.01
#> 3     0 D             0    0.99
#> 4     0 D             0    0.94
#> 5     0 D             0    0.99
#> 6     0 D             0    0.94

# Equivalent to the above line
gCdd <- modelData(cadmium_daphnia, type = "continuous")
head(gCdd)
#> # A tibble: 6 × 4
#>    time replicate  conc measure
#>   <dbl> <chr>     <dbl>   <dbl>
#> 1     0 D             0    0.99
#> 2     0 D             0    1.01
#> 3     0 D             0    0.99
#> 4     0 D             0    0.94
#> 5     0 D             0    0.99
#> 6     0 D             0    0.94

As with other analyses on binary and count data, you can check the compliance of the dataset by using the continuousDataCheck() function.

continuousDataCheck(cadmium_daphnia)
#> Correct formatting
#> [1] msg
#> <0 rows> (or 0-length row.names)

The dataset can then be plotted with different options:

plot(gCdd)

plot(gCdd, concentration = 25)

plot(gCdd, addlegend = TRUE)

Dose-response plot of quantitative conitnuous data

Before the inference step, you can run the t.test function on the data you uploaded with the doseResponse() function.

gCdd_DR <- doseResponse(gCdd)
plot(gCdd_DR)

plot(gCdd_DR, log.scale = TRUE)
#> When using log-scale, data with concentration at 0 are removed
#> Warning in scale_x_log10(): log-10 transformation introduced infinite values.

By default, the dose-response analysis is performed at the end of the experiment (namely the maximal time point in the dataset). However, you can specify a different target.time.

gCdd_DRTT <- doseResponse(gCdd, target.time = 7)
plot(gCdd_DRTT)

Inference

The fitting process is done using the fit() function:

fit_gCdd <- fit(gCdd)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 210
#>    Unobserved stochastic nodes: 214
#>    Total graph size: 672
#> 
#> Initializing model
#> Warning: The EC50 estimation (model parameter e) lies outside the range
#>         of tested concentrations and may be unreliable as the prior
#>         distribution on this parameter is defined from this range !

Then you can plot the results:

plot(fit_gCdd, adddata = TRUE)

Posterior Predictive Check (PPC)

# First calculate the PPC coordinates of the the predictions
ppc_gCdd <- ppc(fit_gCdd)

# Plot the PPC
plot(ppc_gCdd)

Finally, the summary of PPC is possible:

summary(ppc_gCdd)
#> Summary: 
#> 
#> Percent of Observation in the 95% Credible Interval:  100

XCX_gCdd <- xcx(fit_gCdd, x = 50)
XCX_gCdd$quantiles
#>         median   Qinf95   Qsup95
#> x: 50 118.4307 100.4032 199.7793

plot(XCX_gCdd)

Tutorial