The package morseDR
is dedicated to the analysis of data
from standard toxicity tests at target time. It provides a simple
workflow for exploring/visualising a dataset and computing estimates of
risk assessment indicators, such as the typical \(LC_{50}\) or \(EC_{50}\) values.
This document illustrates a typical use of morseDR
on
binary, count and quantitative continuous data, which can be followed
step by step to analyse new datasets. Details on models and parameters
can be found in the vignette Dose-Response models in the
morseDR
package.
The following example shows all the steps to perform a dose-response analysis on standard binary toxicity test data and to produce estimates of the \(LC_x\). We are using survival data as an example, but the same analysis can be performed for any type of binary data (e.g., mobility or burrowing).
We will use a dataset from the library named cadmium2
,
which contains both survival and reproduction data from a chronic
laboratory toxicity test. In this experiment, snails were exposed to six
concentrations of a heavy metal contaminant (cadmium) for 56 days.
The data from a survival toxicity test should be collected in a
data.frame
with a specific layout. This is documented in
the section on the binaryData()
function in the **Reference
manual* or the modelData(..., type = 'binary')
. You can
examine the datasets provided in the package, e.g.,
cadmium2
.
First, we load the dataset and use the binaryDataCheck()
function to check if it has the expected layout:
data(cadmium2)
binaryDataCheck(cadmium2)
#> Correct formatting
#> [1] msg
#> <0 rows> (or 0-length row.names)
The output Correct formatting
just informs that the
dataset is well-formatted.
BinaryData
objectThe BinaryData
class corresponds to correctly formatted
binary data and is the basic layout used for the subsequent operations.
Note that if the call to binaryDataCheck()
returns no error
(namely, the output Correct formatting
), it is guaranteed
that the binaryData()
will not fail.
survData_Cd <- binaryData(cadmium2)
head(survData_Cd)
#> conc time Nsurv Nrepro replicate Ninit
#> 1 0 0 5 0 1 5
#> 2 0 3 5 262 1 5
#> 3 0 7 5 343 1 5
#> 4 0 10 5 459 1 5
#> 5 0 14 5 328 1 5
#> 6 0 17 5 742 1 5
Note also the use of the function
modelData(..., type = 'binary')
:
The plot()
function can be used to plot the number of
surviving individuals as a function of time for all concentrations and
replicates.
If the argument pool.replicate
is TRUE
, the
data points at a given time point and a given concentration are pooled
and only the mean number of survivors is plotted. To observe the full
dataset, this option must be set to FALSE
.
By selecting one of the concentrations tested, we can visualise the particular plot corresponding to that specified concentration:
We can also pool the replicates. In such a situation, the legend becomes useless.
We can also plot the survival rate, at a given time point, as a
function of concentration, with binomial confidence intervals around the
data. This is achieved by using the doseResponse()
function
and setting the target.time
option (the default is the end
of the experiment).
plot(survData_Cd_DR_, log.scale = TRUE)
#> When using log-scale, data with concentration at 0 are removed
#> Warning in scale_x_log10(): log-10 transformation introduced infinite values.
# pooling replicates
survData_Cd_DR <- doseResponse(survData_Cd, target.time = 21, pool.replicate = TRUE)
We are now ready to fit the probabilistic model to the survival data, in order to describe the relationship between the concentration of the chemical compound and survival at a given target time. Our model assumes that the latter is a log-logistic function of the former, from which the package provides estimates of the parameters.
We can use the fit
function which automatically
recognises the class of the BinaryData
object.
An object of the FitTT
, here specifically of the
BinaryFitTT
class is returned with the estimated parameters
as a posterior[^1] distribution, quantifying the uncertainty around
their true value.
Parameter estimates can be obtained by using the
summary()
function. If the inference went well, it is
expected that the difference between the quantiles of the posteriors
will be reduced compared to the prior, meaning that the data were
helpful in reducing the uncertainty on the true value of the
parameters.
summary(survFit_Cd)
#> Summary:
#>
#> The loglogisticbinom_3 model with a binomial stochastic part was used !
#>
#> Priors on parameters (quantiles):
#>
#> 50% 2.5% 97.5%
#> b 1.000e+00 1.259e-02 7.943e+01
#> d 5.000e-01 2.500e-02 9.750e-01
#> e 1.227e+02 5.390e+01 2.793e+02
#>
#> Posteriors of the parameters (quantiles):
#>
#> 50% 2.5% 97.5%
#> b 8.571e+00 3.968e+00 1.599e+01
#> d 9.571e-01 9.087e-01 9.864e-01
#> e 2.365e+02 2.100e+02 2.534e+02
Then, we can compute any \(EC_x\) or \(LC_x\) values as the median (i.e., a point estimate) and the 2.5% and 97.5% quantiles of the posterior distribution (i.e., a measure of uncertainty, also known as the 95% credible interval).
LCx_Cd <- xcx(survFit_Cd, x = c(10, 20, 30, 40, 50))
LCx_Cd$quantiles
#> median Qinf95 Qsup95
#> x: 10 183.1935 125.4077 214.1247
#> x: 20 201.1952 152.6669 226.1819
#> x: 30 214.2854 173.7298 235.1099
#> x: 40 225.6554 192.4798 243.6166
#> x: 50 236.5119 210.0399 253.3907
We also have the distribution of XCx
object:
head(LCx_Cd$distribution)
#> x: 10 x: 20 x: 30 x: 40 x: 50
#> [1,] 159.6457 183.7281 201.7122 217.7599 233.6077
#> [2,] 190.9065 206.3875 217.3655 226.7987 235.8152
#> [3,] 141.7979 165.9161 184.1751 200.6318 217.0246
#> [4,] 169.4401 188.7189 202.7315 214.9907 226.8920
#> [5,] 207.6755 221.1105 230.5177 238.5269 246.1215
#> [6,] 210.9375 221.0854 228.0991 234.0141 239.5770
And, we can plot the distribution
The fit can also be plotted as follows:
This plot shows the estimated relationship between the concentration of the chemical compound and the survival probability (orange curve). It is computed by setting each parameter to the median of its posterior. To assess the uncertainty around this median curve, we generate many such curves by sampling the parameters in the posterior distribution. This results in the grey band, which for any given concentration shows an interval (namely the credible interval) that is expected to contain the observed survival probability 95% of the time.
The experimental data are shown as black dots and correspond to the observed survival probabilities when all replicates are pooled. The black error bars correspond to the 95% binomial confidence interval, which is another, simpler way of bounding the most likely value of the observed survival probability for a tested concentration. In favourable situations, we expect that the credible interval around the estimated curve and the confidence interval around the experimental data overlap to a large extent.
Note that the survFitTT()
function will warn you if the
estimated \(LC_{x}\) estimate lies
outside the range of the tested concentrations, as in the following
example:
data("cadmium1")
doubtful_fit <- fit(binaryData(cadmium1), target.time = 21)
#> Warning: The LC50 estimation (model parameter e) lies outside the range
#> of tested concentrations and may be unreliable as the prior
#> distribution on this parameter is defined from this range !
In this example, the experimental design does not include enough high concentrations, so we are missing measurements that would have lead to more than 50% of effect at the highest one. Therefore, even if the fit delivers an estimate of parameter \(e\), it should be considered unreliable.
The fit can be further checked using what is known as a posterior predictive check (PPC): the principle is to plot the observed values against their corresponding predictions, plotted as medians along with their 95% credible intervals. If the fit is correct, we expect about 95% of the data to fall within the credible intervals.
The black dots on the above PPC plot correspond to the observed values (\(x\) coordinate) against their corresponding estimated medians (\(y\) coordinate), along with their 95% credible intervals (vertical segments). Ideally, observations and predictions should be in agreement, so we would expect to see the black dots to be aligned along the first bisector line of equation \(Y = X\) (i.e. the 1:1 line) The implemented model provides a tolerable variation around the predicted mean value as an interval where we expect 95% of the dots to be in average. The intervals are coloured in green if they overlap with the 1:1 line, otherwise they are coloured in red.
The summary of PPC is also available:
The expectation if to get narrower orange (posterior) distributions compared to the grey(prior) ones, as well as posterior distribution not constrained in their tails by the priors (e.g., parameters \(b\) and \(e\)), except if it corresponds to biological or a mathematical constrain (e.g., parameter \(d\)).
Using the library GGally
, you can get a correlation plot
between parameters:
library(GGally)
#> Loading required package: ggplot2
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
Cd_posterior <- posterior(survFit_Cd)
ggpairs(Cd_posterior)
The expectation is to get potato shapes in pairs of parameter plans.
To test the goodness of fit, you can extract the model, and element
of the FitTT
object to run extra modules like
dic
or waic
:
library(rjags)
#> Loading required package: coda
#> Linked to JAGS 4.3.2
#> Loaded modules: basemod,bugs
fit <- survFit_Cd
model <- rjags::jags.model(file = textConnection(fit$model.specification$model.text),
data = fit$jags.data,
n.chains = length(fit$mcmc),
n.adapt = 3000)
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 6
#> Unobserved stochastic nodes: 9
#> Total graph size: 62
#>
#> Initializing model
n_iter <- nrow(fit$mcmc[[1]])
dic.samples(model, n.iter = n_iter)
#> Mean deviance: 19.23
#> penalty 3.133
#> Penalized deviance: 22.36
If you have JAGS version >4.3.0, you can load the module
load.module("dic")
, and then compute both deviance and
WAIC:
The steps for dose-response analysis of count data are exactly the same as those described above for binary data, except that the ultimate goal is to get \(EC_x\) estimates. We use reproduction data as an example, but the same analysis can be performed for any type of count data.
Here is a typical session:
# (1) load dataset
data(cadmium2)
# (2) check structure and integrity of the dataset
countDataCheck(cadmium2)
#> Correct formatting
#> [1] msg
#> <0 rows> (or 0-length row.names)
# (3) create a `reproData` object
dat <- countData(cadmium2)
head(dat)
#> conc time Nsurv Nrepro replicate Ninit Nreprocumul Nindtime
#> 1 0 0 5 0 1 5 0 0
#> 2 0 3 5 262 1 5 262 15
#> 3 0 7 5 343 1 5 605 35
#> 4 0 10 5 459 1 5 1064 50
#> 5 0 14 5 328 1 5 1392 70
#> 6 0 17 5 742 1 5 2134 85
The last step countData()
can be done using the function
modelData(..., type = 'count')
:
# (3) create a `reproData` object
dat <- modelData(cadmium2, type = 'count')
head(dat)
#> conc time Nsurv Nrepro replicate Ninit Nreprocumul Nindtime
#> 1 0 0 5 0 1 5 0 0
#> 2 0 3 5 262 1 5 262 15
#> 3 0 7 5 343 1 5 605 35
#> 4 0 10 5 459 1 5 1064 50
#> 5 0 14 5 328 1 5 1392 70
#> 6 0 17 5 742 1 5 2134 85
Then we can plot the data:
# (4) represent the cumulated number of offspring as a function of time
plot(dat, concentration = 124, addlegend = TRUE, pool.replicate = FALSE)
And also plot the dose-response:
# (5) represent the reproduction rate as a function of concentration
dat_DR <- doseResponse(dat, target.time = 28)
plot(dat_DR)
# (7) fit a concentration-effect model at target-time
reproFit <- fit(dat, stoc.part = "bestfit", target.time = 21, quiet = TRUE)
summary(reproFit)
#> Summary:
#>
#> The GP model with a Gamma Poisson stochastic part was used !
#>
#> Priors on parameters (quantiles):
#>
#> 50% 2.5% 97.5%
#> b 1.000e+00 1.259e-02 7.943e+01
#> d 1.819e+01 1.538e+01 2.100e+01
#> e 1.488e+02 7.902e+01 2.804e+02
#> omega 1.000e+00 1.585e-04 6.310e+03
#>
#> Posteriors of the parameters (quantiles):
#>
#> 50% 2.5% 97.5%
#> b 4.052e+00 3.024e+00 5.746e+00
#> d 1.759e+01 1.535e+01 1.998e+01
#> e 1.357e+02 1.138e+02 1.677e+02
#> omega 1.346e+00 7.713e-01 2.629e+00
# (8) Get ECx estimates
ECx_count <- xcx(reproFit, x = c(10, 20, 30, 40, 50))
ECx_count$quantiles
#> median Qinf95 Qsup95
#> x: 10 78.77297 56.69611 111.9041
#> x: 20 96.30477 73.64211 129.7089
#> x: 30 109.97662 87.43214 143.0404
#> x: 40 122.76155 100.50471 155.1730
#> x: 50 135.66685 113.82796 167.7047
As above, the expectation from the PPC plot is that 95% of the data will fall within the predicted 95% credible intervals.
For count data analyses, the morseDR
package
automatically compares a model that neglects the between-replicate
variability (called the Poisson model) with another that takes
it into account (called the Gamma-Poisson model). However, you
can manually choose either one or the other with the
stoc.part
option. Setting this option to
"bestfit"
lets the fit()
function decides
which models fit the data best. The corresponding choice can be seen by
calling the summary
function:
summary(reproFit)
#> Summary:
#>
#> The GP model with a Gamma Poisson stochastic part was used !
#>
#> Priors on parameters (quantiles):
#>
#> 50% 2.5% 97.5%
#> b 1.000e+00 1.259e-02 7.943e+01
#> d 1.819e+01 1.538e+01 2.100e+01
#> e 1.488e+02 7.902e+01 2.804e+02
#> omega 1.000e+00 1.585e-04 6.310e+03
#>
#> Posteriors of the parameters (quantiles):
#>
#> 50% 2.5% 97.5%
#> b 4.052e+00 3.024e+00 5.746e+00
#> d 1.759e+01 1.535e+01 1.998e+01
#> e 1.357e+02 1.138e+02 1.677e+02
#> omega 1.346e+00 7.713e-01 2.629e+00
When the the Gamma-Poisson model is selected (or has been
automatically selected), the summary will show an additional parameter
called omega
, which quantifies the variability between
replicates: the higher the omega
, the higher the
variability between replicates.
In morseDR
, count datasets are a special case of binary
datasets. Indeed, a count dataset includes the same information as in a
binary dataset plus the information about count records. For this
reason, the S3 class countData
inherits from the
binaryData
class, which means that any operation on a
binaryData
object is permitted on a countData
object. In particular, to use the plot function related to the binary
data analysis on a binaryData
object, we can first use
binaryData
as a formatting function:
In Bayesian inference, the parameters of a model are estimated from the data using what is called a prior, which is a probability distribution that represents an initial guess of the true value of the model parameters before looking at the data. The posterior distribution represents the uncertainty in the parameters after looking at the data and combining them with the prior information. To obtain a point estimate of the parameters, it is typical to compute the mean or median of the posterior distribution. We can quantify the uncertainty by reporting the standard deviation or an inter-quantile distance from this posterior distribution.
The summary of PPC is also available:
ppcReproFit <- ppc(reproFit)
summary(ppcReproFit)
#> Summary:
#>
#> Percent of Observation in the 95% Credible Interval: 97.2
The morseDR
package also provides turnkey R functions
for performing dose-response analysis of quantitative
continuous toxicity test data in a Bayesian framework,
including the estimation of the x% effective toxicity value, which can
be an x% effective rate (\(ER_x\)), an
x% effective concentration (\(EC_x\))
or any other expression of your choice. We use growth data as an
example, but the same analysis can be performed for any type of
quantitative continuous data.
You can have access to already implemented datasets with the
data
function. The continuousData()
function
converts a data frame into a continuousData
object after
performing all the checks require for quantitative continuous
analysis.
data("cadmium_daphnia")
gCdd <- continuousData(cadmium_daphnia)
head(gCdd)
#> # A tibble: 6 × 4
#> time replicate conc measure
#> <dbl> <chr> <dbl> <dbl>
#> 1 0 D 0 0.99
#> 2 0 D 0 1.01
#> 3 0 D 0 0.99
#> 4 0 D 0 0.94
#> 5 0 D 0 0.99
#> 6 0 D 0 0.94
# Equivalent to the above line
gCdd <- modelData(cadmium_daphnia, type = "continuous")
head(gCdd)
#> # A tibble: 6 × 4
#> time replicate conc measure
#> <dbl> <chr> <dbl> <dbl>
#> 1 0 D 0 0.99
#> 2 0 D 0 1.01
#> 3 0 D 0 0.99
#> 4 0 D 0 0.94
#> 5 0 D 0 0.99
#> 6 0 D 0 0.94
As with other analyses on binary and count data, you can check the
compliance of the dataset by using the
continuousDataCheck()
function.
continuousDataCheck(cadmium_daphnia)
#> Correct formatting
#> [1] msg
#> <0 rows> (or 0-length row.names)
The dataset can then be plotted with different options:
Before the inference step, you can run the t.test
function on the data you uploaded with the doseResponse()
function.
plot(gCdd_DR, log.scale = TRUE)
#> When using log-scale, data with concentration at 0 are removed
#> Warning in scale_x_log10(): log-10 transformation introduced infinite values.
By default, the dose-response analysis is performed at the end of the
experiment (namely the maximal time point in the dataset). However, you
can specify a different target.time
.
The fitting process is done using the fit()
function:
fit_gCdd <- fit(gCdd)
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 210
#> Unobserved stochastic nodes: 214
#> Total graph size: 672
#>
#> Initializing model
#> Warning: The EC50 estimation (model parameter e) lies outside the range
#> of tested concentrations and may be unreliable as the prior
#> distribution on this parameter is defined from this range !
Then you can plot the results: