This function imputes non-detect (censored) values in environmental laboratory analytical data using survival models with automatic distribution selection. It validates data quality requirements and fits multiple distributions to select the best model based on AIC. Each imputed value is guaranteed to be below its respective detection limit and above the specified minimum value.
Usage
impute_nondetect(
dt,
value_col = "value",
cens_col = "censored",
parameter_col = NULL,
unit_col = NULL,
dist = c("gaussian", "lognormal", "weibull", "exponential", "logistic", "loglogistic"),
min_observations = 25,
max_censored_pct = 75,
min_value = 0,
control = survival::survreg.control(),
verbose = FALSE
)Arguments
- dt
A data.frame or data.table containing laboratory analytical data
- value_col
Character string specifying the column name containing values
- cens_col
Character string specifying the column name containing censoring indicators (0 = non-detect/censored, 1 = detected/observed)
- parameter_col
Character string specifying the column name containing parameter names (optional, for validation)
- unit_col
Character string specifying the column name containing units (optional, for validation)
- dist
Character vector of distributions to test. Options include:
"gaussian","lognormal","weibull","exponential","logistic","loglogistic"- min_observations
Minimum number of observations required for modeling (default: 25)
- max_censored_pct
Maximum percentage of censored values allowed (default: 75)
- min_value
Minimum allowable value for imputed concentrations (default: 0, use
1e-10for strictly positive distributions)- control
A
survreg.controlobject used to control the fitting algorithm, e.g. maximum number of iterations and convergence tolerance. Defaults tosurvival::survreg.control(). Increasemaxiter(e.g.survreg.control(maxiter = 200)) when convergence warnings are raised for complex datasets.- verbose
Logical indicating whether to display progress messages and distribution fitting information (default:
FALSE)
Value
A data.table with additional columns:
[value_col]_imputedImputed values for non-detect observations
[value_col]_finalFinal values combining original detected and imputed non-detect values
The returned object also has attributes containing model information:
- best_model
The fitted survival model object
- best_distribution
Name of the best-fitting distribution
- detection_limits
Vector of all detection limits found in the data
- max_detection_limit
The highest detection limit (for reference)
- parameter
Parameter name (if
parameter_colprovided)- unit
Unit of measurement (if
unit_colprovided)- aic
AIC value of the best model
- sample_size
Total number of observations
- censored_pct
Percentage of censored observations
- convergence_warnings
Character vector of convergence warning messages emitted by
survregwhen fitting the best-selected distribution. An empty character vector (character(0)) indicates clean convergence. These warnings are always captured silently; setverbose = TRUEto also print them to the console.
Details
The function performs several validation checks:
Ensures sufficient sample size (>=
min_observations)Checks that censoring percentage is reasonable (<=
max_censored_pct)Validates that only one parameter and unit are present (if columns provided)
Tests multiple distributions and selects the best based on AIC
Generates random imputed values below each observation's detection limit and above
min_value
For non-detect observations (censored = 0), the value in value_col
is treated as the detection limit for that specific analysis, allowing for
different detection limits across samples or analytical methods.
Convergence control: The control argument is passed directly to
survreg. Any convergence warnings raised during fitting
are silently captured and stored in the convergence_warnings attribute of
the result, rather than being printed to the console. This makes the function
safe for batch processing while still preserving a full diagnostic record. When
verbose = TRUE, captured warnings are also printed to the console.
Distributions that fail to fit entirely (hard errors) are silently skipped in
all cases.
Note: This function should be applied to data containing only ONE parameter at a time. Different environmental parameters have different distributions and should not be modelled together.
Examples
# Load example data
data(multi_censored_data)
# Basic imputation with default settings
set.seed(123)
result <- impute_nondetect(
dt = multi_censored_data,
value_col = "value",
cens_col = "censored",
verbose = FALSE
)
# View imputed values for non-detects
head(result[censored == 0, .(value, value_imputed, value_final)])
#> value value_imputed value_final
#> <num> <num> <num>
#> 1: 25 8.487192 8.487192
#> 2: 15 10.300671 10.300671
#> 3: 5 4.937120 4.937120
#> 4: 5 4.798058 4.798058
#> 5: 5 4.408386 4.408386
#> 6: 5 4.284067 4.284067
# Check best distribution selected
attr(result, "best_distribution")
#> [1] "lognormal"
# Check whether the best model converged cleanly
attr(result, "convergence_warnings") # character(0) means no warnings
#> NULL
# Increase max iterations for difficult datasets
result <- impute_nondetect(
dt = multi_censored_data,
value_col = "value",
cens_col = "censored",
control = survival::survreg.control(maxiter = 200)
)
# With parameter and unit validation
result <- impute_nondetect(
dt = multi_censored_data,
value_col = "value",
cens_col = "censored",
parameter_col = "parameter",
unit_col = "unit"
)
# For strictly positive values (avoiding exactly zero)
result <- impute_nondetect(
dt = multi_censored_data,
value_col = "value",
cens_col = "censored",
min_value = 1e-10,
verbose = FALSE
)
