A zero-inflated model effectively captures the nuances of data sets characterized by a preponderance of zeros. It operates by distinguishing between two distinct processes: 1) Determining whether the outcome is zero and 2) predicting values for non-zero outcomes. This dual approach is particularly suited to asking questions such as: “Are there cells present and, if so, how many?”
To handle data sets with an abundance of zeros, we employ models such as hurdle_poisson()
and Zero_inflated_poisson
both designed for scenarios where standard counting models such as Poisson or negative binomial models are inadequate (3).Generally speaking, a key difference between hurdle_poisson()
and Zero_inflated_poisson
is that the latter incorporates an additional probability component specifically for zeros, improving its ability to handle data sets where zeros are not only common but also significant. We will see the impact these features have on our modeling strategy using brms
.
Fitting a Poisson_Obstacle Model
Let's start using the hurdle_poisson()
distribution in our modeling scheme:
Hurdle_Fit1 <- brm(Cells ~ Hemisphere,
data = Svz_data,
family = hurdle_poisson(),
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/2024-04-19_CountsZeroInflated/Hurdle_Fit1.rds",
file_refit = "never")# Add loo for model comparison
Hurdle_Fit1 <-
add_criterion(Hurdle_Fit1, c("loo", "waic", "bayes_R2"))
Let's look at the results using the standard summary function.
summary(Hurdle_Fit1)
Given this family distribution, estimates are shown on the logarithmic scale (mu = log). In practical terms, this means that the number of cells in the contralateral subventricular zone (SVZ) can be expressed as exp(1,11) = 3.03. Similarly, the ipsilateral hemisphere is estimated to have exp(1.07) = 2.91 times the number of cells. These results align well with our expectations and offer a consistent interpretation of the cellular distribution between the two hemispheres.
Furthermore, the hu
The parameter within “Family Specific Parameters” sheds light on the probability of observing zero cell counts. Indicates a 38% probability of nothing happening. This probability highlights the need for a zero-inflated model approach and justifies its use in our analysis.
To better visualize the implications of these findings, we can take advantage of the conditional_effects
function. This tool in brms
The package allows us to plot the estimated effects of different predictors on the response variable, providing a clear graphical representation of how the predictors influence the expected cell counts.
Hurdle_CE <-
conditional_effects(Hurdle_Fit1)Hurdle_CE <- plot(Hurdle_CE,
plot = FALSE)((1))
Hurdle_Com <- Hurdle_CE +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
Hurdle_CE_hu <-
conditional_effects(Hurdle_Fit1, dpar = "hu")
Hurdle_CE_hu <- plot(Hurdle_CE_hu,
plot = FALSE)((1))
Hurdle_hu <- Hurdle_CE_hu +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
Hurdle_Com | Hurdle_hu
These graphs paint a more logical picture than our first model. The graph on the left shows the two parts of the model (“mu” and “hu”). Furthermore, if this model is adequate, we should see more aligned predictions when using pp_check
:
pp_check(Hurdle_Fit1, ndraws = 100) +
labs(title = "Hurdle regression") +
theme_classic()
As expected, our model predictions have a lower bound at 0.
Modeling data dispersion.
Looking at the data presented in the graph on the right of Figure 5 reveals a discrepancy between our empirical findings and our theoretical understanding of the topic. Based on established knowledge, we expect a higher probability of non-zero cell counts in the subventricular zone (SVZ) of the ipsilateral hemisphere, especially after injury. This is because the ipsilateral SVZ normally becomes a center of cellular activity, with significant cell proliferation following injury. Our data, indicating non-zero prevalent counts in this region, support this biological expectation.
However, the current model predictions do not fully align with these insights. This divergence underscores the importance of incorporating scientific understanding into our statistical models. Relying solely on standard tests without contextual adaptation can lead to misleading conclusions.
To address this, we can refine our model by specifically adjusting the hu
parameter, which represents the probability of zero occurring. This allows us to more accurately reflect the expected biological activity in the SVZ of the ipsilateral hemisphere. Then we build a second obstacle model:
Hurdle_Mdl2 <- bf(Cells ~ Hemisphere,
hu ~ Hemisphere)Hurdle_Fit2 <- brm(
formula = Hurdle_Mdl2,
data = Svz_data,
family = hurdle_poisson(),
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/2024-04-19_CountsZeroInflated/Hurdle_Fit2.rds",
file_refit = "never")
# Add loo for model comparison
Hurdle_Fit2 <-
add_criterion(Hurdle_Fit2, c("loo", "waic", "bayes_R2"))
Let's first see if the results graph aligns with our hypothesis:
Hurdle_CE <-
conditional_effects(Hurdle_Fit2)Hurdle_CE <- plot(Hurdle_CE,
plot = FALSE)((1))
Hurdle_Com <- Hurdle_CE +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
Hurdle_CE_hu <-
conditional_effects(Hurdle_Fit2, dpar = "hu")
Hurdle_CE_hu <- plot(Hurdle_CE_hu,
plot = FALSE)((1))
Hurdle_hu <- Hurdle_CE_hu +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
Hurdle_Com | Hurdle_hu
This revised modeling approach appears to be a substantial improvement. By specifically taking into account the higher probability of zero counts (~75%) in the contralateral hemisphere, the model now aligns more closely with both the observed data and our scientific knowledge. This adjustment not only reflects the expected lower cellular activity in this region, but also improves the precision of our estimates. With these changes, the model now offers a more nuanced interpretation of cellular dynamics after injury. Let's see the summary and the TRANSFORMATION FOR THE hu
parameters (don't look at the others) to display them on a probability scale using the logit2prob
function we created at the beginning.
logit2prob(fixef(Hurdle_Fit2))
Although the estimates for cell number are similar, the hu
The parameters (on the logit scale) tell us that the probability of seeing zeros in the contralateral hemisphere is:
Instead:
It represents a drastic reduction to approximately 0.23% of the probability of observing zero cell counts in the injured (ipsilateral) hemisphere. This is a notable change in our estimates.
Now, let us explore whether a zero_inflated_poisson()
The distribution family changes these ideas.