Don't think too much about "outliers", instead use a Student's t distribution | by Daniel Manrique-Castaño | March 2024

TO Student's t distribution It is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Student's t distribution. The Gaussian distribution is defined by the mean (μ) and the standard deviation (σ). The Student's t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the “thickness” of the distribution. This parameter assigns greater probability to events that are further away from the average. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student's t distribution approaches the Gaussian distribution. We can visualize this using density plots:

# Load necessary libraries
library(ggplot2)# Set seed for reproducibility
set.seed(123)
# Define the distributions
x <- seq(-4, 4, length.out = 200)
y_gaussian <- dnorm(x)
y_t3 <- dt(x, df = 3)
y_t10 <- dt(x, df = 10)
y_t30 <- dt(x, df = 30)
# Create a data frame for plotting
df <- data.frame(x, y_gaussian, y_t3, y_t10, y_t30)
# Plot the distributions
ggplot(df, aes(x)) +
geom_line(aes(y = y_gaussian, color = "Gaussian")) +
geom_line(aes(y = y_t3, color = "t, df=3")) +
geom_line(aes(y = y_t10, color = "t, df=10")) +
geom_line(aes(y = y_t30, color = "t, df=30")) +
labs(title = "Comparison of Gaussian and Student t-Distributions",
x = "Value",
y = "Density") +
scale_color_manual(values = c("Gaussian" = "blue", "t, df=3" = "red", "t, df=10" = "green", "t, df=30" = "purple")) +
theme_classic()

Figure 1: Comparison of Gaussian and Student t distributions with different degrees of freedom.

Note on Figure 1 that the hill around the mean becomes smaller as the degrees of freedom decrease as a result of the probability mass going toward the tails, which are thicker. This property is what gives the Student's t distribution a reduced sensitivity to outliers. For more details on this matter, you can consult this Blog.

We load the required libraries:

library(ggplot2)
library(brms)
library(ggdist)
library(easystats)
library(dplyr)
library(tibble)
library(ghibli)

So, let's skip the data simulations and get serious. We will be working with real data that I acquired from mice that performed the rotarod test.

First, we load the dataset into our environment and set the corresponding factor levels. The data set contains animal identifications, an outcome variable (genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we modeled only one of the trials (Test3). We'll save the other tests for a future article on modeling variation.

As data management implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In biomedical science, categorical predictors or grouping factors are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.

data <- read.csv("Data/Rotarod.csv")
data$Day <- factor(data$Day, levels = c("1", "2"))
data$Genotype <- factor(data$Genotype, levels = c("WT", "KO"))
head(data)

Let's have an initial view of the data using Rain cloud patches as shown Guilherme A. Franchi, PhD in this great blog post.

edv <- ggplot(data, aes(x = Day, y = Trial3, fill=Genotype)) +
scale_fill_ghibli_d("SpiritedMedium", direction = -1) +
geom_boxplot(width = 0.1,
outlier.color = "red") +
xlab('Day') +
ylab('Time (s)') +
ggtitle("Rorarod performance") +
theme_classic(base_size=18, base_family="serif")+
theme(text = element_text(size=18),
axis.text.x = element_text(angle=0, hjust=.1, vjust = 0.5, color = "black"),
axis.text.y = element_text(color = "black"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position="bottom")+
scale_y_continuous(breaks = seq(0, 100, by=20), 
limits=c(0,100)) +
# Line below adds dot plots from {ggdist} package 
stat_dots(side = "left", 
justification = 1.12,
binwidth = 1.9) +
# Line below adds half-violin from {ggdist} package
stat_halfeye(adjust = .5, 
width = .6, 
justification = -.2, 
.width = 0, 
point_colour = NA)
edv

Figure 2: Exploratory data visualization.

Figure 2 It looks different from the original Guilherme A. Franchi, PhD because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are what can be considered extreme observations that tilt the measures of central tendency (especially the mean) towards one direction. We also observe that the variances are different, so sigma modeling can also give better estimates. Our task now is to model the output using the brms package.

Don't think too much about “outliers”, instead use a Student's t distribution | by Daniel Manrique-Castaño | March 2024

Technical Terrence Team

S&P 500 hits new high, Nasdaq takes a step back

Leave a Reply Cancel reply

Recommended.

Crypto Analysts Bullish on Cardano and Bitcoin, Everlodge Will Rebound

Unity announces the release of Muse: a text-to-video gaming platform that lets you create natural language textures, sprites, and animations

Long-Term Bitcoin Price: How High Can BTC Go in 2025/2026?

5 UK stocks I'd put my full-year ISA into for passive income

Bitcoin predictions to watch as price reclaims $34,000

Categories

Important Links