Choosing between frequentist and Bayesian approaches is the great debate of the last century, with a recent rise in the adoption of Bayesian in the sciences.
What is the difference?
The philosophical difference is actually quite subtle: some propose that the great Bayesian critic, Fisher, was himself a Bayesian in some respect. While there are countless articles that delve into the formulated differences, what are the practical benefits? What does Bayesian analysis offer the lay data scientist that the plethora of highly adopted frequentist methods don't already offer? This article aims to give a practical introduction to the motivation, formulation and application of Bayesian methods. Let's dive in.
While frequentists are concerned with describing the exact distributions of any data, the Bayesian view is more subjective. Subjectivity and statistics? Yes, it is actually supported.
Let's start with something simple, like flipping a coin. Suppose you flip a coin 10 times and get heads 7 times. What is the probability that it will come up heads?
P(heads) = 7/10 (0.7)?
Obviously, we are plagued by a low sample size here. However, in a Bayesian view, we are allowed to encode our beliefs directly, stating that if the coin is fair, the probability of it coming up heads or tails must be equal, that is, 1/2. Although in this example the choice seems quite obvious, the debate becomes more nuanced when we come to a more complex and less obvious phenomenon.
StillThis simple example is a powerful starting point, highlighting both the major benefit and flaw from Bayesian analysis:
Benefit: Deal with a lack of data. Let's say you are modeling the spread of an infection in a country where data collection is sparse. Will you use the small amount of data to gain all your insights? Or would you like to take into account patterns commonly observed in similar countries, i.e. prior reported beliefs, in your model? Although the choice is clear, it leads directly to deficiency.
Flaw: he previous belief is difficult to formulate. For example, if the coin is actually unfair, it would be a mistake to assume that P(heads) = 0.5, and there is almost no way to find the true P(heads) without a long-term experiment. In this case, assuming that P(heads) = 0.5 would actually be detrimental to finding the truth. However, every statistical model (frequentist or Bayesian) must make assumptions at some level, and “statistical inferences” in the human mind are actually very similar to Bayesian inference, that is, constructing previous belief systems that influence our decisions in each new situation. Furthermore, formulating erroneous prior beliefs is also usually not a death sentence from a modeling perspective, if we can learn from enough data (more on this in later articles).
So what does all this look like mathematically? Bayes' rule lays the groundwork. Suppose we have a parameter θ that defines some model that could describe our data (for example, θ could represent the mean, the variance, the slope with respect to the covariate, etc.). Bayes' rule states that
P (θ = t|data) ∝ P (data|θ = t) * P (θ=t)
In simpler words,
- P (θ = t|data) represents the conditional probability that θ is equal to t, given our data (also known as later).
- Instead, P(data|θ) represents the probability of observing our data, if θ = t (also known as 'probability').
- Finally, P (θ=t) is simply the probability that θ takes the value t (the infamous 'previous').
So what is this mysterious t? It can take many possible values, depending on what θ means. In fact, you want to try many values and check the probability of your data for each one. This is a key step and you really hope to have checked the best possible values for θ, i.e. those that cover the maximum probability area to view your data (global minimums, for those who care).
And that's the crux of everything Bayesian inference does!
- Form a prior belief about the possible values of θ,
- Climb with him probability at each value of θ, given the observed data, and
- Returns the calculated result, that is, the later, which tells you the probability of each tested θ value.
Graphically, this looks like:
Which highlights the next big advantages of Bayesian statistics:
- We have an idea of the full shape of the distribution of θ (e.g. how wide the peak is, how heavy the tails are, etc.) which can allow for stronger inferences. Because? Simply because we can not only better understand but also quantify the uncertainty (compared to a traditional point estimate with standard deviation).
- Since the process is iterative, we can constantly update our beliefs (estimates) as more data flows into our model, making it much easier to fully build it. online Models.
Easy enough! But not entirely…
This process involves many calculations, where you have to calculate the probability for each possible value of θ. Well, maybe this is easy if we assume that θ is in a small range like (0,1). We can use brute force. grid method, testing values in discrete intervals (10, intervals of 0.1 or 100, intervals of 0.01 or more… you get the idea) to map the entire space at the desired resolution.
But what if the space is huge and God forbid there are additional parameters involved, like in any real-life modeling scenario?
Now we have to test not only the possible values of the parameters but also all their possible combinations, that is, the solution space expands exponentially, making a grid search computationally infeasible. Fortunately, physicists have worked on the problem of efficient sampling and today there are advanced algorithms (e.g., Metropolis-Hastings MCMC, variational inference) that can quickly explore high-dimensional parameter spaces and find convex points. You also don't need to code these complex algorithms yourself; Probabilistic computer languages such as PyMC or STAN make the process very agile and intuitive.
STANDARD
STAN is my favorite as it allows you to interact with more common data science languages like Python, R, Julia, MATLAB, etc., which helps with adoption. STAN is based on state-of-the-art Hamiltonian Monte Carlo sampling techniques that virtually guarantee convergence at a reasonable time for well-specified models. In my next article, I will cover how to get started with STAN for simple and non-simple regression models, with a complete Python code tutorial. I will also cover the complete Bayesian modeling workflow, which involves modeling specification, appropriate, display, comparisonand interpretation.
Follow us and stay tuned!