A/B testing, also known as split testing, allows businesses to experiment with different versions of a web page or marketing asset to determine which performs best in terms of user engagement, click-through rates, and most importantly, conversion rates.
Conversion rates—the percentage of visitors who complete a desired action, such as making a purchase or signing up for a newsletter—are often the key metrics that determine the success of online campaigns. By carefully testing variations of a web page, businesses can make data-driven decisions that significantly improve these rates. Whether it’s modifying the color of a call-to-action button, changing the headline, or reorganizing the layout, A/B testing provides actionable insights that can transform the effectiveness of your online presence.
In this post, I will show how to perform Bayesian A/B testing to analyze conversion rates. We will also walk through a more complicated example where we will analyze the differences in customer behavior changes after an intervention. We will also discuss the differences when comparing this approach to a frequentist approach and what the potential advantages or disadvantages are.
Let's say we want to improve our e-commerce website. To do this, we expose two groups of customers to two versions of our website where, for example, we change a button. Then, we stop this experiment after we have exposed a certain number of visitors to both versions. After that, we get a binary matrix with a 1 indicating conversion and a 0 if there was no conversion.
We can summarize the data in a contingency table that shows us the (relative) frequencies.
contingency = np.array(((obsA.sum(), (1-obsA).sum()), (obsB.sum(), (1-obsB).sum())))
In our case, we showed each variation to 100 customers. In the first variation, 5 (or 5%) converted, and in the second variation, 3.
Frequentist configuration
We will perform a statistical test to measure whether this result is significant or due to chance. In this case, we will use a Chi2 test that compares the observed frequencies with those that would be expected if there were no real differences between the two versions (the null hypothesis). For more information, see This blog Publication that goes into more detail.
In this case, the p-value does not fall below the significance threshold (e.g., 5%) and therefore we cannot reject the null hypothesis that the two variants differ in their effect on the conversion rate.
However, there are some drawbacks to using the Chi2 test that can lead to misleading data. First, it is very sensitive to sample size. With a large sample size, even the smallest differences will become significant, while with a small sample size, the test may fail to detect any differences. This is especially so if the expected frequencies calculated for any of the fields are less than five. In this case, another test must be used. Furthermore, the test does not provide information about the magnitude or practical significance of the difference. By performing multiple A/B tests simultaneously, the probability of finding at least one significant result due to chance increases. The Chi2 test does not take into account this problem of multiple comparisons, which can lead to false positives if not properly controlled (e.g., by Bonferroni correction).
Another common mistake occurs when interpreting the results of the Chi2 test (or any statistical test). The p-value gives us the probability of observing the data, given that the null hypothesis is true. It does not make a statement about the distribution of conversion rates or their difference. And this is a major problem. We cannot make statements like “the probability that the conversion rate of variant B is 2% is x%” because for that we would need the probability distribution of the conversion rate (conditional on the observed data).
These drawbacks highlight the importance of understanding the limitations of the Chi2 test and using it appropriately within its limits. When applying this test, it is essential to complement it with other statistical methods and contextual analysis to ensure accurate and meaningful conclusions.
Bayesian setup
After looking at the frequentist way of approaching A/B testing, let’s look at the Bayesian version. Here, we model the data generation process (and therefore the conversion rate) directly. That is, we specify a probability and a prior probability that could lead to the observed outcome. Think of this as specifying a “story” of how the data could have been created.
In this case, I am using the Python package PyMC for modeling since it has a clear and concise syntax. Within the 'with' statement we specify distributions that we can combine and that give rise to a data generation process.
with pm.Model() as ConversionModel:
# priors
pA = pm.Uniform('pA', 0, 1)
pB = pm.Uniform('pB', 0, 1)delta = pm.Deterministic('delta', pA - pB)
obsA = pm.Bernoulli('obsA', pA, observed=obsA)
obsB = pm.Bernoulli('obsB', pB, observed=obsB)
trace = pm.sample(2000)
We have pA and pB, which are the conversion probabilities in groups A and B respectively. With pm.Uniform we specify our prior belief about these parameters. This is where we could encode prior knowledge. In our case, we are neutral and allow any conversion rate between 0 and 1 to be equally likely.
PyMC then allows us to draw samples from the posterior distribution, which is our updated opinion on the parameters after seeing the data. We now have a full probability distribution for the conversion probabilities.
From these distributions, we can directly read off quantities of interest such as confidence intervals. This allows us to answer questions such as “What is the probability of a conversion rate between x% and Y%?”
The Bayesian approach allows for much more flexibility, as we will see later. The interpretation of the results is also simpler and more intuitive than in the frequentist context.
Now we’ll look at a more complicated example of an A/B test. Suppose we expose subjects to some intervention at the beginning of the observation period. This would be the A/B part, where one group receives intervention A and the other receives intervention B. We then observe the interaction of the two groups with our platform over the next 100 days (perhaps something like the number of logins). What we might see is the following.
Now we want to know if these two groups show a significant difference in their response to the intervention. How would we figure this out with a statistical test? Frankly, I don't know. Someone would have to come up with a statistical test for exactly this scenario. The alternative is to go back to a Bayesian setting, where we'll first come up with a data generation process. We'll assume that each individual is independent and that their interactions with the platform are normally distributed. They have a change point where they change their behavior. This change point happens only once, but it can happen at any time. Before the change point, we assume a mean interaction intensity of mu1 and after that an intensity of mu2. The syntax might seem a bit complicated, especially if you've never used PyMC before. In that case, I'd recommend checking out their learning material.
with pm.Model(coords={
'ind_id': ind_id,
}) as SwitchPointModel:sigma = pm.HalfCauchy("sigma", beta=2, dims="ind_id")
# draw a switchpoint from a uniform distribution for each individual
switchpoint = pm.DiscreteUniform("switchpoint", lower=0, upper=100, dims="ind_id")
# priors for the two groups
mu1 = pm.HalfNormal("mu1", sigma=10, dims="ind_id")
mu2 = pm.HalfNormal("mu2", sigma=10, dims="ind_id")
diff = pm.Deterministic("diff", mu1 - mu2)
# create a deterministic variable for the
intercept = pm.math.switch(switchpoint < x.T, mu1, mu2)
obsA = pm.Normal("y", mu=intercept, sigma=sigma, observed=obs)
trace = pm.sample()
The model can then show us the distribution of the switching point location as well as the distribution of the differences before and after the switching point.
We can take a closer look at these differences with a forest diagram.
We can clearly see how the differences between Group A (id 0 to 9) and Group B (id 10 to 19) are very different, where Group B shows a much greater response to the intervention.
Bayesian inference offers a lot of flexibility when modeling situations where we don't have much data and are concerned about model uncertainty. In addition, we need to make our assumptions explicit and think about them. In simpler scenarios, frequentist statistical tests are often easier to use, but we need to be aware of the assumptions that go with them.
All the code used in this article can be found on my GitHubUnless otherwise stated, all images are created by the author.