Starting with a random distribution in SQL | by Sami Abboud | February 2024

From a probability density function to random samples

tHere are different methods to update the policy of a reinforcement learning agent at each iteration. A few weeks ago we started experimenting with replacing our current method with a Bayesian inference step. Some of the data workloads within our agent are written in SQL running on GCP's BigQuery engine. We use this stack because it provides scalable computing capabilities, machine learning packages, and a simple SQL interface.

The Bayesian inference step we seek to implement uses one of the alternative parameterizations of a beta distribution. Which means we will need to be able to use a beta distribution in SQL. While working on this, I discovered that extracting from randomization in SQL is a topic with very few well-documented examples. So I'm writing about it here…

BigQuery does not have a beta distribution. It does not have the ability to extract any random distribution. So my first intuition was to take the definition of the beta distribution, write it in SQL, set the parameters using a CTA, draw a random number between 0 and 1 and calculate the value of that function.

But it's 2024, so I asked ChatGPT how they would do it:

Me: “How do you create random giveaways from a beta distribution in BigQuery?

ChatGPT:

WITH beta_draws AS (
SELECT
(alpha — 1) / (alpha + beta — 2) AS a,
(beta — 1) / (alpha + beta — 2) AS b
FROM
(SELECT 2 AS alpha, 5 AS beta) — Set your desired alpha and beta values
)
SELECT
(a * POW(RAND(), alpha - 1) * POW(1 - RAND(), beta - 1) + b) AS beta_draw
FROM
beta_draws,
UNNEST(GENERATE_ARRAY(1, 1000)) AS rand_num

Me thinking to myself: Right, so clearly that won't work.

Do you see the problem in the code? ChatGPT draws two different x values for the presumed beta distribution PDF. I fixed this, cleaned up the query a bit and tried 1000 values. And here is the SQL code to do that:

WITH raw_data AS (
SELECT
‘user_a’ AS contact_id
, 0.674 AS probability
, 72 AS lifetime_messages_received
), parameters AS (
SELECT
contact_id
, probability*lifetime_messages_received AS alpha
, (1.0 — probability)*lifetime_messages_received AS beta
, RAND() AS x
FROM raw_data
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000)) AS draw_id
)
SELECT
contact_id
, ARRAY_AGG(POW(x, alpha — 1.0) * POW(1.0 — x, beta — 1)) AS beta_x
FROM parameters
GROUP BY contact_id

Thanks everyone, that's all See you in the next post!

MISTAKEN!

Let's take a reliable implementation of extracting a beta distribution using the same parameters and compare the results. I have used SciPy beta.rvs() in Python and here are two 100-bin histograms that will allow you to compare the two drawn distributions.

from scipy.stats import betaalpha_param = 0.674 * 72
beta_param = (1–0.674) * 72
scipy_beta_draw = beta.rvs(alpha_param, beta_param, size=1000)

Left: Naive drawings with BigQuery. Right: Draw using SciPy's beta.rvs() — (**Left)**: Naive drawings with BigQuery. (**Good**): Draw using SciPy beta.rvs()

Well, you don't need a magnifying glass to realize that the distributions are different. I went back to the definition of the beta distribution and realized that it could be because the beta distribution also has a scaling constant that depends on the gamma function which I did not include in the calculation .

Problem: The gamma function does not have closed form expressionand BigQuery does not provide an implementation that comes close to it. So at this point I decided to switch to Python, a language I am more familiar with and which will make my experimentation more efficient. The idea was that if I make it concrete in Python, I will be able to translate it to SQL. I would still need some way to approximate a gamma function, but step by step.

Let's implement a manual draw of a beta distribution in Python, but now with the correct constant using SciPy's gamma function:

import numpy as np
from scipy.special import gamma
from scipy.stats import uniformalpha_param = 0.674 * 72
beta_param = (1–0.674) * 72
constant = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
scipy_manual_beta_draw = np.array((
constant*pow(x, alpha_param-1)*pow(1-x, beta_param-1)
for x in uniform.rvs(size=1000)
))

Let's examine the distribution again using a histogram of 100 bins:

The first thing we notice is that the scale is now different, but the layout still looks like the one drawn in BigQuery.

… something is wrong… It's time to take a little walk to think

…

After a short walk:

What does it really mean to draw from a random distribution? What I have implemented so far is a random sampling of the beta probability density function (PDF) and it didn't work.

So I had to look up some statistics classes.

Here are a couple of good reviews on:

In short, the bottom line is that drawing from a random variable actually means inverse cumulative distribution function (CDF) samplingnot from the probability density function (PDF) as I was doing until now.

Of course . My probability teacherwho I just learned passed away from an illness in 2020, I would have been encouraged to “brush up on the basics” at this point.

OK. Let's review the Python code and now extract samples from the reverse CDF (which is also called the quantile function) of our beta distribution, and compare it with the distribution drawn using SciPy's beta.rvs():

import numpy as np
from scipy.special import gamma
from scipy.stats import uniform, betaalpha_param = 0.674 * 72
beta_param = (1–0.674) * 72
n_draws = 1000
# Use SciPy RVS for comparison
scipy_beta_draw = beta.rvs(alpha_param, beta_param, size=n_draws)
# Manual beta draw with the help of the SciPy Gamma function
# We start with a discrete analogue of the Beta PDF we wish to draw from.
# This is just sampling from the PDF at fixed intervals but do check out
# this review for a more in-depth treatment of the subject:
# https://jsdajournal.springeropen.com/articles/10.1186/s40488-015-0028-6
# Set the resolution for generating the discrete PDF
n_samples = 1000
# The beta distribution is supported on the range (0, 1), so we set the
# pdf min and max parameters accordingly
pdf_min = 0.0
pdf_max = 1.0
x_span = np.linspace(pdf_min, pdf_max, n_samples)
constant = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
beta_pdf = np.array((
constant * pow(x, alpha_param — 1) * pow(1 — x, beta_param — 1)
for x in x_span
))
# Using the discrete Beta PDF, we now compute a discrete Beta CDF.
# To do that, we integrate the PDF. For each point x, we sum the PDF until
# that point and multiple with the width of each sample.
freq = 1.0 / n_samples
beta_cdf = beta_pdf.cumsum() * freq
def inv(cdf, q):
“””Return inverse CDF for value q using the quantile function”””
return x_span(np.argmin(cdf < q))
# Finally, we can now draw n_draws from the discrete inverse of CDF, aka
# generate random samples from a beta distribution
manual_beta_draw = np.array((
inv(beta_cdf, x)
for x in uniform.rvs(size=n_draws)
))

*ugh* this looks much better:

An overlay of two histograms comparing 1000 draws using SciPy's beta.rvs() and a manual draw.

Now that we've sampled a random variable, it's time to get back to SQL. For the sake of simplicity, and because BigQuery does not readily come with an implementation of a Gamma function¹, I will rely on the logistics distribution (with parameters a=0 and b=1).

 — The following 3 parameters need to be adjusted based on the support of the
— PDF of the distribution you wish to draw from. This values are set for a logistic
— distribution with a=0 and b=1DECLARE pdf_min INT64 DEFAULT -10;
DECLARE pdf_max INT64 DEFAULT 10;
DECLARE n_samples INT64 DEFAULT 5000;
DECLARE sampling_step FLOAT64 DEFAULT (pdf_max — pdf_min) / n_samples;
— The number of random draws you wish to perform
DECLARE n_draws INT64 DEFAULT 1000;
WITH pdf AS (
— The discrete sampling of the logistic distribution PDF
SELECT
x
, exp(-x) / pow(1 + exp(-x), 2) AS y — a=0, b=1
FROM UNNEST(GENERATE_ARRAY(pdf_min, pdf_max, sampling_step)) AS x
), cdf AS (
— The discrete CDF
SELECT
x
, SUM(y)
OVER (
ORDER BY x
) * (1.0 / n_samples) AS y
FROM pdf
), random_draws AS (
— Random draws in the range of (0, max(cdf))
SELECT
RAND() * (SELECT MAX(y) FROM cdf) as q
, draw_id
FROM UNNEST(GENERATE_ARRAY(1, n_draws)) AS draw_id
)
— Calculate the inverse CDF per draw using the quantile function by generating
— and array of the discrete support of the distribution and returning the value
— of the index just before the randomly generated number is larger than the CDF
SELECT
ARRAY_AGG(x ORDER BY x)(OFFSET(SUM(CAST(y < q AS INT64)))) AS x
FROM random_draws
JOIN cdf
ON TRUE
GROUP BY draw_id;

Let us now compare the distributions of the three sampling methods:

SciPy's logistic.rvs()
Manually sample the logistic distribution PDF in Python and extract a random sample as per Step 2 above
Doing the same in SQL

An overlay of three histograms comparing 1000 draws using SciPy's beta.rvs(), a manual draw in Python, and a manual draw in SQL.

This seems like a success to me!

This SQL code above shows the logistic distribution, but should work in any distribution where you can get a discrete representation of the PDF by sampling it at consistent intervals.