Implementation of Inverse Transform Sampling, Rejection Sampling, and Importance Sampling in Python
In this post, we will discuss how to sample from a probability distribution. This is a common need, in an ML setup this is most often used to run probabilistic model inference. However, because the distributions are so complex, this is often intractable. Therefore, the main focus of this publication is to present approximate methods for this task, in particular using numerical sampling, known as Monte Carlo methods.
Still, for introductory purposes, we will first introduce inverse transform sampling, which allows exact inference for arbitrary and tractable distributions. Then we will shift our focus to approximate methods, allowing for sampling or estimation of moments for (almost) arbitrary distributions: we start with rejection sampling and then move on to importance sampling.
Please note that this post is the first in a series that acquaints readers with [1]and this particular publication covers parts of Chapter 11: Sampling Methods.
The inverse transform sampling method allows us to sample any distribution for which we know how to calculate the inverse of the cumulative distribution function (CDF). It involves sampling y
of U[0, 1]
(the uniform distribution), and then compute the desired x
What:
That is, we are generating a random variable
and then assert that
— that the inverse of the cumulative distribution (indicated by F
) follows our desired target distribution (more specifically, its probability density functiondenoted by f
).
This is the case if and only if the CDFs are the same, that is:
To prove this, let’s look at the left-hand side and apply F to both sides of the equation:
Since for the uniform distribution over [0, 1] we have
we get, as desired:
(test taken from COMP 480 / 580).
Python implementation
Let’s try this for him exponential distributiondefined by the pdf:
Taking the CDF and inverting it, we get:
Thus giving us our desired sampling formula! Let’s convert this to Python.
First we define a function exp_distr
evaluating the pdf in the dice x
values, and then a function exp_distr_sampled
sampling from the exponential distribution with the inverse transform method and our derivative formula above.
Finally we plot the true pdf and a histogram of our sampled values:
import numpy as np
import matplotlib.pyplot as pltNUM_SAMPLES = 10000
RATE = 1.5
def exp_distr(x: np.ndarray, rate: float) -> np.ndarray:
return rate * np.exp(-x * rate)
def exp_distr_sampled(num_samples: int, rate: float) -> np.ndarray:
x = np.random.uniform(0, 1, num_samples)
return -1 / rate * np.log(1 - x)
x = np.linspace(0, 5, NUM_SAMPLES)
y_true = exp_distr(x, RATE)
y_sampled = exp_distr_sampled(NUM_SAMPLES, RATE)
plt.plot(x, y_true, "r-")
plt.hist(y_sampled, bins=30, density=True)
plt.show()
Running this code produces something like the following, which shows that our sampling does indeed produce correct results:
If possible, inverse transform sampling works perfectly. However, as mentioned, it requires that we be able to invert the CDF. This is only possible for certain simpler distributions, even for a normal distribution this is significantly more complex, although possible. And when this is too difficult or not possible, we have to resort to other methods, which we will discuss in this section.
We start with rejection sampling and then introduce importance sampling, and conclude this article with a discussion of the limitations of these methods and some perspective.
Note that both methods still require us to be able to evaluate p(x)
for a given x
.
rejection sampling
Rejection sampling involves the introduction of a simpler nomination distribution q
from which we can sample, and which we use to approximate p
. It follows the idea that q(x) ≥ p(x)
for all x
and we take samples of q
but rejects all samples that are above p
— thus resulting in exactly the distribution p
eventually.
As mentioned above, it is possible to apply inverse transform sampling to normal distributions, but it is difficult; therefore, for the sake of proof, let us try here to approximate a normal distribution (p
) and use a uniform distribution as the proposed distribution (q
). Let’s visualize above intuition in this scenario:
Now, with rejection sampling, we would take samples from the uniform distribution q
and reject all samples above p
— thus in the limit resulting in points that exactly fill the area under p
that is, sampling from this distribution.
Formally, we first select the distribution of our proposal q
and then a scale coefficient k
st kq(x) ≥ p(x)
for all x
. So, we sample a value x_0
of q
. Next, we generate a value u_0
of the uniform distribution over [0, kq(x_0)]
. yes now u_0 > p(x_0)
we reject the sample, otherwise we accept it.
Let’s implement this in Python:
from typing import Tupleimport matplotlib.pyplot as plt
import numpy as np
NUM_SAMPLES = 10000
MEAN = 0
STANDARD_DEVIATION = 1
MIN_X = -4
MAX_X = 4
def uniform_distr(low: float, high: float) -> float:
return 1 / (high - low)
def normal_dist(
x: np.ndarray, mean: float, standard_deviation: float
) -> np.ndarray:
return (
1
/ (standard_deviation * np.sqrt(2 * np.pi))
* np.exp(-0.5 * ((x - mean) / standard_deviation) ** 2)
)
x = np.linspace(MIN_X, MAX_X, NUM_SAMPLES)
y_true = normal_dist(x, MEAN, STANDARD_DEVIATION)
def rejection_sampling(
num_samples: int, min_x: float, max_x: float
) -> Tuple[np.ndarray, float]:
x = np.random.uniform(min_x, max_x, num_samples)
# uniform_distr(-4, 4) = 0.125 -> we need to scale this to ~0.5, i.e.
# select k = 4.
k = 4
u = np.random.uniform(np.zeros_like(x), uniform_distr(min_x, max_x) * k)
(idx,) = np.where(u < normal_dist(x, MEAN, STANDARD_DEVIATION))
return x[idx], len(idx) / num_samples
y_sampled, acceptance_prob = rejection_sampling(NUM_SAMPLES * 10, MIN_X, MAX_X)
print(f"Acceptance probability: {acceptance_prob}")
plt.plot(x, y_true, "r-")
plt.hist(y_sampled, bins=30, density=True)
plt.show()
Giving the following result:
Acceptance probability: 0.24774
importance sampling
Often, we are only interested in expectations, which is where importance sampling comes in: it is a technique for approximating expectations (or other moments) from a complex distribution. p
again using a distribution proposal q
.
The expectation of a (continuous) random variable X is defined as:
We then use a little math trick to rephrase this into:
Now what does this formula say? is the expectation of
under q
– that is, to calculate E[X]
We can now evaluate this term when we sample from q
! the coefficients p(x) / q(x)
are called importance weights.
For the sake of proof, let us set again p
be a normal distribution, and use another normal distribution for q
:
import numpy as npNUM_SAMPLES = 10000
MEAN_P = 3
MEAN_Q = 2
def normal_dist(
x: np.ndarray, mean: float, standard_deviation: float
) -> np.ndarray:
return (
1
/ (standard_deviation * np.sqrt(2 * np.pi))
* np.exp(-0.5 * ((x - mean) / standard_deviation) ** 2)
)
q_sampled = np.random.normal(loc=MEAN_Q, size=NUM_SAMPLES)
p_sampled = (
q_sampled
* normal_dist(q_sampled, MEAN_P, 1)
/ normal_dist(q_sampled, MEAN_Q, 1)
)
print(
f"Resulting expecation when sampling from q {np.mean(p_sampled)} vs true expecation {MEAN_P}"
)
In the code, we set p
be a normal distribution with mean 3. Then, we sample NUM_SAMPLES
of the distribution of the proposal q
which is a normal distribution with mean 2, and use the reformulation above to calculate the expectation of X
under p
through this, giving roughly the correct result (~3 vs 3).
Let us end this section with a discussion of variance: the variance of the resulting samples will be
For our case, this means an increase in variance the more p
Y q
they vary among themselves, that is, they do not agree. To demonstrate this, compare the variance of MEAN_Q = 3.2
Y 5
we get ~0.20 and ~91.71, respectively.
However, note that importance sampling is also commonly used as a variance reduction technique when choosing wisely. q
– however, this could be a topic for a future post.
In this publication we present three sampling methods: inverse transform sampling, rejection sampling, and importance sampling.
Inverse transform sampling can be used for relatively simple distributions, for which we know how to invert the CDF.
For more complex distributions, we have to resort to rejection or importance sampling. Still, for both we need to be able to evaluate the pdf of the distribution in question. In addition, there are other drawbacks, such as: Reject sampling is wasteful when we cannot “box” p
correctly with kq
— this becomes especially tricky in higher dimensions. Similarly, for importance sampling, it is difficult, especially in higher dimensions, to find good distributions of proposals. q
with appropriate importance weights.
If any of the drawbacks mentioned above is too serious, we have to resort to more powerful methods, for example, from the family of Markov chain Monte Carlo methods (MCCM). These have much less stringent requirements on the distributions we want to approximate and are less constrained, for example, in large-dimensional spaces.
This ends this introduction to sampling methods. Please note that all images, unless otherwise stated, are by the author. I hope you enjoyed this post, thanks for reading!
References:
[1] Bishop, Christopher M., “Pattern Recognition and Machine Learning”, 2006, https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf