In this new post, we will try to understand how the multinomial naive Bayes classifier works and provide working examples using Python and scikit-learn.
What we will see:
- What is multinomial distribution? Unlike naive Bayes Gaussian classifiers that are based on an assumed Gaussian distribution, naive Bayes multinomial classifiers are based on a multinomial distribution.
- The general approach to creating classifiers is based on Bayes' theorem, along with the naive assumption that input features are independent of each other given a target class.
- How to “tune” a multinomial classifier by learning/estimating the multinomial probabilities for each class, using the smoothing trick to handle empty features.
- How the probabilities of a new sample are calculated, using the log space trick to avoid overflow.
All images by author.
If you are already familiar with the multinomial distribution, you can skip to the next part.
The first important step in understanding the Naive Bayes multinomial classifier is to understand what multinomial distribution is.
In simple words, it represents the probabilities of an experiment that can have a finite number of outcomes and is repeated N times, for example like rolling a die with 6 faces, say 10 times and counting the number of times each face appears. Another example is counting the number of occurrences of each vocabulary word in a text.
You can also view the multinomial distribution as an extension of the binomial distribution: except when you toss a coin with 2 possible outcomes (binomial), you roll a die with 6 outcomes (multinomial). As for the binomial distribution, the sum of all the probabilities of the possible outcomes must add up to 1. Then we could have: