Bernoulli's Naive Bayes Method Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | August 2024

CLASSIFICATION ALGORITHM

Unlocking predictive power through yes/no probability

All illustrations in this article were created by the author, incorporating design elements licensed from Canva Pro.

Unlike the baseline method of dummy classifiers or KNN similarity-based reasoning, Naive Bayes leverages probability theory. It combines the individual probabilities of each “cue” (or feature) to make a final prediction. This simple yet powerful method has proven invaluable in several machine learning applications.

Naive Bayes is a machine learning algorithm that uses probability to classify data. It is based on Bayes' Theorema formula for computing conditional probabilities. The “naive” part refers to its key assumption: it treats all features as independent of each other, even when they might not be in reality. This simplification, while often unrealistic, greatly reduces computational complexity and works well in many practical scenarios.

Naive Bayes methods are simple machine learning algorithms that use probability as a basis.

There are three main types of Naive Bayes classifiers. The key difference between these types lies in the assumption they make about the feature distribution:

Bernoulli naive bayes: Suitable for binary/boolean functions. Assumes that each function is a binary-valued variable (0/1).
Multinomial Naive Bayes:Generally used for discrete counts. It is often used in text classification, where features may be word counts.
Gaussian naive bayes:Assumes that continuous features follow a normal distribution.

The Bernoulli NB method assumes binary data, the multinomial NB method works with discrete counts, and the Gaussian NB method handles continuous data assuming a normal distribution.

It is a good start to focus on the simplest one, which is the Bernoulli NB. The “Bernoulli” in its name comes from the assumption that each feature has a binary value.

In this article, we will use this artificial golf dataset (inspired by (1)) as an example. This dataset predicts whether a person will play golf based on the weather conditions.

Columns: 'Outlook', 'Temperature' (in degrees Fahrenheit), 'Humidity' (in %), 'Wind' and 'Play' (target function)

# IMPORTING DATASET #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as npdataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'),
'Temperature': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humidity': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Play': ('No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
}
df = pd.DataFrame(dataset_dict)
# ONE-HOT ENCODE 'Outlook' COLUMN
df = pd.get_dummies(df, columns=('Outlook'),  prefix='', prefix_sep='', dtype=int)
# CONVERT 'Windy' (bool) and 'Play' (binary) COLUMNS TO BINARY INDICATORS
df('Wind') = df('Wind').astype(int)
df('Play') = (df('Play') == 'Yes').astype(int)
# Set feature matrix x and target vector y
x, y = df.drop(columns='Play'), df('Play')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)
print(pd.concat((X_train, y_train), axis=1), end='\n\n')
print(pd.concat((X_test, y_test), axis=1))

We will adapt it slightly for Bernoulli Naive Bayes by converting our features to binary.

Since all data must be in 0 and 1 format, the “Outlook” is hot coded, while temperature is separated into ≤80 and >80. Similarly, humidity is separated into ≤75 and >75.

# One-hot encode the categorized columns and drop them after, but do it separately for training and test sets
# Define categories for 'Temperature' and 'Humidity' for training set
X_train('Temperature') = pd.cut(X_train('Temperature'), bins=(0, 80, 100), labels=('Warm', 'Hot'))
X_train('Humidity') = pd.cut(X_train('Humidity'), bins=(0, 75, 100), labels=('Dry', 'Humid'))# Similarly, define for the test set
X_test('Temperature') = pd.cut(X_test('Temperature'), bins=(0, 80, 100), labels=('Warm', 'Hot'))
X_test('Humidity') = pd.cut(X_test('Humidity'), bins=(0, 75, 100), labels=('Dry', 'Humid'))
# One-hot encode the categorized columns
one_hot_columns_train = pd.get_dummies(X_train(('Temperature', 'Humidity')), drop_first=True, dtype=int)
one_hot_columns_test = pd.get_dummies(X_test(('Temperature', 'Humidity')), drop_first=True, dtype=int)
# Drop the categorized columns from training and test sets
X_train = X_train.drop(('Temperature', 'Humidity'), axis=1)
X_test = X_test.drop(('Temperature', 'Humidity'), axis=1)
# Concatenate the one-hot encoded columns with the original DataFrames
X_train = pd.concat((one_hot_columns_train, X_train), axis=1)
X_test = pd.concat((one_hot_columns_test, X_test), axis=1)
print(pd.concat((X_train, y_train), axis=1), '\n')
print(pd.concat((X_test, y_test), axis=1))

Bernoulli's Naive Bayes method operates on data where each feature is 0 or 1.

Calculate the probability of each class in the training data.
For each feature and class, calculate the probability that the feature is 1 and 0 given the class.
For a new instance: For each class, multiply its probability by the probability of each feature value (0 or 1) for that class.
Predict the class with the highest resulting probability.

For our golf dataset, a Bernoulli NB classifier looks at the probability of each feature occurring for each class (YES and NO) and then makes a decision based on which class has the highest probability.

The training process for Bernoulli Naive Bayes involves calculating probabilities from the training data:

Class probability calculation:For each class, calculate its probability: (Number of instances in this class) / (Total number of instances)

In our golf example, the algorithm would calculate how often golf is played overall.

from fractions import Fractiondef calc_target_prob(attr):
total_counts = attr.value_counts().sum()
prob_series = attr.value_counts().apply(lambda x: Fraction(x, total_counts).limit_denominator())
return prob_series
print(calc_target_prob(y_train))

2.Feature probability calculation:For each feature and each class, calculate:

(Number of instances where the feature is 0 in this class) / (Number of instances in this class)
(Number of instances where the feature is 1 in this class) / (Number of instances in this class)

For each weather condition (e.g. sunny), how often golf is played when it is sunny and how often golf is not played when it is sunny.

from fractions import Fractiondef sort_attr_label(attr, lbl):
return (pd.concat((attr, lbl), axis=1)
.sort_values((attr.name, lbl.name))
.reset_index()
.rename(columns={'index': 'ID'})
.set_index('ID'))
def calc_feature_prob(attr, lbl):
total_classes = lbl.value_counts()
counts = pd.crosstab(attr, lbl)
prob_df = counts.apply(lambda x: (Fraction(c, total_classes(x.name)).limit_denominator() for c in x))
return prob_df
print(sort_attr_label(y_train, X_train('sunny')))
print(calc_feature_prob(X_train('sunny'), y_train))

The same process applies to all other functions.

for col in X_train.columns:
print(calc_feature_prob(X_train(col), y_train), "\n")

3. Smoothing (optional):Add a small value (usually 1) to the numerator and denominator of each probability calculation to avoid zero probabilities

We add 1 to all numerators and add 2 to all denominators to keep the total probability of the class at 1.

# In sklearn, all processes above is summarized in this 'fit' method:
from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB(alpha=1)
nb_clf.fit(X_train, y_train)

4. Store results:Save all calculated probabilities for use during classification.

Smoothing is already applied to all feature probabilities. We will use these tables to make predictions.

Given a new instance with characteristics that are 0 or 1:

Collection of probabilities:For each possible class:

Start with the probability of this class occurring (class probability).
For each feature in the new instance, collect the probability that this feature is 0/1 for this class.

For ID 14, we select the probabilities of each of the features occurring (0 or 1).

2. Score calculation and prediction:For each class:

Multiply all the collected probabilities together
The result is the score for this class.
The class with the highest score is the prediction.

After multiplying the class probability and all the feature probabilities, we select the class that has the highest score.

y_pred = nb_clf.predict(X_test)
print(y_pred)

This simple probabilistic model provides high accuracy for this simple dataset.

# Evaluate the classifier
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Bernoulli's Naive Bayes method has some important parameters:

Alpha (α): This is the smoothing parameter. It adds a small count to each feature to avoid zero probabilities. The default value is usually 1.0 (Laplace smoothing), as shown above.
Binarize:If your features are not already binary, this threshold converts them. Any value above this threshold becomes 1 and any value below becomes 0.

In the case of BernoulliNB in scikit-learn, numerical features are usually standardized rather than manually binarized. The model then internally converts these standardized values to binary, typically using 0 (the mean) as a threshold.

3. Preset adjustment:Whether to learn class prior probabilities or assume uniform (50/50) prior probabilities.

For our golf dataset, we could start with the default α=1.0, no binarization (since we've made our features binary), and fit_prior=True.

Like any machine learning algorithm, Bernoulli Naive Bayes has its strengths and limitations.

Simplicity:Easy to implement and understand.
Efficiency:Fast to train and predict, works well with large feature spaces.
Performance with small data sets:It can work well even with limited training data.
Handles high-dimensional data:It works well with many functions, especially in text classification.

Assumption of Independence:It assumes that all features are independent, which is often not true in real-world data.
Limited to binary functionsIn its pure form, it only works with binary data.
Sensitivity to input data:May be sensitive to how features are binarized.
Zero frequency problem:Without smoothing, zero probabilities can strongly affect predictions.

The Bernoulli Naive Bayes classifier is a simple yet powerful machine learning algorithm for binary classification. It excels in text analysis and spam detection, where features are typically binary. This probabilistic model, known for its speed and efficiency, works well with small data sets and high-dimensional spaces.

Despite its naive assumption of feature independence, it often rivals more complex models in accuracy. Bernoulli's naive Bayes method serves as an excellent baseline and real-time classification tool.

# Import needed libraries
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split# Load the dataset
dataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'),
'Temperature': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humidity': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Play': ('No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
}
df = pd.DataFrame(dataset_dict)
# Prepare data for model
df = pd.get_dummies(df, columns=('Outlook'),  prefix='', prefix_sep='', dtype=int)
df('Wind') = df('Wind').astype(int)
df('Play') = (df('Play') == 'Yes').astype(int)
# Split data into training and testing sets
x, y = df.drop(columns='Play'), df('Play')
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)
# Scale numerical features (for automatic binarization)
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=('float64')).columns
X_train(float_cols) = scaler.fit_transform(X_train(float_cols))
X_test(float_cols) = scaler.transform(X_test(float_cols))
# Train the model
nb_clf = BernoulliNB()
nb_clf.fit(X_train, y_train)
# Make predictions
y_pred = nb_clf.predict(X_test)
# Check accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Bernoulli's Naive Bayes Method Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | August 2024

Technical Terrence Team

Celebrity Cruises Answers a Key Question About Alcohol and Loyalty

Leave a Reply Cancel reply

Recommended.

Hong Kong Crypto Push Could Have China Backing: Reports

Netflix's Q1 Preview: Eye on Subscriber Growth and Ad Tier Performance (NASDAQ:NFLX)

Traders Must Watch Bitcoin and US M2 Supply Closely to Win

US lawmakers seek answers from telecom companies over Chinese hacking report By Reuters

Rolls-Royce's share price rise could have made me that much in just one year

Categories

Important Links

Bernoulli's Naive Bayes Method Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | August 2024

CLASSIFICATION ALGORITHM

Unlocking predictive power through yes/no probability

Related

Technical Terrence Team

Celebrity Cruises Answers a Key Question About Alcohol and Loyalty

Leave a Reply Cancel reply

Recommended.

Hong Kong Crypto Push Could Have China Backing: Reports

Netflix's Q1 Preview: Eye on Subscriber Growth and Ad Tier Performance (NASDAQ:NFLX)

Traders Must Watch Bitcoin and US M2 Supply Closely to Win

US lawmakers seek answers from telecom companies over Chinese hacking report By Reuters

Rolls-Royce's share price rise could have made me that much in just one year

Categories

Important Links

Get daily news updates to your inbox!