Because we use an unsupervised learning algorithm, there is no widely available measure of accuracy. However, we can use domain knowledge to validate our groups.
By visually inspecting the groups, we can see that some benchmarking groups have a mix of budget and luxury hotels, which does not make business sense as the demand for hotels is fundamentally different.
We can scroll down to the data and notice some of those differences, but can we come up with our own measure of precision?
We want to create a function to measure the consistency of the recommended benchmark sets on each feature. One way to do this is by calculating the variance of each feature for each set. For each group, we can calculate an average of the variance of each characteristic and then we can average the variance of each group of hotels to obtain a total model score.
From our domain knowledge, we know that to establish a comparable set of benchmarks, we must prioritize hotels from the same brand, possibly from the same market and from the same country, and if we use different markets or countries, then the market level should be the same.
With that in mind, we want our measure to have a higher penalty for variation in those characteristics. To do this, we will use a weighted average to calculate the variance of each reference set. We will also print the variation of the key and secondary features separately.
In summary, to create our precision measure, we need:
- Calculate the variance of categorical variables.: A common approach is to use an “entropy-based” measure, where greater diversity in categories indicates greater entropy (variance).
- Calculate the variance of numerical variables.: we can calculate the standard deviation or the range (difference between maximum and minimum values). This measures the spread of numerical data within each group.
- Normalize data– Normalize the variance scores for each category before applying weights to ensure that no feature dominates the weighted average due to scale differences alone.
- Apply weights for different metrics– Weight each type of variation according to its importance to the grouping logic.
- Calculate weighted averages: Calculate the weighted average of these variance scores for each group.
- Add scores between groups: The total score is the average of these weighted variance scores across all groups or rows. A lower average score would indicate that our model is effectively grouping similar hotels together, minimizing variation within the group.
from scipy.stats import entropy
from sklearn.preprocessing import MinMaxScaler
from collections import Counterdef categorical_variance(data):
"""
Calculate entropy for a categorical variable from a list.
A higher entropy value indicates datas with diverse classes.
A lower entropy value indicates a more homogeneous subset of data.
"""
# Count frequency of each unique value
value_counts = Counter(data)
total_count = sum(value_counts.values())
probabilities = (count / total_count for count in value_counts.values())
return entropy(probabilities)
#set scoring weights giving higher weights to the most important features
scoring_weights = {"BRAND": 0.3,
"Room_count": 0.025,
"Market": 0.25,
"Country": 0.15,
"Market Tier": 0.15,
"HCLASS": 0.05,
"Demand": 0.025,
"Price range": 0.025,
"distance_to_airport": 0.025}
def calculate_weighted_variance(df, weights):
"""
Calculate the weighted variance score for clusters in the dataset
"""
# Initialize a DataFrame to store the variances
variance_df = pd.DataFrame()
# 1. Calculate variances for numerical features
numerical_features = ('Room_count', 'Demand', 'Price range', 'distance_to_airport')
for feature in numerical_features:
variance_df(f'{feature}') = df(feature).apply(np.var)
# 2. Calculate entropy for categorical features
categorical_features = ('BRAND', 'Market','Country','Market Tier','HCLASS')
for feature in categorical_features:
variance_df(f'{feature}') = df(feature).apply(categorical_variance)
# 3. Normalize the variance and entropy values
scaler = MinMaxScaler()
normalized_variances = pd.DataFrame(scaler.fit_transform(variance_df),
columns=variance_df.columns,
index=variance_df.index)
# 4. Compute weighted average
cat_weights = {f'{feature}': weights(f'{feature}') for feature in categorical_features}
num_weights = {f'{feature}': weights(f'{feature}') for feature in numerical_features}
cat_weighted_scores = normalized_variances(categorical_features).mul(cat_weights)
df('cat_weighted_variance_score') = cat_weighted_scores.sum(axis=1)
num_weighted_scores = normalized_variances(numerical_features).mul(num_weights)
df('num_weighted_variance_score') = num_weighted_scores.sum(axis=1)
return df('cat_weighted_variance_score').mean(), df('num_weighted_variance_score').mean()
To keep our code clean and keep track of our experiments, let's also define a function to store the results of our experiments.
# define a function to store the results of our experiments
def model_score(data: pd.DataFrame,
weights: dict = scoring_weights,
model_name: str ="model_0"):
cat_score,num_score = calculate_weighted_variance(data,weights)
results ={"Model": model_name,
"Primary features score": cat_score,
"Secondary features score": num_score}
return resultsmodel_0_score= model_score(results_model_0,scoring_weights)
model_0_score
Now that we have a baseline, let's see if we can improve our model.
Improving our model through experimentation
Until now, we didn't have to know what was going on under the hood when we ran this code:
nns = NearestNeighbors()
nns.fit(data_scaled)
nns_results_model_0 = nns.kneighbors(data_scaled)(1)
To improve our model, we will need to understand the model parameters and how we can interact with them to obtain better reference sets.
Let's start by looking at the Scikit Learn documentation and source code:
# the below is taken directly from scikit learn sourcefrom sklearn.neighbors._base import KNeighborsMixin, NeighborsBase, RadiusNeighborsMixin
class NearestNeighbors_(KNeighborsMixin, RadiusNeighborsMixin, NeighborsBase):
"""Unsupervised learner for implementing neighbor searches.
Parameters
----------
n_neighbors : int, default=5
Number of neighbors to use by default for :meth:`kneighbors` queries.
radius : float, default=1.0
Range of parameter space to use by default for :meth:`radius_neighbors`
queries.
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
Algorithm used to compute the nearest neighbors:
- 'ball_tree' will use :class:`BallTree`
- 'kd_tree' will use :class:`KDTree`
- 'brute' will use a brute-force search.
- 'auto' will attempt to decide the most appropriate algorithm
based on the values passed to :meth:`fit` method.
Note: fitting on sparse input will override the setting of
this parameter, using brute force.
leaf_size : int, default=30
Leaf size passed to BallTree or KDTree. This can affect the
speed of the construction and query, as well as the memory
required to store the tree. The optimal value depends on the
nature of the problem.
metric : str or callable, default='minkowski'
Metric to use for distance computation. Default is "minkowski", which
results in the standard Euclidean distance when p = 2. See the
documentation of `scipy.spatial.distance
`_ and
the metrics listed in
:class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric
values.
p : float (positive), default=2
Parameter for the Minkowski metric from
sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric_params : dict, default=None
Additional keyword arguments for the metric function.
"""
def __init__(
self,
*,
n_neighbors=5,
radius=1.0,
algorithm="auto",
leaf_size=30,
metric="minkowski",
p=2,
metric_params=None,
n_jobs=None,
):
super().__init__(
n_neighbors=n_neighbors,
radius=radius,
algorithm=algorithm,
leaf_size=leaf_size,
metric=metric,
p=p,
metric_params=metric_params,
n_jobs=n_jobs,
)
There's quite a bit going on here.
He Nearestneighbor
the class inherits fromNeighborsBase
, which is the case class for nearest neighbor estimators. This class handles common functionality required for nearest neighbor searches, such as
- n_neighbors (the number of neighbors to use)
- radius (the radius for radio-based neighbor lookups)
- algorithm (the algorithm used to calculate nearest neighbors, such as 'ball_tree', 'kd_tree' or 'brute')
- metric (the distance metric to use)
- metric_params (additional keyword arguments for the metric function)
He Nearestneighbor
The class also inherits fromKNeighborsMixin
and RadiusNeighborsMixin
classes. These Mixin classes add specific neighbor search functionality to the Nearestneighbor
KNeighborsMixin
provides functionality to find the fixed number k of nearest neighbors to a point. It does this by finding the distance to neighbors and their indices and constructing a graph of connections between points based on the k nearest neighbors of each point.RadiusNeighborsMixin
It is based on the radius neighbor algorithm, which finds all neighbors within a given radius of a point. This method is useful in scenarios where the focus is on capturing all points within a significant distance threshold rather than a fixed number of points.
Based on our scenario, KNeighborsMixin provides the functionality we need.
We need to understand a key parameter before we can improve our model; This is the distance metric.
The documentation mentions that the NearestNeighbor algorithm uses the “Minkowski” distance by default and gives us a reference to the SciPy API.
In scipy.spatial.distance
We can see two mathematical representations of the “Minkowski” distance:
∥u−v∥ p=( i ∑∣ui−vi∣ p ) 1/p
This formula calculates the pth root of the sum of the powered differences between all elements.
The second mathematical representation of the “Minkowski” distance is:
∥u−v∥ p=( i ∑wi(∣ui−vi∣ p )) 1/p
This is very similar to the first, but introduces weights. wi
to differences, emphasizing or downplaying specific dimensions. This is useful when certain features are more relevant than others. By default, the setting is None, which gives all functions the same weight of 1.0.
This is a great option to improve our model, as it allows us to convey domain knowledge to our model and emphasize the similarities that are most relevant to users.
If we look at the formulas, we see the parameter. p
. This parameter affects the “path” the algorithm takes to calculate the distance. By default, p=2, which represents the Euclidean distance.
You can think of Euclidean distance as calculating distance by drawing a straight line between 2 points. This is usually the shortest distance, however, it is not always the most desirable way to calculate distance, especially in larger spaces. For more information on why this occurs, there is this excellent article online: https://bib.dbvis.de/uploadedFiles/155.pdf
Another common value for p is 1. This represents the distance from Manhattan. It is thought of as the distance between two points measured along a grid-like path.
On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding element of the vectors.. Basically, it measures the worst-case difference, making it useful in scenarios where you want to ensure that no feature varies too much.
By reading and familiarizing ourselves with the documentation, we have discovered some possible options to improve our model.
By default, n_neighbors is 5; However, for our set of benchmarks, we want to compare each hotel to the 3 most similar hotels. To do this, we need to set n_neighbors = 4 (hotel in question + 3 peers)
nns_1= NearestNeighbors(n_neighbors=4)
nns_1.fit(data_scaled)
nns_1_results_model_1 = nns_1.kneighbors(data_scaled)(1)
results_model_1 = clean_results(nns_results=nns_1_results_model_1,
encoders=encoders,
data=data_clean)
model_1_score= model_score(results_model_1,scoring_weights,model_name="baseline_k_4")
model_1_score
According to the documentation, we can pass weights to the distance calculation to emphasize the relationship between some features. Based on our domain knowledge, we have identified the characteristics we want to emphasize, in this case, Brand, Market, Country and Market Level.
# set up weights for distance calculation
weights_dict = {"BRAND": 5,
"Room_count": 2,
"Market": 4,
"Country": 3,
"Market Tier": 3,
"HCLASS": 1.5,
"Demand": 1,
"Price range": 1,
"distance_to_airport": 1}
# Transform the wieghts dictionnary into a list by keeping the scaled data column order
weights = ( weights_dict(idx) for idx in list(scaler.get_feature_names_out()))nns_2= NearestNeighbors(n_neighbors=4,metric_params={ 'w': weights})
nns_2.fit(data_scaled)
nns_2_results_model_2 = nns_2.kneighbors(data_scaled)(1)
results_model_2 = clean_results(nns_results=nns_2_results_model_2,
encoders=encoders,
data=data_clean)
model_2_score= model_score(results_model_2,scoring_weights,model_name="baseline_with_weights")
model_2_score
Passing domain knowledge to the model using weights significantly increased the score. Next, let's test the impact of the distance measure.
So far we have been using the Euclidean distance. Let's see what happens if we use the Manhattan distance instead.
nns_3= NearestNeighbors(n_neighbors=4,p=1,metric_params={ 'w': weights})
nns_3.fit(data_scaled)
nns_3_results_model_3 = nns_3.kneighbors(data_scaled)(1)
results_model_3 = clean_results(nns_results=nns_3_results_model_3,
encoders=encoders,
data=data_clean)
model_3_score= model_score(results_model_3,scoring_weights,model_name="Manhattan_with_weights")
model_3_score
Decreasing p by 1 resulted in some nice improvements. Let's see what happens when p approaches infinity.
To use the Chebyshev distance, we will change the metric parameter to Chebyshev.
The default sklearn Chebyshev metric does not have a weight parameter. To avoid this, we will define a custom weighted_chebyshev
metric.
# Define the custom weighted Chebyshev distance function
def weighted_chebyshev(u, v, w):
"""Calculate the weighted Chebyshev distance between two points."""
return np.max(w * np.abs(u - v))nns_4 = NearestNeighbors(n_neighbors=4,metric=weighted_chebyshev,metric_params={ 'w': weights})
nns_4.fit(data_scaled)
nns_4_results_model_4 = nns_4.kneighbors(data_scaled)(1)
results_model_4 = clean_results(nns_results=nns_4_results_model_4,
encoders=encoders,
data=data_clean)
model_4_score= model_score(results_model_4,scoring_weights,model_name="Chebyshev_with_weights")
model_4_score
We managed to reduce the variation scores of the main characteristics through experimentation.
Let's visualize the results.
results_df = pd.DataFrame((model_0_score,model_1_score,model_2_score,model_3_score,model_4_score)).set_index("Model")
results_df.plot(kind='barh')
Using Manhattan distance with weights appears to provide the most accurate reference sets for our needs.
The last step before implementing the reference sets would be to examine the sets with the highest primary feature scores and identify what steps to take with them.
# Histogram of Primary features score
results_model_3("cat_weighted_variance_score").plot(kind="hist")
exceptions = results_model_3(results_model_3("cat_weighted_variance_score")>=0.4)print(f" There are {exceptions.shape(0)} benchmark sets with significant variance across the primary features")
These 18 cases will need to be reviewed to ensure that the benchmark sets are relevant.
As you can see, with a few lines of code and some understanding of nearest neighbor search, we managed to establish sets of internal landmarks. Now we can distribute the sets and start measuring the hotels' KPIs against their reference sets.
You don't always need to focus on the most advanced machine learning methods to deliver value. Very often, simple machine learning can offer great value.
What are some of the low-hanging fruits in your business that you could easily address with machine learning?
World Bank. “World Development Indicators.” Retrieved June 11, 2024 from https://datacatalog.worldbank.org/search/dataset/0038117
Aggarwal, C.C., Hinneburg, A., and Keim, D.A. (undated). On the surprising behavior of distance metrics in a high-dimensional space. IBM TJ Watson Research Center and Institute of Computer Science, Halle University. Obtained from https://bib.dbvis.de/uploadedFiles/155.pdf
SciPy Manual v1.10.1. scipy.spatial.distance.minkowski
. Retrieved June 11, 2024 from https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html
Geeks for geeks. Haversine formula to find the distance between two points on a sphere. Retrieved June 11, 2024 from https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/
scikit-learn. Neighbors Module. Retrieved June 11, 2024 from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors