REGRESSION ALGORITHM
Continuing our exploration of the nearest neighbor classifier, let's move on to its brother in the world of regression. The nearest neighbor regressor applies the same intuitive concept to predict continuous values. But as our data sets grow, finding these neighbors efficiently becomes a real pain. That's where the KD and Ball trees come in.
It's very frustrating that there is no clear guide that really explains what happens with these algorithms. Sure, there are some 2D visualizations, but they often don't make it clear how trees function in a multidimensional environment.
Here we will explain what it is in fact happening in these algorithms without using the overly simplified 2D representation. We'll focus on the construction of the trees themselves and see what calculations (and numbers) really matter.
The nearest neighbor regressor is a simple predictive model that estimates values by averaging the results of nearby data points. This method is based on the idea that similar inputs are likely to produce similar results.
To illustrate our concepts, we will use our usual data set. This data set helps predict the number of golfers visiting the site on a given day. It includes factors such as weather outlook, temperature, humidity and wind conditions. Our objective is to estimate the daily participation of golfers based on these variables.
To use nearest neighbor regression effectively, we must preprocess our data. This involves converting categorical variables to numerical format and scaling numerical features.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer# Create dataset
dataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'),
'Temperature': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humidity': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Num_Players': (52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41)
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=('Outlook'))
# Convert 'Wind' column to binary
df('Wind') = df('Wind').astype(int)
# Split data into features and target, then into training and test sets
x, y = df.drop(columns='Num_Players'), df('Num_Players')
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)
# Identify numerical columns
numerical_columns = ('Temperature', 'Humidity')
# Create a ColumnTransformer to scale only numerical columns
ct = ColumnTransformer((
('scaler', StandardScaler(), numerical_columns)
), remainder='passthrough')
# Fit the ColumnTransformer on the training data and transform both training and test data
X_train_transformed = ct.fit_transform(X_train)
X_test_transformed = ct.transform(X_test)
# Convert the transformed data back to DataFrames
feature_names = numerical_columns + (col for col in X_train.columns if col not in numerical_columns)
X_train_scaled = pd.DataFrame(X_train_transformed, columns=feature_names, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_transformed, columns=feature_names, index=X_test.index)
The nearest neighbor regressor works similarly to its classifier counterpart, but instead of voting for a class, it averages the target values. Here is the basic process:
- Calculate the distance between the new data point and all points in the training set.
- Select the K nearest neighbors based on these distances.
- Compute the average of the target values of these K neighbors.
- Assign this average as the predicted value for the new data point.
The above approach, which uses all data points to find neighbors, is known as brute force method. It is simple but can be slow for large data sets.
Here, we will explore two most efficient algorithms for finding nearest neighbors:
KD Tree (K-Dimensional Tree) is a binary tree structure used to organize points in a k-dimensional space. It is particularly useful for tasks such as nearest neighbor searches and range searches in multidimensional data.
Training steps:
1. Build KD tree:
to. Start with a root node that contains all the points.
b. Choose a feature to split. Actually, any random feature should be fine, but another good way to choose it is by looking at which feature has the median value closest to the midpoint between the maximum and minimum value.
do. Split the tree at the chosen feature and the midpoint.
d. Recursively perform step ac until the stopping criterion is reached, usually a minimum sheet size (see “minimum sheet samples” here)
2. Save the target values:
Along with each point in the KD tree, store its corresponding target value. The minimum and maximum value for each node are also stored.
Regression/prediction steps:
1. Traverse the KD tree:
to. Start at the root node.
b. Compare the query point (test point) with the split dimension and the value at each node.
do. Recursively traverse left or right based on this comparison.
d. Upon reaching a leaf node, add its points to the candidate set.
2. Refine your search:
to. Work your way back up the tree, checking to see if there may be closer points on other branches.
b. Use distance calculations to the maximum and minimum of unexplored branches to determine if exploring is necessary.
3. Find K nearest neighbors:
to. Among the candidate points found, select the K points closest to the query point.
4. Perform regression:
to. Compute the average of the target values of the K nearest neighbors.
b. This average is the predicted value for the query point.
By using a KD tree, the average time complexity of finding nearest neighbors can be reduced by oh(north) in the brute force method for oh(record north) in many cases, where north is the number of points in the data set. This makes KNN regression much more efficient for large data sets.
Ball Tree is another spatial partitioning data structure that organizes points into a series of nested hyperspheres. It is particularly effective for high-dimensional data where KD trees can become less efficient.
Training steps:
1. Build the ball tree:
to. Compute the centroid of all points in the node (the mean). This becomes the pivot point.
b. Make the first branch:
– Find the first center: Choose the point furthest from the pivot point as the first center with its distance as radio.
– Find the second center: From the first center, select the furthest point as the second center.
– Partition: Split the remaining points into two child nodes based on which center they are closest to.
d. Recursively apply steps ab to each child node until a stopping criterion is met, usually a minimum leaf size (see “minimum leaf samples” here).
2. Save the target values:
Along with each point in the ball tree, store its corresponding target value. The radius and centroid of each node are also stored.
Regression/prediction steps:
1. Go through the ball tree:
to. Start at the root node.
b. At each node, calculate the distance between the invisible data and the center of each secondary hypersphere.
do. First, go through the nearest hypersphere.
d. Upon reaching a leaf node, add its points to the candidate set.
2. Refine your search:
to. Determine if other branches need to be explored.
b. If the distance to a hypersphere plus its radius is greater than the current Kth closest distance, that branch can be safely ignored.
3. Find K nearest neighbors:
to. Among the candidate points found, select the K points closest to the query point.
4. Perform regression:
to. Compute the average of the target values of the K nearest neighbors.
b. This average is the expected value for the query point.
Ball trees can be more efficient than KD trees for high-dimensional data or when the dimensionality is greater than the log number of samples. They maintain good performance even when the number of dimensions increases, making them suitable for a wide range of data sets.
The time complexity to query a Ball Tree is O(log north) on average, similar to KD trees, but Ball trees often perform better in higher dimensions where KD trees can degrade to linear time complexity.
Regardless of which algorithm we choose, they all give the same following result:
You can follow this simple rule to choose the best one:
– For small data sets (<1000 samples), "raw" may be fast enough and guarantees finding exact nearest neighbors.
– For larger data sets with few features (<20), 'kd_tree' is usually fastest.
– For larger data sets with many features, 'ball_tree' usually works better.
The 'automatic' option in scikit-learn usually follows the following box:
Although KNN regression has many other parameters, in addition to the algorithm we just discussed (brute force, kd tree, ball tree), you should mainly consider
- Number of Neighbors (K).
– Smaller K: more sensitive to local patterns, but may cause overfitting.
– Larger K: smoother predictions, but important local variations may be missed.
Unlike the classification, even numbers are fine in regression since we are averaging values. - sheet size
This is the stopping condition for building a kd tree or a ball tree. Generally, it affects the speed of construction and query, as well as the memory required to store the tree.
Advantages:
- Simplicity and Versatility: Easy to understand and implement; It can be used for both classification and regression tasks.
- No assumptions: It makes no assumptions about the data distribution, making it suitable for complex data sets.
- No training phase: You can quickly incorporate new data without needing to retrain.
- Interpretability: Predictions can be explained by examining nearest neighbors.
Cons:
- Computational complexity: Prediction time can be slow, especially with large data sets, although optimized algorithms (KD-Tree, Ball Tree) can help for smaller dimensions.
- Curse of dimensionality: Performance degrades in large footprints, affecting both accuracy and efficiency.
- Memory intensive: Requires storing all training data.
- Sensitive to feature scaling and irrelevant features: May be biased by features at larger scales or those that are not important for prediction.
The K-Nearest Neighbors (KNN) regressor is a basic but powerful tool in machine learning. Its simple approach makes it ideal for beginners and its flexibility ensures it is also useful for experts.
As you learn more about data analysis, use KNN to understand the basics of regression before exploring more advanced methods. By mastering KNN and how to calculate nearest neighbors, you will build a solid foundation for tackling more complex challenges in data analysis.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer# Create dataset
dataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'),
'Temperature': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humidity': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Num_Players': (52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41)
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=('Outlook'))
# Convert 'Wind' column to binary
df('Wind') = df('Wind').astype(int)
# Split data into features and target, then into training and test sets
x, y = df.drop(columns='Num_Players'), df('Num_Players')
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)
# Identify numerical columns
numerical_columns = ('Temperature', 'Humidity')
# Create a ColumnTransformer to scale only numerical columns
ct = ColumnTransformer((
('scaler', StandardScaler(), numerical_columns)
), remainder='passthrough')
# Fit the ColumnTransformer on the training data and transform both training and test data
X_train_transformed = ct.fit_transform(X_train)
X_test_transformed = ct.transform(X_test)
# Convert the transformed data back to DataFrames
feature_names = numerical_columns + (col for col in X_train.columns if col not in numerical_columns)
X_train_scaled = pd.DataFrame(X_train_transformed, columns=feature_names, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_transformed, columns=feature_names, index=X_test.index)
# Initialize and train KNN Regressor
knn = KNeighborsRegressor(n_neighbors=5,
algorithm='kd_tree', #'ball_tree', 'brute'
leaf_size=5) #default is 30
knn.fit(X_train_scaled, y_train)
# Make predictions
y_pred = knn.predict(X_test_scaled)
# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")