Least Squares Regression Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | November 2024

REGRESSION ALGORITHM

Sliding through points to minimize squares

When people start learning about data analysis, they usually start with linear regression. There's a good reason for this: it's one of the most useful and easiest ways to understand how regression works. The most common linear regression approaches are called “least squares methods”: they work by finding patterns in the data by minimizing the squared differences between predictions and actual values. The most basic type is Ordinary least squares (OLS), which finds the best way to draw a straight line through your data points.

However, sometimes OLS is not enough, especially when the data has many related features that can make the results unstable. That's where Ridge regression come in. Ridge regression does the same job as OLS but adds a special control that helps prevent the model from becoming too sensitive to a single feature.

Here, we'll slide through two key types of least squares regression, exploring how these algorithms smoothly glide through your data points and look at their differences in theory.

All visuals: Created by the author using Canva Pro. Optimized for mobile devices; It may look oversized on the desktop.

Linear regression is a statistical method that predicts numerical values using a linear equation. Models the relationship between a dependent variable and one or more independent variables by fitting a straight line (or plane, in multiple dimensions) through the data points. The model calculates coefficients for each characteristic, representing its impact on the result. To get a result, you enter the feature values of your data into the linear equation to calculate the predicted value.

To illustrate our concepts, we will use our standard data set that predicts the number of golfers who will visit us on a given day. This data set includes variables such as weather outlook, temperature, humidity, and wind conditions.

Columns: 'Outlook' (hot-coded for sunny, cloudy, rain), 'Temperature' (in Fahrenheit), 'Humidity' (in %), 'Wind' (Yes/No) and 'Number of players' (numeric, target characteristic)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split# Create dataset
dataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'),
'Temp.': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humid.': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Num_Players': (52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41)
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=('Outlook'),prefix='',prefix_sep='')
# Convert 'Wind' column to binary
df('Wind') = df('Wind').astype(int)
# Split data into features and target, then into training and test sets
x, y = df.drop(columns='Num_Players'), df('Num_Players')
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)

While not required, to effectively use linear regression (including ridge regression) we can first standardize numerical features.

Standard scaling applies to “Temperature” and “Humidity”, while one-hot encoding applies to “Outlook” and “Wind”.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer# Create dataset
data = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 
'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 
'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'),
'Temperature': (85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 75, 72, 81, 71, 81, 74, 76, 78, 82, 
67, 85, 73, 88, 77, 79, 80, 66, 84),
'Humidity': (85, 90, 78, 96, 80, 70, 65, 95, 70, 80, 70, 90, 75, 80, 88, 92, 85, 75, 92, 
90, 85, 88, 65, 70, 60, 95, 70, 78),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, 
True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Num_Players': (52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 
14, 34, 29, 49, 36, 57, 21, 23, 41)
}
# Process data
df = pd.get_dummies(pd.DataFrame(data), columns=('Outlook'))
df('Wind') = df('Wind').astype(int)
# Split data
x, y = df.drop(columns='Num_Players'), df('Num_Players')
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)
# Scale numerical features
numerical_cols = ('Temperature', 'Humidity')
ct = ColumnTransformer((('scaler', StandardScaler(), numerical_cols)), remainder='passthrough')
# Transform data
X_train_scaled = pd.DataFrame(
ct.fit_transform(X_train),
columns=numerical_cols + (col for col in X_train.columns if col not in numerical_cols),
index=X_train.index
)
X_test_scaled = pd.DataFrame(
ct.transform(X_test),
columns=X_train_scaled.columns,
index=X_test.index
)

Linear regression predicts numbers by making a straight line (or hyperplane) from the data:

The model finds the best line by making the gaps between the actual values and the predicted values of the line as small as possible. This is called “least squares.”
Each input gets a number (coefficient/weight) that shows how much the final answer changes. There is also a starting number (intercept/skew) that is used when all inputs are zero.
To predict a new response, the model takes each input, multiplies it by its number, adds everything together, and adds the initial number. This gives you the expected response.

Let's start with ordinary least squares (OLS), the fundamental approach to linear regression. The goal of OLS is to find the best fit line across our data points. We do this by measuring how “wrong” our predictions are compared to the actual values and then finding the line that makes these errors as small as possible. When we say “error,” we mean the vertical distance between each point and our line; In other words, how far our predictions are from reality. Let's first see what happened in the 2D case.

In 2D case

In the 2D case, we can imagine the linear regression algorithm like this:

Here is the explanation of the above process:

1.We start with a training set, where each row has:
· unknown : our input function (the numbers 1, 2, 3, 1, 2)
· and : our target values (0, 1, 1, 2, 3)

2. We can plot these points on a scatterplot and we want to find a line. and = b₀ + b₁unknown that best fits these points

3. For any given line (any b₀ and b₁), we can measure how good it is by:
· Calculate the vertical distance (d₁, d₂, d₃, d₄, d₅) from each point to the line
· These distances are |and —(b₀ + b₁unknown)| for each point

4. Our optimization goal is to find b₀ and b₁ that minimize the sum of distances squared: d₁² + d₂² + d₃² + d₄² + d₅². In vector notation, this is written as ||and — Xβ||², where unknown = (1 unknown) contains our input data (with 1 for the intersection) and b = (b₀ b₁)ᵀ contains our coefficients.

5. The optimal solution has closed form: b = (unknownᵀunknown)⁻¹unknownᵀy. Calculating this we obtain b₀ = -0.196 (intercept), b₁ = 0.761 (slope).

This vector notation makes the formula more compact and shows that we are actually working with matrices and vectors rather than individual points. We will see more details of our calculation below in the multidimensional case.

In multidimensional case ( Dataset)

Again, the goal of OLS is to find coefficients (b) that minimize the squared differences between our predictions and the actual values. Mathematically, we express this as minimizing ||and — Xβ||², where unknown is our data matrix and and contains our target values.

The training process follows these key steps:

Training step

1. Prepare our data matrix unknown. This involves adding a column of ones to account for the bias/intercept term (b₀).

2. Instead of iteratively searching for the best coefficients, we can calculate them directly using the normal equation:
b = (unknownᵀunknown)⁻¹unknownᵀand

where:
· b is the vector of estimated coefficients,
· unknown is the matrix of the data set (including a column for the intersection),
· and It's the label
· unknownᵀ represents the transpose of the matrix unknown,
· ⁻¹ represents the inverse of the matrix.

Let's analyze this:

to. we multiply unknownᵀ (unknown transpose) by unknowngiving us a square matrix

b. We calculate the inverse of this matrix.

do. we calculate unknownᵀand

d. We multiply (unknownᵀunknown)⁻¹ and unknownᵀand to obtain our coefficients

Test step

Once we have our coefficients, making predictions is simple: we simply multiply our new data point by these coefficients to get our prediction.

In matrix notation, for a new data point unknown*, prediction and* is calculated as
and* = unknown*b = (1, x₁, x₂,…, xₚ) × (β₀, β₁, β₂,…, βₚ)ᵀ,
where b₀ is the y-intercept b₁ via bₚ are the coefficients of each characteristic.

Evaluation step

We can do the same process for all data points. For our data set, here is also the final result with the RMSE.

Now, let's consider ridge regression, which is based on OLS and addresses some of its limitations. The key idea of ridge regression is that sometimes the optimal OLS solution implies very large coefficientswhich can cause overfitting.

Ridge Regression adds a penalty term (I||b||²) to the objective function. This term discourages large coefficients by adding their squared values to what we are minimizing. The complete objective becomes:

min ||and — unknownβ||² + λ||b||²

He I The (lambda) parameter controls how much we penalize large coefficients. When I = 0, we obtain OLS; as I increases, the coefficients decrease toward zero (but never reach it).

Training step

Like OLS, prepare our data matrix unknown. This involves adding a column of ones to account for the intercept term (b₀).
The Ridge formation process follows a similar pattern to that of OLS, but with one modification. The closed form solution becomes:
b = (unknownᵀunknown+ λI)⁻¹unknownᵀand

where:
· Yo is the identity matrix (with the first element, corresponding to b₀, sometimes set to 0 to exclude intersection from regularization in some implementations),
· λ is the regularization value.
· AND is the vector of observed values of the dependent variables.
· Other symbols remain as defined in the OLS section.

Let's analyze this:

to. we add II to unknownᵀUNKNOWN. the value of I It can be any positive number (say 0.1).

b. We calculate the inverse of this matrix. The benefits of adding λI to unknownᵀunknown before investment are:
· Makes the matrix invertible, even if unknownᵀunknown it is not (solving a key numerical problem with OLS)
· Reduces the coefficients proportionally to I

do. We multiply (unknownᵀunknown+ λyo)⁻¹ and unknownᵀand to obtain our coefficients

Test step

The prediction process remains the same as OLS: multiply the new data points by the coefficients. The difference lies in the coefficients themselves, which tend to be smaller and more stable than their OLS counterparts.

Evaluation step

We can do the same process for all data points. For our data set, here is also the final result with the RMSE.

Final Comments: Choosing between OLS and Ridge

The choice between OLS and Ridge often depends on your data:

Use OLS when you have well-behaved data with low multicollinearity and enough samples (relative to features)
Use Ridge when you have:
– Many features (relative to samples)
– Multicollinearity in your features.
– Signs of overfitting with OLS

With Ridge, you'll have to choose I. Start with a range of values (often logarithmically spaced) and choose the one that provides the best validation performance.

Apparently the default *λ = 1 gives the best RMSE for our data set.*

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import Ridge# Create dataset
data = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 
'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 
'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'),
'Temperature': (85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 75, 72, 81, 71, 81, 74, 76, 78, 82, 
67, 85, 73, 88, 77, 79, 80, 66, 84),
'Humidity': (85, 90, 78, 96, 80, 70, 65, 95, 70, 80, 70, 90, 75, 80, 88, 92, 85, 75, 92, 
90, 85, 88, 65, 70, 60, 95, 70, 78),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, 
True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Num_Players': (52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 
14, 34, 29, 49, 36, 57, 21, 23, 41)
}
# Process data
df = pd.get_dummies(pd.DataFrame(data), columns=('Outlook'), prefix='', prefix_sep='', dtype=int)
df('Wind') = df('Wind').astype(int)
df = df(('sunny','overcast','rain','Temperature','Humidity','Wind','Num_Players'))
# Split data
x, y = df.drop(columns='Num_Players'), df('Num_Players')
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, shuffle=False)
# Scale numerical features
numerical_cols = ('Temperature', 'Humidity')
ct = ColumnTransformer((('scaler', StandardScaler(), numerical_cols)), remainder='passthrough')
# Transform data
X_train_scaled = pd.DataFrame(
ct.fit_transform(X_train),
columns=numerical_cols + (col for col in X_train.columns if col not in numerical_cols),
index=X_train.index
)
X_test_scaled = pd.DataFrame(
ct.transform(X_test),
columns=X_train_scaled.columns,
index=X_test.index
)
# Initialize and train the model
#model = LinearRegression() # Option 1: OLS Regression
model = Ridge(alpha=0.1)  # Option 2: Ridge Regression (alpha is the regularization strength, equivalent to λ)
# Fit the model
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
# Additional information about the model
print("\nModel Coefficients:")
print(f"Intercept    : {model.intercept_:.2f}")
for feature, coef in zip(X_train_scaled.columns, model.coef_):
print(f"{feature:13}: {coef:.2f}")

Least Squares Regression Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | November 2024

Technical Terrence Team

Apple will face fine under landmark EU Digital Markets Law, sources say By Reuters

Leave a Reply Cancel reply

Recommended.

The best new cryptography? Currency price prediction proof as $ TST Hits $ 0.19

Polkadot will sponsor Messi's Inter Miami: DOT to the moon?

Best Sites for Online Tutoring and Teaching

Finanzas Herramientas de inteligencia artificial que están revolucionando la industria

NFL Rivals’ Digital Player Cards Launch First NFTs Tomorrow

Categories

Important Links

Least Squares Regression Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | November 2024

REGRESSION ALGORITHM

Sliding through points to minimize squares

In 2D case

In multidimensional case ( Dataset)

Training step

Test step

Evaluation step

Training step

Test step

Evaluation step

Final Comments: Choosing between OLS and Ridge

Related

Technical Terrence Team

Apple will face fine under landmark EU Digital Markets Law, sources say By Reuters

Leave a Reply Cancel reply

Recommended.

The best new cryptography? Currency price prediction proof as $ TST Hits $ 0.19

Polkadot will sponsor Messi's Inter Miami: DOT to the moon?

Best Sites for Online Tutoring and Teaching

Finanzas Herramientas de inteligencia artificial que están revolucionando la industria

NFL Rivals’ Digital Player Cards Launch First NFTs Tomorrow

Categories

Important Links

Get daily news updates to your inbox!