Diamond classification with PCA in PySpark | by Gustavo Santos | December 2023

The challenges of running Principal Component Analysis in PySpark

Here we go for another post on PySpark. I've enjoyed writing about this topic as I feel like we're lacking good blog posts on PySpark, especially when we talk about machine learning in MLlib; By the way, that's Spark. METERstill lwinner bookshopry, designed to work with Big Data in a parallelized environment.

I can say that the Spark documentation is excellent. It's super organized and easy to follow the examples. But working with machine learning in Spark is not the friendliest thing to do.

In this post, I work on a PCA model to help me create a diamond classification and I had to face a couple of challenges that we will see in the next few lines.

I have already written about PCA before and how it is useful for dimensionality reduction as well as for creating rankings. However, this is the first time I've done this using Spark, with the goal of reproducing the technique in a Big Data environment.

Let's see the result.

Let's start our coding with the modules to import.

from pyspark.sql.functions import col, sum, when, mean, countDistinct
from pyspark.sql import functions as F
from pyspark.ml.feature import PCA
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.window import Window

Data set

The data set that will be used in this exercise is diamondsfrom the ggplot2 package and licensed under Creative Commons 4.0.

Here, I load it from the Databricks sample data sets and remove two known outliers from one of the variables. PCA is affected by outliers. They tend to dominate one component due to a very large and distorted variance.

# Point file path
path = '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv'# Load Data
df = spark.read.csv(path, header=True, inferSchema= True)
df = df.filter( col('y') < 30)

Next, since PCA is a technique that is used for numerical values, I have chosen to work with the carat , table and depth variables from the data. YO…

Diamond classification with PCA in PySpark | by Gustavo Santos | December 2023

Technical Terrence Team

A highly rated space heater on Amazon that heats rooms in "minutes" is on sale today for just $20

Leave a Reply Cancel reply

Recommended.

Crypto Futures Take a $152 Million Beating as Bitcoin Surpasses $47,300

Who would buy ‘dying’ UK stocks? I would like to!

This week in AI: OpenAI finds a partner in higher education

Researchers claim Bitcoin experiment generated nearly 300% higher returns than hodling

My easy £10 a day passive income plan

Categories

Important Links

Diamond classification with PCA in PySpark | by Gustavo Santos | December 2023

The challenges of running Principal Component Analysis in PySpark

Data set

Related

Technical Terrence Team

A highly rated space heater on Amazon that heats rooms in "minutes" is on sale today for just $20

Leave a Reply Cancel reply

Recommended.

Crypto Futures Take a $152 Million Beating as Bitcoin Surpasses $47,300

Who would buy ‘dying’ UK stocks? I would like to!

This week in AI: OpenAI finds a partner in higher education

Researchers claim Bitcoin experiment generated nearly 300% higher returns than hodling

My easy £10 a day passive income plan

Categories

Important Links

Get daily news updates to your inbox!