Here we go for another post on PySpark. I've enjoyed writing about this topic as I feel like we're lacking good blog posts on PySpark, especially when we talk about machine learning in MLlib; By the way, that's Spark. METERstill lwinner bookshopry, designed to work with Big Data in a parallelized environment.
I can say that the Spark documentation is excellent. It's super organized and easy to follow the examples. But working with machine learning in Spark is not the friendliest thing to do.
In this post, I work on a PCA model to help me create a diamond classification and I had to face a couple of challenges that we will see in the next few lines.
I have already written about PCA before and how it is useful for dimensionality reduction as well as for creating rankings. However, this is the first time I've done this using Spark, with the goal of reproducing the technique in a Big Data environment.
Let's see the result.
Let's start our coding with the modules to import.
from pyspark.sql.functions import col, sum, when, mean, countDistinct
from pyspark.sql import functions as F
from pyspark.ml.feature import PCA
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.window import Window
Data set
The data set that will be used in this exercise is diamondsfrom the ggplot2 package and licensed under Creative Commons 4.0.
Here, I load it from the Databricks sample data sets and remove two known outliers from one of the variables. PCA is affected by outliers. They tend to dominate one component due to a very large and distorted variance.
# Point file path
path = '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv'# Load Data
df = spark.read.csv(path, header=True, inferSchema= True)
df = df.filter( col('y') < 30)
Next, since PCA is a technique that is used for numerical values, I have chosen to work with the carat
, table
and depth
variables from the data. YO…