Image by author
Exploratory Data Analysis (or EDA) is a central phase within the Data Analysis Process, emphasizing a thorough investigation of the internal details and characteristics of a data set.
Its main objective is to discover underlying patterns, understand the structure of the data set and identify possible anomalies or relationships between variables.
When performing EDA, data professionals verify the quality of the data. Therefore, it ensures that subsequent analyzes are based on accurate and insightful information, thereby reducing the likelihood of errors at later stages.
So let's try to understand together what are the basic steps to carry out a good EDA for our next Data Science project.
I'm pretty sure you've already heard the phrase:
Garbage in garbage out
The quality of input data is always the most important factor for any successful data project.
Unfortunately, most of the data at the beginning is dirty. Through the process of Exploratory Data Analysis, a data set that is almost usable can be transformed into one that is completely usable.
It is important to clarify that it is not a magic solution to purify any data set. However, numerous EDA strategies are effective in addressing several typical problems encountered in data sets.
So… let's learn the most basic steps according to Ayodele Oluleye in her book Exploratory Data Analysis with Python Cookbook.
Step 1: Data Collection
The initial step in any data project is having the data itself. This first step is where data is collected from various sources for further analysis.
2. Statistics summary
In data analysis, handling tabular data is quite common. During the analysis of such data, it is often necessary to obtain quick information about the patterns and distribution of the data.
These initial insights serve as the basis for further exploration and in-depth analysis and are known as summary statistics.
They provide a concise overview of the distribution and patterns of the data set, summarized through metrics such as mean, median, mode, variance, standard deviation, range, percentiles, and quartiles.
Image by author
3. Data preparation for EDA
Before beginning our exploration, it is usually necessary to prepare the data for further analysis. Data preparation involves transforming, aggregating, or cleaning data using the Python pandas library to meet the needs of your analysis.
This step adapts to the structure of the data and may include grouping, aggregating, merging, sorting, categorizing, and dealing with duplicates.
In Python, the pandas library makes it easy to accomplish this task through its various modules.
The tabular data preparation process does not follow a universal method; instead, it is determined by the specific characteristics of our data, including its rows, columns, data types, and the values they contain.
4. Data visualization
Visualization is a core component of EDA, making complex relationships and trends within the data set easily understandable.
Using the right charts can help us identify trends within a large data set and find hidden patterns or outliers. Python offers different libraries for data visualization, including Matplotlib or Seaborn among others.
Image by author
5. Carrying out variable analysis:
Variable analysis can be univariate, bivariate or multivariate. Each of them provides information about the distribution and correlations between the variables in the data set. The techniques vary depending on the number of variables analyzed:
univariate
The primary goal of univariate analysis is to examine each variable within our data set on its own. During this analysis, we can discover information such as median, mode, maximum, range, and outliers.
This type of analysis is applicable to both categorical and numerical variables.
bivariate
Bivariate analysis aims to reveal information between two chosen variables and focuses on understanding the distribution and relationship between these two variables.
Since we analyze two variables at the same time, this type of analysis can be more complicated. It can encompass three different pairs of variables: numerical-numeric, numerical-categorical, and categorical-categorical.
multivariate
A common challenge with large data sets is the simultaneous analysis of multiple variables. Although univariate and bivariate analysis methods provide valuable information, this is typically not sufficient to analyze data sets containing multiple variables (typically more than five).
This problem of managing high-dimensional data, often called the curse of dimensionality, is well documented. Having a large number of variables can be advantageous as it allows more knowledge to be extracted. At the same time, this advantage may be against us due to the limited number of techniques available to analyze or visualize multiple variables at the same time.
6. Time series data analysis
This step focuses on examining data points collected during regular time intervals. Time series data applies to data that changes over time. This basically means that our data set is made up of a group of data points that are recorded at regular time intervals.
When we analyze time series data, we can usually discover patterns or trends that repeat over time and exhibit temporal seasonality. Key components of time series data include trends, seasonal variations, cyclical variations, and irregular variations or noise.
7. Handling of outliers and missing values
Outliers and missing values can distort analysis results if not properly addressed. That is why we must always consider a single phase to address them.
Identifying, removing, or replacing these data points is crucial to maintaining the integrity of the data set analysis. Therefore, it is very important to address them before starting to analyze our data.
- Outliers are data points that have a significant deviation from the rest. They usually have unusually high or low values.
- Missing values are the absence of data points corresponding to a specific variable or observation.
A critical initial step in addressing missing values and outliers is to understand why they are present in the data set. This understanding usually guides the selection of the most appropriate method to address them. Additional factors to consider are the characteristics of the data and the specific analysis that will be performed.
EDA not only improves data set clarity but also allows data professionals to navigate the curse of dimensionality by providing strategies for managing data sets with numerous variables.
Through these meticulous steps, EDA with Python equips analysts with the tools necessary to extract meaningful insights from data, laying a solid foundation for all subsequent data analysis efforts.
Joseph Ferrer He is an analytical engineer from Barcelona. He graduated in physical engineering and currently works in the field of Data Science applied to human mobility. He is a part-time content creator focused on data science and technology. You can contact him at LinkedIn, Twitter either Half.