Image by freepik
This entire article will discuss time-based data visualization using Python with the Pandas library. As you know, time series data is a treasure trove of insights, and with the skillful resampling technique, you can transform raw temporal data into visually compelling narratives. If you are a data enthusiast, scientist, analyst or just curious to unravel the hidden stories in time-based data, this article will help you with the knowledge and tools to revamp your data visualization skills. . So, let’s start discussing Pandas resampling techniques and turning data into informative and engaging temporary masterpieces.
While working with time-based data visualization, data resampling is crucial and very useful. It allows you to control the granularity of data to extract meaningful insights and create visually appealing representations for better understanding. In the image below, you can see that you can upsample or downsample your time series data in terms of frequencies depending on your requirements.
Picture of SQL release
Basically, the two main purposes of data resampling are mentioned below:
- Granularity adjustment: Big data collection allows you to change the time intervals at which data points are collected or aggregated. You can get just the vital information instead of receiving the noise. This can help you eliminate noisy data, making it more manageable data for viewing.
- Alignment: It also helps align data from multiple sources with different time intervals, ensuring consistency when creating visualizations or performing analysis.
For example,
Suppose you have daily data on the stock price of a particular company that you get from a stock exchange, and your goal is to visualize long-term trends without including noisy data points in your analysis. So, to do this, you can resample this daily data at a monthly frequency by taking the average closing price of each month and as a result, the size of the data for visualization purposes decreases and its analysis can provide better insights.
import pandas as pd
# Sample daily stock price data
data = {
'Date': pd.date_range(start="2023-01-01", periods=365, freq='D'),
'StockPrice': (100 + i + 10 * (i % 7) for i in range(365))
}
df = pd.DataFrame(data)
# Resample to monthly frequency
monthly_data = df.resample('M', on='Date').mean()
print(monthly_data.head())
In the example above, you noticed that we resampled the daily data into monthly intervals and calculated the average closing price for each month, resulting in a smoother and less noisy representation of the stock price data. , which makes it easier to identify. Long-term trends and patterns for decision making.
When working with time series data, the main parameter for resampling is frequency, which you must select correctly for practical and insightful visualizations. Basically, there is a balance between granularity, which means how detailed the data is, and clarity, which means how well data patterns are revealed.
For example,
Imagine you have temperature data recorded every minute for a year. Suppose you have to visualize the annual temperature trend; Using minute-level data would result in an overly dense and cluttered graph. On the other hand, if you aggregate the data to annual averages, you could lose valuable information.
# Sample minute-level temperature data
data = {
'Timestamp': pd.date_range(start="2023-01-01", periods=525600, freq='T'),
'Temperature': (20 + 10 * (i % 1440) / 1440 for i in range(525600))
}
df = pd.DataFrame(data)
# Resample to different frequencies
daily_avg = df.resample('D', on='Timestamp').mean()
monthly_avg = df.resample('M', on='Timestamp').mean()
yearly_avg = df.resample('Y', on='Timestamp').mean()
print(daily_avg.head())
print(monthly_avg.head())
print(yearly_avg.head())
In this example, we resampled minute-level temperature data into daily, monthly, and annual averages. Depending on your analytical or visualization goals, you can choose the level of detail that best suits your purpose. Daily averages reveal daily temperature patterns, while annual averages provide a high-level overview of annual trends.
By selecting the optimal resampling frequency, you can balance the amount of detail in your data with the clarity of your visualizations, ensuring that your audience can easily discern the patterns and insights you want to convey.
When working with time-based data, it is essential to understand various aggregation methods and techniques. These methods allow you to effectively summarize and analyze your data, revealing different aspects of your time-based information. Standard aggregation methods include calculating sums and means or applying custom functions.
Picture of Towards Data Science
For example,
Consider that you have a data set that contains daily sales data for a retail store for a year. You want to analyze the trend of annual income. To do this, you can use aggregation methods to calculate the total sales for each month and year.
# Sample daily sales data
data = {
'Date': pd.date_range(start="2023-01-01", periods=365, freq='D'),
'Sales': (1000 + i * 10 + 5 * (i % 30) for i in range(365))
}
df = pd.DataFrame(data)
# Calculate monthly and yearly sales with the aggregation method
monthly_totals = df.resample('M', on='Date').sum()
yearly_totals = df.resample('Y', on='Date').sum()
print(monthly_totals.head())
print(yearly_totals.head())
In this example, we resample the daily sales data into monthly and yearly totals using the sum() aggregation method. By doing this, you can analyze the sales trend at different levels of granularity. Monthly totals provide information on seasonal variations, while annual totals provide a high-level overview of annual performance.
Depending on your specific analysis requirements, you can also use other aggregation methods, such as calculating means and medians or applying custom functions based on the distribution of the data set that is meaningful for the problem. These methods allow you to extract valuable information from your time-based data by summarizing it in a way that makes sense for your analysis or visualization goals.
Handling missing data is a critical aspect of working with time series, ensuring that your visualizations and analyzes remain accurate and informative even when dealing with gaps in your data.
For example,
Imagine you are working with a historical temperature data set, but some days are missing temperature readings due to equipment malfunctions or data collection errors. You must handle these missing values to create meaningful visualizations and maintain data integrity.
# Sample temperature data with missing values
data = {
'Date': pd.date_range(start="2023-01-01", periods=365, freq='D'),
'Temperature': (25 + np.random.randn() * 5 if np.random.rand() > 0.2 else np.nan for _ in range(365))
}
df = pd.DataFrame(data)
# Forward-fill missing values (fill with the previous day's temperature)
df('Temperature').fillna(method='ffill', inplace=True)
# Visualize the temperature data
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(df('Date'), df('Temperature'), label="Temperature", color="blue")
plt.title('Daily Temperature Over Time')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()
Production:
Image by author
In the example above, you can see that we first simulate the missing temperature values (about 20% of the data) and then use the direct fill (padding) method to fill the gaps, which means that the values missing temperatures are replaced by the previous day’s temperature.
Therefore, handling missing data ensures that your visualizations accurately represent the underlying trends and patterns in the time series, preventing gaps from distorting your insights or misleading your audience. Various strategies, such as interpolation or backward padding, can be employed depending on the nature of the data and the research question.
Data resampling in Pandas allows you to visualize trends and patterns in sequential or time-based data, further helping you gather information and effectively communicate results to others. As a result, you can find clear and informative visual representations of your data to highlight different components, including trends, seasonality, and irregular patterns (possibly noise in the data).
For example,
Suppose you have a data set that contains daily website traffic data collected over the past few years. Its objective is to visualize the general trend of traffic in the following years, identify seasonal patterns and detect irregular spikes or drops in traffic.
# Sample daily website traffic data
data = {
'Date': pd.date_range(start="2019-01-01", periods=1095, freq='D'),
'Visitors': (500 + 10 * ((i % 365) - 180) + 50 * (i % 30) for i in range(1095))
}
df = pd.DataFrame(data)
# Create a line plot to visualize the trend
plt.figure(figsize=(12, 6))
plt.plot(df('Date'), df('Visitors'), label="Daily Visitors", color="blue")
plt.title('Website Traffic Over Time')
plt.xlabel('Date')
plt.ylabel('Visitors')
plt.grid(True)
# Add seasonal decomposition plot
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df('Visitors'), model="additive", freq=365)
result.plot()
plt.show()
Production:
Image by author
In the example above, we first created a line chart to visualize the daily trend of website traffic over time. This graph describes the overall growth and any irregular patterns in the data set. Furthermore, to decompose the data into different components, we used the seasonal decomposition technique from the statistics model library, which includes trend, seasonality, and residual components.
This way, you can effectively communicate website traffic trends, seasonality, and anomalies to stakeholders, improving your ability to derive important insights from time-based data and turn them into decisions. based on data.
Colab notebook link: https://colab.research.google.com/drive/19oM7NMdzRgQrEDfRsGhMavSvcHx79VDK#scrollTo=nHg3oSjPfS-Y
In this article, we discuss time-based data resampling in Python. So, to conclude our session, let’s summarize the important points covered in this article:
- Time-based resampling is a powerful technique for transforming and summarizing time series data to gain better insights for decision making.
- Careful selection of resampling frequency is essential to balance granularity and clarity in data visualization.
- Aggregation methods like sum, mean, and custom functions help reveal different aspects of time-based data.
- Effective visualization techniques help identify trends, seasonalities, and irregular patterns, facilitating clear communication of findings.
- Real-world use cases in finance, weather forecasting, and social network analysis demonstrate the broad impact of time-based resampling.
Aryan Garg It is a B.tech. Electrical Engineering student, currently in the last year of the degree. His interest lies in the field of Web Development and Machine Learning. He has pursued this interest and is eager to work further in these directions.