Data preparation and exploratory analysis
Now that we have outlined our approach, let's take a look at our data and with what kind of characteristics we are working on.
From the above, we see that our data contains ~ 197,000 deliveries, with a variety of numerical and non -numerical characteristics. None of the characteristics are missing a large percentage of values (lower non -null count ~ 181,000), so we will probably not have to worry about dropping any characteristic completely.
Let's verify if our data contains duplicate deliveries, and if there is any observation for which we cannot calculate the delivery time.
print(f"Number of duplicates: {df.duplicated().sum()} \n")print(pd.DataFrame({'Missing Count': df(('created_at', 'actual_delivery_time')).isna().sum()}))
We see that all deliveries are unique. However, there are 7 deliveries to which they lack a value for real_delivery_time, which means that we cannot calculate the duration of delivery for these orders. As there is only a handful of these, we will eliminate these observations of our data.
Now, we believe our prediction goal. We want to predict the duration of the delivery (in seconds), which is the time elapsed between when the client made the order ('Creation_at') and when he received the order ('real_delivery_time').
# convert columns to datetime
df('created_at') = pd.to_datetime(df('created_at'), utc=True)
df('actual_delivery_time') = pd.to_datetime(df('actual_delivery_time'), utc=True)# create prediction target
df('seconds_to_delivery') = (df('actual_delivery_time') - df('created_at')).dt.total_seconds()
The last thing we will do before dividing our train/test data is to verify the missing values. We already saw the non -null counts for each previous feature, but let's see the proportions to get a better image.
We see that market characteristics ('onhift_dashers', 'busy_dashers', 'overstanding_orders') have the highest percentage of missing values (~ 8% foul). The characteristic with the second highest missing data is 'Store_primary_category' (~ 2%). All other characteristics have lost <1%.
Since none of the characteristics has a high counting, we will not eliminate any of them. Later, we will analyze the distributions of characteristics to help us decide how to properly deal with the missing observations for each characteristic.
But first, we divide our train/test data. We will proceed with an 80/20 division, and we will write this test data in a separate file that we will not touch until our final model evaluates.
from sklearn.model_selection import train_test_split
import os# shuffle
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)
# split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# write test data to separate file
directory = 'datasets'
file_name = 'test_data.csv'
file_path = os.path.join(directory, file_name)
os.makedirs(directory, exist_ok=True)
test_df.to_csv(file_path, index=False)
Now, we are going to immerse ourselves in the details of our train. We will establish our numerical and categorical characteristics to make it clear to which columns are referred to in subsequent exploratory steps.
categorical_feats = (
'market_id',
'store_id',
'store_primary_category',
'order_protocol'
)numeric_feats = (
'total_items',
'subtotal',
'num_distinct_items',
'min_item_price',
'max_item_price',
'total_onshift_dashers',
'total_busy_dashers',
'total_outstanding_orders',
'estimated_order_place_duration',
'estimated_store_to_consumer_driving_duration'
)
Let's review the categorical characteristics with missing values ('Market_id', 'Store_primary_category', 'Order_protocol'). Since there were few missing data between these characteristics (<3%), we will simply impute those missing values with an "unknown" category.
- In this way, we will not have to delete data from other characteristics.
- Perhaps the absence of characteristics values contains some predictive power for the duration of delivery, that is, these characteristics are not Failure to chance.
- In addition, we will add this imputation step to our preprocessing pipe during modeling, so that we do not have to manually double this work in our test set.
missing_cols_categorical = ('market_id', 'store_primary_category', 'order_protocol')train_df(missing_cols_categorical) = train_df(missing_cols_categorical).fillna("unknown")
Let's look at our categorical characteristics.
pd.DataFrame({'Cardinality': train_df(categorical_feats).nunique()}).rename_axis('Feature')
Since 'Market_id' & 'Order_Procol' has low cardinality, we can visualize its distributions easily. On the other hand, 'Store_id' & 'Store_primary_category' are high cardinality characteristics. We will take a deeper look at the later.
import seaborn as sns
import matplotlib.pyplot as pltcategorical_feats_subset = (
'market_id',
'order_protocol'
)
# Set up the grid
fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)
# Create barplots for each variable
for i, col in enumerate(categorical_feats_subset):
sns.countplot(x=col, data=train_df, ax=axes(i))
axes(i).set_title(f"Frequencies: {col}")
# Adjust layout
plt.tight_layout()
plt.show()
Some key things to consider:
- ~ 70% of the orders made have 'market_id' of 1, 2, 4
- <1% of orders have 'Order_Procol' of 6 or 7
Unfortunately, we have no additional information about these variables, such as what values 'market_id' are associated with which cities/locations, and what each number 'Order_protocol' represents. At this point, requesting additional data on this information can be a good idea, since it can help investigate trends in the duration of delivery categorizations/lighter location.
Let's look at our highest categorical characteristics. Maybe every 'store_primary_category' has an associated range 'store_id'? If so, we may not need 'Store_id', such as 'Store_primary_category' would already encapsulate a lot of information about the store that is being ordered.
store_info = train_df(('store_id', 'store_primary_category'))store_info.groupby('store_primary_category')('store_id').agg(('min', 'max'))
Clearly it is not the case: we see that the ranges 'Store_id' overlap through the levels of 'Store_primary_category'.
A quick look at different values and associated frequencies for 'Store_id' & 'Store_prrimary_category' shows that these characteristics have High cardinality And they are distributed scarcely. In general, high cardinality categorical characteristics can be problematic in regression tasks, particularly for regression algorithms that require only numerical data. When these high cardinality characteristics are coded, they can drastically expand the characteristic space, making the available data scarce and reducing the model of the model to generalize new observations in that space of characteristics. For a better and more professional explanation of phenomena, you can read more about it. here.
Let's have an idea of how distributed these characteristics are scarcely.
store_id_values = train_df('store_id').value_counts()# Plot the histogram
plt.figure(figsize=(8, 5))
plt.bar(store_id_values.index, store_id_values.values, color='skyblue')
# Add titles and labels
plt.title('Value Counts: store_id', fontsize=14)
plt.xlabel('store_id', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()
We see that there are a handful of stores that have hundreds of orders, but most of them have much less than 100.
To handle the high cardinality of 'Store_id', we will create another characteristic, 'Store_id_freq', which groups the 'Store_id' values by frequency.
- We will group the 'Store_id' values in five different percentile containers shown below.
- 'Store_id_freq' will have much lower cardinality than 'Store_id', but will retain the relevant information about the popularity of the store from which the delivery was ordered.
- To get more inspiration behind this logic, look this thread.
def encode_frequency(freq, percentiles) -> str:
if freq < percentiles(0):
return '(0-50)'
elif freq < percentiles(1):
return '(50-75)'
elif freq < percentiles(2):
return '(75-90)'
elif freq < percentiles(3):
return '(90-99)'
else:
return '99+'value_counts = train_df('store_id').value_counts()
percentiles = np.percentile(value_counts, (50, 75, 90, 99))
# apply encode_frequency to each store_id based on their number of orders
train_df('store_id_freq') = train_df('store_id').apply(lambda x: encode_frequency(value_counts(x), percentiles))
pd.DataFrame({'Count':train_df('store_id_freq').value_counts()}).rename_axis('Frequency Bin')
Our coding shows us that they were ordered ~ 60,000 deliveries of the fourteen stores in the 90-99 percentile in terms of popularity, while ~ 12,000 delivery of stores that were in the 0-50 popularity percentiles were ordered.
Now that we have (we try) to capture relevant information 'Store_id' in a lower dimension, let's try to do something similar with 'Store_Primary_category'.
Let's look at the most popular levels of 'Store_primary_category'.
A quick look shows us that many of these levels 'Store_primary_category' are not exclusive to each other (ex: 'American' and 'Hamburger'). Additional investigation shows many more examples of this type of superposition.
Therefore, let's try to map these different categories of the store in some basic and crowdable groups.
store_category_map = {
'american': ('american', 'burger', 'sandwich', 'barbeque'),
'asian': ('asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean',
'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'),
'mexican': ('mexican'),
'italian': ('italian', 'pizza'),
}def map_to_category_type(category: str) -> str:
for category_type, categories in store_category_map.items():
if category in categories:
return category_type
return "other"
train_df('store_category_type') = train_df('store_primary_category').apply(lambda x: map_to_category_type(x))
value_counts = train_df('store_category_type').value_counts()
# Plot pie chart
plt.figure(figsize=(6, 6))
value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
plt.title('Category Distribution')
plt.ylabel('') # Hide y-axis label for aesthetics
plt.show()
This group is probably brutally simple, and there may be a better way to group these categories of stores. Proceed with him for now by simplicity.
We have carried out a lot of research on our categorical characteristics. Let's look at the distributions of our numerical characteristics.
# Create grid for boxplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15)) # Adjust figure size
axes = axes.flatten() # Flatten the 5x2 axes into a 1D array for easier iteration# Generate boxplots for each numeric feature
for i, column in enumerate(numeric_feats):
sns.boxplot(y=train_df(column), ax=axes(i))
axes(i).set_title(f"Boxplot for {column}")
axes(i).set_ylabel(column)
# Remove any unused subplots (if any)
for i in range(len(numeric_feats), len(axes)):
fig.delaxes(axes(i))
# Adjust layout for better spacing
plt.tight_layout()
plt.show()
Many of the distributions seem to be rather biased of what are due to the presence of atypical values.
In particular, there seems to be an order with more than 400 articles. This seems strange since the next largest order is less than 100 articles.
Let's see more in that order of more than 400 articles.
train_df(train_df('total_items')==train_df('total_items').max())