How long would you keep your Gym membership before you decide to cancel it? or Netflix if you are a series fan but busier than usual to allocate 2 hours of your time to your sofa and your TV? Or when to upgrade or replace your smartphone ? What best route to take when considering traffic, road closure, time of the day? or How long until your car needs servicing? These are all regular (but not trivial) questions we face (some of them) in our daily life without thinking too much (or nothing at all) of the thought process we go through on the different factors that influence our next course of action. Surely (or maybe after reading these lines) one would be interested to know what factor or factors could have the greatest influence on the expected time until a given event (from the above or any other for that matter) occurs? In statistics, this is referred as time-to-event-analysis or Survival analysis. And this is the focus of this study.
In Survival Analysis one aims to analyze the time until an event occurs. In this article, I will be employing survival analysis to predict when a registered member is likely to leave (churn), specifically the number of days until a member cancels his/her membership contract. As the variable of interest is the number of days, one key element to explicitly reinforce at this point: the time to event dependent variable is of a continuous type, a variable that can take any value within a certain range. For this, survival analysis is the one to employ.
DATA
This study was conducted using a proprietary dataset provided by a private organization in the tutoring industry. The data includes anonymized records for confidentiality purposes collected over a period of 2 years, namely July 2022 to October 2024. All analyses were conducted in compliance with ethical standards, ensuring data privacy and anonymity. Therefore, to respect the confidentiality of the data provider, any specific organizational details and/or unique identifier details have been omitted.
The final dataset after data pre-processing (i.e. tackling nulls, normalizing to handle outliers, aggregating to remove duplicates and grouping to a sensible level) contains a total of 44,197 records at unique identifier level. A total of 5 columns were input into the model, namely: 1) Age, 2) Number of visits, 3) First visit 4) and Last visit during membership and 5) Tenure. The later representing the number of days holding a membership hence the time-to-event target variable. The visit-based variables are a feature engineered product for this study generated from the original, existing variables and by performing some calculations and aggregation on the raw data for each identifier over the period under analysis. Finally and very importantly, the dataset is ONLY composed of uncensored records. This is, all unique identifiers have experienced the event by the time of the analysis, namely membership cancellation. Therefore there is no censored data in this analysis where individuals survived (did not cancel their membership) beyond their observed duration. This is key when selecting the modelling technique as I will explain next.
Among all different techniques used in survival analysis, three stand out as most commonly used:
Kaplan-Meier Estimator.
- This is a non-parametric model hence no assumptions on the distribution of the data is made.
- KM is not interested on how individual features affect churn thus it does not offer feature-based insights.
- It is widely used for exploratory analysis to assess what the survival curve looks like.
- Very importantly, it does not provide personalized predictions.
Cox Proportional Hazard (PH) Model
- The Cox PH Model is a semi-parametric model so it does not assume any specific distribution of the survival time, making it more flexible for a wider range of data.
- It estimates the hazard function.
- It relies heavily on uncensored as well as censored data to be able to differentiate between individuals “at risk” of experiencing the event versus those who already had the event. Thus, if only uncensored data is analyzed the model assumes all individuals experienced the event yielding bias results thus leading the Cox PH to perform poorly.
AFT Model
- It does not require censor data. Thus, can be used where everyone has experienced the event.
- It directly models the relationship between covariates.
- Used when time-to-event outcomes are of primary interest.
- The model estimate the time-to-event explicitly. Thus, provide direct predictions on the duration until cancellation.
Given the characteristics of the dataset used in this study, I have selected the Accelerated Failure Time (AFT) Model as the most suitable technique. This choice is driven by two key factors: (1) the dataset contains only uncensored data, and (2) the analysis focuses on generating individual-level predictions for each unique identifier.
Now before diving any deeper into the methodology and model output, I will cover some key concepts:
Survival Function: It provides insight into the likelihood of survival over time
Hazard Function: Rate at which the event is taking place at point in time t. It captures how the event is changing over time.
Time-to-event: Refers to the (target) variable capturing the time until an event occurs.
Censoring: Flag referring to those event that have not occurred yet for some of the subjects within the timeframe of the analysis. NOTE: In this piece of work only uncensored data is analyzed, this is the survival time for all the subjects under the study is known.
Concordance Index: A measure of how well the model predicts the relative ordering of survival time. It is a measure of ranking accuracy rather than absolute accuracy that assess the proportion of all pairs of subjects whose predicted survival time align with the actual outcome.
Akaike Information Criterion (AIC): A measure that evaluates the quality of a model penalizing against the number of irrelevant variables used. When comparing several models, the one with the lowest AIC is considered the best.
Next, I will expand on the first two concepts.
In mathematical terms:
The survival function is given by:
where,
T is a random variable representing the time to event — duration until the event occurs.
S