A practical guide to effectively evaluate and make decisions on data to enrich and improve your models
For the past five years, I served as vice president of data science, artificial intelligence, and research at two publicly traded companies. In both positions, artificial intelligence was critical to the company's core product. We partnered with data providers who enriched our data with relevant features that improved the performance of our models. Having had my fair share of problems with data providers, This post will help you save time and money. When trying new suppliers.
Warning: Don't start this process until you have very clear business metrics for your model and have spent a decent amount of time optimizing your model. Working with most data providers for the first time is usually a long process (weeks at best, but often months) and can be very expensive (some data providers I've worked with cost tens of thousands of dollars a year, others have racked up millions of dollars a year when operating at scale).
Since this is usually a large investment, Don't even begin the process unless you are able to clearly formulate how the decision to go or not will be made. This is the number one mistake I've seen, so re-read that sentence. For me, this has always required transforming all decision inputs into dollars.
For example, your model's performance metric could be the technology/choosing-the-right-kpis-to-evaluate-your-models-1cb42a9a26d5″ rel=”noopener”>PRAUC of a classification model predict fraud. Let's say your PRAUC increases from 0.9 to 0.92 with the new data added, which could be a huge improvement from a data science perspective. However, it costs 25 cents per call. To determine if this is worth it, you will need to convert the incremental PRAUC into margin dollars. This stage can take time and will require a good understanding of the business model. How exactly does a higher PRAUC translate into higher revenue/margin for your company? For most data scientists, this is not always easy.
This post won't cover every aspect of selecting a data provider (for example, we won't discuss contract negotiation), but it will cover the main aspects expected of you as a data science leader.
If it seems like you are the decision maker and your business operates on a large scale, chances are you regularly receive unsolicited emails from vendors. While a random provider may have some value, it is generally best to talk to industry experts and understand which data providers are commonly used. in that industryThere are huge network effects and economies of scale when working with data, so the largest and most well-known vendors can generally provide more value. Don’t rely on vendors that offer solutions for every problem or industry, and remember that the most valuable data is often the kind that takes the most work to create, not something that can be easily scraped off the internet.
Some points to cover when starting initial conversations:
- Who are your clients? How many important clients do you have in your sector?
- Cost (at least order of magnitude), as this could be a deal breaker in the early stages of the deal
- Time travel capability: Do they have the technical ability to “travel back in time” and tell you how the data existed at a point in time? This is critical when running a historical proof of concept (more on this below).
- Technical restrictions: latency (pro tip: always look at p99 or other higher percentiles, not averages), uptime SLA, etc.
Assuming the vendor has met the requirements from the main points above, you are ready to plan a proof of concept. You should have a baseline model with a clear evaluation metric that can be translated into business metrics. Your model should have a training set and an out-of-time test set (perhaps also one or more validation sets). Typically, you will send the relevant features from the training and test set, with their timestamp, for the vendor to combine your data as it existed historically (time travel). You can then retrain your model with their features and evaluate the difference on the out-of-time test set.
Ideally, you should not share your target variable with the vendor. Vendors may sometimes request to receive your target variable in order to “calibrate/tune” their model, train a custom model, perform feature selection, or any other kind of manipulation to better tailor their features to your needs. If you decide to share the target variable, make sure it is only for the training set. never the test set.
If the above paragraph gave you the shivers, congratulations. When working with vendors, they will always be eager to prove the value of their data, and this is especially true for smaller vendors (for whom every deal can make a big difference).
One of my worst experiences working with a vendor was a few years ago. A new data vendor had just signed a Series A, created a lot of buzz, and promised extremely relevant data for one of our models. It was a new product that we lacked relevant data for, and we thought it could be a good way to get things up and running. We went ahead and started a POC, during which their model improved our AUC from 0.65 to 0.85 on our training set. On the test set, their model completely failed — they had ridiculously overfitted on the training set. After discussing this with them, they asked for the target variable from the test set to analyze the situation. They put their senior data scientist to work and asked for a second iteration. We waited a few more weeks for new data to be collected (to serve as a new, invisible test set). Once again, they dramatically improved the AUC on the new train, only to bomb once again on the test set. Needless to say, we made no progress.
- Set a higher ROI threshold:
Start by calculating ROI: estimate the incremental net margin generated by the model relative to cost. Most projects will want a good positive return. Since there is a lot of room for issues that erode your return (data drift, gradual implementation, limitation in usage with all your segments, etc.), set a higher threshold than you would normally set. I have occasionally required a 5x financial return on enrichment costs as a minimum to move forward with a vendor, as a buffer against data drift, potential overfitting, and uncertainty in our point estimate of ROI. - Partial enrichment:
Perhaps the ROI across the entire model is not sufficient. However, some segments may demonstrate a much higher lift than others. Splitting your model into two might be best and enriching only those segments. For example, perhaps you are running a classification model to identify fraudulent payments. Perhaps the new data analyzed provides a strong ROI in Europe, but not elsewhere. - Phased enrichment:If you have a classification model, you may consider splitting your decision into two phases:
- Phase 1: Execute the existing model
- Enrich only observations close to your decision threshold (or above your threshold, depending on the use case). Every observation beyond the threshold is decided in Phase 1.
- Phase 2: Run the second model to refine the decision
This approach can be very useful to reduce costs by enriching a small subset and getting most of the lift, especially when working with imbalanced data. It won’t be as useful if the second model creates a large swing size. For example, if seemingly very safe orders are later identified as fraud due to the enriched data, you will have to enrich most (if not all) of the data to get that lift. Gradually implementing enrichment will also potentially double your latency time, since you will be running two similar models sequentially, so you need to carefully consider how to optimize the trade-off between latency, cost, and performance gain.
Working effectively with data providers can be a long and tedious process, but the increase in the performance of your models can be significant. Hopefully, this guide will help you save time and money. Happy modeling!