Recommendations are ubiquitous in our digital lives, from e-commerce giants to streaming services. However, behind every great recommendation system lies a challenge that can significantly affect its effectiveness: sampling bias.
In this article, I will introduce how sampling bias occurs during training of recommendation models and how we can solve this problem in practice.
Let’s dive in!
In general, we can formulate the recommendation problem as follows: given query x (which may contain user information, context, previously clicked items, etc.)find the set of elements {y1,.., yk} that the user You’ll probably be interested.
One of the main challenges for large-scale recommender systems is low latency requirements. However, groups of users and items are vast and dynamic, so qualifying each candidate and greedily finding the best one is impossible. Therefore, to meet the latency requirement, recommender systems are generally divided into two main stages: retrieval and ranking.
Retrieval is a cheap and efficient way to quickly capture the top candidates (a few hundred) from the vast pool of candidates (millions or billions). Recovery optimization has mainly two objectives:
- During the training phase, we want to encode users and items into embeddings that capture user behavior and preferences.
- During inference, we want to quickly retrieve relevant elements through Approximate Nearest Neighbors (ANN).
For the first goal, one of the common approaches is two-tower neural networks. The model gained popularity for addressing cold start issues by incorporating article content features.
In detail, queries and items are encoded by the corresponding DNN towers so that the relevant embeddings (query, item) remain…