Where the assumptions behind the architecture of the two-tower model break down and how to go beyond
Two tower models are among the most common architectural design choices in modern recommender systems: the key idea is to have one tower that learns relevance and a second, shallow tower that learns observation biases, such as position bias.
In this post, we'll take a closer look at two assumptions behind two-tower models, in particular:
- he factorization assumptionthat is, the hypothesis that we can simply multiply the probabilities calculated by the two towers (or add their logits), and
- he positional independence assumptionthat is, the hypothesis that the only variable that determines position bias is the position of the item itself and not the context in which it is printed.
We will see where both assumptions break down and how to go beyond these limitations with newer algorithms such as the MixEM model, the Dot Product model, and XPA.
Let's start with a brief reminder.
Two tower models: the story so far
The main learning goal of classification models in recommender systems is relevance: we want the model to predict the best possible content given the context. Here, context simply means everything we have learned about the user, for example from their previous interaction or their search histories, depending on the application.
However, classification models often suffer from certain observation biases, that is, the tendency of users to interact more or less with an impression depending on how it was presented to them. The most prominent observation bias is position bias: the tendency for users to interact more with items that are displayed first.
The key idea in two-tower models is to train two “towers”, i.e. neural networks, in parallel, the main tower for relevance learning, and…