A prompt-based experiment to improve both accuracy and transparent reasoning in content personalization.
In THE MIRRORWe continually explore ways to improve how we recommend news articles to our readers. In our last (offline) experiment, we investigated whether Large language models (LLM) could effectively predict which articles a reader would be interested in, based on their reading history.
Our approach
We conducted a study with readers who participated in a survey where they rated their interest in various news articles. This gave us a fundamental truth about readers' preferences. For each participant, we had two key pieces of information: their actual reading history (what articles they had read before taking the survey) and their ratings of a set of articles new to the survey. Read more about this mixed methods approach to offline evaluation of news recommender systems here:
Then we use the Anthropic API to access Claude Sonnet 3.5a next-generation language model, like our recommendation engine. For each reader, we provided the model with their reading history (news title and article summary) and asked it to predict how interested they would be in the survey articles. Here is the message we use:
You are a news recommendation system. Based on the user's reading history,
predict how likely they are to read new articles. Score each article from 0 to 1000,
where 1000 means highest likelihood to read.Reading history (Previous articles read by the user):
(List of previously read articles with titles and summaries)
Please rate the following articles (provide a score 0-1000 for each):
(List of candidate articles to rate)
You must respond with a JSON object in this format:
{
"recommendations": (
{
"article_id": "article-id-here",
"score": score
}
)
}
Using this approach, we can now compare the actual survey scores with the LLM score predictions. This comparison provides an ideal data set to evaluate the language model's ability to predict readers' interests.
Key results and findings
The findings were impressively robust. To understand performance, we can look at two key metrics. First, the Accuracy@5– The LLM achieved a score of 56%, meaning that when the system recommended its top 5 articles for a user (out of 15), on average (almost) 3 of these 5 articles were actually among the articles that the user rated highest in our survey. Looking at the distribution of these predictions reveals even more impressive results: for 24% of users, the system correctly identified at least 4 or 5 of their top articles. For another 41% of users, it correctly identified 3 of their top 5 items.
To put this in perspective, if we were to recommend items at random, we would only achieve 38.8% accuracy (see previous section). medium item for more details). Even recommendations based solely on article popularity (recommending what most people read) only reach 42.1%, and our previous approach using an embedding-based technique reached 45.4%.
The graph below shows the increase: While having some knowledge about users is better than guessing (random model), the LLM-based approach shows the strongest performance. Even compared to our sophisticated embedding-based logic, LLM achieves a significant improvement in prediction accuracy.
As a second evaluation metric, we use Spearman correlation. At 0.41, it represents a substantial improvement over our embedding-based approach (0.17). This also shows that the LLM is not only better at finding relevant articles, but also at understanding how much a reader might prefer one article over another.
Beyond performance: the power of explainability
What sets LLM-based recommendations apart is not just their performance but their ability to explain their decisions in natural language. Below is an example of how our system analyzes a user's reading patterns and explains its recommendations (message not shown):
User has 221 articles in reading historyTop 5 Comparison:
--------------------------------------------------------------------------------
Top 5 Predicted by Claude:
1. Wie ich mit 38 Jahren zum ersten Mal lernte, strukturiert zu arbeiten (Score: 850, Actual Value: 253.0)
2. Warum wir den Umgang mit der Sonne neu lernen müssen (Score: 800, Actual Value: 757.0)
3. Lohnt sich ein Speicher für Solarstrom vom Balkon? (Score: 780, Actual Value: 586.0)
4. »Man muss sich fragen, ob dieser spezielle deutsche Weg wirklich intelligent ist« (Score: 750, Actual Value: 797.0)
5. Wie Bayern versucht, sein Drogenproblem unsichtbar zu machen (Score: 720, Actual Value: 766.0)
Actual Top 5 from Survey:
4. »Man muss sich fragen, ob dieser spezielle deutsche Weg wirklich intelligent ist« (Value: 797.0, Predicted Score: 750)
5. Wie Bayern versucht, sein Drogenproblem unsichtbar zu machen (Value: 766.0, Predicted Score: 720)
2. Warum wir den Umgang mit der Sonne neu lernen müssen (Value: 757.0, Predicted Score: 800)
6. Abitur als Lotterie? (Value: 601.0, Predicted Score: 650)
3. Lohnt sich ein Speicher für Solarstrom vom Balkon? (Value: 586.0, Predicted Score: 780)
Performance Metrics:
--------------------------------------------------------------------------------
Spearman Correlation: 0.673
Precision@5: 0.800
Claude's Analysis:
--------------------------------------------------------------------------------
1. Key patterns in user's reading history:
- Strong interest in social and political issues
- Regular engagement with lifestyle and personal development content
- Interest in health and wellness topics
- Frequent reading of articles about economic and financial matters
- Attraction to critical analysis and commentary pieces
- Interest in food and cooking content
2. Scoring explanation:
Highest scored articles:
- "Wie ich mit 38 Jahren zum ersten Mal lernte, strukturiert zu arbeiten" (850) - Aligns with the user's interest in personal development and lifestyle content
- "Warum wir den Umgang mit der Sonne neu lernen müssen" (800) - Matches interest in health and wellness topics
- "Lohnt sich ein Speicher für Solarstrom vom Balkon?" (780) - Corresponds to interest in practical advice and economic considerations
Medium scored articles:
- "Man muss sich fragen, ob dieser spezielle deutsche Weg wirklich intelligent ist" (750) - Fits pattern of interest in political commentary
- "Wie Bayern versucht, sein Drogenproblem unsichtbar zu machen" (720) - Matches interest in social issues and critical reporting
- "Abitur als Lotterie?" (650) - Aligns with interest in educational and social topics
Lower scored articles:
- "Eine Brise Formel 1" (550) - Limited sports content in reading history
- "Reizender Absatz" (450) - Less alignment with demonstrated interests
- "Hier wird jetzt auf ganz, ganz hohem Niveau gemeckert" (400) - Style and topic less aligned with user preferences
The scoring prioritizes articles that match the user's demonstrated interests in social issues, practical advice, and critical analysis while giving lower scores to sports and lighter content that appears less frequently in their reading history.
Instead of functioning as a black box, the system could articulate why it thinks a particular article might be interesting to a reader: Since you frequently read articles about practical advice and economic issues, you may find this analysis on the profitability of balcony solar storage especially relevant. This type of transparent reasoning could make recommendations seem more personal and trustworthy.
Conclusion
While our results are promising, several challenges need to be addressed. Due to the long prompts (hundreds of article summaries per user), the most important thing is the cost. At approximately $0.21 per user for a single recommendation, expanding this to a full readership would be irresponsibly expensive. High performance testing <a target="_blank" class="af ow" href="https://ai.meta.com/blog/meta-llama-3-1/” rel=”noopener ugc nofollow” target=”_blank”>open source modelscould potentially reduce these costs. Additionally, the current implementation is relatively slow, taking several seconds per user. For a news platform where content is updated frequently and readers' interests evolve sometimes even during a single day, we would need to run these recommendations multiple times a day to remain relevant.
Additionally, we use a single, simple message without any engineering or optimization. There is likely (significant) room for improvement through rapid and systematic refinement.(1) Furthermore, our current implementation only uses titles and abstracts of articles, without taking advantage of available metadata. We could potentially increase performance by incorporating additional signals, such as read time per article (how much time users spent reading each article) or overall article popularity. In any case, due to the high costs of APIs, running iterative evaluation processes is currently not an option.
All in all, the combination of strong predictive performance and natural language explanations suggests that LLMs will be a valuable tool in news recommendation systems. And beyond the recommendations, they add a new way of analyzing the user journey in digital news. Its ability to process and interpret reading histories along with metadata opens up exciting possibilities: from understanding the journey of content and progression of topics to creating personalized review summaries.