Many of my clients ask me for advice on which wide language model (LLM) to use to build products tailored for Dutch-speaking users. However, most of the available benchmarks are multilingual and do not focus specifically on Dutch. As a machine learning engineer and PhD researcher in machine learning at the University of Amsterdam, I know how crucial benchmarks have been for the advancement of ai, but I also understand the risks when you blindly rely on them. That’s why I decided to experiment and run some Dutch-specific benchmarks myself.
In this post, you will find an in-depth look at my first attempt at benchmarking several large language models (LLMs) on real Dutch exam questions. I will guide you through the entire process, from collecting over 12,000 exam PDFs to extracting question-answer pairs and automatically scoring the models’ performance using LLM. You will see how models like o1-preview, o1-mini, GPT-4o, GPT-4o-mini and Claude-3 performed at different Dutch language education levels, from VMBO to VWO, and whether higher costs of certain models lead to better results. This is just a first attempt at tackling the problem and I may go deeper in more posts like this in the future, exploring other models and tasks. I will also talk about the challenges and costs involved and share some insights on which models offer the best value for Dutch language tasks. If you are building or scaling LLM-based products for the Dutch market, this post will provide you with valuable insights to help guide your choices starting in September 2024.
It’s becoming increasingly common for companies like OpenAI to make bold, almost outlandish, claims about the capabilities of their models, often without sufficient real-world validation to back them up. That’s why it’s so important to evaluate these models as benchmarks, especially when they’re marketed as solutions for everything from complex reasoning to understanding nuanced languages. With such grandiose claims, it’s vital to conduct objective testing to see how well they actually work, and more specifically, how they handle the unique challenges of the Dutch language.
I was surprised to discover that there has been no extensive research into benchmarking LLMs for Dutch, which led me to take matters into my own hands on a rainy afternoon. With so many institutions and companies increasingly relying on these models, it seemed like the right time to dive in and start validating them. So here is my first attempt to begin filling that gap, and I hope it offers valuable insights for anyone working with the Dutch language.
Many of my clients work with Dutch products and need ai models that are cost-effective and high-performing in language understanding and processing. While large language models (LLMs) have made impressive progress, most available benchmarks focus on English or multilingual capabilities, and often neglect the nuances of smaller languages like Dutch. This lack of focus on Dutch is significant because linguistic differences can lead to large performance gaps when a model is asked to understand non-English text.
Five years ago, NLP deep learning models for Dutch were far from mature (like the first versions of BERT). At that time, traditional methods like TF-IDF combined with logistic regression often outperformed early deep learning models on the Dutch language tasks I worked on. While models (and datasets) have improved greatly since then, especially with the rise of pre-trained multilingual LLMs and transformers, it is still critical to check how well these advances translate to specific languages like Dutch. The assumption that performance improvements in English carry over to other languages does not always hold, especially for complex tasks like reading comprehension.
That’s why I focused on creating a custom benchmark for Dutch, using real data from the Dutch “Nederlands” exams (these exams become public domain after their publication). These exams don’t just involve simple linguistic processing; they test “reading comprehension,” requiring students to understand the intent behind various texts and answer nuanced questions about them. This type of task is particularly important because it reflects real-world applications, such as processing and summarizing legal documents, news articles, or client queries written in Dutch.
By evaluating LLMs on this specific task, I wanted to gain deeper insights into how the models handle the complexity of the Dutch language, especially when they are asked to interpret intents, draw conclusions, and respond with accurate answers. This is crucial for companies building products tailored to Dutch-speaking users. My goal was to create a more specific and relevant benchmark to help identify which models deliver the best performance for Dutch, rather than relying on general multilingual benchmarks that don’t fully capture the complexities of the language.