Indications: zero versus few shots
Getting meaningful answers from LLMs can be challenging. So what's the best way to ask an LLM to label your data? As we can see from Table 1previous studies explored zero-shot or low-shot stimuli, or both. zero shot The prompt waits for a response from the LLM without having seen any examples in the prompt. While few shots The instructions include several examples in the application itself so that the LLM knows what the desired response is like:
Studies differ in their views on which approach yields better results. Some resort to brief instructions in their tasks, others to no instructions at all. Therefore, you may want to explore what works best for your particular use case and model.
If you're wondering how to get started with good directions Sander Schulhoff & Shyamal H Anadkat having created LearnPrompting which can help you with the basics and also more advanced techniques.
Incitement: Sensitivity
LLMs are sensitive to minor modifications in the message. Changing a word in your message can affect the response. If you want to take that variability into account to some extent, you can address it as in study (3). First, they let an expert on the task provide the initial prompt. Then, using GPT, they generate 4 more with similar meaning and average the results of the 5 prompts. Or you can also explore moving away from handwritten prompts and trying to replace them with signatures, letting DSPy optimize the request for you, as shown in Leonie Monigattiblog post.
Model choice
Which model should you choose to label your data set? There are a few factors to consider. Let's briefly look at some key considerations:
- Open source vs closed source: Do you opt for the latest model with the best performance? Or is open source customization more important to you? You'll need to think about things like your budget, performance requirements, customization and ownership preferences, security needs, and community support requirements.
- Railings: LLMs have security barriers in place to prevent them from responding with unwanted or harmful content. If your task involves sensitive content, models may refuse to label their data. Additionally, LLMs vary in the strength of their safeguards, so you should explore and compare them to find the most suitable one for your task.
- Model size: LLMs come in different sizes and larger models may perform better, but they also require more computing resources. If you prefer to use open source LLM and have limited computing, you might consider quantization. For closed source models, larger models currently have higher costs per message associated with them. But is bigger always better?
Model bias
According to the largest study (3) and adapted to the instructions³ The models show superior labeling performance. However, the study does not evaluate bias in its results. Other research efforts show that bias tends to increase with both scale and ambiguous contexts. Several studies also warn of left-wing tendencies and limited ability to accurately represent the views of minority groups (e.g., older people or underrepresented religions). In general, today's LLMs show considerable cultural biases and respond with stereotypical views of minority people. Depending on your task and your goals, these are things to consider in every timeline of your project.
“By default, LLM responses tend to be more similar to the opinions of certain populations, such as those in the United States and some European and South American countries” — study quote (2)
Model parameter: temperature
A parameter commonly mentioned in most studies in Table 1 is the temperature parameter, which adjusts the “creativity” of the LLM results. Studies (5) and (6) experiment with higher and lower temperatures and find that LLMs have greater consistency in responses at lower temperatures without sacrificing accuracy; that's why they recommend lower values for annotation tasks.
Language limitations
As we can see in Table 1, most studies measure the labeling performance of LLMs on English datasets. Study (7) explores tasks in French, Dutch, and English and observes a considerable decline in performance in languages other than English. At the moment, LLMs work best in English, but alternatives are being developed to extend its benefits to non-English speaking users. Two of these initiatives include: YugoGPT (for Serbian, Croatian, Bosnian, Montenegrin) by Aleksa Gordic & Aya (101 different languages) by Cohesion for ai.
Human reasoning and behavior (natural language explanations)
In addition to simply requesting a label from the LLM, we may also ask you to provide us with an explanation of the chosen label. One of the studies (10) concludes that GPT provides comparable, if not clearer, explanations than those produced by humans. However, we also have researchers from Carnegie Mellon and Google who highlight that LLMs are not yet capable of simulating human decision making and not show human behavior in their elections. They find that instruction-tuned models show even less human behavior and say that LLMs should not be used to replace humans in the annotation process. I would also caution the use of natural language explanations at this stage.
“Substitution undermines three values: the representation of the interests of the participants; inclusion and empowerment of participants in the development process” — quote from Agnew (2023)