A deep dive into stochastic decoding with temperature, top_p, top_k and min_p
When a Large Language Model (LLM) is asked a question, the model generates a probability for each possible token in its vocabulary.
After sampling a token from this probability distribution, we can add the selected token to our input message so that the LLM can generate the probabilities for the next token.
This sampling process can be controlled by parameters such as the famous temperature
and top_p
.
In this article, I will explain and visualize the sampling strategies that define the output behavior of LLMs. By understanding what these parameters do and configuring them based on our use case, we can improve the output generated by LLMs.
For this article, I will use Very good like the inference engine and Microsoft's new one Phi-3.5-mini-instructions Model with AWQ quantization. To run this model locally, I use the NVIDIA GeForce RTX 2060 GPU in my laptop.
table of Contents
· Understanding Sampling with Logprobs
∘ LLM decoding theory
∘ Retrieving Logprobs with the OpenAI Python SDK
· Greedy decoding
· Temperature
· Sampling the best k
· Top-p sampling
· Combining Top-p…