GGUF is a binary file format designed for efficient storage and fast loading of large language models (LLMs) with GGML, a C-based tensor library for machine learning.
GGUF encapsulates all the components required for inference, including the tokenizer and code, within a single file. It supports conversion of various language models, such as Llama 3, Phi, and Qwen2. Additionally, it facilitates quantization of models at lower precisions to improve speed and memory efficiency on CPUs.
We often write “GGUF quantization”, but GGUF itself is just a file format, not a quantization method. There are several quantization algorithms implemented in llama.cpp to reduce the model size and serialize the resulting model into the GGUF format.
In this article, we will see how to accurately quantize a LLM and convert it to GGUF, using an importance matrix (imatrix) and the K-Quantization method. I provide the GGUF conversion code for Gemma 2 Instruct, using an imatrix. It works the same way with other llama.cpp supported models: Qwen2, Llama 3, Phi-3, etc. We will also see how to evaluate the quantization accuracy and inference performance of the resulting models.