Distribute and run LLM with llamafile in 5 easy steps

Image by author

For many of us, exploring the possibilities of LLMs has seemed out of reach. Whether it's downloading complicated software, figuring out coding, or needing powerful machines, getting started with an LLM can seem daunting. But imagine if we could interact with these powerful language models as easily as launching any other program on our computers. No installation, no coding, just click and talk. This accessibility is key for both developers and end users. llamaFile emerges as a novel solution, merging the call.cpp with cosmopolitan library in a single framework. This framework reduces the complexity of LLMs by offering a single file executable called “file called”which runs on local machines with no installation required.

So how does it work? llamaFile offers two convenient methods To run LLM:

The first method involves downloading the latest version of llamafile along with the corresponding Hugging Face model weights. Once you have those files, you're ready!
The second method is even simpler: you can access pre-existing example flame files that have built-in weights.

In this tutorial, you will work with the file llama del LLaVa Model using the second method. It's a 7 billion parameter model quantized to 4 bits that you can interact with via chat, upload images, and ask questions. Example llamafile files from other models are also available, but we will work with the LLaVa model since its llama file size is 3.97 GB, while Windows has a maximum executable file size of 4 GB. The process is quite simple and you can run LLM by following the steps mentioned below.

First, you need to download the executable llama-v1.5-7b-q4.llamafile (3.97 GB) from the provided source. here.

Open your computer's terminal and navigate to the directory where the file is located. Then run the following command to give your computer permission to run this file.

chmod +x llava-v1.5-7b-q4.llamafile

If you are on Windows, add “.exe” to the file name at the end. You can run the following command in terminal for this purpose.

rename llava-v1.5-7b-q4.llamafile llava-v1.5-7b-q4.llamafile.exe

Run the file called with the following command.

./llava-v1.5-7b-q4.llamafile -ngl 9999

Since MacOS uses zsh as its default shell and if you find yourself zsh: exec format error: ./llava-v1.5-7b-q4.llamafile error then you need to run this:

bash -c ./llava-v1.5-7b-q4.llamafile -ngl 9999

For Windows, your command may look like this:

llava-v1.5-7b-q4.llamafile.exe -ngl 9999

After running llamafile, it should automatically open your default browser and display the user interface as shown below. If not, open the browser and navigate to http://localhost:8080 by hand.

Image by author

Let's start by interacting with the interface with a simple question to provide information related to the LLaVa model. Below is the response generated by the model:

Image by author

The answer highlights the approach to developing the LLaVa model and its applications. The response generated was reasonably fast. Let's try to implement another task. We will upload the following sample image of a bank card with details and extract the required information from it.

Image by Ruby Thompson

Here is the answer:

Image by author

Once again, the answer is quite reasonable. The authors of LLaVa claim that it achieves top-notch performance across a variety of tasks. Feel free to explore various tasks, observe their successes and limitations, and experience the excellent performance of LLaVa yourself.

Once your interaction with the LLM is complete, you can close the file call by returning to the terminal and pressing “Control – C”.

Distributing and running LLM has never been easier. In this tutorial, we explain how easily you can run and experiment with different models with a single executable called file. This not only saves time and resources, but also expands the accessibility and real-world usefulness of LLMs. We hope you found this tutorial useful and we would love to hear your thoughts on it. Also, if you have any questions or comments, please feel free to contact us. We will always be happy to help and value your comments.

Thank you for reading!

Kanwal Mehreen Kanwal is a machine learning engineer and technical writer with a deep passion for data science and the intersection of ai with medicine. She is the co-author of the eBook “Maximize Productivity with ChatGPT.” As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She is also recognized as a Teradata Diversity in tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is a passionate advocate for change and founded FEMCodes to empower women in STEM fields.