You can find the code in this GitHub repository:
https://github.com/amirarsalan90/personal_llm_assistant
The main components of the application include:
Llama-cpp-python is a Python binding for the large llama.cpp, which implements many large language models in C/C++. Due to its wide adoption by the open source community, I decided to use it in this tutorial.
Note: I tested this app on a system with Nvidia RTX4090 gpu.
First things first, let's create a new conda environment:
conda create --name assistant python=3.10
conda activate assistant
Next we need to install llama-cpp-python. As mentioned in call-cpp-python Descriptions, llama.cpp supports several hardware acceleration backends to speed up inference. To take advantage of the GPU and run the LLM on the GPU, we will create the program with CUBLAS. I had some trouble downloading the model to the GPU and finally found this post on how to install correctly:
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
pip install llama-cpp-python(server)