Turn your computer into an AI machine - Install Ollama
Introduction
In the first part of this blog series I installed Open WebUI and everything to get it running. The Nvidia GPU drivers are also already in place and the CUDA toolkit too.
Before I can use Open WebUI
as a chat interface for a local large language model (LLM) I need a software that provides an interface to such a LLM and that’s what Ollama does. It helps get up and running Llama 3.2, Mistral, Gemma 2, and other large language models. A full list of supported LLMs is provided in the Model Library. Below the model library there is also the following note:
You should have at least 8 GB of RAM available to run the 7B models,
16 GB to run the 13B models, and 32 GB to run the 33B models.
If your computer has around 40 GByte of RAM (you also need some RAM for your OS) you can basically run a 33B
model with Ollama (the B
stands for billion and we’re talking about parameters). But as mentioned in my previous blog post running a LLM on a CPU isn’t much fun because of the long time it takes to answer your questions (and AFAIK not even on a very current CPU but I might be wrong here). Nevertheless with >= 64 GByte of RAM you can try running Llama 3.3 70B model with Ollama. That’s basically a GPT-4 class model and that’s pretty impressive (who have thought about that a year ago…). The only problem here is that with “only” 64 GByte of RAM you wont be able to start much other programs anymore 😉
So even if you own a Nvidia 4090 with 24 GByte VRAM you can’t run a 33B
or even 70B
model on that GPU. Normally the more parameters a model has, the more complex patterns it can recognize in data. A larger number typically indicates a more complex model that can capture more nuances in language. Generally, models with more parameters can generate more coherent and contextually relevant responses, handle more complex tasks, and learn from more diverse data. However, having more parameters also means that the model requires more computational resources (like memory and processing power) to train and run.
So with my Nvidia 4080 SUPER with 16 GByte VRAM I can run only run 13B
models maybe even 14B
or of course models with a lower number of parameters completely on GPU. These models are already not that bad but of course not as good as those models that you get with ChatGPT form OpenAI e.g.
Download and install Ollama
Lets switch to /tmp
directory. In this directory I download the Ollama
archive (that’s currently around 1.7 GByte - so may take a while):
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
I extract the content of that archive to $HOME/.local
(see file-hierarchy - Home directory). That will create (if not already there) bin
and lib
directories there.
# Ensure $HOME/.local directory
[[ ! -d .local ]] && mkdir .local
# Extract archive
tar -C $HOME/.local -xzf ollama-linux-amd64.tgz
Note: If you want update Ollama
somewhere in the future just run the commands above again and restart the service.
Adjust .bashrc
Make sure that $HOME/.local
is the first entry in your PATH
variable by executing echo $PATH
. If this is not the case add the following to $HOME/.bashrc
:
export PATH=$HOME/.local/bin:$PATH
export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH
To make the changes effective right away run:
source $HOME/.bashrc
Ollama environment variables
I specify OLLAMA_HOST
variable to make sure that Ollama
not only listens on 127.0.0.1
(localhost) but on all interfaces. This is of course depended on your security needs and might not fit for your setup. In this case just remove that variable.
By default Ollama
stores the downloaded model files in /usr/share/ollama/.ollama/models
. Depending on how much models you want to download you might want to specify a different location. A few models can already take 30-50 GByte. A different location can be specified by setting OLLAMA_MODELS
variable to a different path.
Start Ollama
Now lets start Ollama
:
OLLAMA_HOST=0.0.0.0:11434 $HOME/.local/bin/ollama serve
Besides some other output I also get
time=2024-12-03T21:35:21.131+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-03T21:35:21.338+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a2e0feb2-d63e-2326-bdbf-1223b4ad4b3b library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4080 SUPER" total="15.7 GiB" available="15.4 GiB"
Ollama
detected my GPU. So that’s fine and models can be downloaded next.
Downloading models
All models available are listed at the Models page. Currently the best model available for a chat bot is most probably llama3.1
. There are three sizes available: 8B
, 70B
and 405B
. Only the 8B
model fits in my GPU’s VRAM. So I’ll pull that one (that’s around 5 GByte to download):
ollama pull llama3.1:8b
Another interesting model is llama3.2-vision:11b
. That’s a model that can tell you what’s on a picture e.g. (that’s about 8 GByte to download);
ollama pull llama3.2-vision:11b
With Ollama
installed and one or more models downloaded Open WebUI can be configured and used which will happen in the next blog post.