Turn your computer into an AI machine - Install Ollama

In the first part of this blog series I installed Open WebUI and everything to get it running. The Nvidia GPU drivers are also already in place and the CUDA toolkit too.

Before I can use Open WebUI as a chat interface for a local large language model (LLM) I need a software that provides an interface to such a LLM and that’s what Ollama does. It helps get up and running Llama 3.2, Mistral, Gemma 2, and other large language models. A full list of supported LLMs is provided in the Model Library. Below the model library there is also the following note:

plain

You should have at least 8 GB of RAM available to run the 7B models,
16 GB to run the 13B models, and 32 GB to run the 33B models.

If your computer has around 40 GByte of RAM (you also need some RAM for your OS) you can basically run a 33B model with Ollama (the B stands for billion and we’re talking about parameters). But as mentioned in my previous blog post running a LLM on a CPU isn’t much fun because of the long time it takes to answer your questions (and AFAIK not even on a very current CPU but I might be wrong here). Nevertheless with >= 64 GByte of RAM you can try running Llama 3.3 70B model with Ollama. That’s basically a GPT-4 class model and that’s pretty impressive (who have thought about that a year ago…). The only problem here is that with “only” 64 GByte of RAM you wont be able to start much other programs anymore 😉

So even if you own a Nvidia 4090 with 24 GByte VRAM you can’t run a 33B or even 70B model on that GPU. Normally the more parameters a model has, the more complex patterns it can recognize in data. A larger number typically indicates a more complex model that can capture more nuances in language. Generally, models with more parameters can generate more coherent and contextually relevant responses, handle more complex tasks, and learn from more diverse data. However, having more parameters also means that the model requires more computational resources (like memory and processing power) to train and run.

So with my Nvidia 4080 SUPER with 16 GByte VRAM I can run only run 13B models maybe even 14B or of course models with a lower number of parameters completely on GPU. These models are already not that bad but of course not as good as those models that you get with ChatGPT form OpenAI e.g.

Lets switch to /tmp directory. In this directory I download the Ollama archive (that’s currently around 1.7 GByte - so may take a while):

bash

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz

I extract the content of that archive to $HOME/.local (see file-hierarchy - Home directory). That will create (if not already there) bin and lib directories there.

bash

# Ensure $HOME/.local directory
[[ ! -d .local ]] && mkdir .local
# Extract archive
tar -C $HOME/.local -xzf ollama-linux-amd64.tgz

Note: If you want update Ollama somewhere in the future just run the commands above again and restart the service. To avoid problems with errors like undefined symbol: ggml_backend_cuda_reg delete $HOME/.local/lib/ollama directory before you “untar” the archive.

Make sure that $HOME/.local is the first entry in your PATH variable by executing echo $PATH. If this is not the case add the following to $HOME/.bashrc:

bash

export PATH=$HOME/.local/bin:$PATH
export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH

To make the changes effective right away run:

bash

source $HOME/.bashrc

I specify OLLAMA_HOST variable to make sure that Ollama not only listens on 127.0.0.1 (localhost) but on all interfaces. This is of course depended on your security needs and might not fit for your setup. In this case just remove that variable.

By default Ollama stores the downloaded model files in /usr/share/ollama/.ollama/models. Depending on how much models you want to download you might want to specify a different location. A few models can already take 30-50 GByte. A different location can be specified by setting OLLAMA_MODELS variable to a different path.

Now lets start Ollama:

bash

OLLAMA_HOST=0.0.0.0:11434 $HOME/.local/bin/ollama serve

Besides some other output I also get

plain

time=2024-12-03T21:35:21.131+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-03T21:35:21.338+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a2e0feb2-d63e-2326-bdbf-1223b4ad4b3b library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4080 SUPER" total="15.7 GiB" available="15.4 GiB"

Ollama detected my GPU. So that’s fine and models can be downloaded next.

Instead starting Ollama manually you can also create a ollama.service file and using systemd to start Ollama. The ollama.service file could look like this:

ini

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="HOME=/home/<user>"
Environment="PATH=$HOME/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
ExecStart=/home/<user>/.local/bin/ollama serve
User=<user>
Group=<group>
Restart=always
RestartSec=3

[Install]
WantedBy=default.target

This method installs the service file system wide. Replace <user> with your username and <group> with your group name. If you want Ollama to listen only on localhost, just remove Environment="OLLAMA_HOST=0.0.0.0:11434". Store the file into /etc/systemd/system/ directory. Run sudo systemctl daemon-reload to load the new file and sudo systemctl start ollama.service. If you want to enable the service permanently run sudo systemctl enable ollama.service. Check the logs with sudo journalctl -f -t ollama.

If you don’t want to install the service file system wide, you can use a user unit file. Create a directory $HOME/.config/systemd/user/ (if it doesn’t exist already). Put ollama.service file there. Run systemctl --user daemon-reload and systemctl --user start ollama.service to start the service. If you want to enable the service permanently run systemctl --user enable ollama.service.

All models available are listed at the Models page. Currently the best model available for a chat bot is most probably llama3.1. There are three sizes available: 8B, 70B and 405B. llama3.3 would also be an option and is a GPT-4 class model. But to load that one you need around 64 GByte RAM to be able to run it on CPU at least. But be prepared that responses will take quite long. If you’ve a GPU with > 50 GB: Congratulations! 😉

Only the 8B model fits in my GPU’s VRAM. So I’ll pull that one (that’s around 5 GByte to download):

bash

ollama pull llama3.1:8b

Another interesting model is llama3.2-vision:11b. That’s a model that can tell you what’s on a picture e.g. (that’s about 8 GByte to download);

bash

ollama pull llama3.2-vision:11b

By default Ollama has only a context length/context window (num_ctx) of 2048 (see Modelfile parameter). For the bigger models that’s too small quite often. Ollama doesn’t throw an error if the request exceeds the context window. The context window is important the longer your request gets. So if you start a chat and you do this “question/answer” game a while, the text gets longer and longer. And every time the whole text needs to be submitted to the model. If the context window is too small, the oldest messages in the chat will be silently discarded and the answers you get might not be that useful anymore if some context is missing.

To figure out the max. context size of a model you can use ollama show command. First lets see what models are installed:

bash

ollama list

NAME                   ID              SIZE      MODIFIED     
llama3.2-vision:11b    38107a0cd119    7.9 GB    2 months ago    
llama3.1:8b            46e0c10c039e    4.9 GB    2 months ago

To get the context length:

bash

ollama show llama3.1:8b

  Model
    architecture        llama     
    parameters          8.0B      
    context length      131072    
    embedding length    4096      
    quantization        Q4_K_M
...

So for this model the context length is 131072. Now before you set num_ctx to that value be careful 😉 As usual: The bigger the context length the more CPU and RAM/VRAM you need. A good compromise is 8192 or 16384. This Reddit comment gives you a very rough impression what to expect. So for 4 to 8 bit precision the difference between 256 and 16384 context length is roughly 0.5 to 1.5 GByte that you need to add to the RAM/VRAM requirements.

To make a model with a bigger context length available to Ollama I’ll now create a Modelfile. I’ll create a directory $HOME/ollama_models. For the llama3.2-vision:11b model I’ll create a directory llama3.2-vision-11b-16k (16k for 16384 context length). In this directory I’ll create a file called Modelfile. The content of that file is just two lines:

plain

FROM llama3.2-vision:11b
PARAMETER num_ctx 16384

This looks a bit like a Dockerfile. The first line specifies which model you want to use. In the next line I’ll just set the value for num_ctx parameter.

ollama create llama3.2-vision:11b-16k creates a model from that Modelfile. But it doesn’t download the base model again. So this just uses a few more bytes but now with a different parameter.

bash

ollama list

NAME                       ID              SIZE      MODIFIED           
llama3.2-vision:11b-16k    a681e5aed4ed    7.9 GB    About a minute ago    
llama3.2-vision:11b        38107a0cd119    7.9 GB    2 months ago          
llama3.1:8b                46e0c10c039e    4.9 GB    2 months ago

With Ollama installed and one or more models downloaded and/or configured, Open WebUI can be configured and used which will happen in the next blog post.