Turn your computer into an AI machine - Install Ollama

Introduction
In the first part of this blog series I installed Open WebUI and everything to get it running. The Nvidia GPU drivers are also already in place and the CUDA toolkit too.
Before I can use Open WebUI
as a chat interface for a local large language model (LLM) I need a software that provides an interface to such a LLM and that’s what Ollama does. It helps get up and running Llama 3.2, Mistral, Gemma 2, and other large language models. A full list of supported LLMs is provided in the Model Library. Below the model library there is also the following note:
You should have at least 8 GB of RAM available to run the 7B models,
16 GB to run the 13B models, and 32 GB to run the 33B models.
If your computer has around 40 GByte of RAM (you also need some RAM for your OS) you can basically run a 33B
model with Ollama (the B
stands for billion and we’re talking about parameters). But as mentioned in my previous blog post running a LLM on a CPU isn’t much fun because of the long time it takes to answer your questions (and AFAIK not even on a very current CPU but I might be wrong here). Nevertheless with >= 64 GByte of RAM you can try running Llama 3.3 70B model with Ollama. That’s basically a GPT-4 class model and that’s pretty impressive (who have thought about that a year ago…). The only problem here is that with “only” 64 GByte of RAM you wont be able to start much other programs anymore 😉
So even if you own a Nvidia 4090 with 24 GByte VRAM you can’t run a 33B
or even 70B
model on that GPU. Normally the more parameters a model has, the more complex patterns it can recognize in data. A larger number typically indicates a more complex model that can capture more nuances in language. Generally, models with more parameters can generate more coherent and contextually relevant responses, handle more complex tasks, and learn from more diverse data. However, having more parameters also means that the model requires more computational resources (like memory and processing power) to train and run.
So with my Nvidia 4080 SUPER with 16 GByte VRAM I can run only run 13B
models maybe even 14B
or of course models with a lower number of parameters completely on GPU. These models are already not that bad but of course not as good as those models that you get with ChatGPT form OpenAI e.g.
Download and install Ollama
Lets switch to /tmp
directory. In this directory I download the Ollama
archive (that’s currently around 1.7 GByte - so may take a while):
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
I extract the content of that archive to $HOME/.local
(see file-hierarchy - Home directory). That will create (if not already there) bin
and lib
directories there.
# Ensure $HOME/.local directory
[[ ! -d .local ]] && mkdir .local
# Extract archive
tar -C $HOME/.local -xzf ollama-linux-amd64.tgz
Note: If you want update Ollama
somewhere in the future just run the commands above again and restart the service. To avoid problems with errors like undefined symbol: ggml_backend_cuda_reg
delete $HOME/.local/lib/ollama
directory before you “untar” the archive.
Adjust .bashrc
Make sure that $HOME/.local
is the first entry in your PATH
variable by executing echo $PATH
. If this is not the case add the following to $HOME/.bashrc
:
export PATH=$HOME/.local/bin:$PATH
export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH
To make the changes effective right away run:
source $HOME/.bashrc
Ollama environment variables
I specify OLLAMA_HOST
variable to make sure that Ollama
not only listens on 127.0.0.1
(localhost) but on all interfaces. This is of course depended on your security needs and might not fit for your setup. In this case just remove that variable.
By default Ollama
stores the downloaded model files in /usr/share/ollama/.ollama/models
. Depending on how much models you want to download you might want to specify a different location. A few models can already take 30-50 GByte. A different location can be specified by setting OLLAMA_MODELS
variable to a different path.
Start Ollama manually
Now lets start Ollama
:
OLLAMA_HOST=0.0.0.0:11434 $HOME/.local/bin/ollama serve
Besides some other output I also get
time=2024-12-03T21:35:21.131+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-03T21:35:21.338+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a2e0feb2-d63e-2326-bdbf-1223b4ad4b3b library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4080 SUPER" total="15.7 GiB" available="15.4 GiB"
Ollama
detected my GPU. So that’s fine and models can be downloaded next.
Start Ollama with a systemd service file
Instead starting Ollama manually you can also create a ollama.service
file and using systemd
to start Ollama
. The ollama.service
file could look like this:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
Environment="HOME=/home/<user>"
Environment="PATH=$HOME/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
ExecStart=/home/<user>/.local/bin/ollama serve
User=<user>
Group=<group>
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
Using systemd system manager
This method installs the service file system wide. Replace <user>
with your username and <group>
with your group name. If you want Ollama
to listen only on localhost
, just remove Environment="OLLAMA_HOST=0.0.0.0:11434"
. Store the file into /etc/systemd/system/
directory. Run sudo systemctl daemon-reload
to load the new file and sudo systemctl start ollama.service
. If you want to enable the service permanently run sudo systemctl enable ollama.service
. Check the logs with sudo journalctl -f -t ollama
.
Using systemd user system manager
If you don’t want to install the service file system wide, you can use a user unit file. Create a directory $HOME/.config/systemd/user/
(if it doesn’t exist already). Put ollama.service
file there. Run systemctl --user daemon-reload
and systemctl --user start ollama.service
to start the service. If you want to enable the service permanently run systemctl --user enable ollama.service
.
Downloading models
All models available are listed at the Models page. Currently the best model available for a chat bot is most probably llama3.1
. There are three sizes available: 8B
, 70B
and 405B
. llama3.3 would also be an option and is a GPT-4 class model. But to load that one you need around 64 GByte RAM to be able to run it on CPU at least. But be prepared that responses will take quite long. If you’ve a GPU with > 50 GB: Congratulations! 😉
Only the 8B
model fits in my GPU’s VRAM. So I’ll pull that one (that’s around 5 GByte to download):
ollama pull llama3.1:8b
Another interesting model is llama3.2-vision:11b
. That’s a model that can tell you what’s on a picture e.g. (that’s about 8 GByte to download);
ollama pull llama3.2-vision:11b
Adjusting Ollama context length
By default Ollama
has only a context length/context window (num_ctx
) of 2048
(see Modelfile parameter). For the bigger models that’s too small quite often. Ollama
doesn’t throw an error if the request exceeds the context window. The context window is important the longer your request gets. So if you start a chat and you do this “question/answer” game a while, the text gets longer and longer. And every time the whole text needs to be submitted to the model. If the context window is too small, the oldest messages in the chat will be silently discarded and the answers you get might not be that useful anymore if some context is missing.
To figure out the max. context size of a model you can use ollama show
command. First lets see what models are installed:
ollama list
NAME ID SIZE MODIFIED
llama3.2-vision:11b 38107a0cd119 7.9 GB 2 months ago
llama3.1:8b 46e0c10c039e 4.9 GB 2 months ago
To get the context length:
ollama show llama3.1:8b
Model
architecture llama
parameters 8.0B
context length 131072
embedding length 4096
quantization Q4_K_M
...
So for this model the context length is 131072
. Now before you set num_ctx
to that value be careful 😉 As usual: The bigger the context length the more CPU and RAM/VRAM you need. A good compromise is 8192
or 16384
. This Reddit comment gives you a very rough impression what to expect. So for 4 to 8 bit precision the difference between 256
and 16384
context length is roughly 0.5 to 1.5 GByte that you need to add to the RAM/VRAM requirements.
To make a model with a bigger context length available to Ollama
I’ll now create a Modelfile. I’ll create a directory $HOME/ollama_models
. For the llama3.2-vision:11b
model I’ll create a directory llama3.2-vision-11b-16k
(16k
for 16384
context length). In this directory I’ll create a file called Modelfile
. The content of that file is just two lines:
FROM llama3.2-vision:11b
PARAMETER num_ctx 16384
This looks a bit like a Dockerfile
. The first line specifies which model you want to use. In the next line I’ll just set the value for num_ctx
parameter.
ollama create llama3.2-vision:11b-16k
creates a model from that Modelfile
. But it doesn’t download the base model again. So this just uses a few more bytes but now with a different parameter.
ollama list
NAME ID SIZE MODIFIED
llama3.2-vision:11b-16k a681e5aed4ed 7.9 GB About a minute ago
llama3.2-vision:11b 38107a0cd119 7.9 GB 2 months ago
llama3.1:8b 46e0c10c039e 4.9 GB 2 months ago
With Ollama
installed and one or more models downloaded and/or configured, Open WebUI
can be configured and used which will happen in the next blog post.