Turn your computer into an AI machine - Install Ollama

In the first part of this blog series I installed Open WebUI and everything to get it running. The Nvidia GPU drivers are also already in place and the CUDA toolkit too.

Before I can use Open WebUI as a chat interface for a local large language model (LLM) I need a software that provides an interface to such a LLM and that’s what Ollama does. It helps get up and running Llama 3.2, Mistral, Gemma 2, and other large language models. A full list of supported LLMs is provided in the Model Library. Below the model library there is also the following note:

plain

You should have at least 8 GB of RAM available to run the 7B models,
16 GB to run the 13B models, and 32 GB to run the 33B models.

If your computer has around 40 GByte of RAM (you also need some RAM for your OS) you can basically run a 33B model with Ollama (the B stands for billion and we’re talking about parameters). But as mentioned in my previous blog post running a LLM on a CPU isn’t much fun because of the long time it takes to answer your questions (and AFAIK not even on a very current CPU but I might be wrong here). Nevertheless with >= 64 GByte of RAM you can try running Llama 3.3 70B model with Ollama. That’s basically a GPT-4 class model and that’s pretty impressive (who have thought about that a year ago…). The only problem here is that with “only” 64 GByte of RAM you wont be able to start much other programs anymore 😉

So even if you own a Nvidia 4090 with 24 GByte VRAM you can’t run a 33B or even 70B model on that GPU. Normally the more parameters a model has, the more complex patterns it can recognize in data. A larger number typically indicates a more complex model that can capture more nuances in language. Generally, models with more parameters can generate more coherent and contextually relevant responses, handle more complex tasks, and learn from more diverse data. However, having more parameters also means that the model requires more computational resources (like memory and processing power) to train and run.

So with my Nvidia 4080 SUPER with 16 GByte VRAM I can run only run 13B models maybe even 14B or of course models with a lower number of parameters completely on GPU. These models are already not that bad but of course not as good as those models that you get with ChatGPT form OpenAI e.g.

Lets switch to /tmp directory. In this directory I download the Ollama archive (that’s currently around 1.7 GByte - so may take a while):

bash

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz

I extract the content of that archive to $HOME/.local (see file-hierarchy - Home directory). That will create (if not already there) bin and lib directories there.

bash

# Ensure $HOME/.local directory
[[ ! -d .local ]] && mkdir .local
# Extract archive
tar -C $HOME/.local -xzf ollama-linux-amd64.tgz

Note: If you want update Ollama somewhere in the future just run the commands above again and restart the service.

Make sure that $HOME/.local is the first entry in your PATH variable by executing echo $PATH. If this is not the case add the following to $HOME/.bashrc:

bash

export PATH=$HOME/.local/bin:$PATH
export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH

To make the changes effective right away run:

bash

source $HOME/.bashrc

I specify OLLAMA_HOST variable to make sure that Ollama not only listens on 127.0.0.1 (localhost) but on all interfaces. This is of course depended on your security needs and might not fit for your setup. In this case just remove that variable.

By default Ollama stores the downloaded model files in /usr/share/ollama/.ollama/models. Depending on how much models you want to download you might want to specify a different location. A few models can already take 30-50 GByte. A different location can be specified by setting OLLAMA_MODELS variable to a different path.

Now lets start Ollama:

bash

OLLAMA_HOST=0.0.0.0:11434 $HOME/.local/bin/ollama serve

Besides some other output I also get

plain

time=2024-12-03T21:35:21.131+01:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-12-03T21:35:21.338+01:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-a2e0feb2-d63e-2326-bdbf-1223b4ad4b3b library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4080 SUPER" total="15.7 GiB" available="15.4 GiB"

Ollama detected my GPU. So that’s fine and models can be downloaded next.

All models available are listed at the Models page. Currently the best model available for a chat bot is most probably llama3.1. There are three sizes available: 8B, 70B and 405B. Only the 8B model fits in my GPU’s VRAM. So I’ll pull that one (that’s around 5 GByte to download):

bash

ollama pull llama3.1:8b

Another interesting model is llama3.2-vision:11b. That’s a model that can tell you what’s on a picture e.g. (that’s about 8 GByte to download);

bash

ollama pull llama3.2-vision:11b

With Ollama installed and one or more models downloaded Open WebUI can be configured and used which will happen in the next blog post.