Robert Važan Jan 13, 2024 – Apr 11, 2024

Local LLMs on linux with ollama

I finally got around to setting up local LLM, almost a year after I declared that AGI is here. I have low-cost hardware and I didn't want to tinker too much, so after messing around for a while, I settled on CPU-only Ollama and Open WebUI, both of which can be installed easily and securely in a container. Ollama has a big model library while Open WebUI is rich in convenient features. Ollama is built on top of the highly optimized llama.cpp.

Setup is super simple. No GPU is needed. Both projects have instructions for running in docker containers. See the relevant ollama blog post and Open WebUI README. I have tweaked the instructions a bit to use podman instead of docker (I am using Fedora) and to restart automatically after reboot:

podman run -d --name ollama --replace --restart=always \
    -p 11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \
    docker.io/ollama/ollama
podman run -d --name open-webui --replace --restart=always \
    -p 3000:8080 -v open-webui:/app/backend/data \
    --add-host=host.docker.internal:host-gateway \
    ghcr.io/open-webui/open-webui:main
systemctl --user enable podman-restart

I have also created some aliases/scripts to make it very convenient to invoke ollama from the command line, because without aliases, containerized CLI interface gets a bit verbose:

podman exec -it ollama ollama run tinyllama

Or alternatively run the CLI interface in a separate container:

podman run -it --rm --add-host=host.docker.internal:host-gateway \
    -e OLLAMA_HOST=host.docker.internal docker.io/ollama/ollama run tinyllama

Why run LLMs locally?

I used to have GPT-4 subscription, but it was barely paying for itself. It saved less than 10% of my time and I wasted a lot of time tinkering with it. Local LLMs are free and increasingly good. Then there are all the issues with the cloud. Cloud LLM can change, disappear, or get more expensive at any moment. It keeps asking for my feedback and other data, which only serves the operator while I get my data locked up. I am quite sensitive about privacy and freedom and although I don't run into guardrails often, it's annoying when I do. I am also hoping that local LLM will offer more control, because even though GPT-4 is smart, it's often unnecessarily creative when I just want it to follow instructions. API gives more control, but it can get crazy expensive if some script gets stuck in a loop.

Choosing models

My current favorite models are dolphin-mistral 7B and deepseek-coder 6.7B. If you have plenty of DDR5 RAM, you might be interested in mixtral 8x7B. 3B and smaller models are really fast even on CPU, but they are a confused, hallucinating mess. If you must, orca-mini 3B is the least bad one. Uncensored models, if they work well, are preferable, because they are more amenable to tweaking with crafted prompts. 4-bit quantization makes models smaller and faster with negligible loss of accuracy. 3-bit quantization is cutting into accuracy perceptibly, but it's still better than resorting to smaller model. There's no point in running models with more than 4 bits per parameter. If you have powerful hardware, just run larger model instead.

Open WebUI provides convenient UI for Ollama's custom modelfiles, which can be used to tweak parameters. This is important for smaller models, which are often unsure. This uncertainty manifests in overly wide output probability distributions. To compensate, I tighten the available parameters (temperature, top_k, top_p) to narrow the distribution. I have even created custom modelfiles with greedy sampling (top_k = 1) for when I absolutely don't want any creativity. Beware that narrowing output distribution is a hack that makes smaller models vulnerable to repeat loops, so use with care. Sufficiently large models are confident in their output and they should be controlled only via prompt. Ditto for smaller models used for creative output. As for other parameters, I remove output length limit (num_predict = -1) and relax repeat_penalty to allow the model to output repetitive code when I need it. I also expand context size (num_ctx) above the default 2048, because Ollama does not handle context-exceeding conversations well. Mistral and Mixtral support 32K and DeepSeek Coder 16K, but beware of memory requirements. Custom model files can be used to tweak the system prompt, but this can damage model performance if the model wasn't trained with diverse system prompts.

No matter how much you tweak them, small local models aren't of much practical use. Compared to cloud LLMs, local 7B model is an alpha-stage technology demo. Nobody would pay for GPT4 if the free 175B ChatGPT was good enough, so what do you expect from a 7B model? People mostly use local LLMs for entertainment, especially role-play. The more serious business use-cases rely on fine-tuning, which is currently impractical for individual users and next to impossible without high-end GPU. Smaller models are however okay for simple Q&A and topic exploration. They can serve as a natural language scripting engine if the task is simple enough and the LLM is properly instructed and provided with examples. Summarization and document indexing are feasible, but you need GPU to process the long prompts quickly. Code completion might work well if the editor supports it and you have a GPU for fast context processing. Text completion might work even without GPU if you write the text top-down.

Speeding things up

Hardware is a big problem, BTW. I have a few months old computer but a low-cost one. I am not a LLM nerd like the guys hanging out at /r/LocalLLaMA who build multi-GPU rigs just to run the largest LLMs. GPUs are bloody expensive these days and they do not have anywhere near enough RAM. I therefore opted for a cheap box with iGPU and lots of system RAM. The downside is that inference is slow. Token rate is about 20-35% lower than what you would guess from model size and memory bandwidth, probably because of context access, but also because some parts of inference are not memory-bound. Models become barely comfortable to use at speeds above 10 tokens/second, which is approximately what you can expect from a 7B model like Mistral on 2-channel DDR4-3200.

To speed things up, my system prompt, if the model supports it well, usually consists of only two sentences: one giving the AI a role (assistant) and one requesting brevity and accuracy. Ollama can cache the system prompt, but keeping it short still helps a bit, especially with first query. I stick to one model to avoid the cost of ollama switching models and clearing attention cache. Attention cache (also called KV cache) is essential for performance in multi-turn conversations. If it is cleared, ollama will reconstruct it by reevaluating the whole conversation from the beginning, which is slow on CPU. Open WebUI can use configured LLM to generate titles in your chat history, but it's such a performance killer that I have disabled the feature. By default, Ollama unloads the model and discards attention cache after 5 minutes of inactivity. This can be configured as "Keep Alive" in Open WebUI settings and I have set it to higher value to ensure quick response even if I come back to the conversation a bit later.

Real-time priority

Llama.cpp is very sensitive to competition from background processes, running as much as 2x slower even if the background process is in a cgroup with low CPU share. The most likely cause is that the background process interferes with scheduling of llama.cpp's threads, which causes some thread to fall behind and then the other threads are left idling while they wait for the affected thread to catch up. This is hard to fix purely within llama.cpp code, at least for transformer architecture, which requires the implementation to repeatedly parallelize small chunks of work, syncing threads after every chunk. To fix this on system level, we can tinker with scheduler configuration, specifically with real-time priorities. I run CPU-hogging background processes all the time, so I invested the necessary effort into granting ollama real-time priority:

sudo podman run -d --name ollama --replace --restart=always \
    -p 11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \
    --cap-add=SYS_NICE --entrypoint=/bin/sh \
    docker.io/ollama/ollama \
    -c chrt 1 /bin/ollama serve
sudo systemctl enable podman-restart

Rootless podman ignores SYS_NICE, so run with sudo. I tried both round-robin scheduler (chrt default) and FIFO scheduler, but I don't see any difference. Interestingly, real-time schedulers are 10-20% slower than default scheduler on an unloaded system, probably because default scheduler is a bit smarter about evenly spreading load over all cores. But the massive boost under load is worth it. With real-time priority, ollama performs almost as well as it does on an unloaded system. System remains stable, because I have CPU with hyperthreading, which ollama does not use, so apparent CPU usage is only 50% and system can schedule other prosesses freely. I nevertheless noticed significant interference with other real-time processes, notably audio playback. Be warned that without hyperthreading, ollama with real-time priority will probably crash the system.

Thread count

Ollama allocates one thread per physical core, but this can be configured in custom modelfile. My experiments show that inference can work with fewer threads, because it is bottlenecked on RAM bandwidth. It even runs slightly faster with one less thread on an unloaded system. But prompt processing can definitely use all available cores. Increasing thread count beyound one thread per core actually worsens performance, probably because instruction-level parallelism already fully utilizes all cores and additional threads just introduce thread coordination issues. It's therefore best to stick with default number of threads.

Making good use of iGPU

Running on CPU cores is the trouble-free solution, but all CPU-only computers also have an iGPU, which happens to be faster than all CPU cores combined despite its tiny size and low power consumption. With some tinkering and a bit of luck, you can employ the iGPU to improve performance.

Even though iGPUs can dynamically allocate host RAM via UMA/GTT/GART and llama.cpp supports it via compile-time switch, there's currently no UMA support in Ollama. The only option is to reserve some RAM as dedicated VRAM in BIOS if your system supports it (some notebooks don't). In my case, it defaults to puny 512MB, but it can be configured to any power of two up to 16GB. I opted for 8GB VRAM, which is sufficient for quantized 7B model (4GB), KV cache and buffers (1GB), desktop and applications (1-2GB), and some headroom (1GB). Multimodal llava 7B is a bit larger (5GB), but it still fits. If you want to run 13B models, you will need to reserve 16GB of RAM as VRAM.

How you run ollama with GPU support depends on GPU vendor. I have AMD processor, so these instructions are AMD-only. To make ollama use iGPU on AMD processors, you will need docker image variant than bundles ROCm, AMD's GPU compute stack. It's a separate image, because ROCm adds 4GB to image size (no kidding). You will also have to give it a few more parameters:

podman run -d --name ollama --replace --restart=always \
    -p 11434:11434 -v ollama:/root/.ollama --stop-signal=SIGKILL \
    --device /dev/dri --device /dev/kfd \
    -e HSA_OVERRIDE_GFX_VERSION=9.0.0 -e HSA_ENABLE_SDMA=0 \
    docker.io/ollama/ollama:rocm

ROCm has a very short list of supported GPUs. The environment variables trick ROCm to use the unsupported iGPU in my Ryzen 5600G. You might have to adjust the variables and Ollama/ROCm versions for other unsupported GPUs. According to what I have read on the topic of GPU access in containers, the container remains properly sandboxed despite all the sharing.

If you do all this and ollama does not error out, crash, or hang, you should get a nice performance boost. In my case, I am observing over 2x faster prompt processing, approaching 40 tokens/second for 7B model on an unloaded system. Processing of image inputs in llava is also 2x faster, but it's still impractical at 17 seconds per image. Generation speed is still limited by memory bandwidth at 10 tokens/second, but it is no longer impacted by background workload on the CPU, which is a killer feature of iGPU inference for me. Ollama still uses some CPU time even if the whole model runs on iGPU, but CPU load is negligible now.

The whole thing is a bit shaky though. Unsupported GPU means this can break with any future update. Inference on iGPU sometimes goes off the rails and produces garbage until ollama is restarted. Even when it works, output is a tiny bit different from what CPU produced (with top_k = 1) and the first run on iGPU produces slightly different output from second and following runs. Ollama sometimes fails to offload all layers to the iGPU when switching models, reporting low VRAM as if parts of the previous model are still in VRAM. This is damaging to performance and it gets worse over time, but restarting ollama fixes the problem for a while. Offloading of Mixtral layers to iGPU is broken. The model just hangs.

What to expect in the future

This sets priorities for future hardware purchases. Nothing else on my computer suffers from hardware constraints as much as local LLMs. If you are willing to pay hundreds of euros per year per subscription for access to cloud models, you might as well spend a thousand euros or more on new hardware to run models locally and get local model benefits like privacy, control, and choice. Local compute also eliminates usage caps and network latency of cloud models.

High-end DDR5 doubles memory bandwidth, which makes Mixtral and 13B dense models sufficiently fast, but larger dense models will not be practical without more memory channels, which are currently rare and expensive. GPUs have wide memory bus, but they instead constrain model size via limited VRAM. You need 2x16GB for 30B+ models and 3x16GB for 70B models. 24GB GPUs are unreasonably expensive. Smaller 8-16GB GPU setup is still useful for multimodal models like llava and for long prompts, but even some iGPUs are going to be fast enough for that. The newly announced CPUs with in-package high-speed RAM will enable iGPUs to run 30B+ models.

There are also plenty of opportunities for software and model optimizations, which is where I hope to get significant performance boost in the next year or two. Mistral shows that well-trained 7B model can deliver impressive results. Properly trained 3B model could approximate it while delivering lightning speed on large prompts. Code and text completion is an obvious application for local LLMs, but editor support is still scarce and often cumbersome. Domain models could crush much larger generalists, but there are hardly any specialist models at the moment. Lightweight local fine-tuning could fix style and conventions without excessive prompting, but it's not exactly a pushbutton experience yet. Letting LLMs access tools, Internet, and supporting databases can help overcome their size limitations. RWKV, Mamba, and ternary networks promise faster inference and other benefits. Speculative execution of LLMs can help a lot, but no open weights model uses it. Beam search would be essentially free for local inference. iGPUs and AMD/Intel dGPUs could help with multimodal models, long prompts, and energy efficiency, but most of them sit still for lack of software support. MoE and sparsity are underutilized.

I am very optimistic about software improvements. The area is exciting and attracts lots of talented people. I am not going to contribute anything beyond bug reports though, because I need to tend to my own business and LLMs are mere productivity boost for me. Money for training hardware will keep coming from governments and enterprises worried about data security. Hardware will also improve, although not as quickly, because hardware is expensive to change and also because vendors hesitate to commit to computational primitives that might be rendered obsolete by next year's software optimizations.

I am confident there will be steady and fairly fast progress in local LLMs, but cloud LLMs will not go away. With high sparsity and other optimizations, cloud LLMs will eventually grow to be as big as search engines. Instead of replacing cloud LLMs, local LLMs will evolve to support different use cases.