Setting Up Nixie

The rig arrived. Time to turn a gaming PC into an inference server.

Ubuntu Server

USB stick, minimal install. OpenSSH server selected during setup, everything else skipped. Grabbed the IP, went headless. From this point on, everything is SSH.

BIOS

A few settings matter for GPU workloads:

Above 4G Decoding: enabled. Required for 24GB VRAM.
Re-Size BAR Support: enabled. Improves GPU memory access.
IOMMU: enabled. Needed for PCIe device isolation, useful if you ever want VM passthrough.
SVM Mode: enabled. AMD’s virtualisation support.

These are easy to miss and cause mysterious problems later.

Staying Alive

AI models eat memory. On a headless box you can’t see it happening, and by the time Linux’s OOM killer fires the system is usually already unresponsive. SSH stops responding. The only fix is a power cycle.

The Hard Way: This happened. Walked to the garage, pressed the power button, walked back. Still no SSH. Checked the switch. No lights. Swapped the cable. Still no lights. Dug out a keyboard and monitor to log in at the console. Then realised I'd forgotten my password.

GRUB recovery. Mash Escape during boot, edit the linux line, change ro quiet splash to rw init=/bin/bash, Ctrl+X. Root shell. Reset the password. Then hit a DNS issue. Minimal Ubuntu install, no dhclient, no dig, empty resolv.conf. Manually added nameservers to get online. Eventful afternoon.

Two things prevent this:

zram creates compressed swap in RAM. A model that almost fits will run instead of killing the system.

ALGO=zstd
PERCENT=50
PRIORITY=100

That gives about 16GB of compressed swap on a 32GB system.

earlyoom watches memory and kills processes before things lock up. Critically, it preserves SSH access:

EARLYOOM_ARGS="-r 60 -m 5 -s 5 --avoid '(^|/)(sshd|systemd|init)$' --prefer '(^|/)(ollama|python|llama)'"

Kill the inference process, not the SSH session. On a headless box this is the difference between a minor inconvenience and driving to the garage to press the power button.

NVIDIA Drivers

sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
sudo apt install -y nvidia-driver-580-open nvidia-cuda-toolkit

With Secure Boot enabled, the driver needs a Machine Owner Key enrolled. During installation you set a password, then on reboot there’s a blue MOK Management screen to complete enrollment. Miss it and the driver won’t load. For a home inference box, Secure Boot is optional. I turned it off.

nvidia-smi

RTX 3090, 24GB VRAM, driver loaded. Done.

Ollama

curl -fsSL https://ollama.ai/install.sh | sh

First models pulled:

llama3.2:3b for testing
llama3.1:8b for general use
llama3.1:70b-instruct-q4_K_M because I wanted to see what 24GB could actually handle

The 70b model pushes the VRAM right to the limit. It loads, it runs, but there’s no headroom for anything else. The sweet spot turned out to be the 7-14b range. Enough capability to be useful, enough VRAM left over for concurrent requests.

What I Actually Wanted

Ollama was running. I could hit the API from my Mac. But what I really wanted was agents and humans in the same chat, triggering things. Not just asking a model questions. I wanted to send a message and have it kick off a workflow. Nobody was really doing this yet.

Claude had no voice at the time. ChatGPT had just added it. Both were single-user, single-conversation tools. I wanted multiple chat rooms, phone clients, bots that could act on messages. The art of the possible was obvious but no one had built the plumbing.

Mattermost

I looked at the options. Slack is SaaS with message limits on the free tier. Discord is SaaS. Mattermost was open source, self-hosted, already had bot features. Closest to what I needed.

I ran it in Docker alongside Ollama and wrote a Go bot to wire them together. Started with a sample from GitHub, rewrote it for my use case: poll channels, maintain per-user conversation context, feed messages to Ollama, post responses back.

It worked. Self-hosted chat talking to local models. Multiple rooms. Phone client. No data leaving the network.

Then I hit the paywall. Some of the features I needed were enterprise-only. That took the wind out of it. I shelved the project, went skiing with the family, and came back to it a few weeks later with fresh eyes and a different approach.

SSH Config

On the Mac, one line in ~/.ssh/config made everything seamless:

Host nixie
    HostName nixie
    User carl
    ForwardAgent yes
    LocalForward 11434 localhost:11434

Tailscale handles the hostname resolution. Port forwarding means Ollama’s API is available on localhost:11434 from the Mac, as if it were running locally. Any tool that talks to Ollama just works without knowing it’s hitting a remote machine.

Ubuntu Server

BIOS

Staying Alive

NVIDIA Drivers

Ollama

What I Actually Wanted

Mattermost

SSH Config

Finding an Inference Box on Facebook Marketplace

Semantic Recipe Search with LangChain, pgvector, and Local Embeddings

Ubuntu Server

BIOS

Staying Alive

NVIDIA Drivers

Ollama

What I Actually Wanted

Mattermost

SSH Config

Related

Finding an Inference Box on Facebook Marketplace

Semantic Recipe Search with LangChain, pgvector, and Local Embeddings