Setting Up Nixie
Ubuntu Server on a headless box with an RTX 3090. NVIDIA drivers, Ollama, memory tuning, and a Go bot wired to Mattermost.
The rig arrived. Time to turn a gaming PC into an inference server.
Ubuntu Server
USB stick, minimal install. OpenSSH server selected during setup, everything else skipped. Grabbed the IP, went headless. From this point on, everything is SSH.
BIOS
A few settings matter for GPU workloads:
- Above 4G Decoding: enabled. Required for 24GB VRAM.
- Re-Size BAR Support: enabled. Improves GPU memory access.
- IOMMU: enabled. Needed for PCIe device isolation, useful if you ever want VM passthrough.
- SVM Mode: enabled. AMD’s virtualisation support.
These are easy to miss and cause mysterious problems later.
Staying Alive
AI models eat memory. On a headless box you can’t see it happening, and by the time Linux’s OOM killer fires the system is usually already unresponsive. SSH stops responding. The only fix is a power cycle.
The Hard Way: This happened. Walked to the garage, pressed the power button, walked back. Still no SSH. Checked the switch. No lights. Swapped the cable. Still no lights. Dug out a keyboard and monitor to log in at the console. Then realised I'd forgotten my password.
GRUB recovery. Mash Escape during boot, edit the linux line, change ro quiet splash to rw init=/bin/bash, Ctrl+X. Root shell. Reset the password. Then hit a DNS issue. Minimal Ubuntu install, no dhclient, no dig, empty resolv.conf. Manually added nameservers to get online. Eventful afternoon.
Two things prevent this:
zram creates compressed swap in RAM. A model that almost fits will run instead of killing the system.
ALGO=zstd
PERCENT=50
PRIORITY=100
That gives about 16GB of compressed swap on a 32GB system.
earlyoom watches memory and kills processes before things lock up. Critically, it preserves SSH access:
EARLYOOM_ARGS="-r 60 -m 5 -s 5 --avoid '(^|/)(sshd|systemd|init)$' --prefer '(^|/)(ollama|python|llama)'"
Kill the inference process, not the SSH session. On a headless box this is the difference between a minor inconvenience and driving to the garage to press the power button.
NVIDIA Drivers
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update
sudo apt install -y nvidia-driver-580-open nvidia-cuda-toolkit
With Secure Boot enabled, the driver needs a Machine Owner Key enrolled. During installation you set a password, then on reboot there’s a blue MOK Management screen to complete enrollment. Miss it and the driver won’t load. For a home inference box, Secure Boot is optional. I turned it off.
nvidia-smi
RTX 3090, 24GB VRAM, driver loaded. Done.
Ollama
curl -fsSL https://ollama.ai/install.sh | sh
First models pulled:
llama3.2:3bfor testingllama3.1:8bfor general usellama3.1:70b-instruct-q4_K_Mbecause I wanted to see what 24GB could actually handle
The 70b model pushes the VRAM right to the limit. It loads, it runs, but there’s no headroom for anything else. The sweet spot turned out to be the 7-14b range. Enough capability to be useful, enough VRAM left over for concurrent requests.
What I Actually Wanted
Ollama was running. I could hit the API from my Mac. But what I really wanted was agents and humans in the same chat, triggering things. Not just asking a model questions. I wanted to send a message and have it kick off a workflow. Nobody was really doing this yet.
Claude had no voice at the time. ChatGPT had just added it. Both were single-user, single-conversation tools. I wanted multiple chat rooms, phone clients, bots that could act on messages. The art of the possible was obvious but no one had built the plumbing.
Mattermost
I looked at the options. Slack is SaaS with message limits on the free tier. Discord is SaaS. Mattermost was open source, self-hosted, already had bot features. Closest to what I needed.
I ran it in Docker alongside Ollama and wrote a Go bot to wire them together. Started with a sample from GitHub, rewrote it for my use case: poll channels, maintain per-user conversation context, feed messages to Ollama, post responses back.
It worked. Self-hosted chat talking to local models. Multiple rooms. Phone client. No data leaving the network.
Then I hit the paywall. Some of the features I needed were enterprise-only. That took the wind out of it. I shelved the project, went skiing with the family, and came back to it a few weeks later with fresh eyes and a different approach.
SSH Config
On the Mac, one line in ~/.ssh/config made everything seamless:
Host nixie
HostName nixie
User carl
ForwardAgent yes
LocalForward 11434 localhost:11434
Tailscale handles the hostname resolution. Port forwarding means Ollama’s API is available on localhost:11434 from the Mac, as if it were running locally. Any tool that talks to Ollama just works without knowing it’s hitting a remote machine.
Related
Finding an Inference Box on Facebook Marketplace
Researched using my AMD card in an eGPU with a Mac Mini. ROCm killed it. Then I found a complete rig on Facebook Marketplace for less than the GPU alone.
Semantic Recipe Search with LangChain, pgvector, and Local Embeddings
Keyword search fails on intent. 'Quick weeknight chicken' shares zero words with 'Fish Finger Sandwiches' but it's the right answer.