My Ollama server runs on a dedicated Ubuntu machine with an RTX 3090. It pulls 100W at idle and up to 500W under GPU load. It was running 24/7 even though I only use it a few times a day through Friday and my Telegram bot. That’s roughly 72 kWh/month just idling — around R$ 63/month here in Curitiba for a machine that’s actually working maybe 30 minutes a day.
The fix: a transparent proxy that wakes the machine via Wake-on-LAN when a request comes in, and a systemd timer that suspends it when idle.
The Architecture
friday (host) → localhost:11434 (proxy) → 192.168.1.18:11434 (ollama)
↑
WOL if unreachable
The proxy sits on my homelab server (which runs 24/7 anyway for all other services). It listens on port 11434 — the same port Ollama uses. Any service that previously pointed at the Ollama machine’s IP now points at 127.0.0.1:11434 instead. Zero changes to the clients.
When a request arrives:
- Proxy checks if the Ollama machine is reachable (TCP connect, cached for 5 seconds)
- If it’s up, forward the request immediately — no overhead
- If it’s down, send a WOL magic packet and poll every 2 seconds until it wakes (up to 90 seconds)
- Forward the request with full streaming support
On the Ollama machine side, a systemd timer checks every 2 minutes:
- Are any models loaded? (Ollama auto-unloads after 5 minutes of inactivity)
- Is anyone SSH’d in?
- If both are no → suspend
So the flow is: I send a Telegram message → [friday] hits the proxy → proxy wakes the machine (~15 seconds) → Ollama serves the request → 5 minutes of silence → model unloads → next timer tick → machine suspends.
The WOL Proxy
The proxy is a single Python file using aiohttp. It runs in a Docker container with network_mode: host so it can send UDP broadcast packets for WOL and listen on port 11434 directly.
ollama-wol-proxy.py
#!/usr/bin/env python3
import asyncio
import logging
import os
import socket
import time
from aiohttp import ClientSession, ClientTimeout, web
log = logging.getLogger("ollama-wol-proxy")
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "192.168.1.18")
OLLAMA_PORT = int(os.environ.get("OLLAMA_PORT", "11434"))
LISTEN_PORT = int(os.environ.get("LISTEN_PORT", "11434"))
MAC_ADDRESS = os.environ.get("MAC_ADDRESS", "")
BROADCAST_IP = os.environ.get("BROADCAST_IP", "192.168.1.255")
WOL_TIMEOUT = 90
WOL_POLL_INTERVAL = 2
REACHABLE_CACHE_TTL = 5
REQUEST_TIMEOUT = 600
_last_reachable_check: float = 0
_last_reachable_result: bool = False
_wake_lock = asyncio.Lock()The key design decisions:
Reachability caching — Every request checks if Ollama is up via a TCP connect attempt, but the result is cached for 5 seconds. This avoids hammering the network on every request while still detecting a suspended machine quickly.
Wake lock — If 5 requests arrive while the machine is asleep, you don’t want 5 WOL packets and 5 independent polling loops. An asyncio.Lock() ensures only one wake cycle runs at a time. All concurrent requests wait on the same lock and proceed together once the machine is up.
Streaming support — LLM responses are streamed token-by-token. The proxy uses iter_any() to forward chunks as they arrive, so the user sees tokens appearing in real time, not a delayed bulk response.
The WOL magic packet itself is pure Python — no external dependencies:
def send_wol(mac: str) -> None:
mac_bytes = bytes.fromhex(mac.replace(":", "").replace("-", ""))
magic = b"\xff" * 6 + mac_bytes * 16
with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
sock.sendto(magic, (BROADCAST_IP, 9))A WOL magic packet is just 6 bytes of 0xFF followed by the target MAC address repeated 16 times, sent as a UDP broadcast on port 9.
Docker Setup
FROM python:3.12-slim
RUN pip install --no-cache-dir aiohttp
COPY ollama-wol-proxy.py /app/proxy.py
CMD ["python3", "-u", "/app/proxy.py"]# docker-compose.yml
services:
ollama-proxy:
build: .
container_name: ollama-proxy
network_mode: host
restart: unless-stopped
env_file: .envThe .env file holds the target machine’s MAC address and IP:
OLLAMA_HOST=192.168.1.18
OLLAMA_PORT=11434
LISTEN_PORT=11434
MAC_ADDRESS=cc:28:aa:ca:d3:00
BROADCAST_IP=192.168.1.255
The Ollama Machine Setup
Enabling Wake-on-LAN
WOL needs to be enabled on the NIC. On Ubuntu with a wired connection:
# Check current status
sudo ethtool eno1 | grep Wake-on
# Wake-on: d ← disabled
# Enable it
sudo ethtool -s eno1 wol g
# Wake-on: g ← enabledThis doesn’t survive reboots. To persist it, create a script that runs when the network interface comes up:
# /etc/networkd-dispatcher/configuring.d/wol.sh
#!/bin/bash
ethtool -s eno1 wol gMake it executable with chmod +x.
The Auto-Sleep Script
This is the script that decides whether to suspend:
#!/bin/bash
# /usr/local/bin/ollama-autosleep.sh
# Don't suspend if models are loaded (ollama unloads after 5min idle)
MODELS_LOADED=$(curl -sf http://localhost:11434/api/ps 2>/dev/null | grep -c '"model"' || true)
[ "${MODELS_LOADED:-0}" -gt 0 ] && exit 0
# Don't suspend if someone is logged in via SSH
LOGGED_IN=$(who | wc -l)
[ "$LOGGED_IN" -gt 0 ] && exit 0
logger -t ollama-autosleep "Ollama idle, suspending"
systemctl suspendTwo checks, both intentional:
Models loaded — Ollama’s /api/ps endpoint returns currently loaded models. By default, Ollama keeps a model in memory for 5 minutes after the last request (keep_alive). Each new request resets this timer. So the machine won’t suspend mid-conversation — only after 5 minutes of complete silence.
SSH sessions — If I’m SSH’d in doing maintenance, I don’t want the machine to suspend under me.
I initially had more checks (uptime threshold, active TCP connections), but simplified it down to these two. The model-loaded check effectively covers the “active connection” case because Ollama won’t unload while serving requests.
Systemd Timer
The timer runs the check every 2 minutes, with a 5-minute grace period after boot to let services stabilize:
# /etc/systemd/system/ollama-autosleep.service
[Unit]
Description=Ollama auto-sleep idle check
[Service]
Type=oneshot
ExecStart=/usr/local/bin/ollama-autosleep.sh# /etc/systemd/system/ollama-autosleep.timer
[Unit]
Description=Check Ollama idle every 2 minutes
[Timer]
OnBootSec=5min
OnUnitActiveSec=2min
[Install]
WantedBy=timers.targetEnable with systemctl enable --now ollama-autosleep.timer.
Make sure your ollama.service is enabled too (systemctl is-enabled ollama) — it needs to start automatically when the machine wakes from suspend.
The Client Change
The only change on the client side is swapping the IP:
{
"providers": {
"custom": {
"apiKey": "ollama",
"apiBase": "http://127.0.0.1:11434/v1"
}
}
}From 192.168.1.18 to 127.0.0.1. That’s it. The proxy is completely transparent — same port, same API, same streaming behavior.
The Numbers
Here in Curitiba, Copel charges around R$ 0.88/kWh for residential.
| Before | After | |
|---|---|---|
| Daily uptime | 24h | ~1.25h |
| Idle power | 100W × 24h = 2,400 Wh | 100W × 1h = 100 Wh |
| GPU power | 350W × 2h = 700 Wh | 350W × 15min = 88 Wh |
| Monthly consumption | ~87 kWh | ~5.6 kWh |
| Monthly cost | ~R$ 77 | ~R$ 5 |
| Annual savings | ~R$ 860 |
The actual savings depend on usage. Heavy usage (hours of inference daily) shifts the balance, but the idle elimination is the big win. A machine that sits idle 22 hours a day shouldn’t be drawing 100W for 22 hours.
Edge Cases
A few scenarios I considered:
Machine already awake — The TCP check is cached for 5 seconds. If the machine is up, the request forwards immediately with negligible overhead.
Multiple requests during wake — The asyncio lock serializes wake attempts. One WOL packet is sent, all waiting requests proceed together once the machine responds.
Request arrives while suspending — The auto-sleep script checks for loaded models and active connections before suspending. If a request is in-flight, Ollama will have a model loaded, so the script exits without suspending.
WOL fails — The proxy returns a 503 after 90 seconds. The client can retry.
Proxy itself crashes — Docker’s restart: unless-stopped brings it back automatically.
Key Takeaways
- WOL is a simple, reliable way to wake machines on demand — the magic packet is just a UDP broadcast
network_mode: hostin Docker is necessary for both UDP broadcast (WOL) and binding to a specific port on the host- Ollama’s
keep_alivetimer is the natural “idle detector” — no need to build custom activity tracking - The proxy adds essentially zero latency when the machine is already awake (one cached TCP check)
- A systemd timer is cleaner than a cron job for periodic checks — built-in logging, dependency management, and
OnBootSecfor boot delay
Updates
- 2026-03-03 — Initial publication
