Series
Multi-part guides that build up from scratch to a complete, working system.
Series
Self-hosted AI Inference with Ollama
Go from a bare server to a production-grade local AI inference stack. Each post adds one layer — runtime, reverse proxy, async queue, chat UI, voice interface, code assistant integration, and observability. By the end you have a complete, self-hosted setup you understand top to bottom.
9 parts
74 min total
April 2026
- 1 Ollama from zero: run a local LLM and hit its APIInstall Ollama, pull a model, and call the API from the terminal and from code. The starting point for everything that follows in this series. 12 min
- 2 Serving Ollama properly: nginx, TLS, and remote accessMove Ollama from a localhost experiment to a properly served API: nginx reverse proxy, TLS termination, authentication, and remote access with a hardened systemd unit. 6 min
- 3 Async inference with queues: BullMQ and Redis in front of OllamaDecouple callers from Ollama with a BullMQ job queue backed by Redis. Clients enqueue inference requests and poll for results — no more hanging HTTP connections during long generations. 10 min
- 4 Adding a UI: Open WebUI wired to your Ollama stackDeploy Open WebUI as a chat interface for your Ollama stack. Covers Docker-based deployment, nginx integration, user management, and locking down access. 5 min
- 5 Voice interface for Ollama: speech-to-text and text-to-speech with Open WebUIAdd a voice interface to your Ollama stack: a local speaches server for speech-to-text and text-to-speech, wired into Open WebUI. 4 min
- 6 Open WebUI extensions: memory, tools, code execution, vision, and SSOExtend your Open WebUI setup with persistent memory, custom tools, sandboxed code execution, vision and OCR, and single sign-on. Each feature is self-contained — add the ones relevant to your use case. 6 min
- 7 Claude Code + Ollama: local model as your code assistant backendWire Ollama into Claude Code via MCP so you can route specific tasks to a local model. Covers the MCP server setup, Claude Code configuration, and practical use cases where local inference makes sense. 7 min
- 8 Observability for your Ollama stack: Prometheus, Grafana, Loki, and structured logsAdd Prometheus metrics, Loki log aggregation, and Grafana dashboards to your Ollama stack. Track token throughput, generation latency, queue depth, and nginx traffic — broken down by model and route. 16 min
- 9 Ollama infra series wrap-up: architecture, hardware lessons, and what to add nextThe full architecture diagram for the Ollama stack we built across 8 posts, hardware lessons from running it, and a prioritized list of what to tackle next. 3 min