This is part 2 of the Ollama infra series. Part 1 covered installation and the basic API. Here we harden the setup for use by other machines on your network or the internet: nginx in front, TLS, a shared secret for access control, and a systemd unit that survives reboots and crashes.


Why put nginx in front

Ollama’s built-in server is not designed to be internet-facing. It has no TLS support, no authentication, and no rate limiting. nginx handles all of that and more:

  • TLS termination (your clients talk HTTPS, nginx talks HTTP to Ollama on localhost)
  • A shared secret header as a lightweight auth layer
  • Request logging in a format you can feed into observability tools later
  • Connection limits and timeouts
  • A stable public address that does not change if you move Ollama to a different port

Binding Ollama to localhost only

By default on Linux, after the install script runs, Ollama binds to 0.0.0.0:11434 — any interface. Change it to listen only on localhost so nothing reaches it directly:

systemctl edit ollama

This opens an override file. Add:

[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"

Save and reload:

systemctl daemon-reload
systemctl restart ollama

Verify:

ss -tlnp | grep 11434
# should show 127.0.0.1:11434, not 0.0.0.0:11434

From this point, nothing reaches Ollama directly except nginx running on the same machine.


nginx configuration

Install nginx if you do not have it:

apt install nginx

Create /etc/nginx/sites-available/inference:

upstream ollama_backend {
    server 127.0.0.1:11434;
    keepalive 8;
}

server {
    listen 80;
    server_name inference.yourdomain.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name inference.yourdomain.com;

    ssl_certificate     /etc/letsencrypt/live/inference.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/inference.yourdomain.com/privkey.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    # Long timeouts — LLM inference takes time
    proxy_read_timeout    300s;
    proxy_connect_timeout 10s;
    proxy_send_timeout    300s;

    # Required for streaming responses
    proxy_buffering       off;
    proxy_cache           off;

    location /direct/ {
        if ($http_x_ollama_key != "your-secret-here") {
            return 403;
        }
        rewrite ^/direct/(.*) /$1 break;
        proxy_pass         http://ollama_backend;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
        proxy_set_header   Host localhost;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
    }

    access_log /var/log/nginx/inference-access.log json_combined;
    error_log  /var/log/nginx/inference-error.log;
}

Enable and test:

ln -s /etc/nginx/sites-available/inference /etc/nginx/sites-enabled/inference
nginx -t
systemctl reload nginx

TLS with Let’s Encrypt

If your server has a public domain pointing at it:

apt install certbot python3-certbot-nginx
certbot --nginx -d inference.yourdomain.com

Certbot will modify your nginx config to add the certificate paths and set up auto-renewal via a systemd timer.

For a private network (no public domain), use a self-signed certificate or a wildcard cert from your internal CA. With self-signed certs, clients must either trust the CA or disable certificate verification — only do the latter on private networks.


Hardening the systemd unit

The default Ollama unit works but has no restart policy or resource limits. A complete override:

systemctl edit ollama
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=4"
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

Relevant environment variables:

Variable What it controls
OLLAMA_HOST Bind address
OLLAMA_MAX_LOADED_MODELS How many models stay in memory simultaneously
OLLAMA_NUM_PARALLEL Concurrent requests per model
OLLAMA_KEEP_ALIVE How long a model stays loaded after last request (default: 5m)
OLLAMA_MAX_QUEUE Max queued requests before 503

OLLAMA_KEEP_ALIVE controls the trade-off between cold-start latency and memory use. For a server that handles bursts, set it to 30m or longer. For a memory-constrained machine with multiple models, set it to 5m or 0.


Calling the remote API

From any machine with network access, add the secret header:

$ curl https://inference.yourdomain.com/direct/api/generate -s -H "X-Ollama-Key: your-secret-here"   -d '{
    "model": "llama3.2:3b",
    "prompt": "What is a keepalive?",
    "stream": false
  }' | jq
{
  "model": "llama3.2:3b",
  "created_at": "2026-04-21T09:54:18.622159665Z",
  "response": "A keepalive, in computing and networking contexts, refers to a periodic message or signal sent ...",
  "done": true,
  "done_reason": "stop",
  "context": [...],
  "total_duration": 26965557054,
  "load_duration": 255380905,
  "prompt_eval_count": 31,
  "prompt_eval_duration": 450871626,
  "eval_count": 357,
  "eval_duration": 25763962391
}

With the Python client:

import httpx

client = httpx.Client(
    base_url="https://inference.yourdomain.com",
    headers={"X-Ollama-Key": "your-secret-here"},
    timeout=120,
)
response = client.post("/direct/api/generate", json={
    "model": "llama3.2:3b",
    "prompt": "What is a keepalive?",
    "stream": False,
})
print(response.json()["response"])

Log format for observability

nginx’s default combined log format is fine, but a JSON format is easier to parse later when we add structured logging in part 7. Add to /etc/nginx/nginx.conf inside the http block:

log_format json_combined escape=json
    '{'
    '"time":"$time_iso8601",'
    '"remote_addr":"$remote_addr",'
    '"x_forwarded_for":"$http_x_forwarded_for",'
    '"request_id":"$request_id",'
    '"host":"$host",'
    '"method":"$request_method",'
    '"uri":"$uri",'
    '"args":"$args",'
    '"protocol":"$server_protocol",'
    '"status":$status,'
    '"bytes_sent":$bytes_sent,'
    '"body_bytes_sent":$body_bytes_sent,'
    '"request_time":"$request_time",'
    '"upstream_addr":"$upstream_addr",'
    '"upstream_status":"$upstream_status",'
    '"upstream_response_time":"$upstream_response_time",'
    '"upstream_connect_time":"$upstream_connect_time",'
    '"upstream_header_time":"$upstream_header_time",'
    '"referer":"$http_referer",'
    '"user_agent":"$http_user_agent"'
    '}';

Then in your inference server block:

access_log /var/log/nginx/inference-access.log json_combined;

Each request now lands as a JSON object — easy to tail, easy to ship to a log aggregator, easy to query with jq.

An example is shown below:

$ tail -f /var/log/nginx/inference-access.log
{
  "time": "2026-04-21T12:04:25+02:00",
  "remote_addr": "127.0.0.1",
  "x_forwarded_for": "",
  "request_id": "32a558cfdd53f4e2134e48e572d8e306",
  "host": "inference.yourdomain.com",
  "method": "POST",
  "uri": "/api/generate",
  "args": "",
  "protocol": "HTTP/1.1",
  "status": 200,
  "bytes_sent": 4091,
  "body_bytes_sent": 3905,
  "request_time": "25.246",
  "upstream_addr": "127.0.0.1:11434",
  "upstream_status": "200",
  "upstream_response_time": "25.246",
  "upstream_connect_time": "0.001",
  "upstream_header_time": "25.246",
  "referer": "",
  "user_agent": "curl/8.5.0"
}

What you have now

  • Ollama bound to localhost, unreachable directly
  • nginx in front with TLS and a shared secret
  • A hardened systemd unit that restarts on failure
  • JSON access logs ready for the observability post
  • Remote clients can call the API from anywhere with the right credentials