Ollama from zero: run a local LLM and hit its API

This is part 1 of the Ollama infra series. By the end you will have Ollama running, a model pulled, and the HTTP API responding to requests. Every subsequent post in the series builds on this foundation.

What Ollama is

Ollama is a runtime that manages local language models. It handles downloading model weights, quantization selection, memory mapping, and exposes a local HTTP server with an API that is compatible with a subset of the OpenAI API. You run a model the same way you run a Docker container — pull it by name, run it, stop it.

What it is not: a cloud service, a chat UI, or a fine-tuning tool. It is a local inference server. Everything happens on your machine.

Hardware requirements

Ollama runs on Linux, macOS, and Windows. GPU acceleration is optional but strongly recommended for anything above 3B parameters.

Setup	What works well
CPU only	1B–3B models at low throughput
8 GB VRAM (consumer GPU)	7B models at comfortable speed
16 GB VRAM	13B models, or 7B with headroom
24 GB+ VRAM	30B+ models

For this series, any machine with 8 GB RAM can follow along using a 3B model.

VRAM, RAM, and how models load

Your computer has two kinds of memory:

RAM — system memory the CPU reads from. Fast enough for general computation, typically 16–64 GB on a desktop.
VRAM — memory physically on the GPU chip. The GPU can only run math on data already in VRAM. Much higher bandwidth than RAM for the kind of parallel operations a GPU does. Typically 4–24 GB on a consumer card.

When you run a model, Ollama loads the weights from disk into memory. If the model fits in VRAM, the GPU handles all inference — fast. If it does not fit, Ollama offloads layers to RAM — slower.

You can see how a loaded model is placed in memory with:

$ ollama ps
NAME           ID              SIZE      PROCESSOR    CONTEXT    UNTIL
llama3.2:3b    a80c4f17acd5    2.5 GB    100% CPU     4096       4 minutes from now

100% GPU means fully in VRAM. 50% GPU / 50% CPU means it partially spilled to RAM. In this case, it’s fully in RAM.

Quantization

A language model is a large list of numbers — the weights. A 7B model has 7 billion of them. At full precision each number takes 4 bytes: 7B × 4 bytes = 28 GB. That is too large for most hardware.

Quantization is lossy compression applied to those weights. Instead of storing each weight as a precise 32-bit float, you store an approximation using fewer bits:

Format	Bits per weight	7B model size	Quality loss
F32	32	~28 GB	none (baseline)
F16	16	~14 GB	negligible
Q8_0	8	~7 GB	very small
Q4_K_M	4	~4.5 GB	small, usually acceptable
Q2_K	2	~2.5 GB	noticeable

Ollama defaults to Q4_K_M when you pull a model without specifying a tag. It is the practical sweet spot — roughly half the size of Q8, with quality that is hard to distinguish for most tasks. The tag you pull determines the quantization: llama3.2:3b gets the default, llama3.2:3b-instruct-q8_0 gets Q8.

The rule of thumb: ~0.6 GB of VRAM per billion parameters at Q4_K_M. A 7B model needs ~4.5 GB, an 11B model needs ~7 GB, a 3B model needs ~2 GB.

Tags follow the pattern name:size-variant-quantization. Some examples:

Tag	What it is
`llama3.2:3b`	Default 3B, Q4_K_M quantization
`llama3.2:3b-instruct-q8_0`	3B instruct-tuned, Q8 quantization
`llama3.2-vision:11b`	11B multimodal model (text + images)
`codellama:7b`	7B model trained on code
`nomic-embed-text`	Embedding model, not for generation

Not all models are for chat or generation. Embedding models (used for semantic search and RAG) are also in the registry — they have a different API endpoint (/api/embeddings) and do not respond to /api/generate.

Installing Ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

The installer places the binary at /usr/local/bin/ollama and registers a systemd service that starts on boot. The service runs as the ollama user created during installation.

Verify:

ollama --version

macOS:

Download the app from ollama.com and drag it to Applications. The menu bar icon indicates when Ollama is running.

Windows:

Download the installer from ollama.com. It installs a background service.

Finding models

Ollama’s model registry is at ollama.com/library. Every model page shows available tags, parameter counts, quantization variants, and download size. Check the page before pulling — it tells you the VRAM requirement and whether the model supports tools or vision.

Once a model is pulled, you can inspect its details locally:

$ ollama show llama3.2:3b
  Model
    architecture        llama
    parameters          3.2B
    context length      131072
    embedding length    3072
    quantization        Q4_K_M

  Capabilities
    completion
    tools

  Parameters
    stop    "<|start_header_id|>"
    stop    "<|end_header_id|>"
    stop    "<|eot_id|>"

Context length

The context length in ollama show is the maximum number of tokens the model was trained on — llama3.2:3b reports 131072. That is not what Ollama allocates by default.

Ollama’s default context window depends on available VRAM: 4k for machines with less than 24 GiB, scaling up from there. To override it, set the environment variable before starting the server:

OLLAMA_CONTEXT_LENGTH=32768 ollama serve

For a persistent systemd setup, add it to the service’s environment:

systemctl edit ollama

[Service]
Environment="OLLAMA_CONTEXT_LENGTH=32768"

A larger context window requires more VRAM — the KV cache grows linearly with context size. If the model no longer fits in VRAM after increasing it, Ollama offloads layers to RAM and generation slows noticeably. Verify what is actually allocated after restarting:

ollama ps
# NAME           ...  CONTEXT
# llama3.2:3b    ...  32768

For detailed guidance see the Ollama context length documentation.

Pulling and running a model

Ollama uses a model registry at ollama.com/library. Pull a model by name:

# A capable 3B model — fits in 2 GB of VRAM or 4 GB of RAM
ollama pull llama3.2:3b

Once downloaded, run an interactive session:

ollama run llama3.2:3b

Type a prompt and press Enter. Type /bye to exit.

To list what you have pulled:

ollama list

To remove a model:

ollama rm llama3.2:3b

The HTTP API

When Ollama is running (either as a service or via ollama serve), it listens on port 11434 by default. You do not need to do anything extra — the service started during installation is already running.

Check it:

curl http://localhost:11434
# Ollama is running

Generate endpoint

The core endpoint is /api/generate. It accepts a model name and a prompt, and streams back tokens:

curl http://localhost:11434/api/generate -s \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "What is a reverse proxy?",
    "stream": false
  }' | jq

With stream: false you get a single JSON response. With stream: true (the default), you get newline-delimited JSON objects, one per token.

The response:

{
  "model": "llama3.2:3b",
  "created_at": "2026-04-21T09:17:16.256470627Z",
  "response": "A reverse proxy is a server that acts as an intermediary between...",
  "done": true,
  "done_reason": "stop",
  "context": [...],
  "total_duration": 34746343448,
  "load_duration": 298291517,
  "prompt_eval_count": 31,
  "prompt_eval_duration": 352002710,
  "eval_count": 432,
  "eval_duration": 33600883965
}

total_duration is in nanoseconds. eval_count is tokens generated. These fields matter later when we add observability.

Chat endpoint

The /api/chat endpoint follows the OpenAI messages format:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2:3b",
    "messages": [
      { "role": "user", "content": "Explain nginx in one paragraph." }
    ],
    "stream": false
  }'

OpenAI-compatible endpoint

Ollama also exposes /v1/chat/completions, which is compatible with the OpenAI client spec. Any tool that accepts a custom base URL and an API key (you can pass any string) can point at Ollama directly:

curl http://localhost:11434/v1/chat/completions \
  -H "Authorization: Bearer ignored" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{ "role": "user", "content": "Hello" }]
  }'

This compatibility layer is what makes it easy to swap Ollama in for existing tools — something we use in post 7 with Claude Code.

Calling from code

Python

import httpx

response = httpx.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3.2:3b", "prompt": "What is DNS?", "stream": False},
    timeout=60.0,
)
print(response.json()["response"])

Or using the ollama Python package:

pip install ollama

import ollama

response = ollama.generate(model="llama3.2:3b", prompt="What is DNS?")
print(response["response"])

Node.js

npm install ollama

const ollama = require("ollama");

ollama.generate({
  model: "llama3.2:3b",
  prompt: "What is DNS?",
}).then((response) => console.log(response.response));

Checking service status

# Is the service running?
$ systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Mon 2026-04-20 17:51:10 CEST; 17h ago
   Main PID: 3473 (ollama)
      Tasks: 29 (limit: 7108)
     Memory: 3.6G ()
     CGroup: /system.slice/ollama.service
             ├─  3473 /usr/local/bin/ollama serve
             └─102649 /usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c>

Apr 21 11:15:51 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:15:51 | 200 |  7.204189099s |       127.0.0.1 | POST     "/api/generate"
Apr 21 11:16:23 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:16:23 | 200 | 30.295522344s |       127.0.0.1 | POST     "/api/generate"
Apr 21 11:16:32 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:16:32 | 500 |   1.76122077s |       127.0.0.1 | POST     "/api/generate"
Apr 21 11:17:16 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:17:16 | 200 | 35.029927984s |       127.0.0.1 | POST     "/api/generate"
Apr 21 11:18:15 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:18:15 | 200 | 32.624924588s |       127.0.0.1 | POST     "/api/generate"
Apr 21 11:18:51 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:18:51 | 200 | 12.967860968s |       127.0.0.1 | POST     "/api/chat"
Apr 21 11:19:10 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:19:10 | 200 |  2.623931704s |       127.0.0.1 | POST     "/v1/chat/completions"
Apr 21 11:19:28 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:19:28 | 200 |  9.945051648s |       127.0.0.1 | POST     "/api/chat"
Apr 21 11:19:54 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:19:54 | 500 |  5.003496873s |       127.0.0.1 | POST     "/api/generate"
Apr 21 11:20:49 mguarinos ollama[3473]: [GIN] 2026/04/21 - 11:20:49 | 200 | 31.820382643s |       127.0.0.1 | POST     "/api/generate"

# What is using port 11434?
$ ss -tlnp | grep 11434
LISTEN 0      4096           0.0.0.0:11434      0.0.0.0:*

# Recent logs
$ journalctl -u ollama -n 50
Apr 21 11:15:46 mguarinos ollama[3473]: print_info: max token length = 256
Apr 21 11:15:46 mguarinos ollama[3473]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Apr 21 11:15:46 mguarinos ollama[3473]: load_tensors:          CPU model buffer size =  1918.35 MiB
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: constructing llama_context
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: n_seq_max     = 1
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: n_ctx         = 4096
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: n_ctx_seq     = 4096
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: n_batch       = 512
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: n_ubatch      = 512
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: causal_attn   = 1
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: flash_attn    = auto
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: kv_unified    = false
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: freq_base     = 500000.0
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: freq_scale    = 1
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utiliz>
Apr 21 11:15:50 mguarinos ollama[3473]: llama_context:        CPU  output buffer size =     0.50 MiB
Apr 21 11:15:50 mguarinos ollama[3473]: llama_kv_cache:        CPU KV buffer size =   448.00 MiB
Apr 21 11:15:51 mguarinos ollama[3473]: llama_kv_cache: size =  448.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  224.00 MiB, V (f16):  2>
Apr 21 11:15:51 mguarinos ollama[3473]: llama_context: Flash Attention was auto, set to enabled
Apr 21 11:15:51 mguarinos ollama[3473]: llama_context:        CPU compute buffer size =   256.50 MiB
Apr 21 11:15:51 mguarinos ollama[3473]: llama_context: graph nodes  = 875
Apr 21 11:15:51 mguarinos ollama[3473]: llama_context: graph splits = 1
Apr 21 11:15:51 mguarinos ollama[3473]: time=2026-04-21T11:15:51.845+02:00 level=INFO source=server.go:1402 msg="llama runner started in 6.23 secon>
Apr 21 11:15:51 mguarinos ollama[3473]: time=2026-04-21T11:15:51.854+02:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
Apr 21 11:15:51 mguarinos ollama[3473]: time=2026-04-21T11:15:51.856+02:00 level=INFO source=server.go:1364 msg="waiting for llama runner to start >
Apr 21 11:15:51 mguarinos ollama[3473]: time=2026-04-21T11:15:51.872+02:00 level=INFO source=server.go:1402 msg="llama runner started in 6.26 secon>

What you have now

Ollama installed and running as a systemd service
A model pulled and tested interactively
The HTTP API responding on port 11434
A working /api/generate, /api/chat, and /v1/chat/completions endpoint