Grafana: dashboards as code, not clicks

This is part 5 of the Production Observability series. We have PromQL queries from part 4. Now we put them in Grafana — but not by clicking through the UI. We provision everything from YAML and JSON files, so the setup is version-controlled and reproducible.

Why provision from files

Grafana’s UI is convenient for exploration. It is a poor foundation for a production setup.

When you configure datasources and build dashboards through the UI, the configuration lives in Grafana’s internal SQLite database. There is no version history, no code review, and no way to recreate the setup from scratch. When the container is replaced, everything is gone.

Grafana supports provisioning — loading datasources and dashboards from files at startup. The datasource configuration in provisioning/datasources/datasources.yml is read on start and applied automatically. Dashboard JSON files in a watched directory are imported without any manual steps.

Docker Compose addition

Add Grafana to docker-compose.yml:

  grafana:
    image: grafana/grafana:13.0.1
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - grafana-data:/var/lib/grafana

The grafana-data volume persists Grafana’s internal state (user sessions, plugin data) across container restarts, while the provisioned configuration comes from mounted files.

Datasource provisioning

grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - matcherRegex: '"trace_id": "([a-f0-9]+)"'
          name: TraceID
          url: "$${__value.raw}"
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    uid: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      nodeGraph:
        enabled: true
      serviceMap:
        datasourceUid: prometheus
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: false
        spanStartTimeShift: "-1m"
        spanEndTimeShift: "1m"
        tags:
          - key: service.name
            value: service_name
      tracesToMetrics:
        datasourceUid: prometheus
        tags:
          - key: service.name
            value: job
      search:
        hide: false
      lokiSearch:
        datasourceUid: loki

The uid field is the stable identifier used when one datasource references another — the exemplarTraceIdDestinations in Prometheus points to the Tempo datasource by uid: tempo, and the derivedFields in Loki do the same. Without UIDs, these cross-datasource links break if you rename a datasource.

The Tempo jsonData block wires up all the cross-signal navigation that makes Grafana useful as an investigation tool:

serviceMap.datasourceUid — tells Grafana where to find the service graph metrics (generated by Tempo’s service-graphs processor and pushed to Prometheus via remote write).
tracesToLogs — when viewing a trace span, adds a “Logs” button that runs a Loki query filtered to that trace ID and service, jumping you directly to the logs for that request.
tracesToMetrics — when viewing a trace span, adds a “Metrics” button that opens the Prometheus dashboard for the relevant service.
nodeGraph.enabled — enables the node graph panel in trace views, showing the service call graph inline.

We are adding Loki and Tempo datasources now even though those services come in posts 8 and 9. Grafana handles missing backends gracefully — the datasource shows as unavailable rather than erroring out the entire provisioning step.

Dashboard provisioning

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1

providers:
  - name: default
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

This tells Grafana to watch /var/lib/grafana/dashboards (mapped from ./grafana/dashboards on the host) and import any JSON files it finds. The updateIntervalSeconds: 30 means changes to dashboard files are picked up without restarting Grafana.

Building the service dashboard

Grafana dashboards are JSON. Rather than explaining the entire schema, let us walk through the key building blocks that appear in the service dashboard.

Variables

Variables make dashboards dynamic. Define them in the dashboard’s templating.list array:

{
  "name": "job",
  "type": "query",
  "datasource": {"type": "prometheus", "uid": "prometheus"},
  "query": "label_values(http_requests_total, job)",
  "refresh": 2,
  "includeAll": false
}

This creates a $job dropdown populated from the job label values in http_requests_total. Add it to queries as {job="$job"} and the dashboard filters by the selected service automatically.

Request rate panel

A time series panel showing requests per second:

{
  "type": "timeseries",
  "title": "Request Rate",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{job=\"$job\"}[5m])) by (endpoint)",
      "legendFormat": ""
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "custom": {"lineWidth": 2}
    }
  }
}

Latency panel

p50, p95, p99 on one panel:

{
  "type": "timeseries",
  "title": "Request Latency",
  "targets": [
    {
      "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job=\"$job\"}[5m])) by (le))",
      "legendFormat": "p50"
    },
    {
      "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"$job\"}[5m])) by (le))",
      "legendFormat": "p95"
    },
    {
      "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"$job\"}[5m])) by (le))",
      "legendFormat": "p99"
    }
  ],
  "fieldConfig": {
    "defaults": {"unit": "s"}
  }
}

Error rate panel

A stat panel showing current error fraction:

{
  "type": "stat",
  "title": "Error Rate",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{job=\"$job\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"$job\"}[5m]))",
      "legendFormat": "Error Rate"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 0.01},
          {"color": "red", "value": 0.05}
        ]
      }
    }
  }
}

The colour thresholds here implement a visual SLO: green below 1% error rate, yellow between 1–5%, red above 5%.

The complete dashboard JSON

The full grafana/dashboards/service.json is in the repository. It includes:

A $job variable to switch between services
Four stat panels: request rate, error rate, p99 latency, orders in flight
A time series panel: request rate by endpoint
A time series panel: p50/p95/p99 latency
A time series panel: error rate over time with threshold band
A time series panel: orders created vs failed per minute

To add it: place the JSON file in grafana/dashboards/ and Grafana imports it within 30 seconds. No clicking required.

Full service dashboard in Grafana showing stat panels and time series panels — The provisioned service dashboard: Request Rate, Error Rate, p99 Latency, and Orders In Flight stat panels at the top, with time series panels below. The `$job` dropdown filters all panels to a single service.

Grafana data sources list showing Prometheus, Loki, and Tempo with green health check indicators — Connections → Data sources showing Prometheus, Loki, and Tempo all provisioned automatically and passing their health checks.

Iterating on dashboards

The common workflow:

Build or modify a dashboard in the Grafana UI
Go to Dashboard settings → JSON Model → copy the JSON
Save it to grafana/dashboards/ in the repository
Commit and push

On the next docker compose up, the dashboard is provisioned automatically. Anyone who clones the repository gets the same dashboards without any manual setup.

One practical note: when a dashboard is provisioned from a file, Grafana marks it as “read only” in the UI — you cannot save changes back via the UI. You can still explore it, modify it temporarily, and copy the JSON from the JSON Model view. This is intentional: the file is the source of truth.

What you have now

Grafana running with all three datasources provisioned from YAML — Prometheus, Loki (coming in post 8), Tempo (coming in post 9)
A service dashboard loaded from JSON, filterable by $job
Rate, error, and latency panels using the PromQL patterns from post 4

In the next post we define SLOs for the order service and build the error budget view that turns these metrics into actionable reliability targets.