Manuel Guarinos

GitHub Actions to AWS without stored credentials: OIDC role federation

2026-04-20T00:00:00+00:00

The default way people wire up GitHub Actions to AWS is to create an IAM user, generate an access key, and paste AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY into GitHub secrets. It works. It also means you have a long-lived credential sitting in your repository’s secret store that never expires, needs manual rotation, and produces audit trails that say “IAM user github-ci did this” with no indication of which repository, branch, or workflow was responsible.

OIDC federation eliminates the credential entirely. GitHub issues a short-lived signed JWT for each workflow run. AWS STS validates that JWT against a trust policy you control, checks that the claims match - which repository, which branch, which environment - and returns temporary credentials scoped to a specific IAM role. The credentials expire in one hour. There is nothing to rotate. There is no secret to leak. Every AWS CloudTrail event carries the full OIDC subject claim, so you know exactly what triggered it.

This post covers the full setup: the trust model, how to wire it up with the AWS CLI, how to restrict access by branch and environment, multi-environment role design, and the compliance advantages you get without extra effort.

How the trust works

When a GitHub Actions workflow runs with id-token: write permission, GitHub mints a JWT from its OIDC endpoint. That token contains claims describing exactly what triggered the run:

Claim	Example value	What it describes
`iss`	`https://token.actions.githubusercontent.com`	The issuer - GitHub’s OIDC server
`sub`	`repo:org/repo:ref:refs/heads/main`	Repository and trigger context
`aud`	`sts.amazonaws.com`	Intended audience
`exp`	`now + 5min`	Token lifetime - very short on purpose
`repository`	`org/repo`	Repository full name
`ref`	`refs/heads/main`	Git ref that triggered the run
`environment`	`production`	GitHub Environment, if configured

The workflow then calls aws-actions/configure-aws-credentials with a role ARN. That action presents the JWT to AWS STS via AssumeRoleWithWebIdentity. STS validates the JWT signature (against GitHub’s published JWKS), checks the aud claim equals sts.amazonaws.com, and evaluates your IAM role’s trust policy conditions against the sub and other claims. If everything matches, STS returns temporary credentials. If anything fails - wrong repository, wrong branch, wrong environment - the call is rejected before any AWS action can occur.

Setting up the OIDC provider in AWS

Before any role can trust GitHub tokens, AWS needs to know about GitHub’s OIDC endpoint. You register it once per account as an IAM Identity Provider:

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

This is an account-level resource. One provider covers all roles in the account. If you manage multiple accounts (staging, production), run this once in each.

The client-id-list value sts.amazonaws.com must match the aud claim GitHub puts in the token when the workflow uses aws-actions/configure-aws-credentials. This is a fixed agreement between the action and AWS - do not change it.

The IAM role and trust policy

Every environment that GitHub deploys to gets its own IAM role. The trust policy is where you express who is allowed to assume it. Save this as trust-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
          "token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
        }
      }
    }
  ]
}

aws iam create-role \
  --role-name github-prod-deploy \
  --assume-role-policy-document file://trust-policy.json

Two conditions, both must pass:

aud must equal sts.amazonaws.com - prevents tokens minted for other services from being used here.
sub must match your pattern - scopes the role to a specific repository and trigger context.

Attaching permissions to the role

The trust policy controls who can assume the role. A separate permissions policy controls what they can do once they have. Without it the role can be assumed but every AWS call will be denied.

Save this as deploy-policy.json, scoped to exactly what your pipeline needs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:DeleteObject", "s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-prod-bucket",
        "arn:aws:s3:::my-prod-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "lambda:UpdateFunctionCode",
        "lambda:PublishVersion",
        "lambda:UpdateAlias",
        "lambda:GetFunction",
        "lambda:WaitForFunctionActive"
      ],
      "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:my-function"
    },
    {
      "Effect": "Allow",
      "Action": "cloudfront:CreateInvalidation",
      "Resource": "arn:aws:cloudfront::123456789012:distribution/ABCDEF123456"
    }
  ]
}

aws iam put-role-policy \
  --role-name github-prod-deploy \
  --policy-name deploy \
  --policy-document file://deploy-policy.json

Restricting by branch, tag, and environment

The sub claim is the primary restriction surface. Its format depends on what triggered the workflow:

Trigger	Subject claim
Push to branch `main`	`repo:org/repo:ref:refs/heads/main`
Push to any branch	`repo:org/repo:ref:refs/heads/*`
Tag matching `v*`	`repo:org/repo:ref:refs/tags/v*`
GitHub Environment `production`	`repo:org/repo:environment:production`
Pull request	`repo:org/repo:pull_request`
`workflow_dispatch`	`repo:org/repo:workflow_dispatch`

A role that deploys to production should use the environment form, not the branch form. The difference matters: anyone can push to main if branch protection is misconfigured. A GitHub Environment with required reviewers cannot be bypassed without a human approval. The sub claim will contain environment:production only after that gate is cleared.

For a role used by pull requests to run a plan or generate deployment diffs, the pull_request subject restricts it to read operations triggered from PRs - no direct pushes can assume it.

Wildcards in `sub` conditions

When you need a role accessible from any branch in a repository (e.g., a shared CI role that only reads from S3), use StringLike with a wildcard:

"StringLike": {
  "token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:*"
}

Be deliberate about wildcards. repo:your-org/*:* would allow any repository in your org to assume the role - useful for a shared read-only role, dangerous for a deploy role.

Multi-environment role design

The right model is one role per environment per function, each with the minimum permissions it needs.

Each branch context maps to a dedicated IAM role. Production requires a GitHub Environment with a required reviewer - the OIDC subject claim for that environment is only issued after the gate passes.

The trust policy structure is the same as the one in the previous section - only the sub condition changes per role. For staging, scoped to the develop branch:

# trust-policy-staging.json - sub: "repo:your-org/your-repo:ref:refs/heads/develop"
aws iam create-role \
  --role-name github-staging-deploy \
  --assume-role-policy-document file://trust-policy-staging.json

For production, scoped to the GitHub Environment instead of a branch:

# trust-policy-prod.json - sub: "repo:your-org/your-repo:environment:production"
aws iam create-role \
  --role-name github-prod-deploy \
  --assume-role-policy-document file://trust-policy-prod.json

The workflow side

Here is what the GitHub Actions side looks like, based on the Streamline project:

jobs:
  prepare:
    runs-on: ubuntu-latest
    permissions:
      contents: read          # no id-token here - this job doesn't touch AWS

  deploy-frontend:
    needs: prepare
    runs-on: ubuntu-latest
    environment: production   # triggers the GitHub Environment gate
    permissions:
      id-token: write         # required to request the OIDC token
      contents: read
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: $
          aws-region: $

      - name: Deploy
        run: |
          aws s3 sync frontend/ s3://$ \
            --cache-control "public, max-age=31536000, immutable"

Three things to notice:

id-token: write is job-scoped. The prepare job reads the repository and detects what changed - it never touches AWS, so it doesn’t request the OIDC permission. Only the jobs that call configure-aws-credentials need id-token: write. At the workflow level the default permission is id-token: none, which is correct.

environment: production is where the approval gate lives. Set this on the job, not the workflow. GitHub will pause the job and require the designated reviewers to approve before the OIDC token is issued. The sub claim will contain environment:production only after approval - matching your IAM trust policy condition.

role-to-assume accepts an ARN, not a key pair. There is no aws-access-key-id or aws-secret-access-key. The action handles the full OIDC exchange internally and exports the standard AWS_* environment variables for subsequent steps.

Compliance and audit advantages

No credential to rotate, leak, or audit

Long-lived IAM credentials require rotation policies, leak detection (AWS regularly scans public repos and revokes exposed keys), access key age alarms in Security Hub, and periodic audits of who has credentials and whether they are still needed. OIDC eliminates all of this for CI/CD. The credential surface shrinks to the GitHub-issued token, which expires in minutes and cannot be reused outside the context it was issued for.

CloudTrail records the full identity chain

Every AssumeRoleWithWebIdentity call creates a CloudTrail event. That event includes:

The role ARN that was assumed
The OIDC subject claim: repo:your-org/your-repo:environment:production
The OIDC issuer
The source IP of the GitHub runner
The resulting session ARN

Every subsequent AWS API call in that session carries the session ARN. You can trace any S3 put, Lambda invocation, or CloudFront invalidation back to the exact repository, branch, and workflow run that triggered it.

A real CloudTrail event from a Streamline deploy. The access key starts with ASIA - the STS temporary credential prefix, not a long-lived AKIA key. The username is the IAM role session name set by the workflow, and the source IP belongs to a GitHub-hosted runner.

SOC 2 and ISO 27001 alignment

Both frameworks require demonstrable least-privilege access and evidence that access is scoped to need. The trust policy sub condition is machine-readable proof that production credentials can only be issued to workflows running against a specific environment after human approval. The IAM role configuration and the GitHub Environment settings together constitute an auditable, version-controlled access control - auditors can inspect both without relying on convention or documentation.

The absence of stored credentials also satisfies key management controls: there is no AWS credential in your secret store, so there is nothing to rotate, nothing that can be extracted from a compromised runner cache, and no access key age to report.

What you might miss

The aud condition is not optional. Without it, any JWT issued by token.actions.githubusercontent.com - from any organisation or repository on GitHub - could attempt to assume your role. The sub condition alone does not protect you if someone else’s repository happens to have a matching subject pattern.

StringLike vs StringEquals for wildcards. Use StringEquals for exact matches - it’s faster and leaves no room for misinterpretation. Use StringLike only when you need * or ?. Do not use StringLike with an exact value; it works but signals that the intent was something more permissive.

Branch protection and environment protection are separate layers. The OIDC trust policy restricts which context can assume a role. Branch protection rules prevent who can push to the branch in the first place. GitHub Environment required reviewers gate who can deploy. All three are independent - losing one does not compromise the others, but the strongest posture uses all three.

Session duration. The default session for AssumeRoleWithWebIdentity is one hour, which is the maximum unless the role’s MaxSessionDuration is extended. Deployments taking longer than an hour will fail mid-run with expired credentials. Set role-session-duration in the action or extend the role’s max session if needed.

IAM permission boundaries prevent privilege escalation. If a deployment role has iam:CreateRole or iam:AttachRolePolicy, it can in theory create a new role with more permissions than it has. A permission boundary applied to all roles created by the deploy role caps what those child roles can ever do.

Putting it together

The setup reduces to five things:

An IAM OIDC provider - registered once per AWS account, pointing at https://token.actions.githubusercontent.com.
One IAM role per environment per access level - each with a trust policy that StringEquals the sub claim to the exact context allowed.
A scoped permissions policy on each role - listing only the specific resource ARNs the pipeline needs to touch.
A GitHub Environment for each production-grade deployment target - with required reviewers and (optionally) a deployment wait timer.
permissions: id-token: write on the specific jobs that call configure-aws-credentials, and only those jobs.

The resulting pipeline has no secrets to manage in GitHub, produces a full attribution chain in CloudTrail, and can only deploy to production after a human explicitly approves it through a gated GitHub Environment - all without any changes to how the actual deployment steps work.

The Kubernetes Operator Pattern: teaching your cluster to manage anything

2026-04-15T00:00:00+00:00

When you run kubectl apply, nothing executes your manifest directly. The API server writes your desired state to etcd, and a control loop running somewhere in the cluster notices the gap between what you asked for and what currently exists - then closes it. That loop is a controller. A Deployment is a controller. A ReplicaSet is a controller. The entire Kubernetes architecture is built on this pattern.

An operator is what happens when you take that same pattern and point it at something you own: a database, a DNS record, an SSL certificate, a Slack channel. The operator extends Kubernetes’ reconciliation model to resources that have nothing to do with running containers.

This post uses a Cloudflare DNS operator as the running example - an operator that watches CloudflareDNSRecord objects in the cluster and syncs them to the Cloudflare API. The source is here.

The problem operators solve

Helm charts and plain manifests handle static configuration well. You describe what you want, apply it, and Kubernetes makes it happen. But this works because Kubernetes itself knows how to reconcile pods, services, and config maps. It has no idea what a Cloudflare DNS record is.

The traditional answer is automation outside the cluster: a CI pipeline that calls curl against the Cloudflare API when a variable changes, a shell script someone runs manually, a Terraform workspace that drifts quietly for months. These work, but they all share the same problem: the external resource is not a first-class citizen in the cluster. You can’t kubectl get it, you can’t set a dependsOn, you can’t see its status alongside your other resources. And when someone edits it directly in the Cloudflare dashboard, nothing notices.

Operators bring external resources inside Kubernetes’ reconciliation boundary. Once you have an operator, a DNS record is just another Kubernetes object.

The Cloudflare DNS console - records that exist here are the live state. The operator's job is to keep this in sync with what Kubernetes says.

CRDs: giving Kubernetes new vocabulary

Before you can write an operator, you need to teach Kubernetes what your resource type looks like. Custom Resource Definitions (CRDs) are the mechanism. A CRD is itself a Kubernetes manifest - you apply it once, and from that point on the API server accepts and stores objects of that type.

The DNS operator’s CRD registers the CloudflareDNSRecord kind under dns.operator.io/v1. The CRD is a manifest like any other - you kubectl apply it once and the API server learns the new type:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: cloudflarednsrecords.dns.operator.io   # .
spec:
  group: dns.operator.io
  scope: Namespaced
  names:
    plural: cloudflarednsrecords
    singular: cloudflarednsrecord
    kind: CloudflareDNSRecord
    shortNames: [cfdr]
  versions:
    - name: v1
      served: true
      storage: true
      # Status is a separate write path - the operator patches it without
      # triggering an on.update on the spec.
      subresources:
        status: {}
      # OpenAPI schema: the API server validates every object before storing it.
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: [zone_id, name, type, content]
              properties:
                zone_id: { type: string }
                name:    { type: string }
                type:
                  type: string
                  enum: [A, AAAA, CNAME, TXT, MX, NS, SRV, CAA]
                content: { type: string }
                ttl:     { type: integer, default: 1 }
                proxied: { type: boolean, default: false }
            status:
              type: object
              x-kubernetes-preserve-unknown-fields: true

With the CRD applied and the operator running, you can create records by applying ordinary manifests. Here are two - an A record and a TXT record:

apiVersion: dns.operator.io/v1
kind: CloudflareDNSRecord
metadata:
  name: mguarinos-com-apex-a
spec:
  zone_id: "daeed8d03dd34a9923222a33e96986ff"
  name: "mguarinos.com"
  type: "A"
  content: "1.1.1.1"
  ttl: 1
---
apiVersion: dns.operator.io/v1
kind: CloudflareDNSRecord
metadata:
  name: mguarinos-com-apex-txt
spec:
  zone_id: "daeed8d03dd34a9923222a33e96986ff"
  name: "mguarinos.com"
  type: "TXT"
  content: "Hello world!"
  ttl: 300

kubectl apply -f records.yaml -n cf-operator

cloudflarednsrecord.dns.operator.io/mguarinos-com-apex-a created
cloudflarednsrecord.dns.operator.io/mguarinos-com-apex-txt

The operator picks up both objects immediately and creates the corresponding records in Cloudflare. A few seconds later:

kubectl get cfdr -n cf-operator

kubectl get cfdr -n cf-operator -o wide
NAME                     NAME            TYPE   CONTENT        STATUS          LAST SYNC              AGE   ZONE ID                            RECORD ID                          TTL   PROXIED
mguarinos-com-apex-a     mguarinos.com   A      1.1.1.1        RecordSynced    2026-04-15T22:02:55Z   12s   daeed8d03dd34a9923222a33e96986ff   c62658232a4663be7d9610f69b186572   1     false
mguarinos-com-apex-txt   mguarinos.com   TXT    Hello world!   RecordSynced    2026-04-15T22:03:01Z   6s    daeed8d03dd34a9923222a33e96986ff   aafac48cbcc241ca882e1646e24098a8   300   false

CRDs also define a status subresource, which is a separate write path from the spec. The operator uses it to record what it observed: the Cloudflare record ID it created, the last sync timestamp, and a standard Kubernetes conditions array. The single condition (type: Synced) follows the True/False convention with a CamelCase reason token - RecordSynced, DriftDetected, or SyncFailed - and a human-readable message field. Using the standard conditions format means tools like kubectl wait, ArgoCD health checks, and other GitOps tooling understand the resource state without any custom logic. The subresource separation means the operator can patch status without triggering a reconciliation on the spec - there is no event loop between the two.

kubectl describe cfdr -n cf-operator mguarinos-com-apex-a
Name:         mguarinos-com-apex-a
Namespace:    cf-operator
Labels:       
Annotations:  kopf.zalando.org/last-handled-configuration:
                {"spec":{"content":"1.1.1.1","name":"mguarinos.com","proxied":false,"ttl":1,"type":"A","zone_id":"daeed8d03dd34a9923222a33e96986ff"}}
API Version:  dns.operator.io/v1
Kind:         CloudflareDNSRecord
Metadata:
  Creation Timestamp:  2026-04-15T22:32:47Z
  Finalizers:
    dns.operator.io/cloudflare-cleanup
  Generation:        1
  Resource Version:  7945
  UID:               1c4f445a-fa38-4584-aca7-f1c1d3580d4d
Spec:
  Content:  1.1.1.1
  Name:     mguarinos.com
  Proxied:  false
  Ttl:      1
  Type:     A
  zone_id:  daeed8d03dd34a9923222a33e96986ff
Status:
  Conditions:
    Last Transition Time:  2026-04-15T22:32:48Z
    Message:
    Reason:                RecordSynced
    Status:                True
    Type:                  Synced
  last_sync:    2026-04-15T22:32:48Z
  record_id:    c6853b1fecfd4e3efbd263f835eb70fa
Events:         

The reconciliation loop

An operator is a process - typically a pod in the cluster - that watches the Kubernetes API for events on its custom resource type and reacts to them. The core of the DNS operator is four handlers:

on.create - when a CloudflareDNSRecord object appears, call cf.dns.records.create, then write the returned Cloudflare record ID into .status.record_id. That ID is the link between the Kubernetes object and the external resource. Without it, the operator cannot update or delete the record later.

on.update - when the spec changes, call cf.dns.records.update with the new values. If the status has no record_id (the operator was offline when the object was created), fall back to creating the record rather than failing.

on.delete - delete the Cloudflare record before allowing Kubernetes to remove the object. The finalizer (described below) is what makes this ordering possible.

timer - every 60 seconds, fetch the live record from Cloudflare and compare it to the spec. If they differ, revert Cloudflare to match Kubernetes.

When the operator pod starts you can see all of this initialising in the logs:

kubectl logs -n cf-operator cloudflare-dns-operator-6b55f8556d-5hw97
[2026-04-15 21:59:48,041] kopf._core.reactor.r [DEBUG   ] Starting Kopf 1.44.5.
[2026-04-15 21:59:48,042] kopf.activities.star [DEBUG   ] Activity 'on_startup' is invoked.
[2026-04-15 21:59:48,042] __kopf_script_0__src [INFO    ] Cloudflare DNS Operator starting. namespace=cf-operator  secret=cloudflare-api-token  drift_interval=60s
[2026-04-15 21:59:48,043] helpers              [DEBUG   ] K8s config: loaded in-cluster service-account credentials.
[2026-04-15 21:59:48,052] kubernetes.client.re [DEBUG   ] response body: {REDACTED}

[2026-04-15 21:59:48,054] __kopf_script_0__src [INFO    ] Cloudflare token loaded successfully (length=53).
[2026-04-15 21:59:48,055] kopf.activities.star [INFO    ] Activity 'on_startup' succeeded.
[2026-04-15 21:59:48,056] kopf._core.engines.a [INFO    ] Initial authentication has been initiated.
[2026-04-15 21:59:48,056] kopf.activities.auth [DEBUG   ] Activity 'login_via_client' is invoked.
[2026-04-15 21:59:48,057] kopf.activities.auth [DEBUG   ] Client is configured in cluster with service account.
[2026-04-15 21:59:48,058] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2026-04-15 21:59:48,058] kopf._core.engines.a [INFO    ] Initial authentication has finished.
[2026-04-15 21:59:48,152] kopf._cogs.clients.w [DEBUG   ] Starting the watch-stream for customresourcedefinitions.v1.apiextensions.k8s.io cluster-wide.
[2026-04-15 21:59:48,153] kopf._cogs.clients.w [DEBUG   ] Starting the watch-stream for cloudflarednsrecords.v1.dns.operator.io cluster-wide.

Together these four handlers mean the operator never needs to be told what changed. It observes events and acts on them. If the operator crashes and restarts, the reconciliation loop catches up automatically - any pending creates become updates, any missed deletes are replayed.

The four-step loop. When the live state matches the spec, the Reconcile step is skipped entirely (green arc). The loop repeats on every event and every 60-second timer tick.

Finalizers: the guarantee on delete

Without a finalizer, kubectl delete removes the Kubernetes object immediately and the Cloudflare record is left behind. Finalizers prevent that.

When the operator starts, it registers the string dns.operator.io/cloudflare-cleanup as a finalizer on every object it manages. Kubernetes will not actually delete an object that has a finalizer on it - it only sets a deletionTimestamp and blocks. The API server then fires a delete event to the operator.

The operator’s delete handler calls cf.dns.records.delete. If that call succeeds, the handler returns and the operator removes the finalizer. Kubernetes sees the finalizer list is now empty and removes the object. If the Cloudflare call fails the handler raises a temporary error and the framework retries it every 30 seconds. The finalizer stays in place until the deletion is confirmed.

The result: as long as the operator is running, it is impossible for a kubectl delete to leave an orphaned DNS record.

Without a finalizer, the Kubernetes object disappears before anything cleans up Cloudflare. The finalizer inverts the order: the external resource is deleted first, then Kubernetes removes the object.

Drift detection: Kubernetes as the source of truth

The timer handler is where the operator earns its keep.

Kubernetes stores the desired state. Cloudflare holds the live state. These can diverge whenever a human edits a record directly in the Cloudflare dashboard. Without an operator, that divergence is silent - your IaC says one thing, your DNS actually does another.

Every 60 seconds the operator fetches the record from Cloudflare and diffs content, proxied, and ttl against the spec. If anything differs, it logs the exact discrepancy and calls cf.dns.records.update to revert it. The condition reason is set to DriftDetected the moment divergence is found, and back to RecordSynced once the revert succeeds.

To see this in action: go to the Cloudflare dashboard and change the IP on mguarinos.com from 1.1.1.1 to 1.0.0.1. Within 60 seconds the operator logs:

[2026-04-15 22:03:55,213] kopf.objects         [WARNING ] [cf-operator/mguarinos-com-apex-a] Drift detected on record id=c62658232a4663be7d9610f69b186572 name=mguarinos.com:
  content : live='1.0.0.1'                       desired='1.1.1.1'
  proxied : live=False                           desired=False
  ttl     : live=120.0                           desired=1
Reverting to K8s spec.

This is what “Kubernetes is the source of truth” actually means in practice: not a policy, but a running process that enforces it. Any change made outside the cluster is overwritten within a minute.

Choosing a framework

Operators in Go using kubebuilder or the Operator SDK are the production standard. You get generated boilerplate, built-in status conditions, strong typing, and the full ecosystem of controller-runtime tooling. The tradeoff is that before writing a single line of business logic you are wiring up schemes, registering types, and configuring manager options.

Kopf (Kubernetes Operator Pythonic Framework) flips that tradeoff. A handler is a decorated Python function. The framework handles watches, retries, status patching, finalizer registration, and leader-election. For an operator with a small surface area - a handful of handlers, one external API - the reduction in boilerplate is significant without giving up the important guarantees.

The DNS operator uses kopf. The operator logic is split across three focused files: constants and configuration, shared helpers (K8s client, Cloudflare client, status utilities), and the kopf handlers themselves. For a team already fluent in Python and operating against well-understood Python SDKs (e.g. the Cloudflare client), this is often the right call.

When to write an operator

An operator is the right tool when:

You have an external resource with a lifecycle (create, update, delete) that needs to track Kubernetes objects.
You want drift detection.
The resource type is long-lived and managed by multiple people, where a CI script or manual Terraform run is too fragile.

An operator is overkill when you just need to run a job on deploy, transform a config value, or provision something once. A Job, a Helm hook, or an init container is simpler and easier to reason about.

Streamline: a serverless live streaming platform with 4-hour DVR on AWS

2026-04-13T00:00:00+00:00

AWS IVS gives you managed RTMP ingest, LL-HLS transcode, and a built-in 4-hour DVR window. CloudFront gives you a global CDN. Lambda gives you a cold-start-under-200ms API. Put them together with a bit of Terraform and you get a live streaming platform that costs nothing at rest, scales automatically, and lets viewers rewind up to four hours - no S3 recording bucket, no media server, no operational overhead.

This post walks through the architecture of Streamline, the design decisions behind it, and why certain pieces are wired together the way they are.

The Streamline player — Video.js with LL-HLS, quality selector, and a DVR scrubber that lets viewers rewind up to four hours.

OBS Studio configured with the IVS ingest endpoint and stream key - two fields, then you're live.

Architecture

The whole system fits in one diagram. A broadcaster pushes RTMP to IVS. Viewers hit a single CloudFront distribution that fans out to three origins depending on the URL path. A side channel - EventBridge → Lambda → SSM - keeps stream state without any polling.

There is no media server. There is no recording bucket. IVS handles ingest and transcode entirely on its own infrastructure. CloudFront is the only public surface - the S3 bucket and Lambda function URL both reject requests that don’t come through CloudFront.

The DVR window

IVS STANDARD channels maintain a rolling 4-hour DVR window internally. There is no recording_configuration_arn, no S3 bucket, no retention policy. The HLS manifest IVS generates contains the full seekable range. Video.js reads it automatically when you set liveui: true - no special URL parameters or player configuration required beyond that flag.

While a stream is live, a viewer can drag the progress bar all the way back to hour zero. Clicking the LIVE button snaps back to the live edge instantly. When the stream ends, the DVR segments are discarded.

Configuring OBS is a two-field job: paste the ingest_endpoint and the stream_key (retrieved from Secrets Manager)

How request routing works

A single CloudFront distribution handles three completely different types of traffic. The path prefix determines which origin receives the request:

Each origin has its own cache policy:

S3 (/*): index.html gets must-revalidate (always fresh); other assets get immutable (hash in filename, 1-year TTL)
Lambda (/api/*): no-cache - the Lambda itself has a 10-second in-memory cache, so CloudFront doesn’t need to
IVS (/hls/*): 5-second TTL - long enough to reduce origin hits, short enough that the live edge stays fresh

Stream state: EventBridge + SSM instead of polling

The player needs to know whether a stream is live before it tries to load an HLS manifest. The naive approach - calling IVS GetStream on every API request - adds unnecessary latency and cost at scale. The approach here is event-driven:

When a broadcaster goes live, IVS fires a Stream Start event to EventBridge. EventBridge invokes Lambda, which writes {"status":"live","updatedAt":"..."} to an SSM Parameter. When the stream ends or fails, the same path runs in reverse.

The /api/stream handler reads this SSM parameter (with a 10-second module-level cache) and returns the current state to the player. The Lambda function never polls IVS directly during normal operation - IVS pushes state changes to it. If the SSM parameter doesn’t exist yet (stream has never been live), ParameterNotFound is caught and mapped to idle.

Security: why the Lambda rejects direct requests

The Lambda function URL is configured with authorization_type = "AWS_IAM". Access is granted exclusively to cloudfront.amazonaws.com with a condition scoped to this specific distribution’s ARN. CloudFront uses an Origin Access Control (OAC) to sign every request to the Lambda origin with SigV4 before forwarding it.

The practical result: requests arriving at the function URL from any other source - curl, another Lambda, another CloudFront distribution - are rejected by IAM before they reach the function code. The same OAC pattern applies to S3, where the bucket policy blocks all public access and only allows requests signed by this distribution’s OAC.

Infrastructure as code

Five focused Terraform modules:

Module	Responsibility
`ivs`	IVS channel, stream key, Secrets Manager secret
`s3`	Frontend bucket, CloudFront OAC
`lambda`	IAM role, SSM parameter, function, alias, function URL, EventBridge rule
`cloudfront`	Distribution, three origins, cache behaviours, optional custom domain wiring
`dns`	ACM certificate (us-east-1), Route 53 validation records and alias
`monitoring`	CloudWatch alarms for Lambda errors/throttles and CloudFront 5xx; SNS topic

The S3 bucket policy and the Lambda permission for CloudFront live in the root module. This is intentional: both need values from two different modules (s3/cloudfront and lambda/cloudfront respectively), and putting them in either child module would create a circular dependency. Wiring them at the root lets Terraform resolve the dependency order in a single apply.

State locking uses Terraform 1.10’s native S3 locking (use_lockfile = true). No DynamoDB table required.

Deployment pipeline

Every production deploy is triggered by a semver tag:

git tag v1.0.0 && git push origin v1.0.0

The workflow runs three jobs. prepare extracts the version and detects which paths changed. deploy-frontend and deploy-lambda run in parallel and only execute if their respective paths changed since the previous tag.

deploy-lambda builds the TypeScript source, prunes devDependencies, zips dist/ and node_modules/, uploads the zip to Lambda, waits for propagation, publishes an immutable version snapshot, and points the live alias at it. Every Lambda version is immutable — rolling back is a single AWS CLI call:

aws lambda update-alias \
  --function-name streamline-prod \
  --name live \
  --function-version PREVIOUS_VERSION_NUMBER

GitHub Actions authenticates to AWS via OIDC. There are no long-lived AWS credentials stored as secrets.

Getting started

git clone https://github.com/mguarinos/streamline.git
cd streamline
./scripts/bootstrap.sh          # creates state bucket, OIDC provider, deploy role
cd terraform
terraform init -backend-config=backend.hcl
terraform apply
terraform output                # note the ingest endpoint and stream key command

Full setup instructions are in the README.