Nomad vs k8s for a homelab in 2024

I have been running k3s at home for years. I was curious whether Nomad would be friendlier for homelab workloads where I do not need the full k8s ecosystem. I spent a month running both and moving services back and forth. Here is what I think.

What I ran

k3s cluster: three NUCs, single control plane (with embedded etcd), flannel CNI, Longhorn for storage, Traefik ingress.
Nomad cluster: same three NUCs, Consul for service discovery, no storage layer (used host volumes), Fabio for service routing.

Workloads: a mix of homelab services. jellyfin, grafana, immich, a few bespoke Go services, a postgres, a redis.

What Nomad got right

The HCL job files are easy to read. Compare a basic Nomad job:

job "grafana" {
  datacenters = ["home"]
  group "grafana" {
    network {
      port "http" { static = 3000 }
    }
    service {
      name = "grafana"
      port = "http"
      check {
        type     = "http"
        path     = "/api/health"
        interval = "10s"
        timeout  = "2s"
      }
    }
    task "grafana" {
      driver = "docker"
      config {
        image = "grafana/grafana:10.4.2"
        ports = ["http"]
      }
      volume_mount {
        volume      = "grafana-data"
        destination = "/var/lib/grafana"
      }
    }
    volume "grafana-data" {
      type   = "host"
      source = "grafana-data"
    }
  }
}

That is the whole thing. No deployment, service, configmap, PVC, PV ingress tango. For a homelab it is honest and short.

Nomad’s resource scheduling is simple. I pin postgres to a specific node, and the job file says so. Nomad does not have as many abstractions as k8s, which I find a feature at home.

What k8s got right

Operators. I use four operators in my homelab that I would not know how to replace in Nomad. Cert-manager issues and rotates TLS. External-DNS syncs my dnsmasq zones from Service and Ingress records. Longhorn gives me replicated block storage. The Kubernetes metrics-server plus Prometheus Operator gives me turnkey monitoring.

Volume management. Longhorn is a real product. Its Nomad equivalent would be me writing CSI glue or settling for host volumes. I have had a disk fail in my homelab; Longhorn just dealt with it. Nomad CSI works but is less mature.

Ecosystem. Every open-source tool ships k8s manifests first. Some ship Helm charts. A few ship Kustomize. Very few ship Nomad jobs, and the ones that do are usually six months behind.

Where they are basically tied

Scheduling for simple workloads. If your job is “run one copy of this container and let a health check restart it”, both systems do this fine.
Observability. Both have Prometheus-compatible exporters.
Secrets. Vault is first-class in both; in k8s via the CSI driver or ESO, in Nomad natively.

Where Nomad actually broke me

I had three failures that I would not have had with k8s:

A job updated itself in a way that broke the health check but kept the old allocation running. Nomad considered that fine, and did not roll back. I had to read the allocation history to find out which deployment was deemed successful and rewind manually.
After a network blip, two Consul nodes got into a state where services were registered but marked critical. consul reload fixed it but the blast radius was unclear.
Fabio did not pick up config changes reliably when the Consul tag style changed. I ended up replacing Fabio with Caddy, which means I lost some of the integrated routing story.

Where k8s kept winning at home

Operator ecosystem. The more Operators I use, the more I just can’t walk away.
Control plane. My k3s cluster is one binary. Adding/removing nodes is a single k3s agent invocation. Nomad + Consul is two services to manage, and Consul has its own cluster dynamics.
Documentation. For anything you might want to do in a home cluster, the k8s answer is a StackOverflow search away.

Where k8s is still painful at home

The memory floor. k3s is slim but still uses 1.5 GiB per node before you schedule anything useful. Nomad is closer to 300 MiB. On a NUC, that matters.
The mental model. Explaining to someone new why you need a Deployment AND a Service AND an Ingress is tiresome.
The number of yaml files per service. Even with Helm, you are juggling a lot of files for what is morally “run this container”.

My decision

I put k3s back. Not because Nomad is worse (it is not), but because my homelab is tangled up with the k8s ecosystem in ways that would take more time to unwind than I save. The operators, the Helm charts, the Longhorn volumes, the network policies I have spent weekends tuning — they are already there.

If I were starting fresh in 2024 with no existing setup and no desire to use operators, I would honestly try Nomad first. The HCL is pleasant. The state is simple. The failure modes are mostly legible.

Reflection

This is one of those “it depends” answers that people hate to read on blogs. The “it depends” this time is: what’s your workload, what’s your existing familiarity, and do you want to participate in the k8s ecosystem? If any of those answers is “standard web services, lots of operators, yes”, use k8s. If all three are “handful of boring containers, I can write my own glue, no”, give Nomad a weekend. Related: see my post on why I finally switched from nginx to Caddy.