Baking Hetzner images with Packer and cloud-init

We run a chunk of infrastructure on Hetzner Cloud. For a while we used stock Debian images and let cloud-init install everything at first boot: fail2ban, prometheus node exporter, base ansible bootstrap, TLS roots, a handful of other things. This took 3-4 minutes per new VM. At small scale, fine. At our scale (we often spin up a few dozen for batch jobs), death by a thousand boots.

The approach

Use Packer to produce a custom Hetzner snapshot with all the base packages baked in. Cloud-init on first boot only needs to run the “hostname, ssh keys, network, per-host secrets” part, which is fast.

The Packer config

A Packer HCL file targeting Hetzner:

packer {
  required_plugins {
    hcloud = {
      source  = "github.com/hetznercloud/hcloud"
      version = "~> 1.2"
    }
  }
}

source "hcloud" "base" {
  image         = "debian-12"
  location      = "nbg1"
  server_type   = "cx22"
  ssh_username  = "root"
  snapshot_name = "merce-base-{{timestamp}}"
  snapshot_labels = {
    role = "base"
    built_from = "packer"
  }
}

build {
  name = "base"
  sources = ["source.hcloud.base"]

  provisioner "shell" {
    script = "./scripts/apt-setup.sh"
  }

  provisioner "ansible" {
    playbook_file = "./ansible/baseline.yml"
    user = "root"
    use_proxy = false
  }

  provisioner "shell" {
    inline = [
      "apt-get clean",
      "rm -rf /var/lib/apt/lists/*",
      "cloud-init clean --logs --machine-id"
    ]
  }
}

cloud-init clean --machine-id is important. Without it, the snapshot retains the machine-id of the Packer-booted instance, and every VM created from the snapshot has the same ID. This causes journald to collapse logs across boots in confusing ways and breaks any tool that uses machine-id for identity.

The baseline playbook

The ansible playbook is intentionally small. It installs and configures things that do not depend on the specific host:

- hosts: all
  become: true
  tasks:
    - apt:
        update_cache: true
        upgrade: safe
    - apt:
        name:
          - ca-certificates
          - curl
          - vim
          - python3
          - prometheus-node-exporter
          - fail2ban
          - chrony
          - unattended-upgrades
        state: present
    - copy:
        src: files/chrony.conf
        dest: /etc/chrony/chrony.conf
    - copy:
        src: files/sshd-hardening.conf
        dest: /etc/ssh/sshd_config.d/99-hardening.conf

Nothing here is host-specific. Host-specific config (hostname, host SSH keys, per-role packages, secrets) is applied at first boot via cloud-init.

The cloud-init userdata

Cloud-init on first boot reads userdata provided by Terraform. The userdata is minimal because most of the work is already done:

#cloud-config
hostname: ${hostname}
manage_etc_hosts: true
users:
  - default
write_files:
  - path: /etc/prometheus/node-exporter-tags.env
    content: |
      NODE_EXPORTER_ARGS="--collector.textfile.directory=/var/lib/node_exporter/textfile --web.listen-address=:9100"
runcmd:
  - systemctl restart prometheus-node-exporter
  - /usr/local/bin/join-ansible

join-ansible is a small script baked into the image that registers this VM with our ansible controller so it can pull its real config.

Terraform wiring

Terraform picks the latest snapshot:

data "hcloud_image" "base" {
  with_selector = "role=base"
  most_recent   = true
}

resource "hcloud_server" "app" {
  count       = var.app_count
  name        = "app-${count.index + 1}"
  server_type = "cx22"
  image       = data.hcloud_image.base.id
  location    = "nbg1"
  ssh_keys    = [hcloud_ssh_key.ops.id]
  user_data   = templatefile("${path.module}/user-data.tpl", {
    hostname = "app-${count.index + 1}.internal"
  })
  labels = {
    role = "app"
  }
}

Results

Boot time to “fully configured and scraping metrics”:

Before (stock image + full cloud-init): ~4 min
After (baked image + thin cloud-init): ~35 s

That is a 7x improvement on every boot. For burst workloads this matters because the critical path from “I need capacity” to “capacity is serving traffic” got shorter by minutes.

Packer pipeline

The Packer build runs weekly in CI:

# .gitlab-ci.yml
packer-build:
  image: hashicorp/packer:latest
  script:
    - packer init .
    - packer validate .
    - packer build .
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"

That gives me fresh images with the latest apt updates, weekly. The schedule uses a GitLab scheduled pipeline.

Gotchas

Snapshot counts matter. Hetzner limits snapshots per project. We now run a cleanup step that keeps the last 5 and deletes older ones. Our Packer build is idempotent enough that old snapshots are safe to delete.
If your baseline installs a kernel, reboot before snapshotting. Otherwise you end up with a kernel image whose modules don’t match the running kernel after next boot. A Packer reboot provisioner handles this.
Secrets never go in the baked image. It is a snapshot; anyone with access to the Hetzner project can clone it. All secrets are fetched at first boot from vault.

Reflection

The pattern is: bake what does not change per-host, boot what does. Packer + cloud-init + a thin bootstrap is a durable pattern across cloud providers. We have a parallel Packer config for DigitalOcean that is almost identical, which makes cross-provider burst capacity easy in principle.

Related: see my post on Ansible at scale for the opposite-direction story of what happens when you try to do everything at boot time instead.