Baking Hetzner images with Packer and cloud-init
We run a chunk of infrastructure on Hetzner Cloud. For a while we used stock Debian images and let cloud-init install everything at first boot: fail2ban, prometheus node exporter, base ansible bootstrap, TLS roots, a handful of other things. This took 3-4 minutes per new VM. At small scale, fine. At our scale (we often spin up a few dozen for batch jobs), death by a thousand boots.
The approach
Use Packer to produce a custom Hetzner snapshot with all the base packages baked in. Cloud-init on first boot only needs to run the “hostname, ssh keys, network, per-host secrets” part, which is fast.
The Packer config
A Packer HCL file targeting Hetzner:
packer {
required_plugins {
hcloud = {
source = "github.com/hetznercloud/hcloud"
version = "~> 1.2"
}
}
}
source "hcloud" "base" {
image = "debian-12"
location = "nbg1"
server_type = "cx22"
ssh_username = "root"
snapshot_name = "merce-base-{{timestamp}}"
snapshot_labels = {
role = "base"
built_from = "packer"
}
}
build {
name = "base"
sources = ["source.hcloud.base"]
provisioner "shell" {
script = "./scripts/apt-setup.sh"
}
provisioner "ansible" {
playbook_file = "./ansible/baseline.yml"
user = "root"
use_proxy = false
}
provisioner "shell" {
inline = [
"apt-get clean",
"rm -rf /var/lib/apt/lists/*",
"cloud-init clean --logs --machine-id"
]
}
}
cloud-init clean --machine-id is important. Without it, the snapshot retains the machine-id of the Packer-booted instance, and every VM created from the snapshot has the same ID. This causes journald to collapse logs across boots in confusing ways and breaks any tool that uses machine-id for identity.
The baseline playbook
The ansible playbook is intentionally small. It installs and configures things that do not depend on the specific host:
- hosts: all
become: true
tasks:
- apt:
update_cache: true
upgrade: safe
- apt:
name:
- ca-certificates
- curl
- vim
- python3
- prometheus-node-exporter
- fail2ban
- chrony
- unattended-upgrades
state: present
- copy:
src: files/chrony.conf
dest: /etc/chrony/chrony.conf
- copy:
src: files/sshd-hardening.conf
dest: /etc/ssh/sshd_config.d/99-hardening.conf
Nothing here is host-specific. Host-specific config (hostname, host SSH keys, per-role packages, secrets) is applied at first boot via cloud-init.
The cloud-init userdata
Cloud-init on first boot reads userdata provided by Terraform. The userdata is minimal because most of the work is already done:
#cloud-config
hostname: ${hostname}
manage_etc_hosts: true
users:
- default
write_files:
- path: /etc/prometheus/node-exporter-tags.env
content: |
NODE_EXPORTER_ARGS="--collector.textfile.directory=/var/lib/node_exporter/textfile --web.listen-address=:9100"
runcmd:
- systemctl restart prometheus-node-exporter
- /usr/local/bin/join-ansible
join-ansible is a small script baked into the image that registers this VM with our ansible controller so it can pull its real config.
Terraform wiring
Terraform picks the latest snapshot:
data "hcloud_image" "base" {
with_selector = "role=base"
most_recent = true
}
resource "hcloud_server" "app" {
count = var.app_count
name = "app-${count.index + 1}"
server_type = "cx22"
image = data.hcloud_image.base.id
location = "nbg1"
ssh_keys = [hcloud_ssh_key.ops.id]
user_data = templatefile("${path.module}/user-data.tpl", {
hostname = "app-${count.index + 1}.internal"
})
labels = {
role = "app"
}
}
Results
Boot time to “fully configured and scraping metrics”:
- Before (stock image + full cloud-init): ~4 min
- After (baked image + thin cloud-init): ~35 s
That is a 7x improvement on every boot. For burst workloads this matters because the critical path from “I need capacity” to “capacity is serving traffic” got shorter by minutes.
Packer pipeline
The Packer build runs weekly in CI:
# .gitlab-ci.yml
packer-build:
image: hashicorp/packer:latest
script:
- packer init .
- packer validate .
- packer build .
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
That gives me fresh images with the latest apt updates, weekly. The schedule uses a GitLab scheduled pipeline.
Gotchas
- Snapshot counts matter. Hetzner limits snapshots per project. We now run a cleanup step that keeps the last 5 and deletes older ones. Our Packer build is idempotent enough that old snapshots are safe to delete.
- If your baseline installs a kernel, reboot before snapshotting. Otherwise you end up with a kernel image whose modules don’t match the running kernel after next boot. A Packer
rebootprovisioner handles this. - Secrets never go in the baked image. It is a snapshot; anyone with access to the Hetzner project can clone it. All secrets are fetched at first boot from vault.
Reflection
The pattern is: bake what does not change per-host, boot what does. Packer + cloud-init + a thin bootstrap is a durable pattern across cloud providers. We have a parallel Packer config for DigitalOcean that is almost identical, which makes cross-provider burst capacity easy in principle.
Related: see my post on Ansible at scale for the opposite-direction story of what happens when you try to do everything at boot time instead.