Capabilities bounded the wrong way
A service on one of our boxes suddenly could not bind to port 443 after a config change. The error was Permission denied, which is always a delight because it does not tell you which permission. The answer was in the five different capability sets a Linux process has.
The situation
The service in question was a small Go app serving HTTPS on port 443. It had been running as a non-root user with CAP_NET_BIND_SERVICE granted, which allows binding to low ports without being root. That is a standard pattern.
After a systemd unit change for unrelated reasons, it stopped working:
systemctl status app
# app.service - API service
# Loaded: loaded (/etc/systemd/system/app.service; enabled)
# Active: failed (Result: exit-code)
# Main process exited, code=exited, status=1/FAILURE
Logs:
listen tcp :443: bind: permission denied
User was still app. Binary was unchanged. What changed?
The change
A colleague had added ProtectSystem=strict and ProtectHome=true to the unit, which are good. They had also added:
CapabilityBoundingSet=
An empty CapabilityBoundingSet. This drops all capabilities from the bounding set, which is more aggressive than you might expect.
The capability sets, quickly
Linux has five capability sets per process:
- Inheritable (I): capabilities preserved across exec() via inheritance.
- Permitted (P): capabilities the process is permitted to hold.
- Effective (E): capabilities currently active.
- Bounding (BND): ceiling on what can ever be in P after exec.
- Ambient (A): a modern addition that preserves caps across exec without requiring file capabilities.
For our service to bind to port 443 as non-root, CAP_NET_BIND_SERVICE needs to be in E at the moment of bind(). For systemd units, you usually configure this via:
AmbientCapabilities=CAP_NET_BIND_SERVICE
But that ambient capability is only granted if it is also in the bounding set. If bounding is empty, ambient is dropped, permitted becomes empty, effective becomes empty. Bind fails.
Diagnosing
getpcaps or reading /proc/<pid>/status are the two easy tools:
cat /proc/$(pgrep -x app)/status | grep ^Cap
# CapInh: 0000000000000000
# CapPrm: 0000000000000000
# CapEff: 0000000000000000
# CapBnd: 0000000000000000
# CapAmb: 0000000000000000
All zeros. The service had no capabilities at all. Binding to 443 requires either running as root or holding CAP_NET_BIND_SERVICE in CapEff. Neither was true, so EPERM.
Decoding a bitmap: capsh --decode=0x0000000000000400 decodes to cap_net_bind_service. But with all zeros there is nothing to decode.
The fix
The correct systemd unit:
[Service]
ExecStart=/usr/local/bin/app
User=app
Group=app
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
ProtectSystem=strict
ProtectHome=true
NoNewPrivileges=true
Two lines: AmbientCapabilities= and CapabilityBoundingSet=. Both with the capability we need. Bounding is the ceiling; ambient pushes the cap into effective on exec.
After restart:
cat /proc/$(pgrep -x app)/status | grep ^Cap
# CapInh: 0000000000000400
# CapPrm: 0000000000000400
# CapEff: 0000000000000400
# CapBnd: 0000000000000400
# CapAmb: 0000000000000400
0x400 is CAP_NET_BIND_SERVICE. Bind succeeds.
Why CapabilityBoundingSet= empty is a thing
It is there because for some services you genuinely want no capabilities at all and want the kernel to enforce it. Think a log-reading service that has nothing to do with networking and should never gain capabilities for any reason. Hardening guides sometimes recommend setting CapabilityBoundingSet=~ to drop everything.
The colleague had copied that recommendation from a hardening guide without noticing that our service specifically needed CAP_NET_BIND_SERVICE.
A wider lesson
“No capabilities” is a great default for leaf services that do not need to do anything privileged. For a service that binds to low ports, you must keep CAP_NET_BIND_SERVICE. For a service that writes raw packets, CAP_NET_RAW. For a service that ptrace’s others, CAP_SYS_PTRACE. Know your capability needs before you harden.
A good discipline: for each service, document which capabilities it actually uses. You can discover this by running with default (unrestricted) caps and then whittling down with CapabilityBoundingSet=~cap_thing experiments. Or use strace -e trace=%network and friends to find privileged operations.
Other relevant systemd hardening
If you liked CapabilityBoundingSet=, you will love the rest:
# Prevent setuid/setgid binaries from giving more caps
NoNewPrivileges=true
# Make /etc and /usr read-only
ProtectSystem=strict
# Hide /home, /root, /run/user
ProtectHome=true
# Read-only hierarchies
ReadOnlyPaths=/etc /usr
# Private /tmp
PrivateTmp=true
# Filter syscalls
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
Each of these is a potential foot-gun in specific contexts. SystemCallFilter=@system-service is a reasonable default but will break anything that uses unusual syscalls. Seccomp filters are particularly fun because the error is often a “weird thing” that tells you nothing about the cause.
I now test hardening changes in staging first, and I run systemd-analyze security app.service which spits out a score and a rundown:
systemd-analyze security app.service
# → Overall exposure level: 2.0 (OK)
# ...
# ✓ CapabilityBoundingSet=~CAP_SYS_MODULE Service cannot load or read kernel modules
# ✓ CapabilityBoundingSet=~CAP_NET_BIND_SERVICE — wait, we need this one
Handy tool. Reads the unit and tells you what you have locked down.
Reflection
Linux capabilities are an incremental improvement over “root or non-root”, and they are essential for running services that want to do one privileged thing without having all privileges. The five sets interact in ways that are obvious once you have internalized them and confusing until you have. If you are writing hardened systemd units, take twenty minutes to read man 7 capabilities and man 5 systemd.exec. It will save you from a bind: permission denied at 2am.
Related: see my post on systemd timers and clock drift for another “systemd does what the config says, even when that’s not what I meant” story.