http.Client tuning for a flaky upstream

One of our services at a previous job had to talk to a third-party API that was, to put it delicately, bad at being an API. Connections got dropped mid-response. Their load balancer occasionally sent us to a node that had forgotten our session. The TLS handshake was slow because they re-issued certs every week without warning. We had to make our http.Client robust against all of this.

The default http.Client in Go is fine for happy-path web browsing. It’s not fine for production-grade backend communication. Here’s the checklist I run through.

Timeouts, all of them.

http.Client.Timeout is the one everyone knows. It’s a whole-request timeout — from dial to response body fully read. But the underlying Transport has finer-grained timeouts that are often more useful:

tr := &http.Transport{
    DialContext: (&net.Dialer{
        Timeout:   5 * time.Second,  // TCP connect
        KeepAlive: 30 * time.Second,
    }).DialContext,
    TLSHandshakeTimeout:   5 * time.Second,
    ResponseHeaderTimeout: 10 * time.Second, // server sends nothing
    ExpectContinueTimeout: 1 * time.Second,
    IdleConnTimeout:       90 * time.Second,
    MaxIdleConns:          100,
    MaxIdleConnsPerHost:   10,
    MaxConnsPerHost:       50,
}

client := &http.Client{
    Transport: tr,
    Timeout:   30 * time.Second, // overall ceiling
}

Key ones:

DialContext.Timeout: how long connect(2) can take. Without this, a non-responsive server can hold the dial forever.
TLSHandshakeTimeout: the TLS handshake separately.
ResponseHeaderTimeout: after the request body is sent, how long before we get status and headers. This is the one that catches “server accepted my connection and then forgot about it.”
client.Timeout: the overall ceiling, including reading the response body.

For streaming responses where you WANT a long body read, don’t set client.Timeout — rely on the context you pass to the request. req = req.WithContext(ctx) and let the caller cancel.

Connection pooling.

The default transport has MaxIdleConnsPerHost = 2. Two. For a service doing 10,000 req/s to one upstream, two idle connections is almost nothing — every third request has to do a new TCP handshake. Bump it:

tr.MaxIdleConnsPerHost = 100

Rule of thumb: set it to roughly your peak concurrent requests to that host. If you’re doing a lot of requests, this can save a huge amount of latency.

MaxConnsPerHost caps total connections (idle + active) to one host. Useful as a backpressure mechanism — if your upstream can only handle 50 concurrent, setting this to 50 queues new requests in Go rather than flooding the upstream.

Explicit Close: true for one-offs.

If you’re making a single request and don’t want to pool the connection, set req.Close = true. The connection is closed after the response, not returned to the pool. For a proxy service or a one-shot health check, this can avoid subtle pooling issues.

Reading the response body.

Every response body MUST be read to EOF and closed, even if you don’t care about the contents. Otherwise, the connection can’t be returned to the pool and counts against your limits.

resp, err := client.Do(req)
if err != nil {
    return nil, err
}
defer resp.Body.Close()

if resp.StatusCode >= 400 {
    io.Copy(io.Discard, resp.Body) // drain so the conn pools
    return nil, fmt.Errorf("status %d", resp.StatusCode)
}

return io.ReadAll(io.LimitReader(resp.Body, maxBody))

The number of Go services I’ve debugged where someone returned early on a 4xx without draining is embarrassing. Connection pool creeps up, eventually blows.

Retries, carefully.

Retry GET, HEAD, and OPTIONS freely. Retry POST/PUT only if the server indicated via a 5xx that it didn’t process the request, AND you’re sure the operation is idempotent. Include backoff with jitter:

func retryDo(client *http.Client, req *http.Request, maxAttempts int) (*http.Response, error) {
    var resp *http.Response
    var err error
    for attempt := 0; attempt < maxAttempts; attempt++ {
        if attempt > 0 {
            // jittered backoff
            d := time.Duration(attempt*attempt) * 100 * time.Millisecond
            d += time.Duration(rand.Intn(100)) * time.Millisecond
            select {
            case <-time.After(d):
            case <-req.Context().Done():
                return nil, req.Context().Err()
            }
        }
        resp, err = client.Do(req)
        if err == nil && resp.StatusCode < 500 {
            return resp, nil
        }
        if resp != nil {
            resp.Body.Close()
        }
    }
    return resp, err
}

Watch out: http.Request has a Body which is io.Reader. If you’ve already sent the body once, retrying requires a fresh copy. For small bodies, buffer them and rewind; for large ones, reconsider retrying.

The http.Request.GetBody field is designed for this — set it to a function that returns a fresh io.ReadCloser, and Go’s retry logic (on some 5xx, in some cases) will use it. For your own retry logic, you can do the same.

TLS session resumption.

If your upstream is slow at TLS handshakes, session resumption helps a lot. Go enables this by default, BUT only for connections in the same http.Transport. If you’re creating a new transport per request (which you shouldn’t be — share transports), you get no resumption benefit.

// BAD
func fetch(url string) (*http.Response, error) {
    client := &http.Client{} // new transport every call!
    return client.Get(url)
}

// GOOD
var defaultClient = &http.Client{Transport: myConfiguredTransport}

func fetch(url string) (*http.Response, error) {
    return defaultClient.Get(url)
}

Observe everything.

Hook httptrace.ClientTrace into your request context if you want detailed timing:

trace := &httptrace.ClientTrace{
    GotConn: func(info httptrace.GotConnInfo) {
        log.Printf("conn: reused=%v, idle=%v, idletime=%v", info.Reused, info.WasIdle, info.IdleTime)
    },
    TLSHandshakeDone: func(state tls.ConnectionState, err error) {
        log.Printf("TLS done: %v", err)
    },
}
ctx = httptrace.WithClientTrace(ctx, trace)

I leave this off in production but turn it on for debugging. It tells you whether connections are being reused, how long each phase took, etc.

Graceful degradation.

When an upstream is flapping, you often don’t want to cascade failures. Consider a circuit breaker (sony/gobreaker is fine) and a hedged request strategy for latency-sensitive GETs. Neither is built into net/http.

Finally, my most important advice: share the transport. One *http.Transport per process, used by however many *http.Clients you need (since http.Client is lightweight, you can have many). Don’t ever construct a fresh transport in a hot path. The pool is tied to the transport.

Most of this lives in an httputil package in my monorepo. I set it up once and it saves me a thousand paper cuts.