Skip to content

Building cert-manager-webhook-spaceship: Notes from the Journey

Building cert-manager-webhook-spaceship: Notes from the Journey

I wanted cert-manager DNS-01 support for Spaceship DNS, but there was no webhook that fit what I needed. So I built one: cert-manager-webhook-spaceship.

This was my first time publishing a Helm chart and one of my first end-to-end Go projects that I shipped for other people to use. This post is the real journey: what I got wrong, what I fixed, and what finally made it reliable.

Tooling Highlight

cert-manager-webhook-spaceship is the Go webhook that translates cert-manager DNS-01 challenge requests into Spaceship DNS API calls. Source code and chart are in the GitHub repository.

Tooling Highlight

cert-manager handles certificate automation in Kubernetes. The webhook extends it so DNS-01 challenges can be solved through Spaceship DNS. If you are new to cert-manager, start with the official docs.

Tooling Highlight

Helm is how I packaged and published the webhook for repeatable installs, while also keeping a pre-rendered manifest option. Official docs: helm.sh/docs.

Tooling Highlight

Traefik plus Gateway API became the ingress layer I used during migration, which gave this webhook a real-world stress test. Docs: doc.traefik.io.

What I Set Out to Do

I started with three concrete goals:

  • issue wildcard certs with cert-manager
  • use Spaceship as the DNS provider
  • keep the deploy path easy (Helm first, rendered manifest as a fallback)

On paper that sounds straightforward. In practice, the hard part was not writing the webhook server. The hard part was getting all integration details to agree: cert-manager config, chart defaults, DNS behavior, rollout behavior, and cluster networking.

First Pass: Make the Solver Work

The initial solver implemented the standard DNS-01 lifecycle:

  1. Present challenge TXT record
  2. Wait for cert-manager / ACME validation
  3. Clean up TXT record

Most of the code was typical API integration, but I underestimated how exact the DNS naming has to be. Zone-relative names and trailing dots are easy to get subtly wrong, and the result is frustrating: things look mostly correct while challenges still fail.

The Chart Work Was More Important Than I Expected

I thought charting would be the easy part. It mostly was, but these details mattered a lot in real usage:

  • chart name consistency (cert-manager-webhook-spaceship)
  • sensible namespace defaults (cert-manager)
  • secret strategy for operators:
    • manual secret (recommended for production)
    • chart-managed secret (secrets.createSecret=true) for quick starts

Supporting both secret modes turned out to be the right call. Some people want everything explicit and externalized, others want one command and done.

The Breakages That Actually Taught Me Something

1) groupName mismatch

This one consumed more time than it should have. The webhook group must match between runtime and ClusterIssuer/Issuer config. I had defaults that drifted, and failures were noisy but not obvious.

I standardized everything on:

yaml
groupName: acme.spaceship.neoscrib.com

2) TXT record naming edge cases

I fixed normalization around domain/zone handling and trailing dots. Before that fix, challenge records could be created with the wrong name shape.

3) Secret lifecycle during upgrades

I initially focused on install, then hit upgrade behavior. If the first install creates a secret and later upgrades change flags/values, you can get surprising outcomes unless secret behavior is explicit and documented.

4) Wildcard + apex timing realities

The *.domain + apex case (for example *.piper.dev and piper.dev) is where everything got real. Same challenge name, two TXT values, propagation timing, retries, and backoff all show up quickly.

Publishing Was Its Own Learning Curve

I published the chart through GitHub Pages (gh-pages) and pushed images to Docker Hub. The repeatable flow that worked for me:

  1. push main
  2. package chart (.tgz)
  3. generate/update index.yaml
  4. push gh-pages
  5. push Docker image tag

The process is manual, but easy to reason about and easy to recover when I make a mistake.

Real-Cluster Validation Is Where Confidence Came From

After deployment, I validated with Let's Encrypt staging first, then production:

  • single-domain certs
  • wildcard + apex together (for example *.piper.dev and piper.dev)

I also validated it while migrating services from ingress-nginx to Traefik Gateway API. That migration forced me to test certificates, routes, middleware allowlists, and external backends in a realistic setup instead of a toy example.

What I Would Do Early Next Time

  • start with explicit, shared defaults for groupName
  • document secret modes early
  • validate with staging ACME before production
  • test wildcard + apex early
  • treat Helm packaging and docs as first-class, not cleanup work

Closing

The webhook is now doing exactly what I wanted: cert-manager can issue and renew certificates through Spaceship DNS with a repeatable install path.

The biggest lesson is that shipping infrastructure code is mostly integration work. The core Go code matters, but the real reliability comes from defaults, docs, upgrade behavior, and real-cluster testing.