Skip to content

PATTERN Cited by 1 source

Startup-time fail-fast on config non-compliance

Shape

When a service is configured to run under a compliance or security policy that requires its deployment environment to hold a specific external state (OS-level flag, kernel feature, validated library present), the service validates the precondition at startup and refuses to start if the precondition fails. Canonical verbatim from Redpanda's FIPS implementation: "Redpanda will log an error and exit if the underlying operating system isn't properly configured." (Source: sources/2025-05-20-redpanda-implementing-fips-compliance-in-redpanda)

startup:
  if policy = compliance-required:
    check OS state
    check validated modules present + self-tested
    check config coherent (mode=enabled + required paths + OS flag)
    if any check fails:
      log error → exit non-zero
      (do NOT continue in degraded / non-compliant mode)
  else:
    proceed normally

The key property: no silent downgrade. If the service is claimed to be running in compliance mode but the environment can't support it, the service is hard-down rather than silently non-compliant. This is structurally distinct from logging-then-enforcement rollouts — there is no warn-only regime for regulated workloads; the deployment either meets the boundary or it doesn't.

Why hard-fail, not soft-degrade

For regulatory and security-boundary use cases, a running service that believes it's compliant but isn't is worse than a not-running service:

  • Audit falsifiability. A compliant-labeled cluster that silently fell through to non-approved primitives during, say, a random OS kernel crypto state glitch breaks the compliance guarantee without surfacing the failure. A hard-fail produces a pager page; a soft-fail produces an audit-time surprise.
  • Blast-radius scoping. A startup-time failure is contained to one broker (or node, or cluster) going hard-down on config. A runtime compliance breach in a live system may have already handled sensitive data through the breach.
  • Trust asymmetry. The cost of a false-negative (non-compliant running and believed compliant) is catastrophically higher than the cost of a false-positive (compliant-capable but refusing to start). The pattern optimises for the catastrophic-cost side.

Canonical instance: Redpanda FIPS startup

Redpanda broker startup path when fips_mode: enabled:

  1. Read fips_mode from redpanda.yaml.
  2. Load the FIPS OpenSSL module from openssl_module_directory, run power-on self-tests on the validated cryptographic module.
  3. Check OS-level FIPS state (on RHEL: read from /proc/sys/crypto/fips_enabled or equivalent).
  4. If any check fails → log error → exit non-zero → systemd restart loop fires pages.
  5. Otherwise → bind ports and begin serving Kafka protocol.

The broker refuses to hold a half-valid state — the config asserts FIPS compliance; any component that can't satisfy the assertion is a startup failure, not a degradation.

The three-state fips_mode dial lets operators pick between:

  • disabled — no startup check fires.
  • enabled — startup check fires; fails-fast on OS misconfig.
  • permissive — partial startup check fires (module-layer only); OS layer skipped, non-production.

When to apply the pattern

  • Regulatory / compliance boundaries (FIPS, HIPAA, PCI-DSS, digital-sovereignty regimes) that require the runtime to hold a specific state.
  • Security gates where a broken check produces a confidentiality / integrity violation rather than a latency regression (e.g., signing key present, TLS cert valid, tenant isolation capability enabled).
  • Capacity / resource preconditions where partial resource availability is more dangerous than unavailability (e.g., some data-directory filesystems present but not all — serving with half the partitions is worse than not serving).

When NOT to apply

  • Non-regulatory capacity degradation. Don't fail-fast a serving cluster when a cache warm-up is slow or a non-critical sidecar is unavailable; prefer graceful degradation.
  • Soft compliance claims where a warn-only audit regime is the product. Use logging-then-enforcement instead.
  • Bootstrap scenarios where the precondition is expected to be absent on the first-ever start (e.g., secrets not yet provisioned). Gate on environment signal, not bare existence.

Composition

  • With FIPS mode tri-state: the enabled state activates startup fail-fast; the disabled and permissive states bypass it partially or entirely.
  • With validated cryptographic modules: the module's own power-on self-tests compose into the broker's startup check — a failed module test propagates to a failed broker startup.
  • With process supervisors (systemd, runit, Kubernetes liveness probes): the non-zero exit triggers restart loops, eventually CrashLoopBackOff / degraded-unit state, visible in monitoring.

Seen in

Last updated · 470 distilled / 1,213 read