CONCEPT Cited by 1 source
Runtime dependency on SaaS provider¶
Definition¶
A runtime dependency on a SaaS provider exists whenever a system needs a third-party hosted service to be available during normal operations — not just at build, deploy, or provisioning time. When that SaaS is down, any code path reaching it is down; when the SaaS has a transitive dependency chain of its own, the caller inherits the entire chain as part of its own reliability envelope.
This is the availability dependency concept applied specifically to externally-operated SaaS rather than first-party services: the caller pays the same arithmetic cost, but with two amplifiers — the caller doesn't own the fix path, and the SaaS's own dependency chain is usually opaque.
The PlanetScale 2025-10-20 example¶
Verbatim from the 2025-10-20 incident post:
The service responsible for creating, resizing, and configuring database branches, which is hosted in AWS us-east-1, was unavailable. It depends on our internal secret-distribution service which depends on Amazon S3 which depends on AWS STS which was impacted by the Amazon DynamoDB outage.
Chain depth: four transitive hops from PlanetScale's control plane to the thing that actually broke (DynamoDB, via an upstream DNS misconfiguration).
new-branch/resize/config service
└─ internal secret-distribution service
└─ Amazon S3
└─ AWS STS
└─ Amazon DynamoDB ← originating failure
Any one hop up the chain going dark would have produced the same observable symptom at PlanetScale's API.
PlanetScale's own remediation commitment (verbatim):
We are taking steps to better understand and become resilient to the failure modes of SaaS we depend on, including for CI/CD, SSO, Web application hosting and incident communication. We are investigating more ambitious ways to reduce our runtime dependence on both internal and AWS services.
The list is load-bearing: CI/CD + SSO + web application hosting + incident communication are four distinct SaaS surfaces each of which surfaced impact during the incident.
The two amplifiers of SaaS runtime dependency¶
1. The caller doesn't own the fix path¶
With a first-party dependency, the caller can roll back, patch, bump a timeout, add a retry, change a limit — the fix space is owned. With a SaaS dependency, the caller can only wait. Every production minute during the outage is exposure without agency.
2. The transitive chain is opaque¶
Most SaaS providers don't publish their own runtime dependency graph. A caller planning around "SaaS provider X at 99.9%" implicitly assumes X's upstream chain has independent failure modes from the caller's other dependencies. In practice, the same hyperscaler / region / DNS provider / identity provider shows up in multiple supposedly-independent SaaS dependencies, and a shared-upstream incident takes them all down together.
PlanetScale's 2025-10-20 incident showed this directly: the status page (SaaS), the dashboard (different SaaS), and the SSO login flow (third SaaS) were all impacted at the same time because each sat on the same us-east-1 failure surface.
Four surfaces where SaaS runtime dependency typically appears¶
PlanetScale's post-mortem names these explicitly as the four surfaces they are investigating:
- CI/CD — build + deploy pipelines. If the CI SaaS is down during an incident, emergency patches can't ship.
- SSO / IdP — identity federation. If the IdP is down, operators can't log in to fix things, customers can't log in to observe status. "PlanetScale customers using SSO were unable to login if they weren't already."
- Web application hosting — dashboard, customer portal, support UIs. "It's hosted by a provider that, like the PlanetScale control plane, is hosted in AWS us-east-1."
- Incident communication / status page — the channel customers use to learn what's happening. "Finally, during this phase we were unable to post updates to https://planetscalestatus.com, though even if we were the site itself was unavailable for at least half an hour."
The recursive irony: the status page meant to tell customers about an incident was itself hit by the same incident.
Architectural responses¶
The response pattern is the same as for first-party availability dependencies, with tighter discipline because the caller lacks fix authority:
- Remove SaaS from the hot path. If a SaaS call happens on every request, its availability caps the caller's. Move it to an async fan-out where possible (see patterns/transactional-outbox).
- Cache last-known-good responses. Serve from cache during SaaS outages; refresh opportunistically when the SaaS returns.
- Provider-diversify critical surfaces. A status page hosted on a different cloud than the product has a chance of surviving a single-region outage. Incident-communication playbooks that expect the primary status page to always be up are fragile.
- Ask SaaS providers for their dependency graph. Due diligence is often limited to "what's your uptime?" — the load-bearing question is "whose uptime are you transitively dependent on?"
Seen in¶
- sources/2025-11-03-planetscale-aws-us-east-1-incident-2025-10-20 — PlanetScale, Richard Crowley, 2025-11-03. Canonical wiki entry. Phase 1 of the 2025-10-20 incident cascades through PlanetScale → internal secret-distribution → S3 → STS → DynamoDB; phase-1 side effects hit the PlanetScale dashboard (SaaS), SSO logins (SaaS), and the status page https://planetscalestatus.com (SaaS) simultaneously. Remediation commitment names CI/CD + SSO + web application hosting + incident communication as the four SaaS surfaces being hardened.
Related¶
- concepts/availability-dependency — the broader qualitative framing; this concept is the SaaS specialisation.
- concepts/availability-multiplication-of-dependencies — the quantitative cost: each transitive hop multiplies into the availability ceiling.
- concepts/control-plane-data-plane-separation — the design lever that bounds SaaS-dependency impact to the control plane.
- concepts/control-plane-impact-without-data-plane-impact — the successful outcome-shape when the data plane has cached what it needs and doesn't call back into the SaaS-dependent control plane.
- concepts/blast-radius — what SaaS-dependency amplifies through opaque transitive chains.
- systems/aws-s3, systems/aws-sts, systems/dynamodb — the four-hop transitive chain from the 2025-10-20 incident.