How We Debug 1000s of Databases with AI at Databricks¶
Summary¶
Databricks built an internal AI agent platform (Storex) that unifies database investigation across a fleet of thousands of database instances spanning every major cloud, hundreds of regions, and eight regulatory domains. Before the platform, a MySQL incident required an on-call engineer to stitch together Grafana metrics, Databricks dashboards, CLI SHOW ENGINE INNODB STATUS snapshots, and cloud-console slow-query logs — juggling four tools to form one hypothesis. Storex consolidates the data sources behind one central-first sharded architecture (global coordinator + regional shards keeping sensitive data local) with fine-grained access control, then layers an AI chat agent on top that retrieves metrics/logs, correlates signals, and guides engineers to a next step. The agent framework is DsPy-inspired — tools are defined as Scala classes + docstrings, the LLM infers input/output format, and prompts are decoupled from tool implementation so engineers iterate fast. Correctness is protected by a snapshot-replay validation framework that captures production state and replays it through the agent with a judge LLM scoring responses. The architecture explicitly supports specialized agents per domain (one for system/DB issues, another for client-side traffic patterns, etc.) that collaborate on root-cause analysis. Claimed impact: investigation time cut up to 90%, new-hire time-to-first-DB-investigation dropped to <5 minutes.
Key takeaways¶
- Tool fragmentation is the precondition for agent value. At Databricks, engineers jumped between Grafana, internal dashboards,
SHOW ENGINE INNODB STATUSCLI, and cloud consoles to download slow-query logs. Each tool worked in isolation, but no unified workflow existed. The AI win came not from smarter models but from first unifying the data+workflow substrate the agent reasons over. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases) - v1 = static SOP workflow, and it failed. The first iteration codified the debugging runbook as a deterministic agentic workflow. Engineers rejected it — they wanted a diagnostic report with immediate insights, not an automated checklist. Anomaly detection (v2) surfaced signals but still lacked next-step guidance. The breakthrough was a chat assistant that codifies debugging knowledge and supports follow-ups — making investigation interactive rather than a pipeline. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Central-first sharded architecture is the AI integration precondition. Operating thousands of DBs across 3 clouds × ~hundreds of regions × 8 regulatory domains means the agent would face context fragmentation + unclear governance boundaries + slow iteration loops if it had to speak to every regional API directly. Databricks built a global Storex coordinator that fronts regional shards, keeping sensitive data local/compliant while presenting one interface. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Fine-grained access control at team, resource, and RPC levels is non-negotiable. "Without centralized authorization and policy enforcement, ensuring the agent (and engineers) stay within the right permissions becomes difficult." The agent reuses the same permissions model as engineers — not a separate agent-only permission surface. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Decouple prompts from tools (DsPy-inspired). Engineers define a tool as a normal Scala class + function signature + a short docstring. The LLM infers input format, output structure, and interpretation from the docstring. Prompts can be swapped without touching tool code; tools can be added/removed without restructuring prompts. The underlying infrastructure handles parsing, LLM connections, and conversation state once. Result: fast iteration on both axes independently. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Snapshot-replay + judge-LLM is the regression harness for non-deterministic agents. "How do we prove the agent is getting better without introducing regressions?" — capture snapshots of production state, replay them through the agent under different prompt/tool configurations, score responses with a separate judge LLM for accuracy and helpfulness. This replaces flaky end-to-end tests as the primary correctness signal during agent iteration. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Specialized agents beat one mega-agent. Once the tool-decoupled framework is in place, spinning up domain-specific agents (system+DB issues, client-side traffic patterns, ...) is cheap. Each builds deep expertise in its area while collaborating on end-to-end root-cause analysis. Contrast: a single general agent carrying every tool in context suffers from tool-selection noise. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Hackathon → platform is a valid on-ramp for internal AI tooling. Two-day hackathon prototype unified a few core DB metrics + dashboards into one view. Unpolished but immediately improved basic investigations — enough to unlock org investment. Guiding principle that followed: "move fast and stay customer-obsessed" (direct interviews + shadowing on-call engineers to find real pain). (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Three debugging anti-patterns surfaced by on-call shadowing. (a) Fragmented tooling — dashboards, CLIs, manual steps all isolated. (b) Time wasted gathering context — "what changed, what's normal, who has context" dominates over actual mitigation. (c) Unclear guidance on safe mitigation — engineers default to long investigations rather than risk the wrong action. Postmortems rarely surface these — they look like "missing tools" but are really missing intelligence layer. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
- Mindset shift: from technical architecture to Critical User Journeys (CUJs). The most meaningful impact wasn't "less toil" or "faster onboarding" — it was refocusing infra teams around the journeys engineers take through the system, not the boxes on an architecture diagram. Platforms get built for the engineer experience, not around internal service topology. (Source: sources/2025-12-03-databricks-ai-agent-debug-databases)
Architecture¶
Before (pre-AI):
- Grafana for metrics.
- Databricks internal dashboards for client workload.
- CLI-driven SHOW ENGINE INNODB STATUS for InnoDB internals (transactions, I/O, deadlocks).
- Cloud console log-in + manual slow-query-log download.
- No cross-tool correlation; senior engineers stitched hypotheses by hand, juniors didn't know where to start.
After (Storex platform):
- Foundation layer — central-first sharded.
- Global Storex coordinator presents one interface for engineers and agents.
- Regional shards hold sensitive data; global routes requests to the right region without replicating regulated data out.
- Integrated with existing infra services (metrics, logs, dashboards, CLI-equivalents) so abstractions are consistent across 3 clouds.
-
Fine-grained access control at team / resource / RPC levels — one permission model for humans and agents.
-
Tool layer — DsPy-inspired decoupling.
- Tools defined as Scala classes with normal function signatures + short docstring.
- LLM reads docstring to infer input schema, output structure, and result interpretation.
- Framework (inspired by MLflow prompt-optimization tech + DsPy) owns parsing, LLM connection management, and conversation state.
-
Swap prompts without touching tools. Swap tools without touching prompts. No infrastructure rewrite per change.
-
Agent loop (Storex reasoning loop).
- Agent receives natural-language query ("what's the load source on workspace X?").
- Iteratively decides which tool to call next; accumulates context across calls.
-
Returns a diagnostic report with correlated signals, cause/effect narrative, and recommended next steps — not just raw data.
-
Validation harness.
- Capture production-state snapshots (inputs, tool responses, final state).
- Replay snapshots through candidate agent configurations.
- Judge LLM scores responses on accuracy + helpfulness.
-
Cite: Databricks MLflow 3 "LLM judges" docs referenced in the post.
-
Multi-agent composition.
- Specialized agents per domain (system/DB, client-side traffic, ...).
- Compose on root-cause analyses; each brings deep domain expertise.
- Paves the way for extending beyond databases into other infra ops (restores, production queries, config updates).
Deliberate exclusions / future work: the post names "AI-assisted production operations" (restores, prod queries, config updates) as the next phase once data + context + guardrails are unified — i.e. mutating actions are still human-driven at article time.
Operational numbers¶
- Fleet: "thousands of database instances" across "hundreds of cloud regions", "every major cloud" (3 clouds), "eight regulatory domains".
- Latency impact: up to 90% reduction in time engineers spend on investigation steps.
- Onboarding impact: new hires with zero context can start a DB investigation in under 5 minutes — previously "nearly impossible".
- Journey length: hackathon (2 days) → v1 static SOP (rejected) → v2 anomaly detection (partial) → v3 chat assistant (breakthrough); multi-iteration, user-feedback-driven.
Systems / concepts / patterns extracted¶
- systems/storex — Databricks' internal AI-agent platform for database debugging; central-first sharded architecture + DsPy-inspired tool framework + snapshot-replay validation.
- systems/dspy — open-source framework for programmatic LLM-prompt construction; inspiration for Storex's prompt/tool decoupling.
- systems/mlflow — Databricks-originated ML lifecycle platform; source of the prompt-optimization tech and "LLM judges" primitive Storex validates against.
- systems/grafana — open-source metrics dashboard; one of the four fragmented tools Storex subsumes.
- systems/mysql — the primary database whose investigation workflow drove Storex's v1 scope.
- concepts/llm-as-judge — pattern of using a separate LLM to score another LLM's output on a rubric; primary regression signal for non-deterministic agents.
- concepts/central-first-sharded-architecture — global coordinator + regional shards, keeping sensitive/regulated data local while presenting one interface.
- concepts/observability — updated with the "agent-assisted debugging" layer above metrics/logs/traces.
- patterns/tool-decoupled-agent-framework — define tools as ordinary code + docstrings, let the LLM infer I/O from signatures; decouples prompt iteration from tool iteration.
- patterns/snapshot-replay-agent-evaluation — capture production-state snapshots, replay through candidate agent configs, judge-LLM-score outputs.
- patterns/specialized-agent-decomposition — per-domain agents collaborating on a root-cause analysis, vs. one mega-agent carrying every tool.
- patterns/hackathon-to-platform — small unpolished prototype proves value → org invests → platform follows user feedback, not a multi-quarter spec.
Caveats¶
- Vendor-authored narrative. Databricks publishes MLflow + DsPy + Databricks Data Intelligence Platform as products; the post explicitly ties Storex's internal framework to those product primitives. Treat claims of novelty carefully — the pattern of "DsPy-style tool decoupling + judge LLMs" is not exclusive to Databricks.
- "Up to 90% reduction" is qualified ("individual steps that once required switching..."). No underlying distribution, no before/after SLO math, no control group. Plausible but uncalibrated.
- "Under 5 minutes to jump-start a DB investigation" is a testimonial-grade claim, not a measured TTFB-equivalent metric.
- Mutating actions (restores, production queries, config updates) are explicitly future work. The article describes a read-heavy debugging agent; the safety story for write-path agent actions is deferred.
- No latency numbers for the agent loop itself (how long from NL question → diagnostic report). Everything is reported as engineer-time saved, not agent compute cost.
- Tier-3 source. Ingested because the post does cover architectural substance (central-first sharding, tool/prompt decoupling, snapshot-replay eval) despite the heavy marketing framing around "Databricks Data Intelligence Platform".