Sources¶

Per-article summaries of ingested engineering blog posts. Most recent first.

542 pages

Adaptive write request scheduling in Redpanda's Cloud Topics — Redpanda's Cloud Topics write data directly to S3 as "Level Zero" (L0) objects before acknowledging produce requests. The write-request scheduler dynamically adjusts upload…
Build your own vulnerability harness — Cloudflare publishes a detailed practical guide to building a model-agnostic, fleet-wide vulnerability scanning harness — the architecture behind their Vulnerability Discovery…
Long Horizon: How Atlassian Built a Reasoning Engine for Complex AI Tasks — Atlassian replaced Rovo Chat's hierarchical multi-agent orchestrator (the "Hybrid Orchestrator" — a coordinator dispatching to per-product sub-agents like JiraAgent,…
Cloudflare — Bringing more agent harnesses and frameworks to Cloudflare — Cloudflare engineering post (2026-06-17) articulating the three-layer agent platform stack — framework → harness → runtime/platform…
How Dropbox uses MCP and Dash to close the design-to-code security gap — Dropbox's security team built a system that automatically retrieves relevant threat models during code review and evaluates whether code changes align with the security…
Enabling Evolutionary Database Development: Database branching with Lakebase, the conclusion (Part 3) — Part 3 of Databricks' three-part series on Evolutionary Database Development scales the playbook from a single developer (Part 1) and a single developer's expanded practices (Part…
Scaling Security Insights: how we achieved a 10x increase in global scanning capacity — Cloudflare's Security Insights team needed to scale their account/zone scanning throughput from 10 scans/second to 100+ scans/second to enable automatic scanning for all free…
Ingesting the Milky Way: Petabyte-Scale with Zerobus Ingest — Databricks discloses the internal architecture of Zerobus Ingest — their fully managed, serverless, push-based streaming ingestion service that writes directly into Delta tables…
Metric Semantic Layer: How Lyft Governs and Scales Key Data Definitions — -25cd379abb8---4" ---
AI Serving Platform That Adapts to Your Model — Databricks describes the architecture of Custom Model Serving — their fully managed real-time inference platform for any model packaged in MLflow.…
Route public traffic to private applications with Cloudflare — Cloudflare launches Application Services for Private Origins (closed beta, Enterprise), extending its security, performance, and programmability stack (WAF, bot management,…
Architecting Scalable ML Platforms: The Integrated Infrastructure and Acceleration Behind Rovo — Atlassian describes the architecture of ML Studio, their enterprise-scale ML platform that standardizes modular development, centralizes workflow orchestration,…
Cloud Topics: the Metastore — This post from the Redpanda engineering blog (part of the Cloud Topics series) describes the metastore — a custom-built, Raft-replicated key-value store that serves as the metadata…
Defend against frontier cyber models: Cloudflare's architecture as customer zero — Cloudflare publishes its own internal security architecture as a reference for defending against frontier AI cyber models (models that can discover vulnerabilities,…
Scaling beyond one: How Airbnb evolved its data architecture for a multi-product world — -53c7c27702d5---4 published: 2026-06-09 created: 2026-06-11 updated: 2026-06-11 tags: [data-architecture, data-modeling, offline-data-warehouse, multi-product, schema-evolution,…
Building Agents that Don't Break Themselves — A practical guide from Fly.io demonstrating how to architect AI agent systems so the agent loop (brain) lives on stable infrastructure while all risky command execution (hands)…
Enabling Evolutionary Database Development: database branching with Lakebase, continued (Part 2) — Part 2 of Databricks' three-part series on Evolutionary Database Development revisits Martin Fowler's 2003 seven practices twenty years later,…
Netflix — Dynamically Splitting Wide Partitions in Cassandra for Time Series Workloads — -2615bd06b42e---4 published: 2026-06-03 fetched: 2026-06-03 authors: [Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk, Kartik Sathyanarayanan] tags: [netflix, cassandra,…
Lights Out, Systems On: Validating Instant Power Loss Readiness — Meta introduces Instantaneous PowerLoss Storm, a new testing paradigm within Meta's long-established Disaster Readiness (DR) "Storm" program that validates the infrastructure's…
Apache Spark Real-Time Mode for Gaming: A Better Way to Do Real-Time Sessionization — Databricks presents a real-world gaming sessionization pipeline built with Apache Spark Structured Streaming's Real-Time Mode and the transformWithState operator.…
Enforcing the First AS in BGP AS_PATHs — Cloudflare examines a wave of BGP route hijacks (flagged by Spamhaus) where attackers forged complete ASPATHs—omitting their own ASN entirely—to impersonate legitimate origins…
Redpanda — How OmniNode uses Redpanda to scale AI agent workflows — A Redpanda Blog guest post (2026-06-02) by Jonah Gray, founder and CEO of OmniNode, on the migration of OmniNode's multi-agent coordination bus from Redis Streams to Redpanda…
Instacart — Semantic IDs: Product Understanding at Scale — -587883b5d2ee---4 tags: [instacart, semantic-id, sid, rq-vae, codebook, contrastive-loss, taxonomy-supervision, hierarchical-batch-sampling, esci, gemma, gemini-flash,…
Instacart — From Scoring to Spelling: Rebuilding Ads Retrieval at Instacart — -587883b5d2ee---4 tags: [instacart, ads, retrieval, candidate-generation, generative-retrieval, semantic-id, rq-vae, beam-search, tiger, contextual-recommendations, bert,…
AWS — Automating contract intelligence with Doczy.ai on AWS — AWS Architecture Blog post (2026-06-02) co-authored with AArete — a global management and technology consulting firm specialising in healthcare…
Debunking 8 data layout myths: why Liquid Clustering outperforms partitioning — Databricks blog post (Tier-3, 2026-06-01) that debunks eight persistent myths about Liquid Clustering vs. Hive-style partitioning on Lakehouse tables,…
Cloudflare — How we reduced core unit boot time from hours to minutes — Cloudflare engineering post (2026-06-01) by the OpenBMC team narrating a fleet-wide regression in core-server boot time — following a routine firmware update,…
2026 06 01 Atlassian How We Cut Up To 80 Of Engineering Chores Using Ai Agents In — A first-party engineering post from the Jira team describing how they cut up to 80% of engineering time spent on KTLO ("keeping the lights on") chores by using AI agents inside…
Netflix — High-Throughput Graph Abstraction at Netflix: Part I — -2615bd06b42e---4 published: 2026-05-29 fetched: 2026-05-30T14:00:43+00:00 tier: 1 tags: [netflix, graph-abstraction, graph-database, oltp-graph, property-graph, namespace,…
Netflix — From Silos to Service Topology: Why Netflix Built a Real-Time Service Map — -2615bd06b42e---4 published: 2026-05-29 fetched: 2026-05-30T14:00:46+00:00 tier: 1 tags: [netflix, service-topology, service-dependency-graph, real-time-service-map,…
Databricks — Enabling Evolutionary Database Development: database branching with Lakebase — A Tier-3 Databricks Engineering post (Part 1 of a three-part series on Evolutionary Database Development) that frames Lakebase's copy-on-write database branching as the substrate…
Databricks — Databricks at SIGMOD 2026 — A short corporate-blog announcement (Tier-3 source) that nevertheless discloses the first publicly named architecture of Databricks' incremental-view-maintenance engine — Enzyme…
Slack AI: The Path to Multi-Cloud — Three-year retrospective from the Slack AI infrastructure team on evolving the LLM serving substrate behind Slack AI from a single-region SageMaker deployment in early 2023…
Google Research — A New Era of Innovation: Google Research at I/O 2026 — Google Research's I/O 2026 roundup post is a multi-thread position summary of how Google Research's foundational work flows into Gemini and consumer surfaces.…
Databricks — Advancing Apache Iceberg on Databricks: Iceberg v3 GA, Open Sharing, and Unified Governance — A Databricks Blog post (2026-05-28, Tier 3) announcing a coordinated set of Apache Iceberg capability releases — the bulk reaching General Availability…
Cloudflare — How we built Cloudflare's data platform and an AI agent on top of it — Cloudflare engineering post (2026-05-28) describing the two in-house systems they built to consolidate analytics access across the company: Town Lake…
Yelp — Beyond the Menu Tree: How Yelp Built a Smarter Customer Success Chatbot with AI — Yelp Engineering post (2026-05-27) by the Customer & Sales Intelligence Team disclosing the architecture of the LLM-Assisted Customer Success (CS) Chatbot that replaced Yelp's…
Stripe — Expanding Stripe Radar to protect more of your business — A Stripe Sessions feature-roundup announcing the "biggest expansion ever" of Stripe Radar, Stripe's AI-powered fraud prevention engine.…
Redpanda — Redpanda SQL is GA: the query engine that skips the pipeline — A Redpanda Blog launch post (2026-05-27) announcing the General Availability of Redpanda SQL — a Postgres-protocol query engine that runs inside the customer's Redpanda BYOC…
Google Research — Private analytics via zero-trust aggregation — Google Research announces a production private-analytics architecture that composes a new lattice-based secure-aggregation cryptographic protocol with a TEE inside Google's…
Databricks — Reliable LLM Inference at Scale — Databricks' inference platform engineering team (Marius Seritan, Cyrielle Simeone, Andy Zhang, Yu Zhang, Nick Lanham) discloses the production architecture behind a multi-tenant…
Databricks — How the lakebase architecture stays resilient to cloud failures — Tier-3 Databricks reliability post (Jasraj Dange, Hans Norheim, Stas Kelvich, John Spray; published 2026-05-27) that lays out lakebase's reliability roadmap by reframing…
Databricks — Building a FHIR-native health data platform on Databricks Lakebase — A Databricks Blog post (2026-05-27, Tier 3) co-positioning Health Samurai (vendor of the Aidbox FHIR Server and Database) and Lakebase (Databricks' serverless Postgres)…
Databricks — BI Serving Pointers: Maximizing for Performance and TCO — A Databricks Engineering walkthrough of the BI-serving stack on the Databricks Lakehouse Platform, framed bottom-up across four layers — physical storage,…
Cloudflare — Iran's Internet is partially restored, Cloudflare Radar data shows — A short Cloudflare Radar update post published 2026-05-27 reporting that Iran's nationwide Internet shutdown — which began February 28,…
Meta — SilverTorch: Index as Model, a new retrieval paradigm for recommendation systems — Meta's Recommendation Systems team describes SilverTorch, a fully-rebuilt GPU-native retrieval substrate that replaces the traditional retrieval-stage microservice mesh…
Databricks — Scaling for MHHS: how Octopus Energy achieved a 50x cost reduction in margin data engineering — The UK's Market-wide Half-Hourly Settlement (MHHS) regulation forces every supplier to move from two meter reads per month → 48 reads per day…
Databricks — Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog — Databricks ships OTel-format trace ingestion direct to Unity Catalog Delta tables, decoupling agent instrumentation from storage so production traces become a first-class lakehouse…
Databricks — How World Bank Group uses Databricks to eradicate poverty through shared knowledge — Databricks Blog customer-success post (2026-05-22) on the World Bank Group's unified data + AI knowledge platform built on Databricks. Tier-3 vendor-blog source.…
Databricks — Accelerating LLM Inference with Prompt Caching for Open-Source Models on Databricks — A short Databricks Blog post (2026-05-22, Tier 3) announcing that implicit prompt caching is now generally available for open-weights models served on the Foundation Model APIs…
Atlassian — From Ambiguous Questions to Action: Research Mode in Rovo Dev CLI — A 2026-05-22 Atlassian Engineering blog post introducing Research Mode — a structured multi-agent investigation workflow inside the Rovo Dev CLI designed for "questions…
Yelp — How Partition Access Visualizations Reduced our Data Lake S3 Cost by 33% — Yelp Engineering post (2026-05-21) by the data-platform team disclosing the partition access visualization technique they built on top of the Yelp S3 SAL pipeline (canonicalised…
Pinterest — Making User-Sequence Data More Cost-Efficient, Faster, and Easier to Use — -4c5a5f6279b6---4 published: 2026-05-21 fetched: 2026-05-24 ingested: 2026-05-24 created: 2026-05-24 updated: 2026-05-24 tags: [pinterest, user-sequences, ml-platform,…
Databricks — Databricks for Good and Virtue Foundation: Partnering to Connect Medical Volunteers to Critical Health Services in 72 Countries — Databricks Blog (Databricks-for-Good arm) co-marketing post (2026-05-20) documenting the production-grade rebuild of Virtue Foundation's VF Match platform…
Databricks — Unlock seamless and cost-effective marketing campaigns with Lakebase — A Databricks Blog post (2026-05-20, Tier 3) that frames the canonical bursty marketing-campaign workload as the canonical fit for Lakebase's serverless OLTP economics,…
Databricks — Governing AI agents at scale with Unity Catalog — A 2026-05-20 Databricks Blog vision post extending Databricks' coding-agent-governance playbook (the 2026-04-17 Unity AI Gateway launch…
AWS Architecture Blog — Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events — A 2026-05-20 AWS Architecture Blog post that lays out a complete cyber-resilience reference architecture for recovering AWS workloads after ransomware, data extortion,…
Redpanda — Cloud Topics: Level Zero garbage collection — Redpanda's 2026-05-19 post is Part 1 of 2 on how Cloud Topics decides when an L0 object — the temporary, mixed-partition object-storage file produced by the Cloud Topics write path…
Databricks — How to Build Real-Time Fraud Detection using Spark Real-Time Mode and Lakebase — Databricks Engineering / Solution Accelerator launch post (2026-05-19) that ships an open-source reference implementation of an end-to-end real-time card-fraud-detection system…
Databricks — How Deutsche Börse built a generative AI tool to tackle the large-scale migration of Zeppelin notebooks to Databricks — A 2026-05-19 Databricks customer-blog post co-authored with Deutsche Börse Group (Frankfurt-headquartered financial-market-infrastructure operator;…
Cloudflare — Announcing Claude Managed Agents on Cloudflare — Cloudflare launch post (2026-05-19) co-announced with Anthropic for Claude Managed Agents — a new Anthropic platform that runs the agent control loop on Anthropic's infrastructure…
AWS — How Synthesia optimizes generative AI video inference on Amazon EC2 G7e instances — AWS Architecture Blog post (2026-05-19) co-authored with Synthesia Research Engineering describing a video-decoding optimisation technique…
Airbnb — Scaling Airbnb's identity graph with a unified knowledge graph infrastructure — -53c7c27702d5---4 tags: [graph-database, identity-graph, knowledge-graph, janusgraph, dynamodb, opensearch, gremlin, tinkerpop, multi-tenant-graph-platform, trust-and-safety,…
Cloudflare — Project Glasswing: what Mythos showed us — Cloudflare engineering writeup (2026-05-18) of several months of running Mythos Preview — Anthropic's cyber frontier model preview,…
Databricks — Backstage with Lakebase (Part 2: Governance) — Thoughtworks Part 2 of a three-part series (Part 1: Deployment Cycles, Part 2: Governance, Part 3: FinOps) on running Backstage (Spotify's state-heavy Internal Developer Portal)…
Instacart — Scaling Personalized Marketing for Multi-Tenant Commerce Platforms — -587883b5d2ee---4 published: 2026-05-14 authors: [Brent Scheibelhut, Ryan Martin, Shradha Menon] tags:
From latency to instant: Modernizing GitHub Issues navigation performance (GitHub Engineering, 2026-05-14) — GitHub Engineering's 2026-05-14 retrospective from Alexander on the GitHub Issues team on a multi-quarter perf rewrite of the issues#show route…
Databricks — Expanded interoperability with Unity Catalog Open APIs — Databricks Blog post (2026-05-14) announcing two coordinated milestones for Unity Catalog's Open APIs: External Access to Managed Tables in Beta (external engines like Apache…
Cloudflare — Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse — Cloudflare engineering post (2026-05-14) on a year-long investigation into a hidden bottleneck in ClickHouse's query planner that emerged after Cloudflare extended the partitioning…
Atlassian — Optimisation Tools for Jira: Reducing Configuration Bloat and Enhancing Performance — Atlassian Engineering (Atlassian Blog, 2026-05-14) describes the Jira Cloud Optimisation Tools — a set of admin experiences, an async workflow-driven reporting framework,…
Databricks — The Rosetta Stone of CPS: Inside Claroty's AI-Powered Library — Databricks Blog co-marketing post (2026-05-13) describing how Claroty's AI-Powered CPS Library — the asset-identity layer for Claroty's xDome CPS-protection platform…
Databricks — Clinical operations intelligence belongs on the Lakehouse — Databricks Blog post (2026-05-13) announcing the open-source release of the Site Feasibility Workbench as a fully open-source Databricks App…
Databricks — ABAC row filtering and column masking, governed tags, and data classification GA in Unity Catalog — Databricks Blog post (2026-05-13) announcing General Availability for three Unity Catalog governance capabilities that were in preview during Q1–Q2 2026: Attribute-Based Access…
Cloudflare — Browser Run: now running on Cloudflare Containers, it's faster and more scalable — Cloudflare engineering post (2026-05-13) on the migration of Browser Run (rebranded from Browser Rendering on 2026-04-16) off shared Browser Isolation (BISO) infrastructure onto…
AWS — Streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda — AWS Architecture Blog post (2026-05-13) describing a customer solution that bridges Amazon CloudWatch metrics into a self-hosted OpenTelemetry collector running inside…
Airbnb — Viaduct 1.0 and the future of Airbnb's data mesh — -53c7c27702d5---4 raw: raw/airbnb/2026-05-13-viaduct-10-and-the-future-of-airbnbs-data-mesh-79468758.md tags: [airbnb, viaduct, graphql, data-mesh, data-oriented-service-mesh,…
Meta — Migrating Data Ingestion Systems at Meta Scale — A 2026-05-12 Meta Engineering Data Infrastructure post describing how Meta successfully migrated 100% of its data ingestion workload off a legacy customer-owned-pipelines…
2026 05 12 Cloudflare When Idle Isnt Idle How A Linux Kernel Optimization Became A Quic Bug — Cloudflare engineering post (2026-05-12) on a subtle bug in quiche — Cloudflare's open-source Rust QUIC / HTTP/3 implementation…
AWS — Building hybrid multi-tenant architecture for stateful services on AWS — AWS Architecture Blog post (2026-05-12) by a team running a stateful ad-serving platform at millions of requests per second and billions of dollars in annual advertising revenue.…
MongoDB — Fighting Tool Sprawl: The Case for AI Tool Registries — A 2026-05-11 MongoDB Engineering / Technical blog opinion piece arguing that every enterprise running AI agents at any non-trivial scale needs its own internal tool registry…
Databricks — Unlocking the Archives: Turning Unstructured Documents into a Searchable Database for Groundwater Discovery — Databricks for Good partnered with MapAid (a Stanford-founded nonprofit) and the Sudan Association for Archiving Knowledge (SUDAAK) to turn ~700 scanned PDFs/TIFFs/JPGs (>5,000…
AWS — Choosing between single or multiple organizations in AWS Organizations — Short AWS Architecture Blog decision-framework post (2026-05-11) that distills a cloud-migration advisor's customer conversations into an explicit rubric for when to run a single…
Pinterest — Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models — -4c5a5f6279b6---4 published: 2026-05-08 ingested: 2026-05-21 created: 2026-05-21 updated: 2026-05-21 tags: [pinterest, ads, candidate-generation, retrieval, two-tower,…
Netflix — Scaling ArchUnit with Nebula ArchRules — -2615bd06b42e---4 published: 2026-05-08 fetched: 2026-05-09 ingested: true tags: [netflix, jvm, gradle, archunit, nebula, architectural-rules, fitness-function, polyrepo,…
Databricks — Pushing the Frontier for Data Agents with Genie — Databricks Engineering post (2026-05-08) describing the architectural techniques behind Genie — Databricks' state-of-the-art data agent for answering complex questions over…
Databricks — How Superhuman and Databricks built a 200K QPS inference platform together — Databricks Engineering post (2026-05-08) co-authored by Databricks Model Serving and Superhuman engineering. Documents the joint migration of Superhuman's grammar-correction model…
Databricks — How Lakebase architecture delivers 5x faster Postgres writes — Databricks Engineering post (2026-05-07) on the Lakebase / Neon team eliminating the Full Page Write tax in Postgres by moving full page image generation out of the compute's WAL…
Cloudflare — How Cloudflare responded to the Copy Fail Linux vulnerability — On 2026-04-29 16:00 UTC, CVE-2026-31431 — a Linux kernel local-privilege-escalation vulnerability named "Copy Fail" — was publicly disclosed by Xint Code.…
Databricks — Rethinking Distributed Systems for Serverless Performance and Reliability — Databricks Engineering post (2026-05-06) laying out the three architectural systems that make truly serverless Apache Spark work under the stated design thesis "stability becomes…
Cloudflare — When DNSSEC goes wrong: how we responded to the .de TLD outage — On 2026-05-05 ~19:30 UTC, DENIC — the registry operator for the .de country-code top-level domain — started publishing incorrect DNSSEC signatures for the .de zone during…
From SSH to REST: A Security-Driven Modernization of Slack's EMR Data Pipelines — Slack's data platform was built around 2017 with a simple orchestration pattern: Airflow would run jobs on EMR clusters by SSH-ing into the cluster's master node and executing…
Redpanda — Little's Law in practice with Cloud Topics — Redpanda's 2026-05-05 post is the production-tuning sequel to the 2026-03-30 Cloud Topics architecture deep-dive. Where the architecture post described the L0-file /…
Databricks — 10 trillion samples a day: Scaling beyond traditional monitoring infra — Databricks' monitoring infrastructure (previously built for an order- of-magnitude lower scale) has more than tripled in the last year to 5 billion active in-memory timeseries…
Airbnb — Monitoring reliably at scale — -53c7c27702d5---4 raw: raw/airbnb/2026-05-05-monitoring-reliably-at-scale-07d3d0c6.md tags: [airbnb, observability, monitoring, reliability, circular-dependency,…
Netflix — Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph — -2615bd06b42e---4 published: 2026-05-04 fetched: 2026-05-05 ingested: true tags: [netflix, ml-platform, ml-metadata, model-lifecycle, model-registry, knowledge-graph, lineage,…
Instacart — Empowering Carrot Ads with Domain Adaptive Learning — -587883b5d2ee---4 published: 2026-05-04 authors: [Trey Zhong, Xiyu Wang] tags:
Pinterest — Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer — -4c5a5f6279b6---4 published: 2026-05-01 ingested: 2026-05-02 created: 2026-05-02 updated: 2026-05-02 tags: [pinterest, ml-serving, root-leaf-architecture, network-bottleneck,…
Netflix — State of Routing in Model Serving — -2615bd06b42e---4 published: 2026-05-01 fetched: 2026-05-02 ingested: true tags: [netflix, ml-platform, model-serving, model-inference, routing, switchboard, lightbulb, envoy,…
Meta — How Meta Is Strengthening End-to-End Encrypted Backups — A 2026-05-01 Meta Engineering Security post announcing two infrastructure upgrades to WhatsApp's and Messenger's HSM-based Backup Key Vault…
Grafana — Faster fixes, less context sharing: how Grafana Assistant learns your infrastructure before you even ask — Grafana Labs' launch post for infrastructure memory in Grafana Assistant — a zero-configuration background capability that runs a "swarm of AI agents" across a Grafana Cloud…
Cloudflare — Introducing Dynamic Workflows: durable execution that follows the tenant — Cloudflare launches cloudflare/dynamic-workflows — a small (~300 LOC) MIT-licensed TypeScript library that bridges Cloudflare Workflows (durable execution) with Dynamic Workers…
Code Orange: Fail Small is complete. The result is a stronger Cloudflare network — On 2026-05-01, Cloudflare announced the completion of Code Orange: Fail Small — the ~6-month ("two and a bit quarters") organisation-wide engineering-resiliency programme launched…
Databricks — Backstage with Lakebase (Part 1: Deployment Cycles) — Thoughtworks ran a proof-of-concept ripping Backstage (Spotify's state-heavy Internal Developer Portal) off its standard Postgres database and pointing it at Databricks Lakebase…
Cloudflare — Post-quantum encryption for Cloudflare IPsec is generally available — Cloudflare announces general availability of post-quantum encryption for Cloudflare IPsec, implementing the IETF draft draft-ietf-ipsecme-ikev2-mlkem which specifies hybrid ML-KEM…
Cloudflare — Agents can now create Cloudflare accounts, buy domains, and deploy — A joint launch post from Cloudflare and Stripe (2026-04-30) announcing a new agent-provisioning protocol co-designed between the two companies.…
Stripe — Giving agents the ability to pay — On 2026-04-29, Stripe announced Link's wallet for agents, built on top of a new product called Stripe Issuing for agents.…
Grafana — Get observability in the terminal, for you and your agents, with the gcx CLI tool — Grafana Labs' launch post for gcx — a command-line tool that exposes Grafana Cloud's observability lifecycle (instrumentation, alerts, SLOs, synthetic checks,…
Databricks — Companies Winning with AI Built the Data Layer First — Tier-3 Databricks Blog customer-interview post (2026-04-29). Aly McGue interviews Stephen Ecker, CDO of Trinity Industries…
Databricks — Approximate Answers, Exact Decisions: New Sketch Functions for Analytics — Databricks product-engineering post (2026-04-29) announcing four new sketch function families in Databricks SQL / DataFrame / Structured Streaming — built on Apache DataSketches…
Databricks — Databricks and Stripe Projects: Infrastructure Built for Agents — A short joint launch post from Databricks (co-bylined by Brad Van Vugt + Guillaume Rivals, 2026-04-29) announcing that Databricks is a launch partner for Stripe Projects…
Atlassian — Inside Atlassian's Merge Queues: How we ship faster with fewer incidents — Atlassian's Bitbucket team publishes the first-party architecture + production-results post for Bitbucket Merge Queues, their pre-merge validation queue for Bitbucket Cloud.…
Expedia — Expedia's Service Telemetry Analyzer — -38998a53046f---4 published: 2026-04-28 tags: [expedia, llm, observability, incident-response, rca, fastapi, celery, redis, datadog, prompt-chaining, token-budget, kubernetes,…
Cloudflare — Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions — Cloudflare Radar's quarterly review of observed and confirmed Q1 2026 Internet disruptions, as published by the Radar team on 2026-04-28.…
Airbnb — Skipper: Building Airbnb's embedded workflow engine — -53c7c27702d5---4 raw: raw/airbnb/2026-04-28-skipper-building-airbnbs-embedded-workflow-engine-29f5842a.md tags: [airbnb, skipper, workflow-engine, durable-execution,…
Pinterest — From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation — -4c5a5f6279b6---4 published: 2026-04-27 ingested: 2026-05-02 created: 2026-05-02 updated: 2026-05-02 tags: [pinterest, ads, shopping-ads, candidate-generation, retrieval,…
'Inside one of the first production deployments of Lakebase: LangGuard''s agentic workflow governance engine' — Databricks publishes a case study of LangGuard, one of the first startups building its production governance engine on Lakebase — Databricks' serverless Postgres.…
Deloitte optimizes EKS environment provisioning and achieves 89% faster testing environments using Amazon EKS and vCluster — AWS Architecture Blog customer case study (2026-04-27) in which Deloitte — a global professional-services organisation — describes how they eliminated a 30–45-minute-per-cluster…
Netflix — Scaling Camera File Processing at Netflix — -2615bd06b42e---4 authors: [Eric Reinecke, Bhanu Srikanth] tier: 1 created: 2026-04-26 updated: 2026-04-26 tags: [netflix, media-production, mps, camera-files, flapi, filmlight,…
Atlassian — Rovo Dev Driven Development: How we built a platform in 4 weeks — A first-person engineering post from the builder of Fireworks — Atlassian's Firecracker-microVM orchestrator on Kubernetes,…
Lyft — How We Built a Smarter Pickup Experience for Gated Communities — -25cd379abb8---4 raw: raw/lyft/2026-04-23-how-we-built-a-smarter-pickup-experience-for-gated-communiti-a5e10272.md tags: [lyft, mapping, pickup, gated-community, routing,…
AWS — Modernizing KYC with AWS serverless solutions and agentic AI for financial services — AWS Architecture Blog reference-architecture post (Jayanth Kolli, Andrew Black — IBM + AWS, 2026-04-23) laying out a cloud-native, event-driven,…
Yelp — How Yelp Keeps Server-Driven UI Consistent Across Four Platforms — Follow-up to Yelp's 2025-07-08 CHAOS backend deep-dive, this post unpacks Konbini — the auto-generated library family that bridges CHAOS (Yelp's SDUI framework) to Cookbook,…
Grafana — Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale — Grafana Labs launch post (2026-04-22) for Pyroscope 2.0, a ground-up rearchitecture of their open-source continuous-profiling database.…
Grafana — Grafana Labs acquires Logline to accelerate needle-in-the-haystack log queries — Grafana Labs acquisition announcement (2026-04-22, GrafanaCON 2026) disclosing that Grafana Labs has acquired Logline, a startup founded by Jason Nochlin (previously CEO…
Databricks — Stop Hand-Coding Change Data Capture Pipelines — Databricks product-engineering post (2026-04-22) arguing that CDC and SCD pipelines — foundational to downstream analytics tables…
Databricks — Multimodal Data Integration: Production Architectures for Healthcare AI — Databricks' healthcare-industry blog post (2026-04-22) argues that the usual blocker for multimodal AI in clinical settings is not model sophistication but data architecture…
Databricks — Are LLM agents good at join order optimization? — Databricks (with UPenn collaborators) ran an experiment applying a frontier LLM agent to the decades-old join-ordering problem in relational query optimizers.…
Cloudflare — Making Rust Workers reliable: panic and abort recovery in wasm-bindgen — Cloudflare describes how Rust Workers — Rust code compiled to WebAssembly and run inside Cloudflare Workers via wasm-bindgen…
AWS Architecture Blog — PACIFIC enables multi-tenant, sovereign product carbon footprint exchange on the Catena-X data space using AWS — PACIFIC — a joint product of BASF and CircularTree, certified on the Catena-X Automotive Network — is a multi-tenant SaaS that lets automotive-supply-chain companies exchange…
All Things Distributed: The invisible engineering behind Lambda's network — Werner Vogels tells the decade-long story of the AWS Lambda networking team — a silent infrastructure retrofit on the jet-in-flight scale: converting Lambda's network topology…
Vercel — We Ralph Wiggum'd WebStreams to make them 10x faster — Vercel engineering post (2026-04-21) discloses fast-webstreams, an experimental npm package that reimplements the WHATWG Web Streams API (ReadableStream / WritableStream /…
Vercel — Preventing the stampede: Request collapsing in the Vercel CDN — Vercel's CDN launched request collapsing as a default behaviour for every ISR route on every deployment. When a cache entry expires (or was never written) and many requests arrive…
Vercel — Making Turborepo 96% faster with agents, sandboxes, and humans — Anthony Shew's 2026-04-21 Vercel engineering post documents an eight-day performance campaign that improved Turborepo's task-graph construction time by 81-91 % on Vercel's internal…
Vercel — Making agent-friendly pages with content negotiation — Vercel's 2026-04-21 engineering post documenting their production implementation of HTTP markdown content negotiation across vercel.com/blog and vercel.com/changelog,…
Vercel — Inside Workflow DevKit: How framework integrations work — Vercel's 2026-04-21 engineering post explains the integration pattern behind the Workflow Development Kit (WDK) — how one workflow-definition artefact (code with "use workflow" /…
Vercel — How we made global routing faster with Bloom filters — Vercel's global routing service — the single-threaded front door that decides whether to serve, rewrite, or 404 every incoming request to every deployment…
Vercel — Chat SDK brings agents to your users — Vercel's launch post for Chat SDK, a TypeScript library for building chat bots that run on Slack, Microsoft Teams, Google Chat, Discord, Telegram, GitHub, Linear,…
Vercel — Bun runtime on Vercel Functions — Vercel's 2026-04-21 post announces Bun as a runtime option for Vercel Functions in public beta, alongside the pre-existing Node.js runtime.…
Vercel — Build knowledge agents without embeddings — Vercel's 2026-04-21 launch post for the open-source Knowledge Agent Template — a production-ready knowledge-agent architecture that replaces the vector-database / chunking /…
Vercel — BotID Deep Analysis catches a sophisticated bot network in real-time — Vercel's 2026-04-21 post is a production-incident narrative describing a single 10-minute window on October 29 at 9:44 am,…
Redpanda — Me and my shadow (link!): Disaster recovery replication made easy — Redpanda (unsigned, 2026-04-21) publishes the mechanism + performance + reciprocal-architecture deep-dive on Shadow Linking…
PlanetScale — The state of online schema migrations in MySQL — Shlomi Noach's 2024-07-23 PlanetScale post is a taxonomic survey of the three mechanism classes available for running non-blocking ALTER TABLE against a live production MySQL…
PlanetScale — The MySQL adaptive hash index — Ben Dicken (PlanetScale, 2024-04-24, re-fetched 2026-04-21) publishes a pedagogical primer on MySQL's Adaptive Hash Index (AHI)…
PlanetScale — Scaling Postgres connections with PgBouncer — Ben Dicken (PlanetScale, 2026-03-13) publishes a canonical field-manual on PgBouncer configuration tuning for Postgres, grounded in the OS-substrate cost of Postgres's…
PlanetScale — Profiling memory usage in MySQL — Ben Dicken (PlanetScale) canonicalises native MySQL memory profiling via performanceschema's memory-instrumentation tables.…
PlanetScale — Postgres High Availability with CDC — Sam Lambert (PlanetScale CEO, 2025-09-12, re-fetched 2026-04-21) argues that Postgres's logical replication design makes high-availability (HA) and CDC operationally coupled…
PlanetScale — Larger than RAM Vector Indexes for Relational Databases — Vicent Martí (PlanetScale) presents the engineering-level design of PlanetScale's production vector index inside MySQL / InnoDB, written after two years of development.…
PlanetScale — Instant deploy requests — Shlomi Noach (PlanetScale, originally 2024-09-04, re-fetched 2026-04-21) announces instant deployments on eligible PlanetScale deploy requests: schema changes whose every statement…
PlanetScale — Increase IOPS and throughput with sharding — Ben Dicken (PlanetScale, originally 2024-08-19, re-fetched 2026-04-21) publishes a pricing-pedagogical post canonicalising IOPS and throughput as first-class database-sizing…
PlanetScale — Identifying and profiling problematic MySQL queries — Ben Dicken (PlanetScale, 2024-03-29) publishes a pedagogical field manual for native MySQL query diagnosis: how to use performanceschema + sys tables to identify which queries…
PlanetScale — Graceful degradation in Postgres — Ben Dicken's 2026-03-31 post reframes PlanetScale Traffic Control (already canonicalised via the 2026-04-11 Keeping a Postgres queue healthy post) from a mixed-workload contention…
PlanetScale — Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — A demo-app narrative post from PlanetScale (Simeon Griggs, 2026-02-19) walking through the architectural decisions behind a real-time prediction-market demo built on PlanetScale…
PlanetScale — Faster backups with sharding — Ben Dicken (PlanetScale, 2024-07-30) canonicalises PlanetScale's production backup architecture for sharded MySQL databases and the shard-parallel backup property…
PlanetScale — Dealing with large tables — Ben Dicken (PlanetScale, 2024-07-10) publishes the canonical three-rung scaling ladder for a single fast-growing table: vertical scaling → vertical sharding → horizontal sharding.…
PlanetScale — Database sharding — Ben Dicken (PlanetScale, 2025-01-09) publishes an interactive primer on database sharding — a pedagogical post but architecturally substantive: it canonicalises the four production…
PlanetScale — Consensus algorithms at scale: Part 8 - Closing thoughts — Closing instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-07-07;…
PlanetScale — Consensus algorithms at scale: Part 7 - Propagating requests — Seventh instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-07-01;…
PlanetScale — Consensus algorithms at scale: Part 6 - Completing requests — Sixth instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2022-06-21;…
PlanetScale — Consensus algorithms at scale: Part 5 — Handling races — Sugu Sougoumarane (Vitess co-creator, PlanetScale, originally 2022-04-28, re-fetched via RSS 2026-04-21) publishes Part 5 of the Consensus algorithms at scale series.…
PlanetScale — Consensus algorithms at scale: Part 4 — Establishment and revocation — Sugu Sougoumarane (Vitess co-creator, PlanetScale, originally 2022-04-06, re-fetched via RSS 2026-04-21) publishes the fourth instalment of his consensus-algorithms-at-scale…
PlanetScale — Consensus algorithms at scale: Part 3 — Use cases — Third instalment of Sugu Sougoumarane's Consensus algorithms at scale series on the PlanetScale blog (originally 2020-09-26; re-fetched via RSS 2026-04-21;…
PlanetScale — Benchmarking Postgres — Ben Dicken's companion / methodology disclosure to the PlanetScale for Postgres GA launch (2025-07-01). The post announces Telescope,…
PlanetScale — Behind the scenes: How schema reverts work — Holly Guevara and Shlomi Noach (PlanetScale) describe how PlanetScale turns a completed online schema change into a reversible one: the user can click "Revert changes" after…
PlanetScale — Anatomy of a Throttler, part 3 — Shlomi Noach closes his three-part throttler-design series with the client-side axis: who is asking the throttler, why it matters, and how to differentiate between them.…
PlanetScale — Anatomy of a Throttler, part 2 — Shlomi Noach (Vitess maintainer, now at PlanetScale) continues his throttler-architecture series. Part 1 established the problem shape;…
PlanetScale — Anatomy of a Throttler, part 1 — Shlomi Noach (creator of gh-ost, co-author of the Vitess throttler, ex-GitHub, now at PlanetScale) opens a three-part series on throttler design for database systems.…
PlanetScale — AI-Powered Postgres index suggestions — PlanetScale's Rafer Hazen announces AI-powered Postgres index suggestions shipping inside PlanetScale Insights. The product pairs an LLM that proposes CREATE INDEX statements…
Meta — Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge — Meta re-architected Facebook Groups scoped search — the surface that lets users find answers inside group discussions — from a pure keyword (inverted-index) retrieval system into…
Visibility at Scale: How Figma Detects Sensitive Data Exposure — Figma describes Response Sampling, a two-phase security detection system that inspects a configurable fraction of outbound API responses for sensitive data exposure…
Figma — The Search for Speed in Figma (OpenSearch) — Figma's search team spent several months debugging and re-tuning the search path after upgrading from Elasticsearch to AWS managed OpenSearch Service in late 2023.…
Figma — The Infrastructure Behind AI Search in Figma — Infrastructure companion to Figma's earlier product-narrative post on AI-powered search. Where the product post (see 2026-04-21-figma-how-we-built-ai-powered-search-in-figma)…
Supporting Faster File Load Times with Memory Optimizations in Rust — Figma's Multiplayer server loads each collaborative file into memory to propagate edits across a complex node graph. After the 2024 dynamic page loading rollout…
Figma — Server-side sandboxing — Virtual machines — Part 2 of Figma's security-engineering 3-part series on server-side sandboxing (aka workload isolation) — the practice of accepting that vulnerabilities will exist and minimising…
Figma — Server-side sandboxing — Containers and seccomp — Part 3 of Figma's security-engineering 3-part series on server-side sandboxing (aka workload isolation) — the practice of accepting that vulnerabilities will exist in code…
Figma — Server-side sandboxing — An introduction — Part 1 of Figma's security-engineering 3-part series on server-side sandboxing (a.k.a. workload isolation) — the umbrella intro that frames why you'd sandbox server-side code…
Rolling Out Santa Without Freezing Productivity: Tips from Securing Figma's Fleet — Figma's Endpoint Security team rolled out Santa — the Google-originated open-source macOS binary-authorization tool — to 100% of company laptops over roughly three months,…
Figma Rendering: Powered by WebGPU — Figma's canvas renderer — a C++ codebase compiled to WebAssembly for the browser and to native x64/arm64 for server-side rendering…
Redefining Impact as a Data Scientist (Figma, 2026-04-21) — Figma Engineering post (Data Science author, writing on behalf of the team supporting Billing infrastructure) reframing "data-science impact" in a correctness-heavy domain.…
How We Rebuilt the Foundations of Component Instances — Year-long Figma client-architecture rewrite (2025, 15+ contributors) replacing Instance Updater — the 2016-era self-contained runtime that resolved component-instance properties,…
A Tale of Two Parameter Architectures — and How We Unified Them — Figma retrospective on unifying the architectures behind its two parameter systems — component properties (launched 2022,…
Keeping It 100(x) With Real-time Data At Scale — Figma re-architected LiveGraph — its real-time GraphQL-like data-fetching service — as a multi-tier system ("LiveGraph 100x") to absorb 100× growth in client sessions…
Figma — How We Built AI-Powered Search in Figma — Figma built AI-powered search (shipped at Config 2024) combining visual search (query by screenshot / selected frame / sketch, i.e.…
Figma — How We Built a Custom Permissions DSL at Figma — Figma's engineering team rebuilt permissions enforcement from a Ruby-monolith hasaccess? method — a growing tangle of if/else branches mixing policy logic with ActiveRecord…
Figma — How Figma's Databases Team Lived to Tell the Scale — Figma's Databases team retrospective on scaling RDS Postgres ~100× since 2020. 2020 baseline was a single Postgres on AWS's largest physical instance;…
How Figma Draws Inspiration From the Gaming World — A Figma engineering post (2026-04-21) by a former game-engine engineer framing Figma's client architecture as a game-engine stack adapted for the browser: a C++ 2D…
Figma — Figma's Next-Generation Data Caching Platform — Figma's Storage Products team built FigCache — a stateless, RESP-wire-protocol proxy service sitting in front of AWS ElastiCache Redis clusters…
Figma — Enforcing Device Trust on Code Changes — Figma's security team adds a cryptographic device-trust check on every Git commit that merges into release branches of its internal monorepo.…
Cloudflare — Moving past bots vs. humans — Cloudflare argues that the "bots vs. humans" frame is no longer useful for web protection. The original browser-vs-server balance…
AWS Architecture Blog — Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon QuickSight — Oldcastle APG — one of the largest suppliers of construction materials in North America (150+ facilities; hundreds of daily operational- reporting users)…
Airbnb — Building a fault-tolerant metrics storage system at Airbnb — -53c7c27702d5---4 published: 2026-04-21 fetched: 2026-04-22 tags: [airbnb, observability, metrics, prometheus, promxy, shuffle-sharding, multi-cluster, multi-tenancy,…
Pinterest — Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication — -4c5a5f6279b6---4 published: 2026-04-20 ingested: 2026-04-23 tags: [pinterest, url-normalization, content-deduplication, canonicalization, query-parameters, offline-analysis,…
'Take Control: Customer-Managed Keys for Lakebase Postgres' — Databricks launches Customer-Managed Keys (CMK) for lakebase, its serverless managed-Postgres offering. The technical interest is less the CMK feature itself…
Mercedes-Benz builds a cross-cloud data mesh with Delta Sharing and intelligent replication — Case study from Mercedes-Benz on building a cross-hyperscaler data-sharing backbone between AWS (producer) and Azure (consumer) using delta-sharing + unity-catalog,…
Cloudflare — Orchestrating AI Code Review at scale — Cloudflare's 2026-04-20 post details a CI-native AI code-review orchestration system built around OpenCode (open-source coding agent).…
Cloudflare: The AI Engineering Stack We Built Internally — Cloudflare describes the internal AI engineering stack that reached 93% R&D adoption (3,683 users, 47.95M AI requests in 30 days) in 11 months,…
Netflix — The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale — -2615bd06b42e---4 tier: 1 tags: [netflix, live-streaming, broadcast-operations, toc, boc, smpte-2022-7, srt, seamless-switching, signal-redundancy, hub-and-spoke, operator-ratio,…
Governing Coding Agent Sprawl with Unity AI Gateway — Databricks announces Coding Agent Support in Unity AI Gateway — productising the AI-gateway-as-single-choke-point pattern for the specific category of developer coding tools…
Unweight: how we compressed an LLM 22% without sacrificing quality — Cloudflare introduces Unweight, a lossless compression system for LLM weights that shrinks model footprint by 15–22 % while preserving bit-exact outputs…
Shared Dictionaries: compression that keeps up with the agentic web — Cloudflare's 2026-04-17 post announces an open beta opening April 30, 2026 for shared compression dictionaries support on its edge,…
Redirects for AI Training enforces canonical content — Cloudflare's 2026-04-17 post is the dedicated launch of Redirects for AI Training as a one-toggle feature inside AI Crawl Control, available on all paid Cloudflare plans.…
Introducing the Agent Readiness score. Is your site agent-ready? — Cloudflare introduces isitagentready.com, a Lighthouse-style scanner that grades a website against a four- dimension agent-readiness rubric — Agent Discovery, Content for LLMs,…
Introducing Flagship: feature flags built for the age of AI — Cloudflare's 2026-04-17 Agents-Week post launches Flagship — Cloudflare's native feature flag service, in private beta — built on OpenFeature (the CNCF open standard for flag…
Agents Week: network performance update — Cloudflare's Agents Week 2026 performance update reports that between September 2025 and December 2025 Cloudflare moved from being the fastest provider in 40 % of the top-1,000…
Cloudflare — Agents that remember: introducing Agent Memory — Cloudflare's 2026-04-17 post launches Agent Memory (private beta) — an opinionated managed service that extracts information from agent conversations,…
Meta — Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways — Meta's Security team publishes a strategy paper on its multi-year PQC migration, describing the principles, prioritisation framework, migration-level ladder,…
Meta — Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale — Meta's Capacity Efficiency team describes a unified AI-agent platform built to automate both halves of hyperscale performance engineering…
GitHub Engineering — How GitHub uses eBPF to improve deployment safety — GitHub hosts its own source code on github.com (they are themselves their largest customer) — so any new host-based deployment system has a built-in circular dependency: if…
Cloudflare Email Service: now in public beta. Ready for your agents — Cloudflare's 2026-04-16 Agents-Week post moves Email Sending out of private beta into public beta, pairing it with the long-standing free Email Routing inbound product to give…
Deploy Postgres and MySQL databases with PlanetScale + Workers — Cloudflare announced the next step of its September-2025 PlanetScale partnership: customers will be able to provision PlanetScale Postgres and MySQL (Vitess) databases directly…
Building the foundation for running extra-large language models — Cloudflare's 2026-04-16 deep-dive on how Workers AI serves extra-large LLMs like Kimi K2.5 (~1T params, ~560 GB of weights).…
Artifacts: versioned storage that speaks Git — Cloudflare's 2026-04-16 post launches Artifacts (private beta, public beta by early May 2026) — a distributed versioned filesystem, built for agents,…
Cloudflare AI Search: the search primitive for your agents — Cloudflare's 2026-04-16 post launches AI Search (formerly AutoRAG) as a plug-and-play managed search primitive for AI agents — hybrid BM25 + vector retrieval on built-in storage,…
Cloudflare's AI Platform: an inference layer designed for agents — A 2026-04-16 Agents-Week post positioning Cloudflare as a unified inference layer: one API (env.AI.run()), 70+ models across 12+ providers, one set of credits.…
Atlassian — Streaming Server-Side Rendering in Confluence — Atlassian's Confluence team adopted React 18 streaming SSR as the second big lever in a multi-year page-load performance effort (p90 latency halved over 12 months;…
Airbnb — Building a high-volume metrics pipeline with OpenTelemetry and vmagent — Companion piece to Airbnb's in-house metrics migration (see 2026-03-17-airbnb-observability-ownership-migration): this post covers the collection + aggregation tier that feeds…
Pinterest — Finding zombies in our systems: A real-world story of CPU bottlenecks — -4c5a5f6279b6---4 published: 2026-04-15 fetched: 2026-04-21 authors: [Vaibhav Shankar, Raymond Lee, Chia-Wei Chen, Shunyao Li, Yi Li, Ambud Sharma, Saurabh Vishwas Joshi,…
Project Think: building the next generation of AI agents on Cloudflare — Cloudflare announced Project Think (2026-04-15, published alongside the same-day Agent Lee launch) — "the next generation of the Agents SDK"…
Introducing Agent Lee - a new interface to the Cloudflare stack — Cloudflare launched Agent Lee, an in-dashboard AI assistant that understands a user's Cloudflare account and can both troubleshoot and apply changes across the entire platform…
Redpanda — Openclaw is not for enterprise scale — Redpanda unsigned blog post (2026-04-14, ~1,200 words, rhetorical-voice governance essay) arguing that dropping a Claude-Code-class coding agent ( "Openclaw") into a sandbox…
Airbnb: Privacy-first connections — Empowering social experiences — -53c7c27702d5---4 type: source created: 2026-04-21 updated: 2026-04-21 tier: 2 tags: [airbnb, privacy, identity, authorization,…
Slack — Managing context in long-run agentic applications — Second post in Slack's Security Engineering series on the Spear multi-agent security-investigation service (first post canonicalised…
Pinterest — Scaling Recommendation Systems with Request-Level Deduplication — -4c5a5f6279b6---4 published: 2026-04-13 ingested: 2026-04-23 tags: [pinterest, recsys, ranking, retrieval, deduplication, iceberg, training-efficiency, serving-throughput,…
Building a CLI for all of Cloudflare — Cloudflare announced a Technical Preview of the next-generation Wrangler CLI — installable today as npx cf / npm install -g cf…
Netflix — Evaluating Netflix Show Synopses with LLM-as-a-Judge — -2615bd06b42e---4 tags: [llm-as-judge, evaluation, agents-as-a-judge, tiered-rationale, consensus-scoring, automatic-prompt-optimization, dspy-predecessor, binary-scoring,…
Redpanda — Oracle CDC now available in Redpanda Connect — Redpanda (2026-04-09) announces the oracledbcdc input connector in Redpanda Connect v4.83.0 (enterprise-gated), adding Oracle as the sixth source-database engine in Redpanda's…
Meta — Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases — Meta Engineering's 2026-04-09 post is a multi-year retrospective on retiring a divergent internal WebRTC fork across 50+ RTC use cases — Messenger + Instagram video calling,…
Zalando — Rejecting Invalid Ingress Routes at Apply Time — Zalando runs Skipper as the default Kubernetes ingress controller across 250+ clusters serving 15k+ Ingress objects, ~200k routes, and 500k–2M RPS.…
Pinterest — Performance for Everyone — -4c5a5f6279b6---4 published: 2026-04-08 tags: [pinterest, mobile, android, ios, web, performance, client-performance, user-perceived-latency, visually-complete, view-tree,…
Build a multi-tenant configuration system with tagged storage patterns — AWS Architecture Blog walkthrough of a multi-tenant configuration service built on two heterogeneous storage backends behind a NestJS gRPC service,…
Yelp — Zero downtime Upgrade: Yelp's Cassandra 4.x Upgrade Story — Yelp Engineering post (2026-04-07) from the Database Reliability Engineering team on upgrading more than a thousand Cassandra nodes from 3.11 to 4.1 with zero downtime.…
Pinterest — Evolution of Multi-Objective Optimization at Pinterest Home Feed — -4c5a5f6279b6---4 published: 2026-04-07 tags: [pinterest, home-feed, recommendation, multi-objective-optimization, feed-diversification, reranking, dpp, ssd,…
MongoDB Predictive Auto-Scaling: An Experiment — MongoDB Engineering retrospective on the 2023 internal research prototype that explored whether a predictive auto-scaler could outperform MongoDB Atlas's then-existing reactive…
How we built a real-world evaluation platform for autonomous SRE agents at scale — Datadog's retrospective on building the offline, replayable evaluation platform for Bits AI SRE, its autonomous incident-investigation agent.…
Cloudflare targets 2029 for full post-quantum security — Cloudflare publishes an updated Q-Day risk assessment and an accelerated roadmap: full post-quantum security across the entire product suite including authentication by 2029.…
AWS News Blog: Launching S3 Files, making S3 buckets accessible as file systems — The AWS News Blog launch announcement for Amazon S3 Files — the operational/product-launch companion to Andy Warfield's design-essay…
All Things Distributed: S3 Files and the changing face of S3 — Andy Warfield (VP/DE, S3) announces s3-files, a new S3 feature that integrates Amazon EFS under S3 and lets any S3 bucket or prefix be mounted as a network filesystem from EC2,…
Netflix — Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale — -2615bd06b42e---4 published: 2026-04-06 ingested: 2026-04-22 tags: [netflix, druid, apache-druid, real-time-analytics, olap, time-series-database, dashboard, live-show-monitoring,…
Meta — How Meta used AI to map tribal knowledge in large-scale data pipelines — Meta's Data Platform team points AI coding agents at one of its large-scale data processing pipelines — four repositories,…
AWS Architecture Blog — Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod — Product-announcement post for the SageMaker HyperPod Inference Operator shipping as a native Amazon EKS add-on (replacing the prior Helm-chart install path),…
Synchronizing the Senses — Powering Multimodal Intelligence for Video Search — -2615bd06b42e---4 source: Netflix TechBlog tier: 1 tags: [netflix, video-search, multimodal, elasticsearch, cassandra, kafka, annotation-service, marken, temporal-fusion,…
The uphill climb of making diff lines performant — GitHub Engineering describes the year+ rewrite of the Files changed tab in the React-based pull-request review UI (now the default for all users).…
Redpanda — Supercharging Redpanda Streaming with profile-guided optimization — 2026-04-02 Redpanda engineering deep-dive — the promised mechanism-level companion to the 2026-03-31 Redpanda 26.1 launch post's one-line disclosure "Profile-Guided Optimization…
Netflix — Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events — -2615bd06b42e---4 tier: 1 tags: [netflix, live-streaming, vbr, cbr, qvbr, capped-vbr, bitrate-ladder, vmaf, rebuffering, adaptive-bitrate-streaming, open-connect,…
Meta — KernelEvolve: How Meta's Ranking Engineer Agent Optimizes AI Infrastructure — Meta Engineering describes KernelEvolve, an agentic kernel-authoring system used by Meta's Ranking Engineer Agent (REA) to autonomously generate and optimize production-grade…
Dropbox — Improving storage efficiency in Magic Pocket, our immutable blob store — Dropbox's Magic Pocket storage team hit an overhead spike after a new erasure-coded write path ("the Live Coder service") rolled out across regions…
Cloudflare — Introducing EmDash: a spiritual successor to WordPress that solves plugin security — Cloudflare announces EmDash (v0.1.0 preview), a new CMS positioned as "the spiritual successor to WordPress." Written in TypeScript, powered by Astro for frontend rendering,…
AWS Architecture Blog — Automate safety monitoring with computer vision and generative AI — AWS Architecture Blog retrospective on a serverless, event-driven computer-vision + generative-AI safety-monitoring solution that continuously analyses fixed-camera feeds across…
Slack — From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus — Slack's edge-networking team had a monitoring gap when they started rolling out HTTP/3 on their public edge: existing SaaS and internal black-box probes could not speak HTTP/3…
Redpanda — Redpanda 26.1 delivers the industry's first adaptable streaming engine — 2026-03-31 Redpanda product-launch post for Redpanda 26.1 / Redpanda One (R1), whose headline is the General Availability of Cloud Topics…
Meta — Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads — Meta Engineering's 2026-03-31 ML Applications post describes Meta Adaptive Ranking Model, the serving stack Meta built to scale its Ads ranking models to LLM-scale complexity…
Google Research — Safeguarding cryptocurrency by disclosing quantum vulnerabilities responsibly — Google Research lays out its disclosure philosophy for the 2026 quantum-attack-on-elliptic-curve-crypto result that Cloudflare's 2026-04-07 post cited second-hand.…
AWS Architecture Blog — Streamlining access to powerful disaster recovery capabilities of AWS — Survey-style AWS Architecture Blog post positioning AWS's DR building blocks in a layered "building-blocks" progression: data protection via AWS Backup → compute recovery via AWS…
Redpanda — Under the hood: Redpanda Cloud Topics architecture — Redpanda's 2026-03-30 post is the first architectural deep-dive on Cloud Topics after its general- availability release in Redpanda Streaming 26.1 (prior wiki coverage via the 25.3…
Architecting for agentic AI development on AWS — AWS Architecture Blog prescriptive essay on how to architect AWS systems so AI coding agents can operate effectively. Thesis: most cloud architectures were designed…
Lyft — Beyond A/B Testing: Using Surrogacy and Region-Splits to Measure Long-Term Effects in Marketplaces — -25cd379abb8---4 type: source created: 2026-04-22 updated: 2026-04-22 company: lyft source_url:…
Dropbox — Reducing our monorepo size to improve developer velocity — Dropbox's server-side monorepo — a single Git repository holding most of the company's backend code across many services and libraries — had grown to 87 GB,…
Expedia — Operating Trino at Scale With Trino Gateway — Expedia Group's data-platform team writes up their production use of Trino Gateway — an open-source proxy / load balancer for Trino clusters (originally forked from Lyft's Presto…
Datadog — When upserts don't update but still write: debugging Postgres performance at scale — Datadog's host-metadata team added an INSERT ... ON CONFLICT DO UPDATE upsert to track lastingested per host on a dedicated, unindexed,…
How Generali Malaysia optimizes operations with Amazon EKS — Generali Malaysia — one of Malaysia's largest general insurers, part of the Generali Group (~190 years) — migrated to AWS in 2019 and selected Amazon EKS as the target container…
'Slack — How Slack Rebuilt Notifications' — Slack Engineering retrospective on the Notifications 2.0 project — a ground-up redesign of Slack's notification preference system that migrated millions of users from four…
Pinterest — Building an MCP Ecosystem at Pinterest — -4c5a5f6279b6---4 tier: 2 tags: [pinterest, mcp, model-context-protocol, agents, mcp-registry, hosted-mcp, envoy, spiffe, jwt, oauth, authorization, service-mesh,…
Meta — Friend Bubbles: Enhancing Social Discovery on Facebook Reels — Meta Engineering describes the recommendation-system architecture behind Friend Bubbles on Facebook Reels — the UI affordance that annotates a Reel with small avatar bubbles…
AI-powered event response for Amazon EKS — AWS Architecture Blog product post on AWS DevOps Agent, a fully managed autonomous AI agent (built on Amazon Bedrock) that investigates operational events on Amazon EKS clusters.…
How we optimized Dash's relevance judge with DSPy — Dropbox Tech post on how the Dash relevance judge — an LLM-as-judge that scores (query, document) pairs on a 1–5 scale — was adapted across three different target models using DSPy…
Airbnb: From vendors to vanguard — hard-won lessons in observability ownership — -53c7c27702d5---4 type: source created: 2026-04-21 updated: 2026-04-21 company: airbnb tier: 2 published: 2026-03-17 tags: [observability, metrics, prometheus, promql, migration,…
Zalando — Search Quality Assurance with AI as a Judge — Zalando's Search & Browse team describes the offline framework they built to validate search quality before launching into a new country with no prior user data.…
Stripe — 10 things we learned building for the first generation of agentic commerce — Stripe's 2026-03-12 retrospective, co-authored by the Agentic Commerce Suite product team, on six months of field experience integrating enterprise retailers (URBN / Anthropologie…
Airbnb: Recommending travel destinations to help users explore — -53c7c27702d5---4 type: source created: 2026-04-21 updated: 2026-04-21 tier: 2 tags: [airbnb, recommendation-systems, transformers, sequence-modeling, ml-serving,…
Redpanda — Redpanda Cloud's BYOVPC for AWS is now Generally Available — Redpanda's 2026-03-11 unsigned launch-announcement post moves Bring Your Own VPC (BYOVPC) for AWS from beta to GA on Redpanda Cloud.…
Fly.io — Unfortunately, Sprites Now Speak MCP — Thomas Ptacek, Fly.io, 2026-03-10. Announces that Sprites now have an official remote MCP server at sprites.dev/mcp — plug into Claude Desktop (or any MCP-speaking agent),…
Meta — FFmpeg at Meta: Media Processing at Scale — Meta Engineering's 2026-03-09 post describes how Meta has deprecated its long-standing internal FFmpeg fork for all DASH video-on-demand (VOD) and livestreaming pipelines,…
When an AI agent came knocking: Catching malicious contributions in Datadog's open source repos — Datadog Engineering retrospective (2026-03-09) on how Datadog's BewAIre LLM-driven PR-review system detected two separate attacks against Datadog's public repositories…
Pinterest — Unified Context-Intent Embeddings for Scalable Text-to-SQL — -4c5a5f6279b6---4 published: 2026-03-06 tags: [text-to-sql, analytics-agent, llm, rag, embeddings, vector-search, data-catalog, data-governance, query-history, mcp, presto,…
Redpanda — Introducing Iceberg output for Redpanda Connect — Unsigned Redpanda launch post (~1,000 words, 2026-03-05) announcing the iceberg output connector for Redpanda Connect (shipped in Redpanda Connect v4.80.0, enterprise license).…
Designing MCP tools for agents: Lessons from building Datadog's MCP server — Datadog's retrospective on shipping its official MCP (Model Context Protocol) server — the company's first observability interface built specifically for customer AI agents rather…
Airbnb: It wasn't a culture problem — upleveling alert development at Airbnb — -53c7c27702d5---4 type: source created: 2026-04-21 updated: 2026-04-21 company: airbnb tier: 2 published: 2026-03-04 tags: [observability, alerting, prometheus, backtesting,…
Zalando — Why We Ditched Flink Table API Joins: Cutting State by 75% with DataStream Unions — Zalando's Search & Browse team publishes a production-engineering retrospective on the Product Offer Enrichment pipeline…
Pinterest — Unifying Ads Engagement Modeling Across Pinterest Surfaces — -4c5a5f6279b6---4 tier: 2 tags: [pinterest, ads-ranking, engagement-modeling, ctr-prediction, model-unification, multi-surface-serving, mmoe, multi-gate-mixture-of-experts,…
Netflix — Optimizing Recommendation Systems with JDK's Vector API — Netflix TechBlog post (2026-03-03, Tier 1; Harshad Sane, Netflix) on optimizing the video serendipity scoring hot path in Ranker, Netflix's homepage-row recommendation service.…
How we rebuilt the search architecture for high availability in GitHub Enterprise Server — GitHub Engineering describes a year-long rewrite of the GitHub Enterprise Server (GHES) search-indexing substrate, shipping in 3.19.1 (opt-in via ghe-config app.elasticsearch.ccr…
Meta — Investing in Infrastructure: Meta's Renewed Commitment to jemalloc — Meta Engineering announces that it is renewing investment in jemalloc, the open-source high-performance memory allocator that has long been a load-bearing component of Meta's…
Netflix — Mount Mayhem at Netflix: Scaling Containers on Modern CPUs — Netflix's migration from its legacy virtual-kubelet + docker container runtime to a modern kubelet + containerd runtime on Titus surfaced a startup-path hang: on r5.metal instances…
Pinterest — Bridging the Gap: Diagnosing Online-Offline Discrepancy in Pinterest's L1 Conversion Models — -4c5a5f6279b6---4 tier: 2 tags: [pinterest, ads, ranking, l1-ranking, cvr, conversion-model, ocpm, cpa, auction, two-tower, embedding, ann, ann-index, online-offline-discrepancy,…
'Towards Model-based Verification of a Key-Value Storage Engine' — MongoDB's Part 2 follow-up to the 2026-02 distributed-transactions formal-methods-before-shipping series describes how the modular structure of the team's TLA+ spec of MongoDB's…
We deserve a better streams API for JavaScript — James Snell — Cloudflare Workers runtime engineer, Node.js TSC member, multi-runtime implementer of the WHATWG Streams Standard ("Web streams")…
Instacart — Our Early Journey to Transform Instacart's Discovery Recommendations with LLMs — -587883b5d2ee---4 published: 2026-02-26 ingested: 2026-04-22 tags: [instacart, generative-recommendations, recommendations, discovery, personalization, llm, rag, shopping-hub,…
Using LLMs to amplify human labeling and improve Dash search relevance — Dropbox Tech post on how Dash trains the search relevance model that sits under its retrieval tool — specifically, where the labeled relevance judgements come from.…
AWS: Digital Transformation at Santander — How Platform Engineering is Revolutionizing Cloud Infrastructure — Joint AWS × Santander architecture-blog post on Catalyst, Santander's internal platform built in partnership with the AWS Platform Strategy Program (PSP).…
6,000 AWS accounts, three people, one platform: Lessons learned (AWS Architecture Blog, 2026-02-25) — ProGlove (smart-wearable barcode scanners for frontline workers) runs its Insight SaaS platform on AWS in an account-per-tenant model: every tenant gets a dedicated AWS account,…
Pinterest — Piqama: Pinterest Quota Management Ecosystem — -4c5a5f6279b6---4 tier: 2 tags: [pinterest, piqama, quota, quota-management, rate-limiting, capacity-quota, big-data-platform, moka, yunikorn, pinconf, tidb, online-storage,…
How we rebuilt Next.js with AI in one week (vinext) — Cloudflare's 2026-02-24 post announces vinext — a from-scratch reimplementation of the Next.js API surface on top of Vite (as a Vite plugin),…
Cloudflare — How we rebuilt Next.js with AI in one week — Cloudflare engineering manager Sunil Pai (one engineer + Claude via OpenCode) rebuilt the Next.js API surface from scratch on top of Vite as a drop-in replacement called vinext…
Netflix — MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix — -2615bd06b42e---4 published: 2026-02-23 fetched: 2026-04-21 created: 2026-04-22 updated: 2026-04-22 tags: [netflix, mediafm, multimodal, foundation-model, video-understanding,…
Lyft — Scaling Localization with AI at Lyft — -25cd379abb8---4 raw: raw/lyft/2026-02-19-scaling-localization-with-ai-at-lyft-dbeb06ee.md tags: [lyft, localization, i18n, machine-translation, llm, drafter-evaluator,…
How we reduced the size of our Agent Go binaries by up to 77% — Datadog Engineering retrospective (2026-02-18) on how the Datadog Agent team cut Go-binary sizes by up to 77 % across a 6-month program (Dec 2024 → Jul 2025) spanning versions…
Airbnb Sitar: Safeguarding dynamic configuration changes at scale — -53c7c27702d5---4 type: source created: 2026-04-21 updated: 2026-04-21 company: airbnb tier: 2 published: 2026-02-18 tags: [dynamic-configuration, feature-flags, control-plane,…
Instacart — Turning Data into Velocity: Caper's Edge and Cloud Data Flywheel with Capsight — -587883b5d2ee---4 published: 2026-02-17 authors: [Youming Luo, Andrew Tanner, Matas Sriubiskis, Sylvia Lin, Sikun Zhu, Lei Li, Xiao Zhou] tags: [edge-computing, data-flywheel,…
Expedia — Interleaving for Accelerated Testing (2026-02-17) — -38998a53046f---4 raw: raw/expedia/2026-02-17-interleaving-for-accelerated-testing-e4435b36.md published: 2026-02-17 tags: [expedia, search, ranking, experimentation,…
A chat with Byron Cook on automated reasoning and trust in AI systems — Werner Vogels interviews Byron Cook (Amazon Distinguished Scientist + VP) three and a half years after their first conversation on automated reasoning.…
Scaling LLM Post-Training at Netflix — -2615bd06b42e---4 tags: [netflix, llm, post-training, sft, rl, grpo, dpo, knowledge-distillation, ray, fsdp, spmd, vllm, huggingface, lora, mfu, checkpointing, sequence-packing,…
How low-bit inference enables efficient AI — Dropbox's ML team surveys the low-bit inference landscape — reducing numerical precision of activations and weights (from FP16 down through FP8, FP4,…
Google Research — Scheduling in a changing world: Maximizing throughput with time-varying capacity — Google Research post (2026-02-11) on online throughput- maximising scheduling under a time-varying machine-capacity profile…
Redpanda — How to safely deploy agentic AI in the enterprise — Blog recap of a talk by Tyler Akidau (Redpanda CTO, originator of the Google Dataflow / Apache Beam streaming model) at Dragonfly's Modern Data Infrastructure Summit titled…
From Print to Digital: Making Weekly Flyers Shoppable at Instacart Through Computer Vision and LLMs — -587883b5d2ee---4 tags: [instacart, computer-vision, image-segmentation, llm, vlm, multimodal, sam, segment-anything-model, weighted-boxes-fusion, non-maximum-suppression,…
AWS: How Convera built fine-grained API authorization with Amazon Verified Permissions — AWS Architecture Blog post by the Amazon Verified Permissions and Convera teams on Convera's adoption of Amazon Verified Permissions (AVP,…
Fly.io — Litestream Writable VFS — Ben Johnson's follow-up to the 2025-12-11 Litestream VFS ship post, published 2026-02-04. The post discloses two new capabilities added to the VFS…
AWS: Mastering millisecond latency and millions of events — the event-driven architecture behind the Amazon Key Suite — AWS Architecture Blog post by the Amazon Key team on modernizing their access-management platform (In-Garage Delivery, apartment-building access for property managers)…
Instacart — Migrating to Jetpack Compose: How AI Accelerated Our Journey at Caper — -587883b5d2ee---4 raw: raw/instacart/2026-02-03-migrating-to-jetpack-compose-fef2688d.md tags: [instacart, caper, android, jetpack-compose, fragments, migration, refactoring,…
Yelp — How Yelp Built a Back-Testing Engine for Safer, Smarter Ad Budget Allocation — Yelp Engineering post (2026-02-02) describing the Back-Testing Engine their Ad Budget Allocation team built to simulate proposed algorithm changes against historical campaign data…
Sovereign failover — Design for digital sovereignty using the AWS European Sovereign Cloud — Architectural companion to the 2026-01-16 AWS European Sovereign Cloud GA announcement. Where that post was governance / compliance / marketing,…
Fly.io — Litestream Writable VFS — Ben Johnson's 2026-01-29 shipping post announcing two new opt-in modes for Litestream VFS — the SQLite VFS extension shipped 2025-12-11…
Cloudflare — Moltworker: a self-hosted personal AI agent, minus the minis — Cloudflare ports Moltbot (formerly Clawdbot; later renamed OpenClaw as of 2026-01-30) — an open-source self-hosted personal AI agent normally run on a user's own Mac mini / VPS…
Rust at Scale: An Added Layer of Security for WhatsApp — Meta's WhatsApp security team describes its global rollout of a Rust-rewritten media-consistency library — shipping on billions of phones, laptops, desktops, watches,…
Dropbox: VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash — Edited + condensed version of a talk Josh Clemm (VP of Engineering for Dropbox Dash) gave as a guest speaker in Jason Liu's online RAG course on Maven.…
Redpanda — Engineering Den: Query manager implementation demo — First post in Redpanda's new Engineering Den series, a short (~600 words) post-acquisition disclosure from the Oxla team covering their rewrite of the query manager…
What came first: the CNAME or the A record? — Cloudflare post-mortem on the ~40-minute partial global outage of 1.1.1.1 on 2026-01-08, 17:40–19:55 UTC. Root cause not an attack,…
When protections outlive their purpose — a lesson on managing defense systems at scale — GitHub Engineering's public post-mortem on a quiet-but-sustained false-positive class: legitimate logged-out users browsing GitHub hitting "Too many requests" errors during normal…
Zalando — Paper Announcement: A Practical Approach to Replenishment Optimization with Extended (R, s, Q) Policy and Probabilistic Models — Zalando's blog announcement (2026-01-14) of their Nature Scientific Reports publication on the algorithmic core of the ZEOS Inventory Optimisation System…
Fly.io — The Design & Implementation of Sprites — Thomas Ptacek's 2026-01-14 implementation-deep-dive on Sprites, five days after the 2026-01-09 launch-plus-manifesto. Where the launch post argued the thesis (ephemeral sandboxes…
Redpanda — The convergence of AI and data streaming, Part 1: The coming brick walls — First instalment of a four-part industry-commentary series by Peter Corless (Redpanda), distilled from his talk at the AI-by-the-Bay conference in Oakland.…
Open Sourcing Dicer: Databricks' Auto-Sharder — Databricks open-sourced Dicer, the auto-sharder that underlies "every major Databricks product". Dicer is an intelligent control plane that continuously,…
How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters — AWS Architecture Blog case study (2026-01-12) documenting Salesforce's mid-2025→early-2026 migration of its Kubernetes platform — 1,000+ Amazon EKS clusters, 1,180+ node pools,…
Fly.io — Code And Let Live — Thomas Ptacek's 2026-01-09 product-launch-plus-manifesto announcing Sprites — Fly.io's new coding-agent substrate — and arguing that ephemeral read-only sandboxes for coding agents…
Wix — MCP Resources: all you need to know — -e239c562f907---4 tags: [wix, mcp, model-context-protocol, mcp-resources, mcp-tools, resource-link, embedded-resource, uri-template, rfc3986, rfc6570, subscribe-notify,…
Vercel — How we made v0 an effective coding agent — Vercel's 2026-01-08 retrospective (HN 29 points) on the three production techniques that move v0 — their browser-based AI website builder…
A closer look at a BGP anomaly in Venezuela — On 2026-01-02 between 15:30 and 17:45 UTC, AS8048 (CANTV, Venezuela's state-run ISP) leaked a set of prefixes in the 200.74.224.0/20 subnet…
Hardening eBPF for runtime security: Lessons from Datadog Workload Protection — Datadog Workload Protection's 5-year retrospective on running eBPF at scale — "six hard-won lessons" plus a rollout-safety coda.…
Redpanda — Build a real-time lakehouse architecture with Redpanda and Databricks — Tech-talk recap post (unsigned Redpanda author; ~1,100 words) summarising a joint Redpanda/Databricks session "From Stream to Table: Building a real-time lakehouse architecture…
Lyft — Lyft's Feature Store: Architecture, Optimization, and Evolution — -25cd379abb8---4 raw: raw/lyft/2026-01-06-lyfts-feature-store-architecture-optimization-and-evolution-615733d0.md tags: [feature-store, ml-infrastructure, lyft, dynamodb, valkey,…
Expedia — Powering Vector Embedding Capabilities — Expedia Group's ML Platform team describes the Embedding Store Service — a centralized vector-embedding platform exposing standardized APIs for creating collections,…
Netflix — The Netflix Simian Army (2011) — Netflix's 2011 post (republished 2026-01-02 on Medium) by Yury Izrailevsky (Director, Cloud & Systems Infrastructure) and Ariel Tseitlin (Director,…
MongoDB Server Security Update, December 2025 — On 2025-12-12 at 19:00 ET, MongoDB's Security Engineering team internally detected a security vulnerability in the MongoDB Server (Community + Enterprise editions)…
Zalando — Contributing to Debezium: Fixing Logical Replication at Scale — Zalando's 2025-12-18 engineering post is the sequel to their 2023-11 pgjdbc upstream fix (2023-11-08-zalando-patching-the-postgresql-jdbc-driver).…
MongoDB (Voyage AI) — Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries — Voyage AI by MongoDB describes the production embedding-inference pipeline it runs for query embeddings (the short, latency-sensitive side of retrieval workloads) and the two…
Inside the feature store powering real-time AI in Dropbox Dash — Dropbox built an internal feature store to power ranking in Dash — their AI-powered universal search product — because off-the-shelf options didn't bridge their split…
Zalando — The Day Our Own Queries DoS’ed Us: Inside Zalando Search — Zalando's Search & Browse team publishes a production-incident retrospective on a Sunday-afternoon Elasticsearch meltdown in which the root cause was a self-inflicted denial…
Lyft — From Python 3.8 to Python 3.10: Our Journey Through a Memory Leak — -25cd379abb8---4 type: source created: 2026-04-22 updated: 2026-04-22 company: lyft sourceurl:…
Fly.io — Litestream VFS — Ben Johnson's shipping post for Litestream VFS — the SQLite VFS extension teased as proof-of-concept in the 2025-05-20 revamp and flagged as "still not shipped" in the 2025-10-02…
Architecting conversational observability for cloud applications — AWS Architecture Blog reference-architecture post (2025-12-11) for a generative-AI-powered Kubernetes troubleshooting assistant.…
Redpanda — Streaming IoT and event data into Snowflake and ClickHouse — Unsigned Redpanda Blog how-to / vendor-tutorial post (2025-12-09) framing a reference pipeline for IoT + event data streaming: Redpanda → Redpanda Connect → both Snowflake…
Netflix — AV1: Now Powering 30% of Netflix Streaming — Netflix's Encoding Technologies team (Liwei Guo, Zhi Li, Sheldon Radford, Jeff Watts) retrospectively documents the full AV1 deployment arc on the Netflix streaming service,…
Cloudflare outage on December 5, 2025 — On 2025-12-05 at 08:47 UTC, a portion of Cloudflare's network began serving HTTP 500 errors for a subset of customers. The incident was resolved at 09:12 UTC…
How We Debug 1000s of Databases with AI at Databricks — Databricks built an internal AI agent platform (Storex) that unifies database investigation across a fleet of thousands of database instances spanning every major cloud,…
Redpanda — Operationalize Redpanda Connect with GitOps — Redpanda's 2025-12-02 unsigned tutorial post walks through the canonical GitOps deployment shape for Redpanda Connect pipelines on Kubernetes…
The local-first rebellion: How Home Assistant became the most important project in your house (GitHub Blog, 2025-12-02) — GitHub Blog (Open Source / Maintainers column) profile of Franck "Frenck" Nijhof, lead of Home Assistant, framed around Octoverse 2025 data placing Home Assistant among GitHub's…
Slack — Streamlining security investigations with agents — Slack's Security Engineering team built an internal multi-agent system (Spear, announced in this post as the first in a series) that triages security-detection alerts during…
Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall — AWS Architecture Blog reference-architecture post on how to deploy a centralized network inspection topology for Amazon EVS (AWS's managed VMware Cloud Foundation stack running…
Slack — Android VPAT journey — Slack Engineering retrospective (2025-11-19) on how the Slack Android team triaged and resolved accessibility issues surfaced by a 2024 third-party VPAT (Voluntary Product…
'Lyft — LyftLearn Evolution: Rethinking ML Platform Architecture' — -25cd379abb8---4' raw: raw/lyft/2025-11-18-lyftlearn-evolution-rethinking-ml-platform-architecture-fe6c6d4a.md tags: [ml-platform, lyft, lyftlearn, sagemaker, kubernetes, eks,…
Scaling real-time file monitoring with eBPF: How we filtered billions of kernel events per minute — Datadog's File Integrity Monitoring (FIM) team describes how they reduced an eBPF-collected file-event stream of >10B events/minute down to ~1M events/minute (≈94% reduction)…
Cloudflare outage on November 18, 2025 — On 2025-11-18 at 11:20 UTC, Cloudflare's network began experiencing significant failures to deliver core network traffic.…
Dropbox: How Dash uses context engineering for smarter AI — Dropbox's ML team describes the context-engineering evolution of Dash from a conventional RAG search surface (semantic + keyword over indexed documents) into an agentic AI…
Instacart — Building The Intent Engine: How Instacart is Revamping Query Understanding with LLMs — -587883b5d2ee---4 raw: raw/instacart/2025-11-13-building-the-intent-engine-how-instacart-is-revamping-query-dc1eda15.md tags: [instacart, query-understanding, search,…
Expedia — Colocating Input Partitions with Kafka Streams When Consuming Multiple Topics: Sub-Topology Matters! — Expedia debugs an in-production Kafka Streams application that consumed from two topics with identical partition counts and similar key strategies,…
Slack — Build better software to build software better — Slack's Quip/Canvas team took their monorepo build from 60 minutes to as low as 10 minutes (cached & parallelised) — a ~6× speed-up…
Redpanda — Redpanda 25.3 delivers near-instant disaster recovery and more — 2025-11-06 Redpanda product-launch post previewing the four headline features of the Redpanda 25.3 release: Shadowing (near-instant multi-region disaster recovery),…
Google Research — DS-STAR: A state-of-the-art versatile data science agent — Google Research introduces DS-STAR — a data-science agent built from four specialised LLM sub-agents (Data File Analyzer, Planner, Coder, Verifier) plus a Router,…
Fly.io — You Should Write An Agent — Thomas Ptacek's 2025-11-06 pedagogical essay arguing that every programmer who wants to reason about LLM agents — "the best hater (or stan) you can be"…
Google Research — Exploring a space-based, scalable AI infrastructure system design — Google Research announces Project Suncatcher — a moonshot research programme investigating whether AI compute can be scaled in low Earth orbit by placing TPU-carrying satellites…
Replication redefined: How we built a low-latency, multi-tenant data replication platform — Datadog Engineering retrospective (2025-11-04) on building the internal managed data-replication platform that powers Postgres-to-Elasticsearch, Postgres-to-Postgres,…
Immutable releases are now generally available on GitHub — GitHub announced the general availability of immutable releases on 2025-10-28 (surfaced here on 2025-10-31). Once a release is published with immutability enabled,…
Toward provably private insights into AI use (Google Research, 2025-10-30) — Google Research introduces Provably Private Insights (PPI): a production serving-infrastructure pattern for answering aggregate analytical questions ("what topics do Recorder users…
Redpanda — Introducing the Agentic Data Plane — Alex Gallego's (Redpanda founder/CEO) productization follow-up to his 2025-04-03 autonomy essay, naming the commercial shape of that vision: the Agentic Data Plane (ADP)…
Redpanda — Governed autonomy: The path to enterprise Agentic AI — 2025-10-28 Redpanda product-launch / vision post naming the Agentic Data Plane (ADP) as Redpanda's packaged answer to the enterprise-Agentic-AI governance problem.…
Slack — Advancing Our Chef Infrastructure: Safety Without Disruption — Archie Gunasekara's 2025-10-23 follow-up to Slack's 2024 Advancing Our Chef Infrastructure post. Describes phase two of Slack's EC2 / Chef deploy-safety work: instead of migrating…
Fly.io — Corrosion — Fly.io's canonical introduction post for Corrosion — the Rust-written state-distribution system that propagates a SQLite database across Fly.io's global worker fleet via a gossip…
Google Research — Solving virtual machine puzzles: How AI is optimizing cloud computing — Google Research post (2025-10-17) introducing a trio of lifetime-aware VM-allocation algorithms — NILAS (Non-Invasive Lifetime Aware Scoring), LAVA (Lifetime-Aware VM Allocation),…
Google Research — Coral NPU: A full-stack platform for Edge AI — Google Research introduces Coral NPU as a full-stack, open reference architecture for low-power on-device ML. The post's load-bearing architectural claim is that existing edge…
Cloudflare — Unpacking Cloudflare Workers CPU Performance Benchmarks — Public-response post to Theo Browne's 2025-10-04 cf-vs-vercel-bench benchmark suite, which showed Cloudflare Workers running CPU-heavy JavaScript up to 3.5× slower than Node.js…
Cars24 Improves Search For 300 Million Users With MongoDB Atlas — MongoDB-Blog case study of Cars24 — Indian multinational online car marketplace serving 300 million users globally across car sales, insurance, maintenance, and financing.…
MongoDB — The Cost of Not Knowing MongoDB, Part 3: appV6R0 to appV6R4 — Third and final installment of MongoDB's senior-developer-authored case study on iteratively tuning a document schema by load-testing against a fixed hardware budget (an…
Fly.io — Kurt Got Got — Fly.io security postmortem (2025-10-08) disclosing that Kurt Mackey, the CEO, was phished and the company's @flydotio Twitter/X account was taken over for ~15 hours.…
Cloudflare — How we found a bug in Go's arm64 compiler — Weeks-long debugging retrospective on a one-instruction race condition in Go's arm64 code generator. On stack frames slightly larger than 1<<12 bytes,…
Slack — Deploy Safety: Reducing customer impact from change — Slack's 2025-10-07 retrospective on the Deploy Safety Program — an 18-month cross-org reliability program (mid-2023 → Jan 2025) that reduced customer impact hours…
Google Research — Speech-to-Retrieval (S2R): A new approach to voice search — Google Research introduces Speech-to-Retrieval (S2R) as a new architectural approach for voice search that bypasses the intermediate text transcript entirely…
Meta — Introducing OpenZL: An Open Source Format-Aware Compression Framework — Meta announces OpenZL, a new open-source lossless compression framework that targets structured data (tabular, columnar, numeric arrays, timeseries, ML tensors,…
Zalando — Accelerating Mobile App Development at Zalando with Rendering Engine and React Native — Zalando is migrating its mobile app (previously two separate codebases — a native iOS app and a native Android app, 90+ screens total) to React Native,…
Redpanda — Real-time analytics at scale: Redpanda and Snowflake Streaming — Unsigned Redpanda benchmark post disclosing a large-scale performance test of a Redpanda → Snowflake streaming pipeline built with Redpanda Connect's snowflakestreaming output…
Fly.io — Litestream v0.5.0 is Here — Ben Johnson's shipping announcement for Litestream v0.5.0 — the first batch of the 2025-05-20 "Litestream: Revamped" redesign is now in users' hands.…
Intelligent Kubernetes Load Balancing at Databricks — Databricks replaced Kubernetes' default L4 kube-proxy load balancing with an in-house proxyless, client-side L7 load-balancing system backed by a custom xDS control plane (Endpoint…
MongoDB — Top Considerations When Choosing A Hybrid Search Solution — MongoDB 2025-09-30 technical-blog post (author implicit; MongoDB product-marketing flavour but with legitimate architectural content) surveys the industry evolution of hybrid…
Expedia — Why You Should Prefer MERGE INTO Over INSERT OVERWRITE in Apache Iceberg — A short Expedia Group Tech post arguing that on apache-iceberg tables, teams should default to MERGE INTO (row-level conditional upsert,…
Yelp — S3 server access logs at scale — Yelp Engineering post (2025-09-26) by the SRE / Storage team on operationalising S3 Server Access Logs (SAL) at Yelp's fleet scale…
MongoDB — From Niche NoSQL To Enterprise Powerhouse: The Story Of MongoDB's Evolution — A 2025-09-25 MongoDB Engineering blog post by Ashish Agrawal (joined MongoDB ~2023 via the Grainite acquisition; prior ~decade at Google on Bigtable / Spanner / Datastore /…
MongoDB — Carrying Complexity, Delivering Agility — A 2025-09-25 MongoDB engineering-leadership manifesto co-authored by Ashish Agrawal (joined MongoDB via the Granite acquisition ~2023,…
Zalando — Dead Ends or Data Goldmines? Investment Insights from Two Years of AI-Powered Postmortem Analysis — Zalando's SRE-adjacent datastore team built a multi-stage LLM pipeline to mine thousands of archived postmortems for recurring failure patterns across their five Postgres /…
PlanetScale — Processes and Threads — Ben Dicken (PlanetScale, 2025-09-24, re-fetched 2026-04-21) publishes an interactive-article pedagogical piece on operating-system process and thread abstractions that lands,…
Build AI Agents Worth Keeping: The Canvas Framework — MongoDB-Blog thought-leadership post diagnosing why so many enterprise AI agent projects stall in pilot and prescribing a structured design flow to exit that trap.…
Cloudflare — Cap'n Web: a new RPC system for browsers and web servers — Announcement and full design walkthrough for Cap'n Web, a new RPC protocol and pure-TypeScript implementation open-sourced by Cloudflare (MIT, github.com/cloudflare/capnweb ).…
MongoDB Community Edition to Atlas: A Migration Masterclass with BharatPE — MongoDB-Blog case study of BharatPE — Indian fintech processing ~₹12,000 crore (~US $1.368 B) in monthly UPI transactions on 45 TB across three self-hosted MongoDB Community…
MongoDB — Modernizing Core Insurance Systems: Breaking The Batch Bottleneck — MongoDB authors a framework-level retrospective on post-migration batch-job regressions observed at insurance customers modernizing core platforms from PL/SQL + legacy RDBMS…
Google Research — Making LLMs more accurate by using all of their layers (SLED) — Google Research introduces SLED (Self Logits Evolution Decoding) — a factuality decoding method that improves LLM accuracy by using the early-exit logits from every transformer…
Post-quantum security for SSH access on GitHub — GitHub announced (effective 2025-09-17) the addition of sntrup761x25519-sha512 / sntrup761x25519-sha512@openssh.com as a new SSH key-exchange algorithm on github.com's SSH…
Shopify — Migrating to React Native's New Architecture — Shopify migrated its two largest mobile apps — Shopify Mobile and Shopify Point of Sale (POS) — to React Native's New Architecture (Fabric renderer + TurboModules + synchronous…
Google Research — Speculative cascades: A hybrid approach for smarter, faster LLM inference — Google Research frames speculative cascades as a unified LLM-serving latency-optimization technique that combines two previously-separate primitives…
Rearchitecting GitHub Pages — Around early 2015, GitHub Pages outgrew its single-machine active/standby origin. The original design ran the entire service on a single pair of machines with user data spread…
Instacart — Simplifying Large-Scale LLM Processing across Instacart with Maple — -587883b5d2ee---4 raw: raw/instacart/2025-08-27-simplifying-large-scale-llm-processing-across-instacart-with-7fe37df1.md tags: [instacart, maple, llm, batch-processing, catalog,…
Google Research — From massive models to mobile magic: The tech behind YouTube real-time generative AI effects — Google Research describes the training-to-serving pipeline behind YouTube's real-time, on-device generative AI effects — the stylised face / image effects users apply while…
Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet — Dropbox's seventh-generation in-house server hardware — replacing the 2020-era sixth-gen Cartman platform — rolled out across five named tiers: Crush (compute), Dexter (database),…
All Things Distributed — Removing friction from Amazon SageMaker AI development — Werner Vogels surveys four recent SageMaker AI capabilities released to remove friction points that ML builders kept hitting in production: (1) SSH-over-SSM tunneling via a new…
Cloudflare — Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives — Cloudflare reports on Perplexity AI fetching content from origins that have explicitly disallowed Perplexity's declared crawlers…
Instacart — Scaling Catalog Attribute Extraction with Multi-modal LLMs (PARSE) — -587883b5d2ee---4 raw: raw/instacart/2025-08-01-scaling-catalog-attribute-extraction-with-multi-modal-llms-539b02e6.md tags: [instacart, parse, attribute-extraction, catalog,…
Netflix — Linux Performance Analysis in 60,000 Milliseconds — Netflix Performance Engineering (Brendan Gregg + team) publishes a 10-command, 60-second triage checklist for the first minute of any Linux production performance investigation.…
Google Research — Simulating large systems with Regression Language Models — Google Research post (2025-07-29) proposing text-to-text regression with language models as a general, feature-engineering-free path to numeric prediction over complex,…
Instacart — Introducing PIXEL: Instacart's Unified Image Generation Platform — -587883b5d2ee---4 raw: raw/instacart/2025-07-17-introducing-pixel-instacarts-unified-image-generation-platfo-2f29968c.md tags: [instacart, image-generation, generative-ai,…
Google Research — Android Earthquake Alerts: A global system for early warning — Google Research post (2025-07-17) on the Android Earthquake Alerts System (AEA) — a planet-scale earthquake early-warning (EEW) system that uses the Android fleet as a distributed…
Datadog — How we tracked down a Go 1.24 memory regression across hundreds of pods — Datadog rolled Go 1.24 to a data-processing service across hundreds of Kubernetes pods and observed a ~20% RSS increase that did not appear in Go's own runtime metrics.…
Cloudflare 1.1.1.1 incident on July 14, 2025 — Cloudflare post-mortem on the 62-minute global outage of the 1.1.1.1 public DNS Resolver on 2025-07-14, 21:52 – 22:54 UTC.…
AWS — Introducing Amazon S3 Vectors: First cloud storage with native vector support at scale (preview) — Channy Yun (AWS News Blog, 2025-07-16) announces the preview of Amazon S3 Vectors: a new first-class S3 data primitive for storing and querying vector similarity indices as native…
Yelp — Exploring CHAOS: Building a Backend for Server-Driven UI — Yelp Engineering post (2025-07-08) that unpacks the backend of CHAOS, Yelp's internal SDUI framework. A companion to their earlier 2024-03 post introducing CHAOS;…
PlanetScale — Caching — Ben Dicken (PlanetScale) pedagogical deep-dive on caching as "the most elegant, powerful, and pervasive innovation in computing" — the core principle across CPU L1/L2/L3, RAM,…
Netflix — AV1 @ Scale: Film Grain Synthesis, The Awakening — Netflix's Video Algorithms team describes the global rollout of AV1 Film Grain Synthesis (FGS) on the streaming service. FGS has been part of the AV1 standard since inception;…
Cloudflare: Introducing pay per crawl — Enabling content owners to charge AI crawlers for access — Cloudflare announces Pay Per Crawl (private beta, 2025-07-01), a framework that lets publishers monetize AI-crawler access to their content at internet scale by reviving…
Google Research — How we created HOV-specific ETAs in Google Maps — Google Research post (2025-06-30) announcing a Google Maps feature: HOV-specific ETA predictions on routes that include high-occupancy vehicle (HOV) lanes (carpool lanes).…
Zalando — Building a Dynamic Inventory Optimisation System: A Deep Dive — Zalando (2025-06-29) documents the architecture of ZEOS's AI-driven replenishment-recommendation system — a two-stage machine-learning pipeline that produces probabilistic weekly…
Redpanda — Why streaming is the backbone for AI-native data platforms — Redpanda thought-leadership piece (unsigned; originally syndicated to The New Stack) arguing that the defining architectural property of an AI-native data platform is real-time…
Redpanda — Behind the scenes: Redpanda Cloud's response to the GCP outage — Redpanda (unsigned, 2025-06-21) publishes a production-incident retrospective of the 2025-06-12 Google Cloud Platform global outage…
Fly.io — Phoenix.new: The Remote AI Runtime for Phoenix — Chris McCord (creator of Elixir's Phoenix framework) introduces Phoenix.new — a "batteries-included fully-online coding agent tailored to Elixir and Phoenix"…
Defending the Internet: how Cloudflare blocked a monumental 7.3 Tbps DDoS attack — Cloudflare recounts autonomously blocking a 7.3 Tbps / 4.8 Bpps DDoS attack — the largest ever reported — against a hosting- provider customer using Magic Transit.…
Redpanda — Introducing multi-language dynamic plugins for Redpanda Connect — Launch post for dynamic plugins in Redpanda Connect v4.56.0 (Beta, Apache 2.0). The feature breaks the previous Go- only,…
Netflix — Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix — Netflix's Content Engineering org introduces UDA — Unified Data Architecture — an in-house knowledge-graph platform that sits between business concepts and the many data systems…
Instacart — Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation — -587883b5d2ee---4 tags: [instacart, llm-as-judge, chatbot, customer-support, evaluation, agentic-evaluation, multi-agent-debate, self-reflection, binary-scoring,…
MongoDB — Conformance Checking at MongoDB: Testing That Our Code Matches Our TLA+ Specs — A. Jesse Jiryu Davis's 2025 retrospective (from 2025's perspective) on the 2020 VLDB paper eXtreme Modelling in Practice…
Fly.io — parking_lot: ffffffffffffffff… — A Fly.io long-form debugging retrospective (2025-05-28, Thomas Ptacek, Tier 3) on a weeks-long hunt for why proxies in European regions — especially WAW (Warsaw)…
Yelp — Revenue Automation Series: Testing an Integration with Third-Party System — Yelp Engineering post (2025-05-27) by the Revenue Recognition Team — third post in the Revenue Automation Series after the 2024-12 billing-system modernisation and the 2025-02-19…
Just make it scale: An Aurora DSQL story (Werner Vogels, guest-authored by Niko Matsakis & Marc Bowes) — Werner Vogels hosts a guest post by Sr. Principal Engineers Niko Matsakis (a core Rust language designer) and Marc Bowes on the engineering journey of aurora-dsql…
Redpanda — Implementing FIPS compliance in Redpanda — Redpanda (unsigned, 2025-05-20) publishes a configuration-walkthrough post on enabling FIPS 140-2 compliance mode in a self-managed Redpanda cluster on RHEL.…
Fly.io — Litestream: Revamped — Ben Johnson's retrospective on the biggest redesign of Litestream since its 2020 launch. Litestream's original design — opening a long-lived read transaction on SQLite,…
Launching MCP Servers on Fly.io — Short developer-blog post ("part showing off, part opinion") by Sam Ruby announcing fly mcp launch — a new flyctl subcommand (shipped in flyctl v0.3.125) that takes an existing…
Redpanda — Getting started with Iceberg Topics on Redpanda BYOC — Redpanda (2025-05-13) publishes a BYOC-customer setup walkthrough for Iceberg Topics, five weeks after the 25.1 GA disclosure…
GitHub Issues search now supports nested queries and boolean operators: Here's how we (re)built it — GitHub rewrote Issues search to support logical AND/OR operators and nested parentheses across all fields (e.g. is:issue state:open author:rileybroughten (type:Bug OR type:Epic)),…
Yelp — Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More — Four years after adopting their in-house Lucene-based search engine Nrtsearch in production (and >90% of Elasticsearch traffic migrated to it),…
Fly.io — Provisioning Machines using MCPs — Sam Ruby's short Fly.io developer-blog post (2025-05-07) marking the mutation transition of Fly.io's flyctl MCP server: the read-only two-tool prototype from a month earlier (30…
Redpanda — A guide to Redpanda on Kubernetes — Redpanda (unsigned, 2025-05-06) publishes a product-altitude guide to the evolution of Redpanda's Kubernetes deployment story — from an early Helm chart,…
Understanding transaction visibility in PostgreSQL clusters with read replicas — AWS's response to Jepsen's 2025-04-29 report on transaction visibility in Amazon RDS for PostgreSQL Multi-AZ clusters. AWS confirms Jepsen's finding but clarifies that the behavior…
Meta — Building Private Processing for AI tools on WhatsApp — A 2025-04-30 Meta Engineering Security post previewing Private Processing — the confidential-computing infrastructure WhatsApp is rolling out so users can invoke AI features…
Redpanda — Need for speed: 9 tips to supercharge Redpanda — A numbered checklist of nine performance-tuning tips for Redpanda clusters, framed around the infrastructure → data architecture → application design triad.…
Yelp — Journey to Zero Trust Access — Yelp Engineering post (2025-04-15) by Corporate Systems + Client Platform Engineering — the first-party narrative of why Yelp retired Ivanti Pulse Secure as its employee VPN…
Fly.io — 30 Minutes With MCP and flyctl — Thomas Ptacek's internal-message-turned-blog post on building the "most basic" MCP server for flyctl — flymcp — in 30 minutes.…
Netflix — How Netflix Accurately Attributes eBPF Flow Logs — Netflix describes how FlowCollector, the backend that consumes ~5M TCP flow-log records per second from per-host FlowExporter sidecars,…
Our Best Customers Are Now Robots — Fly.io developer-blog retrospective by Thomas Ptacek disclosing that over the past ~6 months the growth-driving users on the Fly.io platform are LLM-driven coding agents ("robots")…
Redpanda — Redpanda 25.1: Iceberg Topics now generally available — Redpanda's 2025-04-07 product post announces General Availability of Iceberg Topics on the 25.1 release — the first Kafka-API-compatible streaming broker with a broker-native…
Redpanda — Autonomy is the future of infrastructure — Alex Gallego's (Redpanda founder/CEO) vision essay marking Redpanda's 2025-04-03 $100M Series D + launch of the Redpanda Agents SDK (Python SDK + rpk connect mcp-server + rpk…
Netflix — Globalizing Productions with Netflix's Media Production Suite — Netflix's Studio Engineering team describes the Media Production Suite (MPS) — a set of cloud-based filmmaker tools inside Content Hub that moves original camera + sound media into…
Fly.io — Operationalizing Macaroons — Thomas Ptacek's 2025-03-27 retrospective written as Fly.io hands off internal ownership of the Macaroon stack to a new owner.…
Open-sourcing OpenPubkey SSH (OPKSSH): integrating single sign-on with SSH — Cloudflare announces the open-sourcing of OPKSSH (OpenPubkey SSH) under the OpenPubkey Linux Foundation project umbrella (Apache 2.0, github.com/openpubkey/opkssh).…
Redpanda — 3 powerful connectors for real-time change data capture — Redpanda (2025-03-18) publishes a product-altitude tour of the four CDC input connectors shipped with Redpanda Connect — the company's Kafka-Connect alternative…
Sign in as anyone: Bypassing SAML SSO authentication with parser differentials — Two GitHub Security Lab researchers (Peter Stöckli + an external bug-bounty participant, ahacker1) independently discover an authentication-bypass class in the ruby-saml library…
All Things Distributed: In S3 simplicity is table stakes (S3 at 19) — On S3's 19th birthday (Pi Day 2025), Andy Warfield (VP / Distinguished Engineer, S3) reframes what "simple" means for a storage system operating at hundreds of trillions of objects…
PlanetScale — IO devices and latency — PlanetScale's Ben Dicken publishes a pedagogical history of non-volatile storage devices — tape → HDD → SSD → cloud network-attached storage…
Meta — Strobelight: A profiling service built on open source technology — A 2025-01-21 Meta Engineering post (Production Engineering) describing Strobelight, Meta's fleet-wide profiling orchestrator…
Meta — A case for QLC SSDs in the data center — Meta's Data Center Engineering team makes the architectural case for QLC NAND flash as a new middle storage tier between HDD and TLC flash in hyperscale data centers.…
Fly.io — Taming a Voracious Rust Proxy — A Fly.io incident retrospective (2025-02-26, Tier 3) tracing a CPU-runaway + HTTP-error-spike incident on a couple of IAD edge servers to a TLS-close-notify state-machine bug…
Building and operating a pretty big storage system called S3 — Author: Andy Warfield (VP / Distinguished Engineer, S3), guest post hosted by Werner Vogels on All Things Distributed. Based on Warfield's USENIX FAST '23 keynote.
Zalando — LLM-powered migration of UI component libraries — Zalando's Partner Tech department (B2B applications for retail partners) had accumulated two distinct in-house UI component libraries across 15 sophisticated B2B applications.…
Yelp — Revenue Automation Series: Building Revenue Data Pipeline — Yelp Engineering post (2025-02-19) by the Commerce Platform / Financial Systems team — second in the Revenue Automation Series after the 2024-12 billing-system modernisation post.…
Zalando — Scaling Beyond Limits: Harnessing Route Server for a Stable Cluster — Zalando's platform team runs Skipper as the default Kubernetes Ingress proxy across 200 clusters with ~180 Skipper instances per cluster serving up to 2M requests/second against…
We Were Wrong About GPUs — Retrospective / course-correction post by Thomas Ptacek on Fly.io's 2022-era bet on productising GPU Fly Machines — Firecracker-shaped hardware-virtualized Fly Machines…
Fly.io — The Exit Interview: JP Phillips — Exit-interview blog post (2025-02-12) with JP Phillips, the engineer who led flyd — Fly.io's in-house orchestrator for Fly Machines (Firecracker micro-VMs)…
Redpanda — High availability deployment: Multi-region stretch clusters — Part four of Redpanda's HA/DR series frames the multi-region stretch cluster — a single Redpanda cluster whose brokers are distributed across two or more cloud regions (or…
Fly.io — VSCode's SSH Agent Is Bananas — Fly.io's 2025-02-07 opinion post on VSCode Remote-SSH's architecture viewed as a security posture, from the vantage point of wiring Fly Machines into the VSCode remote-editing…
Yelp — Search query understanding with LLMs: from ideation to production — Yelp Engineering post (2025-02-04) — the canonical first-party disclosure of how Yelp productionised LLMs for search query understanding tasks (segmentation, spell correction,…
Datadog — Husky: Efficient compaction at Datadog scale — Third post in Datadog's Husky series (after introducing-husky and the deep-dive on ingestion). Husky is an observability event store layered over object storage (S3 / GCS / Azure…
Redpanda — Implementing the Medallion Architecture with Redpanda — Redpanda (2025-01-21) publishes a pedagogy-altitude explainer on the Medallion Architecture — Databricks' three-tier Bronze / Silver / Gold data-storage pattern for data lakes…
AWS — Migrating from AWS App Mesh to Amazon ECS Service Connect (App Mesh discontinuation announcement) — AWS announces the end-of-life of AWS App Mesh (discontinued 2026-09-30; no new customers as of 2024-09-24; existing customers can still use the service until EOL,…
Scaling Large Language Models for e-Commerce: The Development of a Llama-Based Customized LLM — eBay's 2025-01-17 post describes the training-infrastructure and data-mix design behind e-Llama — 8-billion and 70-billion parameter LLMs adapted from Meta's Llama 3.1 base models…
Shopify — Five years of React Native at Shopify — Mustafa Ali (Director of Engineering at Shopify) publishes a five-year retrospective on Shopify's 2020 commitment to React Native (RN) as the future of mobile at Shopify,…
Google Research — Extra, Extra — Read All About It: Nearly All Binary Searches and Mergesorts are Broken (2006, republished 2025-01-11) — Joshua Bloch's 2006 Google Research blog post — republished on the Google Research blog 2025-01-11 (HN 164) — reports that the standard textbook binary search,…
Slack — Automated Accessibility Testing at Slack — Slack Frontend Test Frameworks team retrospective on a 2022-initiated project to add automated accessibility checks to Slack's desktop test infrastructure using Axe Core integrated…
Netflix — Cloud Efficiency at Netflix — Netflix's Platform DSE (Data Science Engineering) team describes the internal data platform that powers cost-and-ownership attribution across Netflix's AWS footprint.…
Meta — Indexing code at scale with Glean — Meta Engineering post (2024-12-19, syndicated to the raw corpus with a 2025-01-01 published date; 132 HN points) on Glean, Meta's open-source code-indexing system.…
Meta — Translating 10M lines of Java to Kotlin — Meta Engineering post (2024-12-18) on the multi-year effort to translate the entire Android codebase at Meta from Java to Kotlin…
Faster continuous integration builds at Canva — Canva's Developer Platform group cut the average PR-to-merge CI time from ~80 min (Apr 2022, trending toward 1–3 h) to <30 min (sometimes 15 min) over ~2 years.…
Stripe — The secret life of DNS packets: investigating complex networks — Stripe's 2024-12-12 incident-investigation retrospective on an hourly spike of DNS SERVFAIL responses for a small percentage of internal requests.…
Canva: The science of routing print orders — Canva's Print team built a configurable rule-driven routing engine that decides which supplier in the global print network produces each item in a checkout cart,…
Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview) — The preview-launch announcement for Automated Reasoning checks as a new safeguard in Amazon Bedrock Guardrails (AWS News Blog, Antje Barth, 2024-12-04, US West Oregon preview).…
Redpanda — Redpanda 24.3 extends lakehouses with streaming data & CDC — Redpanda's 2024-12-03 24.3 release roundup is the beta announcement of six features that later became the canonical building blocks of the 2025 wiki coverage of Redpanda.…
Meta — How Meta built large-scale cryptographic monitoring — A 2024-12-02 Meta Engineering post (authored by the CryptoEng team — Grace Wu, Ilya Maykov, Srinivas Murri, Isaac Elbaz acknowledged) describing the telemetry system underneath…
Redpanda — Batch tuning in Redpanda to optimize performance (part 2) — Part 2 of James Kinley's two-part batch-tuning series for Redpanda (and, by Kafka-API compatibility, Apache Kafka). Where part 1 built the first- principles framework…
Redpanda — Batch tuning in Redpanda for optimized performance (part 1) — Part 1 of a two-part first-principles explainer on producer-side batching for Redpanda (and, by Kafka-API compatibility, Apache Kafka) streaming brokers.…
All Things Distributed: AWS Lambda turns 10 — a rare look at the PR/FAQ that started it — Werner Vogels publishes the (lightly edited) internal PR/FAQ that launched AWS Lambda in 2014, with 2024-era annotations showing which ideas shipped as originally written,…
Netflix — Netflix's Distributed Counter Abstraction — Netflix introduces the Distributed Counter Abstraction — a counting service built on top of the TimeSeries Abstraction and deployed via the Data Gateway Control Plane.…
What's new with Robinhood, our in-house load balancing service — Robinhood is Dropbox's in-house internal-traffic load balancing service, deployed since 2020 and rebuilt in 2023 around PID-controller-driven feedback-control load balancing over…
Meta — Meta's open AI hardware vision — Meta's 2024-10-15 post — timed to the Open Compute Project (OCP) Global Summit 2024 — announces the next generation of Meta's AI-hardware stack and contributes the designs to OCP.…
AI GPU Clusters, From Your Laptop, With Livebook — Fly.io's 2024-09-24 recap of Chris McCord's and Chris Grainger's ElixirConf 2024 keynote on Livebook + FLAME + the Nx stack…
Netflix — Introducing Netflix's Key-Value Data Abstraction Layer — Netflix introduces the Key-Value (KV) Data Abstraction Layer (DAL) — the most mature of several abstraction services built on top of Netflix's Data Gateway Platform.…
Zalando — Content Creation Copilot: AI-assisted product onboarding — Zalando (2024-09-17) documents the architecture and early production results of its Content Creation Copilot — an internal system that auto-generates product-attribute suggestions…
'Lyft — Protocol Buffer Design: Principles and Practices for Collaborative Development' — Roman Kotenko (Lyft Media) distils Lyft Media's operational experience designing shared proto3 protobuf schemas across mobile (iOS / Android) and backend (Python) teams into…
Netflix — Noisy Neighbor Detection with eBPF — Netflix describes a per-container run queue latency monitor built on eBPF and attached to two Linux scheduler tracepoints (schedwakeup + schedswitch).…
Meta — Sapling: Source control that's user-friendly and scalable — A 2022-11-15 Meta Engineering post (fetched 2024-09-10 into the wiki's raw corpus) announcing the open-sourcing of the Sapling client…
Cloudflare — A good day to trie-hard: saving compute 1% at a time — Cloudflare's pingora-origin service — the last Rust-proxy hop before a request leaves Cloudflare's CDN for the customer origin…
Meta — How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale — A 2024-08-31 Meta Engineering post (accompanying Meta's PEPR 2024 presentation) describing Privacy Aware Infrastructure (PAI)…
'Slack — Unified Grid: How We Re-Architected Slack for Our Largest Customers' — Slack's Unified Grid project (2021–March 2024) replaced the workspace-scoped client and backend architecture that had anchored the product since 2013 with an org-wide architecture…
Google Research — SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL — Google's research paper (published 2024-08-24, surfaced on Hacker News at 308 points) describes piped data-flow syntax added natively to GoogleSQL…
Meta — Leveraging AI for efficient incident response — A Meta Engineering post describing the AI-assisted root-cause analysis (RCA) system Meta uses during investigations of reliability issues in its web monorepo.…
Continuous reinvention: A brief history of block storage at AWS (Marc Olson, guest post on Werner Vogels' blog) — Marc Olson, a ~13-year veteran of the EBS team, narrates EBS's arc from a 2008 HDD-backed shared-disk service into a distributed SSD fleet doing >140 trillion operations/day,…
We're Cutting L40S Prices In Half — Pricing-announcement post for NVIDIA L40S GPUs on Fly.io, cut to $1.25 / hour — the same price as the A10. The substance under the pricing headline is a customer-data-driven…
Figma: How We Migrated onto K8s in Less Than 12 Months — Figma migrated its core compute platform from AWS ECS on EC2 to AWS EKS (Kubernetes) in under 12 months (Q1 2023 plan → January 2024 majority-cutover) with only minor incidents…
Meta — DCPerf: An open source benchmark suite for hyperscale compute applications — Meta open-sourced DCPerf — a benchmark suite where each benchmark is designed by referencing a large Meta production application,…
Meta — A RoCE network for distributed AI training at scale (SIGCOMM 2024) — The SIGCOMM-2024 companion to Meta's 2024-06-12 training overview: an engineering deep-dive on the RoCE (RDMA over Converged Ethernet v2) backend fabric Meta built over four years…
Vercel — How Google Handles JavaScript Throughout the Indexing Process — Vercel + MERJ's 2024-08-01 joint empirical study of Googlebot's rendering behaviour on nextjs.org (with supplemental data from monogram.io and basement.io) over April 2024.…
Segment — $0.6M/year savings by using S3 for change-data-capture for DynamoDB — Twilio Segment (2024-08-01) posts a cost-and-consolidation retrospective on their objects pipeline — the service that stores current state of every Segment object in a ~1 PetaByte…
Fly.io — Making Machines Move — Fly.io's 2024-07-30 engineering post on the year-long rebuild of their fleet-drain capability for stateful Fly Machines — i.e. machines with Fly Volumes (locally-attached NVMe).…
Netflix — Java 21 Virtual Threads: Dude, Where's My Lock? — A Netflix TechBlog post (2024-07-29, Tier 1; Vadim Filanovsky, Mike Huang, Danny Thomas, Martin Chalupa of Netflix's Performance Engineering + JVM Ecosystem teams) documenting…
Amazon's Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2 — Amazon Retail's Business Data Technologies (BDT) team is in the middle of migrating the largest production business-intelligence datasets in Amazon off apache-spark onto ray.…
Netflix — Supporting Diverse ML Systems at Netflix — Netflix's Machine Learning Platform (MLP) team describes how Metaflow — the open-source ML framework they started — is integrated with Netflix's internal production stack…
Netflix — Maestro: Netflix's Workflow Orchestrator — Netflix open-sources Maestro, the horizontally scalable workflow orchestrator that runs hundreds of thousands of workflows, launches thousands of workflow instances daily,…
Slack — AI-Powered Conversion from Enzyme to React Testing Library at Slack — Slack's Frontend Test Frameworks team retrospective on migrating 15,000+ frontend Enzyme tests to React Testing Library (RTL) as part of a React 18 upgrade…
Fly.io — AWS without Access Keys — Fly.io's 2024-06-19 post (oidc-cloud-roles) on giving a Fly Machine access to an AWS S3 bucket without ever minting an AWS keypair.…
Meta — Maintaining large-scale AI capacity at Meta — A Meta Production Engineering post describing how Meta maintains — patches, upgrades, verifies — the GPU training fleet that runs "thousands of training jobs every day…
Meta — MLow: Meta's low bitrate audio codec — Meta Engineering's 2024-06-13 post announces MLow (Meta Low Bitrate), a new real-time communication (RTC) audio codec shipped across WhatsApp, Messenger, and Instagram calls.…
Meta — How Meta trains large language models at scale — Meta Engineering's 2024-06-12 post describes the infrastructure shift from training many small recommender models on many GPUs to training few,…
Dropbox — Testing sync at Dropbox (2020) — Isaac Goldberg's (Dropbox) walkthrough of the testing strategy that allowed the team to rewrite Sync Engine Classic into Nucleus…
Dynamic loading of real-time content at Figma — Figma extended its per-page dynamic loading system — already used by viewers and prototypes, which only read file contents — to editors, which also write.…
Pinterest — HBase Deprecation at Pinterest — Alberto Ordonez Pereira (Sr. Staff SWE) + Lianghong Xu (Sr. Manager, Engineering) publish Part 1 of a 3-part Pinterest Engineering series (2024-05-14) on the decade-long arc…
High Scalability — Kafka 101 — Long-form explainer by Stanislav Kozlovski (Apache Kafka committer, guest author for High Scalability, 2024-05-09) distilling Apache Kafka…
Google Research — VideoPrism: A foundational visual encoder for video understanding — Google Research introduces VideoPrism, a video foundation model (ViFM) — a single visual encoder pre-trained once and then frozen for downstream use across 30 of 33…
Picture This: Open Source AI for Image Description — Fly.io developer-facing post by Nolan (Fly Machines team) walking through a weekend-scale open-source image-description service built from Ollama (serving LLaVA,…
Figma's journey to TypeScript — compiling away our custom programming language — Figma migrated the entire Skew codebase underlying its prototype viewer and mobile client to TypeScript via a custom Skew-to-TypeScript transpiler (not a conventional rewrite),…
Canva — Scaling to Count Billions — Canva's Creators-payment pipeline counts billions of content-usage events per month (templates, images, videos) to pay creators.…
Figma — Speeding Up C++ Build Times — Figma's Core team retrospective on cutting C++ cold-build times ~50% after a year in which the codebase grew 10% but build times grew 50%.…
High Scalability — Capturing a billion emo(j)i-ons — Dedeepya Bonthu's repost (from her Medium original) describes how Hotstar (India's live-sports OTT platform, later merged into JioCinema) built the in-house real-time emoji swarm…
High Scalability — Brief History of Scaling Uber — Retrospective by Josh Clemm (Senior Director of Engineering, Uber Eats — originally posted to LinkedIn, republished on High Scalability,…
Fly.io — JIT WireGuard — Fly.io's 2024-03-12 post on replacing push-based WireGuard peer provisioning with a Just-In-Time (JIT) pull-on-first-packet model on their fleet of gateway servers.…
Fly.io — Fly Kubernetes does more now (FKS beta) — Fly.io announces the beta of Fly Kubernetes (FKS) — their "blessed path" managed Kubernetes service. Architecturally, FKS is not a conventional managed-K8s offering (no Nodes,…
High Scalability — Behind AWS S3's Massive Scale — Third-party explainer by Stanislav Kozlovski (Apache Kafka committer, writing as a guest for High Scalability, 2024-03-06) distilling AWS's public material on Amazon S3 into…
Fly.io — Globally Distributed Object Storage with Tigris — Fly.io's 2024-02-15 public-beta announcement for Tigris, a third-party globally distributed, S3-compatible object store integrated into Fly.io as the fly storage create primitive.…
Zalando — Tale of 'metadpata': the revenge of the supertools — Zalando's 2024-01-22 post (author Adrian Chifor, Principal Engineer on the cloud infrastructure team) is a DNS-outage postmortem from November 2022,…
Zalando — Patching the PostgreSQL JDBC Driver — Zalando's 2023-11-08 engineering post (authored by the team running their internal Postgres-sourced event-streaming platform) diagnoses a long-standing runaway-WAL-growth bug…
Zalando — Understanding GraphQL Directives: Practical Use-Cases at Zalando — A 2023 deep-dive by Zalando's UBFF team (author: Boopathi Rajaa) walking through the full taxonomy of GraphQL directives in production at Europe's largest fashion e-commerce…
High Scalability — The Swedbank Outage shows that Change Controls don't work — A 2023-08-16 opinion-analysis piece on High Scalability (authored by a Kosli engineer, republished on Hoff's blog) using the April 2022 Swedbank outage and its SEK 850M (~USD 85M)…
High Scalability — Lessons Learned Running Presto at Meta Scale — A Meta-authored operational retrospective (Neerad Somanchi, Production Engineer; Philip Bell, Developer Advocate) published by High Scalability,…
High Scalability — Gossip protocol explained — A canonical textbook-style explainer of the gossip protocol (a.k.a. epidemic protocol) as the peer-to-peer alternative to centralized state-management in large distributed systems…
Zalando — Rendering Engine Tales: Road to Concurrent React — Zalando's Rendering Engine (RE), the universal-rendering framework that serves the Fashion Store website, is being migrated to React 18's concurrent rendering.…
High Scalability — Consistent hashing algorithm — A canonical textbook-style explainer of consistent hashing as a cache / storage partitioning primitive, republished by High Scalability from systemdesign.one.…
High Scalability — Stuff The Internet Says On Scalability For December 2nd, 2022 — Todd Hoff's weekly curated link roundup on highscalability.com for the week of 2022-12-02. The companion piece to the earlier-ingested July 11, 2022 roundup,…
High Scalability — Stuff The Internet Says On Scalability For July 11th, 2022 — Todd Hoff's weekly curated link roundup on highscalability.com for the week ending 2022-07-11 — the first High Scalability article ingested into this wiki and a representative…
Zalando — Operation Based SLOs — João Oliveirinha (Zalando SRE, 2022-04-27) publishes the technical deep-dive companion to the 2021-09-20 Tracing SRE's Journey — Part II retrospective.…
Zalando — Zalando's Machine Learning Platform — Zalando's ML Platform team publishes an overview of the full ML practitioner stack serving recommender-system, size-recommendation,…
Zalando — Tracing SRE's journey in Zalando - Part III — Third and final installment of Zalando's SRE retrospective (Koutsiaris + Koutsouraki, 2021-10-14), covering the 2020 transition from a single SRE team to an SRE department.…
Zalando — Micro Frontends: from Fragments to Renderers (Part 1) — Jeremy Chone and co-authors describe Zalando's second micro-frontend generation: the move from Fragment-based Project Mosaic (2015) to the entity-based Interface Framework (IF;…
How Figma's multiplayer technology works — Figma's 2019 post (republished 2025-08-16 on HN, surfacing 4 years after original publication) is the foundational description of Multiplayer's architecture and design-choice…