SYSTEM Cited by 2 sources
GPT-OSS¶
GPT-OSS is OpenAI's family of open-weight LLMs released in 2025, including Mixture-of-Experts variants. First wiki mention: sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix — cited alongside Qwen3, Qwen3 MoE, and Gemma3 as one of the "modern architectures + Mixture-of-Experts variants" supported by Netflix's internal Post-Training Framework.
Hosted on Databricks FMAPI (2026-05-22)¶
GPT-OSS 20B and 120B are served on the Foundation Model APIs with implicit prompt caching enabled. The 2026-05-22 Databricks announcement names GPT-OSS as the first OSS-model rollout of the prompt-caching capability, with the disclosed numbers from one of Databricks' large-scale production batch-inference pipelines:
- +2.5× per-replica input-token throughput
- 3× P50 latency reduction
- 30% cache hit ratio (described as "relatively low")
The 30% hit-ratio / 2.5× throughput asymmetry is structurally explained by prefill-skip economics: cache hits completely skip the prefill stage, so even modest hit rates yield large per-hit savings on prefill-dominated workloads (Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models).
Related¶
- sources/2026-02-13-netflix-scaling-llm-post-training-at-netflix
- sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models
- systems/qwen
- systems/gemma
- systems/llama-3-1
- concepts/mixture-of-experts
- concepts/kv-cache
- concepts/implicit-prompt-caching
- systems/netflix-post-training-framework
- systems/databricks-foundation-model-api
- systems/databricks-fmapi-prompt-caching
- companies/databricks