Meta — Fighting spam with Haskell¶
Summary¶
A 2015-06-26 Meta Engineering post (author Simon Marlow,
GHC core developer) describing Meta's two-year rewrite of
Sigma — the rule engine that proactively
identifies spam, phishing, and malware on Facebook — from its
purpose-built in-house DSL FXL to
Haskell. Post-rewrite Sigma serves >1M rps
in production. The article is the canonical statement of Meta's
five requirements for a policy-authoring language — purely
functional + strongly typed, automatic data-fetch batching /
concurrency, minutes-to-production code deploys, raw
performance, interactive development — and how Haskell plus
Meta-authored GHC extensions met them. The central technical
contribution, beyond the migration itself, is Haxl
(open-sourced on GitHub; ICFP 2014
paper "There is no fork") — a Haskell framework for
implicit concurrent data
fetching — and the associated GHC language feature Applicative
do-notation, which the compiler uses to automatically rearrange
imperative-looking policy code into batched + overlapping fetches
without the policy author writing explicit concurrency. The post
also catalogues Meta's GHC upstream contributions: heap-management
changes reducing GC frequency on multicore, per-thread
allocation limits that safely terminate
runaway requests via asynchronous exceptions,
a garbage-collector fix to support safe unloading of hot-swapped
compiled code, a finalizer-lifecycle fix for clean shutdowns. Two
operational framings deserve wiki canonicalisation: the
"source code in the repository is the code running in Sigma"
continuous-deployment posture (minutes from commit to fleet) and
Haskell sandwiched between two layers of C++ (mature Thrift server
above; existing C++ service clients below, wrapped as
Haxl data sources via FFI) — a pragmatic
embedded
functional runtime pattern distinct from "rewrite the whole stack in
Haskell." Performance headline: 20–30% throughput improvement over
FXL on a typical workload, up to 3× on specific request types,
enabled in part by Meta-initiated bug fixes (a years-old GC bug
causing crashes every few hours
and a nasty aeson JSON-parsing bug
whose corner case "tend to crop up all the time" at Facebook scale).
Key takeaways¶
- Sigma is a rule engine for proactive abuse detection, not a post-hoc classifier. Every user interaction on Facebook — status updates, likes, clicks, messages — triggers Sigma to evaluate a set of policies specific to that interaction type before the interaction takes effect. "Bad content detected by Sigma is removed automatically so that it doesn't show up in your News Feed" — the engine is in the request path and must respond before the action completes. Post-rewrite Sigma serves more than one million requests per second (Source; systems/sigma-meta).
- Policies are continuously deployed from the repository. The operational invariant is explicit: "At all times, the source code in the repository is the code running in Sigma, allowing us to move quickly to deploy policies in response to new abuses." This directly motivates the language requirements — type-correctness as a gate ("we don't allow code to be checked into the repository unless it is type-correct") and minutes-scale code deploys as a non-negotiable (Source; patterns/rule-engine-with-continuous-policy-deploy).
- Five requirements drove the language choice. (1) Purely functional + strongly typed — policies can't interact, can't crash Sigma, are isolable for unit testing; (2) automatic batching + overlapping of data fetches — most policies fetch external data; concurrency must be implicit so policy authors don't have to reason about it; (3) push code to production in minutes; (4) performance competitive with C++ (FXL's slow interpreter forced perf-critical logic into C++ in Sigma itself, slowing roll-out); (5) interactive development against real data. Haskell met (1), (3)–(5) out of the box; (2) required Meta to build Haxl + contribute Applicative do-notation to GHC (Source; concepts/purely-functional-policy-language).
- FXL's rejection is a load-bearing datum. The predecessor DSL "was not ideal for expressing the growing scale and complexity of Facebook policies. It lacked certain abstraction facilities, such as user-defined data types and modules, and its implementation, based on an interpreter, was slower than we wanted." Rather than improve FXL, Meta migrated. Canonical wiki cautionary tale on when in-house DSLs stop paying their cost — complexity growth outruns the expressivity budget of the DSL, and interpreter performance caps hardware utilisation (Source; systems/fxl-meta).
- Haxl: implicit concurrent data fetching as a framework. Haxl (GitHub; ICFP 2014 paper) lets a Haskell program express multiple data fetches as a pure functional computation; the framework automatically batches calls to the same data source and overlaps calls to distinct data sources, with the programmer writing no explicit concurrency constructs. Meta's motivation is that Haskell's native concurrency primitives are explicit (
forkIO,MVar), which is the wrong abstraction layer for authors of anti-abuse policies whose job is spam logic, not scheduling (Source; systems/haxl, concepts/implicit-concurrent-data-fetching). - Applicative do-notation: the compiler half of Haxl. For Haxl's batching/overlapping to work on statement sequences that look imperative, the compiler must recognise which statements are genuinely sequential (a later statement uses an earlier statement's result) vs. independent (and therefore parallelisable). Meta "designed and implemented Applicative do-notation in GHC" (GHC wiki: ApplicativeDo) to do this rearrangement automatically at compile time. Canonical wiki instance of compiler-language co-design in service of a production-system concurrency property (Source; systems/ghc, concepts/implicit-concurrent-data-fetching).
- Hot-code swapping of compiled policies. Sigma loads freshly compiled policy code into a running process, serves new requests on the new code, and lets in-flight requests finish on the old code before discarding it. Three primitives make it work: short-lived requests so no switch-during-request is needed, persistent-state code is never changed so state-layer invariants hold, and the GHC garbage collector detects when old code is no longer in use and triggers safe unloading. Loading uses GHC's built-in runtime linker (principle: "we could use the system dynamic linker"). Canonical wiki instance of GC-assisted safe unload of hot-swapped code (Source; concepts/hot-code-swapping).
- Architecture: Haskell sandwiched between two C++ layers. Above: the mature C++ Thrift server handles the request frontend ("in principle, Haskell can act as a thrift server, but the C++ thrift server is more mature and performant"). Below: existing C++ service clients for Facebook-internal services are wrapped as Haxl data sources via Haskell's Foreign Function Interface rather than rewritten. A compile-time C++ name-demangler "avoid[s] the intermediate C layer" for most calls, since Haskell's FFI is designed to call C, not C++. Canonical wiki instance of the embedded-functional-runtime-in-a-C++-service pattern — get Haskell's safety + concurrency model where it matters, without rewriting Thrift transport or client libraries (Source).
- Performance: 20–30% throughput, up to 3× on specific types. Measured on the 25 most common Sigma request types (≈95% of typical workload), Haskell was up to 3× faster than FXL on certain requests, with a 20–30% overall throughput improvement on a typical workload mix — "we can serve 20 percent to 30 percent more traffic with the same hardware". Achieved via per-request automatic memoization of top-level computations via source-to-source translation (shared values computed once even when referenced by multiple policies), heap-management changes to GHC reducing GC frequency on multicore (Meta runs ≥ 64 MB allocation area per core, up from GHC's frugal defaults), selective FFI marshaling (fetch only the fields a policy needs), and a fix for a one-in-a-million
aesonJSON-parsing bug that "tend[s] to crop up all the time" at Facebook scale (Source). - Allocation limits: per-thread memory caps for resource isolation. Meta added allocation limits to GHC — a cap on the memory a single thread can allocate before it is terminated. When a pathological request consumes outsized memory ("pathological performance of an algorithm on certain rare inputs"), the runtime asynchronously terminates only that request — resources (network connections, etc.) are released safely through Haskell's asynchronous-exception machinery, and other in-flight requests on the same machine are unaffected. A graph in the post "tracks the maximum live memory across various groups of machines in the Sigma fleet" — large live-memory spikes disappear after allocation-limit rollout. Canonical wiki instance of a per-thread allocation cap as cooperative scheduler-level backpressure (Source; concepts/allocation-limit).
- Interactive development via GHCi against production data sources. Facebook engineers develop policies "interactively, testing code against real data as they go" via a customised GHCi. To make this work Meta "had to make our build system link all the C++ dependencies of our code into a shared library that GHCi could load" and "customized the GHCi front end to implement some of our own commands and streamline the desired workflows." Result: "developers can load their code from source in a few seconds" and "test against real production data sources" (Source).
- Package management: Cabal → Stackage. Directly using packages from Hackage combined with Meta's internal build tools produced "a yak-shaving exercise involving a cascade of updates to other packages, often with an element of trial and error to find the right version combinations." The root cause: "the system of version dependencies in Cabal relies too much on package authors getting it right, which is hard to ensure." Meta switched to Stackage — a curated set of package versions "known to work together" — removing the trial-and-error step from dependency bumps. Canonical wiki datum on the cost of SemVer-reliant ecosystems vs. curated-set ecosystems (Source).
- GHC bug contributions. Two explicitly named upstream fixes: (a) a GC bug "that was causing our Sigma processes to crash every few hours. The bug had gone undetected in GHC for several years" — canonical wiki instance of long-latency compiler bugs flushed out by hyperscale workloads; (b) a finalizer-lifecycle fix for "crashes during process shutdown". Post-fix: "we haven't seen any crashes in either the Haskell runtime or the Haskell code itself across our whole fleet" (Source).
Architectural numbers + operational notes (from source)¶
- Sigma throughput: "more than one million requests per second" (post-rewrite, production).
- Workload profile: 25 most common request types ≈ 95% of typical workload (basis for the perf comparison).
- Perf delta vs FXL: up to 3× faster on specific request types; 20–30% overall throughput improvement on typical mix.
- GHC allocation-area size at Meta: ≥ 64 MB per core (vs. GHC's default — explicitly "frugal").
- Rewrite duration: "two-year-long major redesign" (covering FXL → Haskell migration end-to-end).
- Request characteristics used by hot-swap: short-lived requests (no need to switch running requests to new code).
- Language properties used: purely functional, strongly typed, mature optimising compiler (GHC), interactive environment (GHCi), rich library ecosystem (Hackage/Stackage), active developer community.
- Integration model: Haskell between two C++ layers — C++ Thrift server above; C++ service-client code wrapped as Haxl data sources via FFI below.
- FFI detail: compile-time C++ name demangler removes the need for C-shim layers for most C++ calls.
- Fleet-level diagnostic from the allocation-limits rollout: max-live-memory spikes on fleet-monitoring graphs disappear after enabling allocation limits on a request previously exhibiting resource-intensive outliers.
aesonperformance bug: disclosed only as "a nasty performance bug"; Meta cites Bryan O'Sullivan's post-fix write-up. Not quantified in the Meta post.- No disclosures on: exact Meta GHC fork vs. upstream timeline, Sigma fleet size, per-machine request rate, specific policy counts, per-policy latency targets, cold-start / code-load latency numbers.
Systems extracted¶
New wiki pages:
- systems/sigma-meta — Meta's anti-abuse rule engine; 1M+ rps post-rewrite; runs on Haskell between two C++ layers.
- systems/haxl — Meta's Haskell framework for implicit concurrent data fetching with automatic batching + overlapping. Open-source on GitHub; ICFP 2014 paper.
- systems/ghc — the Glasgow Haskell Compiler and runtime, with Meta-contributed features (Applicative do-notation; heap-management changes; per-thread allocation limits; finalizer fix; GC fix for unloading hot-swapped code; GC fix for years-undetected crash).
- systems/haskell — the Haskell language, canonical-named here as the production-policy language at Meta Sigma.
- systems/stackage — the curated Haskell package set Meta moved to after Cabal/Hackage direct use produced version-dependency yak-shaves.
- systems/fxl-meta — the predecessor in-house Facebook DSL Haskell replaced at Sigma (interpreted; no user-defined types or modules; too slow).
Existing pages reinforced: none (this is the first Haskell / Meta-runtime source on the wiki).
Concepts + patterns extracted¶
New concept pages:
- concepts/hot-code-swapping — live-code-reload primitive with three enabling conditions (short-lived requests; persistent state's code never changes; GC-assisted detection of when old code is no longer in use so it can be safely unloaded). Meta's GC-detection implementation cited directly.
- concepts/implicit-concurrent-data-fetching — the abstraction Haxl + Applicative do-notation together realise: the programmer writes pure functional code; the framework + compiler together batch same-source fetches and overlap independent fetches, with no explicit concurrency constructs. Canonical industrial instance.
- concepts/allocation-limit — per-thread memory allocation cap enforced by the runtime. On cap-exceeded, an asynchronous exception terminates only the offending thread; other threads are unaffected; released resources (sockets, files) cleaned up by standard exception machinery. Cooperative-scheduler sibling of OS-level RLIMITs; request-scoped blast-radius containment.
- concepts/purely-functional-policy-language — language-level property set Meta explicitly ties to its operational posture: policies cannot interact with each other, cannot crash the engine, and are isolable for unit testing. Type-correctness gate in the repo. Canonical wiki statement of functional-programming-as-production-safety-primitive for rule engines.
Existing concept reinforced: none directly (the existing concepts/garbage-collection page scopes to storage-layer GC; Sigma's reachability-based GC use is runtime-level and is documented in concepts/hot-code-swapping itself).
New pattern pages:
- patterns/rule-engine-with-continuous-policy-deploy — source-in-repo is code-in-production posture for anti-abuse rule engines. Minutes from commit to fleet. Requires: type-correct-or-rejected repo gate, hot-code-swapping runtime, performance within the request-path budget, and interactive testing against production data. Canonical wiki instance = Meta Sigma.
- patterns/embedded-functional-runtime-in-cpp-service — Haskell (or any managed functional runtime) sandwiched between two C++ layers — mature C++ transport on top, existing C++ client libraries below wrapped as data sources via FFI. Get the safety + concurrency model of the functional runtime exactly where it's needed (the request-evaluation layer) without rewriting the surrounding ecosystem. Canonical wiki instance = Meta Sigma.
Caveats¶
- Dated 2015-06-26. Post content is specific to Sigma in 2015; Meta has not (as of 2024–2026 corpus) published a Sigma follow-up at comparable depth. The Haskell + Haxl foundations remain valid (Haxl is still on facebook/Haxl and the ICFP 2014 paper is canonical), but current Sigma scale, fleet size, and any subsequent language evolution are not disclosed in this source.
- No absolute scale beyond the 1M rps headline. Fleet size, per-machine throughput, memory footprints, cold-start-from-load-new-compiled-code latency, policy count, and per-policy latency targets are not disclosed.
- The "graph of live memory before/after allocation limits" in the original post is referenced verbatim but the graph itself is an image — no axis numbers transcribed into the post text, so the wiki cannot cite absolute MB or duration figures.
- The aeson bug is described qualitatively only ("one-in-a-million corner cases"); the Meta post defers to Bryan O'Sullivan's post-fix write-up for detail.
- Applicative do-notation implementation credit is Meta's ("we also designed and implemented Applicative do-notation in GHC"), but the feature is since upstream and standard in modern GHC — the wiki reflects the 2015 provenance.
- FXL is a stub — Meta never published a standalone FXL design paper; this source's one paragraph is the canonical description.
- Policy data model (how policies compose; how contradictory policies are handled; how policies are versioned across fleet) is not described in this post. The article is a language/runtime piece, not an engine-semantics piece.
Source¶
- Original: https://engineering.fb.com/2015/06/26/security/fighting-spam-with-haskell/
- Raw markdown:
raw/meta/2024-12-22-fighting-spam-with-haskell-at-meta-2015-890c8301.md
Related¶
- companies/meta — this is the earliest-published (2015) Meta Engineering post currently ingested; the language/runtime axis of Meta's stack, orthogonal to the 2023–2024 posts on data warehouse, GenAI training, privacy, source control, and fleet maintenance.
- systems/hhvm — Meta's canonical Hack/PHP VM. HHVM is to Meta's web tier what Sigma/Haskell is to anti-abuse: a purpose-specific managed runtime optimised for a dominant in-house workload. Different languages, adjacent pattern.