SYSTEM Cited by 1 source
DoLa (Decoding by Contrasting Layers)¶
DoLa (Decoding by Contrasting Layers, arXiv:2309.03883, code at voidism/DoLa) is a factuality-decoding method that improves LLM factual accuracy by contrasting the next-token distribution from a mature transformer layer against the distribution from an earlier layer, then decoding from the contrast rather than from the final layer alone. Before SLED, DoLa was the best-performing factuality-decoding method in the category (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
The DoLa and SLED lineage share a structural premise: intermediate transformer layers contain signal that the final layer sometimes overrides in favour of training-data frequency bias. DoLa extracts that signal via pairwise contrast between one premature and one mature layer; SLED extracts it via weighted-average across all layers. The wiki entry here is scoped to the context surfaced by the 2025-09-17 Google Research SLED post, which treats DoLa as its principal comparator.
Role in the SLED comparison¶
The SLED blog post positions DoLa as the prior state of the art among competing decoding methods — the baseline SLED is measured against. Two headline numbers from the comparison (blog-sourced, paper-authoritative):
- SLED improves accuracy up to +16 percentage points over both the base model and DoLa on "two challenging datasets" (not individually named in the blog) (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
- SLED's decoding-time overhead is only ~4% higher than DoLa's, attributed to SLED computing per-layer LM-head projections across every layer instead of contrasting a pair (Source: sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers).
The composability claim the SLED post makes — "SLED can be flexibly integrated with other factuality decoding methods" — explicitly includes DoLa-style contrast; the two are not architecturally exclusive.
What the SLED source discloses about DoLa¶
The Google Research SLED post names DoLa once as the prior SOTA and links to the GitHub repo and arXiv paper, but doesn't reproduce DoLa's mechanism. Per the linked arXiv paper's abstract (referenced for context, not reconstructed as claims here), DoLa's rule is to amplify logit differences between a mature and a premature layer — no additional decoding-time machinery beyond the contrast. Concrete mechanism detail lives in the paper; this wiki page stops at what the SLED source confirms.
Relationship to other wiki primitives¶
- systems/sled — the SOTA successor as of NeurIPS 2024; SLED generalises "contrast two layers" to "weight-average all layers".
- concepts/factuality-decoding — the category DoLa instantiates alongside SLED.
- concepts/early-exit-logits — the shared primitive: vocabulary distributions from intermediate transformer layers, obtained by applying the final projection matrix to intermediate hidden states.
- concepts/logits — the underlying scores DoLa contrasts.
- concepts/llm-hallucination — the failure mode DoLa remediates.
- concepts/llm-decoding-step — the insertion point; DoLa is a modification of the final decoding phase.
Seen in¶
- sources/2025-09-17-google-sled-making-llms-more-accurate-by-using-all-of-their-layers — named as the prior best factuality-decoding baseline; the ~4% decoding-time and up-to-16%-accuracy comparison anchor for SLED.