CONCEPT Cited by 1 source

XML canonicalisation¶

XML canonicalisation (canonical XML; C14N) is the process of converting an XML document or fragment to a byte-exact normal form so that logically-equivalent XML documents with different byte-level representations (whitespace, attribute order, namespace declarations, comments) produce the same bytes. Canonicalisation is required by XML-DSig because cryptographic hashes are computed over bytes, and XML has many valid byte encodings of the same logical tree.

W3C spec: Canonical XML 1.0 (2001); Exclusive XML Canonicalisation 1.0 (2002) is the variant designed for signed fragments that might be moved between documents.

Why canonicalisation is needed¶

Two XML documents can be logically identical but byte-different:

<!-- Doc 1 -->
<Assertion xmlns:saml="urn:...saml..." saml:id="a"/>

<!-- Doc 2 (re-serialised) -->
<Assertion xmlns:saml="urn:...saml..." saml:id='a' ></Assertion>

<!-- Doc 3 (re-serialised differently) -->
<Assertion xmlns:saml="urn:...saml..." saml:id="a" ></Assertion>

Under an XML parser all three are the same tree. Under a byte-level hash they are three different inputs. Canonicalisation defines a deterministic one-way mapping from any of them to the same bytes so that the hash is invariant to byte-level re-serialisation.

Typical C14N rules:

UTF-8 encoding.
Line endings normalised to #xA.
Empty elements serialised as <foo></foo> not <foo/>.
Attributes sorted (namespace declarations first, then lex by name).
Whitespace in content preserved; whitespace between tags preserved.
Comments included (or excluded — this is a canonicalisation-method parameter chosen by the signature's <CanonicalizationMethod>).

Role in XML-DSig verification¶

XML-DSig uses canonicalisation twice in a single signature:

Over <SignedInfo> — the <SignedInfo> element is canonicalised, then signed; the <SignatureValue> is the signature over those bytes. Canonicalisation method named in <CanonicalizationMethod> child of <SignedInfo>.
Over each referenced element — each <Reference URI="..."> points to an element whose canonicalised bytes must hash to the value inside <DigestValue>. Per-reference canonicalisation specified by <Transforms>/<Transform> (e.g. http://www.w3.org/2001/10/xml-exc-c14n#).

Both canonicalisations must be performed by the verifier on the document it received; otherwise the signature/digest don't check.

Why canonicalisation is attack-relevant¶

Two facets make canonicalisation a recurring XSW substrate:

Canonicalisation is a second parse. Many implementations use one XML parser to locate the element and a second to canonicalise it (or hand a string off to a different C14N library). If the locator and the canonicaliser return different elements for the same lookup, you have a parser differential inside the signature path.
Canonicalisation rules embed choices. Whether comments are included, whether namespaces are inherited from ancestors (inclusive vs exclusive C14N), whether whitespace is preserved — these choices are picked by the signature itself via the algorithm URI, but implementations handle the edge cases differently. This is how historic whitespace-injection and comment-injection XSW variants worked.

ruby-saml specifically added systems/nokogiri to ruby-saml because systems/rexml lacked canonicalisation support. The library's architecture assumes "REXML locates, Nokogiri canonicalises" — and the 2025 CVEs exploit the gap where REXML's idea of "where the SignedInfo is" and Nokogiri's idea are not the same.

Structural implication¶

A correct verifier canonicalises exactly the bytes it has already cryptographically located. After the signature verification succeeds on the canonicalised <SignedInfo>, subsequent uses of <SignedInfo> (to find <Reference>, to find <DigestValue>) must operate on those bytes — not on bytes re-obtained by re-querying the document. See patterns/single-parser-for-security-boundaries.

Seen in¶

sources/2025-03-15-github-sign-in-as-anyone-bypassing-saml-sso-authentication-with-parser-differentials — the canonicalisation step in ruby-saml uses systems/nokogiri while the element location uses systems/rexml; the two-parser canonicalisation gap is the root cause of the XML signature wrapping the CVEs exploit.

concepts/xml-signature-wrapping — the attack class that most frequently exploits canonicalisation misuse.
concepts/parser-differential — the underlying class when canonicalisation is split across parsers.
systems/saml-protocol — the spec mandating XML-DSig and thus canonicalisation.