PATTERN Cited by 2 sources
IAM role attribution from S3 access logs¶
Problem¶
When attributing data access in an AWS-native data lake, naive substrates collapse:
- Source IPs are ephemeral (NAT, EC2 instance churn) and meaningless to data owners.
- Compute instance identities are short-lived and don't roll up to teams or services.
- Application-level identifiers require invasive instrumentation on every code path that reads the bucket.
The result: data owners can't tell who is reading their data, only that someone is reading it. Stakeholder discovery, cost attribution, and consumer-pattern fingerprinting all need a stable, human-meaningful entity coordinate.
Pattern¶
Use the IAM principal in the SAL requester field as the entity
coordinate.
S3 Server Access Logs (SAL)
record the IAM identity that issued each request in the
requester field — typically the ARN of an assumed IAM role. With
a deployment naming convention that links role names to services or
teams, this gives a stable, human-meaningful coordinate that
survives instance churn, NAT, and reassignment.
Canonical shape (Yelp 2026-05-21 + 2025-09-26)¶
1. Extract the IAM role from requester¶
The requester field is an IAM principal ARN (for assumed roles,
the format is
arn:aws:sts::<account>:assumed-role/<role-name>/<session-name>).
Yelp's 2025-09-26 SAL post canonicalised the extraction:
2. Use it as the GROUP BY coordinate¶
The 2026-05-21 partition-access-visualisation aggregation:
SELECT
COUNT(1) AS ct,
requester AS iam_role,
...
FROM s3_server_access_logs_compacted
GROUP BY iam_role, ...
(The post uses requester directly as iam_role; the regex
extraction step from 2025-09-26 may be applied either pre-compaction
or in the SELECT.)
3. Adopt a role-naming convention that maps roles to teams¶
"Since Yelp runs on AWS, our entities are IAM roles, with role names linked to services or teams." — Yelp 2026-05-21
The convention is the load-bearing organisational discipline that
makes the pattern useful: a data owner sees prod-search-indexer
and knows immediately which team to talk to. Without this mapping,
the IAM-role coordinate is just opaque.
(Sources: sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33, sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Why IAM roles are the right coordinate¶
- Stable — roles are configured at infrastructure level, not per-instance. A role lives across deploys, scale-outs, and instance replacement.
- Human-meaningful — when role names follow a naming convention, the role IS the entity name.
- Granular enough — services and teams typically have distinct roles, so attribution at role grain ≈ attribution at service / team grain.
- Already there — SAL captures
requesterautomatically; no instrumentation required. - Cross-account-friendly — assumed roles include the account ID, so cross-account reads remain attributable.
Use cases (canonical)¶
- Cost attribution — group by role, sum the cost of S3 reads attributable to each. Yelp's 2025-09-26 disclosure: "to find the IAM role generating the call volume."
- Stakeholder discovery — answer "who consumes this table" for a data owner.
- Consumer-pattern fingerprinting — combined with partition + event- time grouping, the role becomes the colour axis on a partition-access-pattern visualisation.
- Incident response — slice by role to size the blast radius of a compromised credential.
- Default Access Retention exemptions — the IAM gate (per patterns/iam-policy-gated-cold-tier-access) operates at the role level: granting temporary access to a specific role for a Terraform PR-approved cost-acknowledged read.
Limitations¶
- Federation collapses identities. When multiple users assume the same role (e.g. an SSO-mapped engineer role), the role is the finest available granularity — individual user attribution requires the session name field of the assumed-role ARN.
- Application-level pooling collapses identities. A service that pools all reads under one role makes the role the coarsest available coordinate even when the underlying business owners are distinct.
- Out-of-band reads break attribution. If a user reads S3 via a pre-signed URL or an authenticated request that doesn't surface through SAL's role-based requester, attribution is lost.
- Role-naming discipline is required. Without it, roles are opaque strings and the human-meaningfulness claim collapses.
Seen in¶
- sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33 — canonical disclosure of role colour-coding for partition-access visualisation.
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
canonical disclosure of the regex extraction from
requesterand the cost-attribution query shape.
Related¶
- systems/aws-iam — the identity primitive.
- concepts/s3-server-access-logs — the substrate.
- concepts/granular-usage-attribution — the meta-concept.
- concepts/partition-access-pattern — the visualisation that uses role as colour axis.
- systems/yelp-s3-sal-pipeline — canonical Yelp substrate.
- systems/yelp-partition-access-visualization — canonical consumer.
- patterns/access-pattern-visualization-for-data-stewardship — the visualisation pattern that uses this attribution.
- patterns/iam-policy-gated-cold-tier-access — IAM-gate primitive that operates at the role level.