PATTERN Cited by 1 source
Object tagging for lifecycle expiration¶
Problem¶
You need to delete millions of individual S3 objects on a schedule that varies per-object — for example, source raw-text logs immediately after they have been compacted into Parquet, or unused objects identified by an access-based retention job. Three naive approaches don't scale:
- Per-object
DELETEAPI — at millions of objects / day, issue rate is the bottleneck. Even batchedDeleteObjects(100-key batch) tops out at account-wide write TPS limits. - Bucket-level lifecycle TTL — too coarse. You can't say "delete this object in 7 days, that object in 90."
- Lifecycle-policy-per-prefix — modifying the lifecycle policy each time would require constant configuration churn and hit the policy size limits.
Yelp's verbatim framing:
"That's the only scalable way to delete per object without needing to modify lifecycle policy each time or issuing delete API calls." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Pattern¶
Decouple "which objects to delete" from "when to delete them":
- Apply a tag (e.g.
expire=true) to each object that should be deleted. S3 supports up to 10 tags per object, each a key-value pair. - Configure the bucket's lifecycle policy with a rule that
expires objects carrying the tag:
Filter: { Tag: { Key: "expire", Value: "true" } }+Expiration: { Days: N }. - Let AWS do the actual deletion asynchronously.
The tagging step is the scalable operation — per-object
PutObjectTagging or batched via
S3 Batch Operations.
Shape at two scales¶
Low-volume buckets: direct tagging¶
For buckets that produce hundreds to a few thousand objects
per compaction window, issue PutObjectTagging calls directly
from the compaction job. Avoids the fixed $0.25 per-bucket-per-
job S3 Batch Operations fee.
High-volume buckets: S3 Batch Operations¶
For buckets producing millions of objects per window, build
an S3 Batch Operations manifest (CSV of bucket,key) and submit
a PutObjectTagging job. Batch Operations parallelises the
tagging and reports per-object success/failure to an S3 report.
Gotchas from Yelp's 2025-09-26 post:
- Athena query results include a header row — Batch Ops interprets it as a bucket name, causing job failures. "To work around this, we recreate manifest files in memory without headers."
- Object keys in manifests must be URL-encoded to
quote_plus(key, safe="/")equivalence. - Flat $0.25 per bucket per job fee — dominates for low-volume buckets, hence the two-scale dispatch rule above.
- Batch Operations does not support
Deleteas an action — which is why the indirect tag-then-expire pattern exists.PutObjectTaggingis the load-bearing supported action here.
Why tag-then-expire beats direct-delete¶
- Scalable: lifecycle runs async and at AWS's own pace; your job only has to issue the tag, not the delete.
- Auditable: tagged objects are visible (pre-expiration) in the inventory; you can revert by removing the tag before the expiration day.
- Idempotent: re-tagging an already-tagged object is a no-op.
- Recoverable: if a bug in the compaction job emits bad tags, you have a window (the lifecycle's grace period) to detect and untag before actual deletion.
Composition¶
Common stacks:
- Log compaction: patterns/raw-to-columnar-log-compaction compacts raw → Parquet, then tags raw for expiration.
- Access-based retention: patterns/s3-access-based-retention tags unused objects based on joining inventory ⋈ SAL.
- Post-incident cleanup: sweep tool identifies compromise-era objects by tag, lifecycle deletes them.
Reverse-direction variant¶
The opposite pattern — untag to prevent deletion — also
applies. Apply keep=true at ingest; lifecycle rule is
"expire unless tagged". Safer default for systems where the
failure mode of "forgot to tag" is worse than "deleted
prematurely."
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
canonical wiki instance. Yelp tags compacted source SAL
objects for lifecycle expiration; composes with S3 Batch
Operations
PutObjectTaggingfor high-volume buckets and direct tagging for low-volume buckets. The $0.25 per-bucket- per-job fee drives the two-scale dispatch rule; Batch Ops's lack of aDeleteaction is what makes this pattern necessary in the first place.
Related¶
- systems/aws-s3 — parent service.
- systems/s3-batch-operations — the high-volume tagging primitive.
- patterns/s3-access-based-retention — a higher-level retention scheme built on this pattern.
- patterns/raw-to-columnar-log-compaction — common composer.
- concepts/s3-server-access-logs — a common subject of this deletion.