[ENH]: Dead letter queuing for compaction jobs #5023

tanujnay112 · 2025-07-02T22:13:22Z

Description of changes

This change adds a dead letter queueing system to the compaction scheduler. If a compaction job on a collection fails max_failure_count times, it will be moved to a dead set that disables this collection from being compacted while it is in this set. As of this change, the only way to clear this set is by restarting the compaction process.

Improvements & Bug fixes
- Added a failing_jobs map in the CompactionManager to help keep track of jobs that have failed on consecutive attempts.
- Added a dead_jobs set in the CompactionManager to record "dead" jobs.
New functionality
- Described above.
- Added a metric compactor_dead_jobs_count to track the size of the dead jobs set.

Test plan

Added a test in scheduler.rs.

Also manually tested by injecting failures in certain compaction jobs and tracking the dead set size metric locally.

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

tanujnay112 · 2025-07-02T22:13:40Z

[ENH]: Dead letter queuing for compaction jobs #5023 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-07-02T22:13:48Z

tanujnay112 · 2025-07-07T18:05:00Z

rust/worker/src/execution/orchestration/compact.rs

@@ -205,6 +205,7 @@ impl ChromaError for CompactionError {

 #[derive(Debug)]
 pub struct CompactionResponse {
+    #[allow(dead_code)]


Not using this in CompactionManager anymore as I pull it out into CompactionTaskCompletion so I can also associate collection ids with compaction errors. I'm thinking we still keep this around for debugging?

propel-code-bot · 2025-07-07T18:06:24Z

Add Dead Letter Queuing for Compaction Jobs with Failure Tracking

This PR introduces a dead letter queue for the compaction scheduler. Collections whose compaction jobs fail more than a configurable number of times (max_failure_count) are moved to a 'dead set', preventing further compaction attempts until the process restarts. The system now tracks failing jobs, records dead jobs, increments a new metric (compactor_dead_jobs_count), and exposes this functionality through extended data structures, job lifecycle, and enhanced test cases.

Key Changes

• Added failing_jobs HashMap and dead_jobs HashSet to track failure counts and dead collections in the Scheduler.
• Extended configuration (CompactorConfig) to support a configurable max_failure_count (with default).
• Refactored job completion logic to succeed, fail, or permanently kill jobs and collections based on consecutive failures.
• Introduced metric reporting for dead jobs count using OpenTelemetry.
• Modified compaction manager to record successes and failures distinctly, associating errors and success with specific collection IDs.
• Updated tests to cover dead letter logic and ensured correctness across scheduler edge cases.

Affected Areas

• rust/worker/src/compactor/scheduler.rs
• rust/worker/src/compactor/compaction_manager.rs
• rust/worker/src/compactor/config.rs
• Configuration handling (CompactorConfig)
• Cargo.toml (dependency addition: opentelemetry)
• Execution and orchestration logic

This summary was automatically generated by @propel-code-bot

HammadB · 2025-07-18T20:44:41Z

rust/worker/src/compactor/scheduler.rs

@@ -541,7 +646,7 @@ mod tests {
        let jobs = jobs.collect::<Vec<&CompactionJob>>();
        assert_eq!(jobs.len(), 1);
        assert_eq!(jobs[0].collection_id, collection_uuid_2,);
-        scheduler.complete_collection(collection_uuid_2);
+        scheduler.succeed_collection(collection_uuid_2);


nit - maybe mark_collection_as_succeeded style naming is a bit less clunky

HammadB

Will you be adding the dashboards / triggers for killed jobs count ?

HammadB · 2025-07-18T20:47:09Z

rust/worker/src/compactor/compaction_manager.rs

@@ -49,14 +49,19 @@ use tracing::Instrument;
 use tracing::Span;
 use uuid::Uuid;

-type BoxedFuture =
-    Pin<Box<dyn Future<Output = Result<CompactionResponse, Box<dyn ChromaError>>> + Send>>;
+type CompactionOutput = Result<CompactionResponse, Box<dyn ChromaError>>;


good cleanup ty

tanujnay112 changed the title ~~more testing pending~~ [ENH]: Dead letter queuing for compaction jobs Jul 2, 2025

tanujnay112 force-pushed the 06-29-add_dead_letter_queue branch 4 times, most recently from 622adc1 to 6757078 Compare July 3, 2025 21:37

more testing pending

ce3d50a

tanujnay112 force-pushed the 06-29-add_dead_letter_queue branch from 6757078 to ce3d50a Compare July 7, 2025 18:01

tanujnay112 commented Jul 7, 2025

View reviewed changes

tanujnay112 requested a review from sanketkedia July 7, 2025 18:05

tanujnay112 marked this pull request as ready for review July 7, 2025 18:05

HammadB reviewed Jul 18, 2025

View reviewed changes

HammadB approved these changes Jul 18, 2025

View reviewed changes

HammadB reviewed Jul 18, 2025

View reviewed changes

tanujnay112 merged commit bca1e26 into main Jul 20, 2025
61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH]: Dead letter queuing for compaction jobs #5023

[ENH]: Dead letter queuing for compaction jobs #5023

Uh oh!

tanujnay112 commented Jul 2, 2025 •

edited

Loading

Uh oh!

tanujnay112 commented Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

tanujnay112 Jul 7, 2025

Uh oh!

propel-code-bot bot commented Jul 7, 2025

Uh oh!

HammadB Jul 18, 2025

Uh oh!

HammadB left a comment

Uh oh!

HammadB Jul 18, 2025

Uh oh!

Uh oh!

Uh oh!

[ENH]: Dead letter queuing for compaction jobs #5023

[ENH]: Dead letter queuing for compaction jobs #5023

Uh oh!

Conversation

tanujnay112 commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

tanujnay112 commented Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

tanujnay112 Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot commented Jul 7, 2025

Uh oh!

HammadB Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

HammadB left a comment

Choose a reason for hiding this comment

Uh oh!

HammadB Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tanujnay112 commented Jul 2, 2025 •

edited

Loading