Skip to content

[ENH]: Dead letter queuing for compaction jobs #5023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 20, 2025

Conversation

tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Jul 2, 2025

Description of changes

This change adds a dead letter queueing system to the compaction scheduler. If a compaction job on a collection fails max_failure_count times, it will be moved to a dead set that disables this collection from being compacted while it is in this set. As of this change, the only way to clear this set is by restarting the compaction process.

  • Improvements & Bug fixes
    • Added a failing_jobs map in the CompactionManager to help keep track of jobs that have failed on consecutive attempts.
    • Added a dead_jobs set in the CompactionManager to record "dead" jobs.
  • New functionality
    • Described above.
    • Added a metric compactor_dead_jobs_count to track the size of the dead jobs set.

Test plan

Added a test in scheduler.rs.

Also manually tested by injecting failures in certain compaction jobs and tracking the dead set size metric locally.

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

github-actions bot commented Jul 2, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 changed the title more testing pending [ENH]: Dead letter queuing for compaction jobs Jul 2, 2025
@tanujnay112 tanujnay112 force-pushed the 06-29-add_dead_letter_queue branch 4 times, most recently from 622adc1 to 6757078 Compare July 3, 2025 21:37
@tanujnay112 tanujnay112 force-pushed the 06-29-add_dead_letter_queue branch from 6757078 to ce3d50a Compare July 7, 2025 18:01
@@ -205,6 +205,7 @@ impl ChromaError for CompactionError {

#[derive(Debug)]
pub struct CompactionResponse {
#[allow(dead_code)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using this in CompactionManager anymore as I pull it out into CompactionTaskCompletion so I can also associate collection ids with compaction errors. I'm thinking we still keep this around for debugging?

@tanujnay112 tanujnay112 requested a review from sanketkedia July 7, 2025 18:05
@tanujnay112 tanujnay112 marked this pull request as ready for review July 7, 2025 18:05
Copy link
Contributor

Add Dead Letter Queuing for Compaction Jobs with Failure Tracking

This PR introduces a dead letter queue for the compaction scheduler. Collections whose compaction jobs fail more than a configurable number of times (max_failure_count) are moved to a 'dead set', preventing further compaction attempts until the process restarts. The system now tracks failing jobs, records dead jobs, increments a new metric (compactor_dead_jobs_count), and exposes this functionality through extended data structures, job lifecycle, and enhanced test cases.

Key Changes

• Added failing_jobs HashMap and dead_jobs HashSet to track failure counts and dead collections in the Scheduler.
• Extended configuration (CompactorConfig) to support a configurable max_failure_count (with default).
• Refactored job completion logic to succeed, fail, or permanently kill jobs and collections based on consecutive failures.
• Introduced metric reporting for dead jobs count using OpenTelemetry.
• Modified compaction manager to record successes and failures distinctly, associating errors and success with specific collection IDs.
• Updated tests to cover dead letter logic and ensured correctness across scheduler edge cases.

Affected Areas

• rust/worker/src/compactor/scheduler.rs
• rust/worker/src/compactor/compaction_manager.rs
• rust/worker/src/compactor/config.rs
• Configuration handling (CompactorConfig)
• Cargo.toml (dependency addition: opentelemetry)
• Execution and orchestration logic

This summary was automatically generated by @propel-code-bot

@@ -541,7 +646,7 @@ mod tests {
let jobs = jobs.collect::<Vec<&CompactionJob>>();
assert_eq!(jobs.len(), 1);
assert_eq!(jobs[0].collection_id, collection_uuid_2,);
scheduler.complete_collection(collection_uuid_2);
scheduler.succeed_collection(collection_uuid_2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - maybe mark_collection_as_succeeded style naming is a bit less clunky

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be adding the dashboards / triggers for killed jobs count ?

@@ -49,14 +49,19 @@ use tracing::Instrument;
use tracing::Span;
use uuid::Uuid;

type BoxedFuture =
Pin<Box<dyn Future<Output = Result<CompactionResponse, Box<dyn ChromaError>>> + Send>>;
type CompactionOutput = Result<CompactionResponse, Box<dyn ChromaError>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good cleanup ty

@tanujnay112 tanujnay112 merged commit bca1e26 into main Jul 20, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants