Skip to content

Conversation

@mina-parham
Copy link
Contributor

@mina-parham mina-parham commented Dec 17, 2025

to get log now I directly read the slurm-{job-id}.out as for sacct waits for slurm to release nodes and it some experiment it took so long time, this way we show log faster as soon as it's written also avoid the sacct error Slurm accounting storage is disabled which happened in the aws slurm cluster I worked on.

We fixed the slurm provider to detect completed jobs by scanning for slurm-*.out log files when accounting is disabled(sacct didn't work because of the storage and squeue gives empty that's why the task stuck on LAUNCHING), enabling job statuses to update from LAUNCHING to COMPLETE.

@codecov-commenter
Copy link

codecov-commenter commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 0% with 51 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
api/transformerlab/compute_providers/slurm.py 0.00% 47 Missing ⚠️
api/transformerlab/routers/experiment/jobs.py 0.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

deep1401
deep1401 previously approved these changes Dec 17, 2025
@deep1401 deep1401 dismissed their stale review December 17, 2025 21:45

Dismissing my review since new changes seem to be added and I'm not sure they're related to the provider log fix

Copy link
Member

@deep1401 deep1401 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern about marking non lab sdk jobs as COMPLETE

active_job_count += 1

# Step 2: Find completed jobs by scanning log files
find_command = f"find /home/{self.ssh_user} -maxdepth 1 -name 'slurm-*.out' -type f -mtime -1 2>/dev/null"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont completely understand this one but maybe you could help?
I am just thinking of this in terms where multiple people are using the same slurm cluster and how we mark 2 jobs in LAUNCHING state as complete if both were launched on the same cluster. The easier solution I think would be that we store the slurm job id in job data (I think we do this already right in the provider launch result?) and then mark that particular job as complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I’m wrong, but when two people launch two jobs on the same cluster, each job should have a different job_id, right? For example, Mina runs a job on a cluster with provider ID 4, and Deep runs one on provider ID 5. So when marking jobs as complete, it should only complete Mina’s job and not affect anything else, since Mina couldn’t have a job with a provider job ID of 5 anyway (I’m assuming this is how the code already works; otherwise, we’d have another problem here).

Does what you’re suggesting mean we need to change the logic already implemented in check-state and list_jobs? The check-state command gets the list of jobs and marks them as complete. My implementation tries to fix the part where the squeue command isn’t working and returns empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants