Fix/provider log #1092

mina-parham · 2025-12-17T19:32:21Z

to get log now I directly read the slurm-{job-id}.out as for sacct waits for slurm to release nodes and it some experiment it took so long time, this way we show log faster as soon as it's written also avoid the sacct error Slurm accounting storage is disabled which happened in the aws slurm cluster I worked on.

We fixed the slurm provider to detect completed jobs by scanning for slurm-*.out log files when accounting is disabled(sacct didn't work because of the storage and squeue gives empty that's why the task stuck on LAUNCHING), enabling job statuses to update from LAUNCHING to COMPLETE.

codecov-commenter · 2025-12-17T19:35:35Z

Codecov Report

❌ Patch coverage is 0% with 51 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
api/transformerlab/compute_providers/slurm.py	0.00%	47 Missing ⚠️
api/transformerlab/routers/experiment/jobs.py	0.00%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

api/transformerlab/routers/experiment/jobs.py

…s when accounting is disabled.

Dismissing my review since new changes seem to be added and I'm not sure they're related to the provider log fix

deep1401

I have a concern about marking non lab sdk jobs as COMPLETE

deep1401 · 2025-12-18T16:36:49Z

api/transformerlab/compute_providers/slurm.py

+                    active_job_count += 1
+
+            # Step 2: Find completed jobs by scanning log files
+            find_command = f"find /home/{self.ssh_user} -maxdepth 1 -name 'slurm-*.out' -type f -mtime -1 2>/dev/null"


I dont completely understand this one but maybe you could help?
I am just thinking of this in terms where multiple people are using the same slurm cluster and how we mark 2 jobs in LAUNCHING state as complete if both were launched on the same cluster. The easier solution I think would be that we store the slurm job id in job data (I think we do this already right in the provider launch result?) and then mark that particular job as complete?

Correct me if I’m wrong, but when two people launch two jobs on the same cluster, each job should have a different job_id, right? For example, Mina runs a job on a cluster with provider ID 4, and Deep runs one on provider ID 5. So when marking jobs as complete, it should only complete Mina’s job and not affect anything else, since Mina couldn’t have a job with a provider job ID of 5 anyway (I’m assuming this is how the code already works; otherwise, we’d have another problem here).

Does what you’re suggesting mean we need to change the logic already implemented in check-state and list_jobs? The check-state command gets the list of jobs and marks them as complete. My implementation tries to fix the part where the squeue command isn’t working and returns empty.

mina-parham added 2 commits December 17, 2025 14:31

Fix provider log issue

19e03c5

ruff

44e4b27

deep1401 previously approved these changes Dec 17, 2025

View reviewed changes

josh-janes reviewed Dec 17, 2025

View reviewed changes

api/transformerlab/routers/experiment/jobs.py Show resolved Hide resolved

mina-parham added 2 commits December 17, 2025 16:41

Fix SLURM job status updates by detecting completed jobs via log file…

a8f1b47

…s when accounting is disabled.

Ruff

de2a786

mina-parham and others added 5 commits December 17, 2025 16:53

Merge branch 'main' into fix/provider-log

362ae6a

Ruff

6e1d1a3

merge conflict

df92602

Merge branch 'main' into fix/provider-log

b704632

Merge branch 'main' into fix/provider-log

5fa4958

deep1401 reviewed Dec 18, 2025

View reviewed changes

mina-parham added 4 commits December 18, 2025 14:51

Merge branch 'main' into fix/provider-log

d1a6542

Merge branch 'main' into fix/provider-log

ff2c0d8

Merge branch 'main' into fix/provider-log

7f883af

Merge branch 'main' into fix/provider-log

15b37d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix/provider log #1092

Fix/provider log #1092

Uh oh!

mina-parham commented Dec 17, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Dec 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

deep1401 left a comment

Uh oh!

deep1401 Dec 18, 2025

Uh oh!

mina-parham Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Fix/provider log #1092

Are you sure you want to change the base?

Fix/provider log #1092

Uh oh!

Conversation

mina-parham commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

deep1401 left a comment

Choose a reason for hiding this comment

Uh oh!

deep1401 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

mina-parham Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mina-parham commented Dec 17, 2025 •

edited

Loading

codecov-commenter commented Dec 17, 2025 •

edited

Loading