-
-
Notifications
You must be signed in to change notification settings - Fork 469
Fix/provider log #1092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix/provider log #1092
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…s when accounting is disabled.
Dismissing my review since new changes seem to be added and I'm not sure they're related to the provider log fix
deep1401
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern about marking non lab sdk jobs as COMPLETE
| active_job_count += 1 | ||
|
|
||
| # Step 2: Find completed jobs by scanning log files | ||
| find_command = f"find /home/{self.ssh_user} -maxdepth 1 -name 'slurm-*.out' -type f -mtime -1 2>/dev/null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont completely understand this one but maybe you could help?
I am just thinking of this in terms where multiple people are using the same slurm cluster and how we mark 2 jobs in LAUNCHING state as complete if both were launched on the same cluster. The easier solution I think would be that we store the slurm job id in job data (I think we do this already right in the provider launch result?) and then mark that particular job as complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I’m wrong, but when two people launch two jobs on the same cluster, each job should have a different job_id, right? For example, Mina runs a job on a cluster with provider ID 4, and Deep runs one on provider ID 5. So when marking jobs as complete, it should only complete Mina’s job and not affect anything else, since Mina couldn’t have a job with a provider job ID of 5 anyway (I’m assuming this is how the code already works; otherwise, we’d have another problem here).
Does what you’re suggesting mean we need to change the logic already implemented in check-state and list_jobs? The check-state command gets the list of jobs and marks them as complete. My implementation tries to fix the part where the squeue command isn’t working and returns empty.
to get log now I directly read the slurm-{job-id}.out as for sacct waits for slurm to release nodes and it some experiment it took so long time, this way we show log faster as soon as it's written also avoid the sacct error
Slurm accounting storage is disabledwhich happened in the aws slurm cluster I worked on.We fixed the slurm provider to detect completed jobs by scanning for slurm-*.out log files when accounting is disabled(sacct didn't work because of the storage and squeue gives empty that's why the task stuck on LAUNCHING), enabling job statuses to update from LAUNCHING to COMPLETE.