-
Notifications
You must be signed in to change notification settings - Fork 67
Properly return PENDING status for docker image batch jobs/fine tune jobs #318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -97,14 +99,30 @@ async def list_jobs( | |||
logger.exception("Got an exception when trying to list the Jobs") | |||
raise EndpointResourceInfraException from exc | |||
|
|||
core_client = get_kubernetes_core_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently this fn is used in the list docker image batch jobs
api call for some reason, feels wrong to me, idk how we didn't catch this before, but oh well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_engine_server/domain/use_cases/batch_job_use_cases.py:ListDockerImageBatchJobV1UseCase.execute
This feels like we're breaking abstraction layers to me at least (e.g. batch jobs and cron jobs should be different), although I guess that broken abstraction gets propagated through the API as well (since the ListBatchJobs has a trigger_id parameter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some unit tests for these gateways? We could mock out the k8s layer, since we're at the end of the line anyway.
@@ -94,10 +97,27 @@ def _parse_job_status_from_k8s_obj(job: V1Job) -> BatchJobStatus: | |||
if status.ready is not None and status.ready > 0: | |||
return BatchJobStatus.RUNNING # empirically this doesn't happen | |||
if status.active is not None and status.active > 0: | |||
return BatchJobStatus.RUNNING # TODO this might be a mix of pending and running | |||
for pod in pods: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth leaving a comment here on why a single Job resource could have multiple pods due to clarify the logic here.
@@ -97,14 +99,30 @@ async def list_jobs( | |||
logger.exception("Got an exception when trying to list the Jobs") | |||
raise EndpointResourceInfraException from exc | |||
|
|||
core_client = get_kubernetes_core_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?
Before, jobs would immediately start as RUNNING even if no pods were allocated, or if pods were PENDING.
testing:
started gateway on devbox, made requests to get/list docker image batch jobs on our clusters, saw that a RUNNING job had changed to PENDING correctly.
also unit tests