Properly return PENDING status for docker image batch jobs/fine tune jobs #318

seanshi-scale · 2023-10-11T22:09:55Z

Before, jobs would immediately start as RUNNING even if no pods were allocated, or if pods were PENDING.

testing:
started gateway on devbox, made requests to get/list docker image batch jobs on our clusters, saw that a RUNNING job had changed to PENDING correctly.
also unit tests

…ob gateway?

seanshi-scale · 2023-10-12T00:07:23Z

model-engine/model_engine_server/infra/gateways/live_cron_job_gateway.py

@@ -97,14 +99,30 @@ async def list_jobs(
            logger.exception("Got an exception when trying to list the Jobs")
            raise EndpointResourceInfraException from exc

+        core_client = get_kubernetes_core_client()


apparently this fn is used in the list docker image batch jobs api call for some reason, feels wrong to me, idk how we didn't catch this before, but oh well

Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?

model_engine_server/domain/use_cases/batch_job_use_cases.py:ListDockerImageBatchJobV1UseCase.execute

This feels like we're breaking abstraction layers to me at least (e.g. batch jobs and cron jobs should be different), although I guess that broken abstraction gets propagated through the API as well (since the ListBatchJobs has a trigger_id parameter)

yixu34

Can we add some unit tests for these gateways? We could mock out the k8s layer, since we're at the end of the line anyway.

song-william · 2023-10-12T17:02:16Z

model-engine/model_engine_server/infra/gateways/live_docker_image_batch_job_gateway.py

@@ -94,10 +97,27 @@ def _parse_job_status_from_k8s_obj(job: V1Job) -> BatchJobStatus:
    if status.ready is not None and status.ready > 0:
        return BatchJobStatus.RUNNING  # empirically this doesn't happen
    if status.active is not None and status.active > 0:
-        return BatchJobStatus.RUNNING  # TODO this might be a mix of pending and running
+        for pod in pods:


Might be worth leaving a comment here on why a single Job resource could have multiple pods due to clarify the logic here.

song-william · 2023-10-12T17:08:23Z

model-engine/model_engine_server/infra/gateways/live_cron_job_gateway.py

@@ -97,14 +99,30 @@ async def list_jobs(
            logger.exception("Got an exception when trying to list the Jobs")
            raise EndpointResourceInfraException from exc

+        core_client = get_kubernetes_core_client()


Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?

seanshi-scale added 5 commits October 11, 2023 15:05

add query for pods and check their phases

1b9be18

add to the triggers stuff too

d6d8ae3

oops, also why is the docker image batch job use case using the cronj…

3c75753

…ob gateway?

todos

4b678aa

filter pods to only look at ones associated with jobs

96324ca

seanshi-scale self-assigned this Oct 11, 2023

Merge branch 'main' into seanshi/dibjb-pending-actually

f3fd8b9

seanshi-scale marked this pull request as ready for review October 11, 2023 23:58

seanshi-scale requested review from squeakymouse and song-william October 11, 2023 23:58

comments

5a92782

seanshi-scale commented Oct 12, 2023

View reviewed changes

yixu34 reviewed Oct 12, 2023

View reviewed changes

seanshi-scale added 2 commits October 11, 2023 18:30

add unit test for get

2cc8cc0

add unit test for list

164f839

seanshi-scale requested review from yixu34 and ian-scale October 12, 2023 01:38

Merge branch 'main' into seanshi/dibjb-pending-actually

c5357c4

song-william approved these changes Oct 12, 2023

View reviewed changes

comment single job resource

3564f00

seanshi-scale merged commit d30a1a5 into main Oct 12, 2023

seanshi-scale deleted the seanshi/dibjb-pending-actually branch October 12, 2023 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Properly return PENDING status for docker image batch jobs/fine tune jobs #318

Properly return PENDING status for docker image batch jobs/fine tune jobs #318

Uh oh!

seanshi-scale commented Oct 11, 2023 •

edited

Loading

Uh oh!

seanshi-scale Oct 12, 2023

Uh oh!

song-william Oct 12, 2023

Uh oh!

seanshi-scale Oct 12, 2023

Uh oh!

yixu34 left a comment

Uh oh!

song-william Oct 12, 2023

Uh oh!

song-william Oct 12, 2023

Uh oh!

Uh oh!

Properly return PENDING status for docker image batch jobs/fine tune jobs #318

Properly return PENDING status for docker image batch jobs/fine tune jobs #318

Uh oh!

Conversation

seanshi-scale commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanshi-scale Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

song-william Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

seanshi-scale Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

yixu34 left a comment

Choose a reason for hiding this comment

Uh oh!

song-william Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

song-william Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seanshi-scale commented Oct 11, 2023 •

edited

Loading