Skip to content

Conversation

seanshi-scale
Copy link
Contributor

@seanshi-scale seanshi-scale commented Oct 11, 2023

Before, jobs would immediately start as RUNNING even if no pods were allocated, or if pods were PENDING.

testing:
started gateway on devbox, made requests to get/list docker image batch jobs on our clusters, saw that a RUNNING job had changed to PENDING correctly.
also unit tests

@seanshi-scale seanshi-scale self-assigned this Oct 11, 2023
@seanshi-scale seanshi-scale marked this pull request as ready for review October 11, 2023 23:58
@@ -97,14 +99,30 @@ async def list_jobs(
logger.exception("Got an exception when trying to list the Jobs")
raise EndpointResourceInfraException from exc

core_client = get_kubernetes_core_client()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently this fn is used in the list docker image batch jobs api call for some reason, feels wrong to me, idk how we didn't catch this before, but oh well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_engine_server/domain/use_cases/batch_job_use_cases.py:ListDockerImageBatchJobV1UseCase.execute

This feels like we're breaking abstraction layers to me at least (e.g. batch jobs and cron jobs should be different), although I guess that broken abstraction gets propagated through the API as well (since the ListBatchJobs has a trigger_id parameter)

Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some unit tests for these gateways? We could mock out the k8s layer, since we're at the end of the line anyway.

@@ -94,10 +97,27 @@ def _parse_job_status_from_k8s_obj(job: V1Job) -> BatchJobStatus:
if status.ready is not None and status.ready > 0:
return BatchJobStatus.RUNNING # empirically this doesn't happen
if status.active is not None and status.active > 0:
return BatchJobStatus.RUNNING # TODO this might be a mix of pending and running
for pod in pods:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth leaving a comment here on why a single Job resource could have multiple pods due to clarify the logic here.

@@ -97,14 +99,30 @@ async def list_jobs(
logger.exception("Got an exception when trying to list the Jobs")
raise EndpointResourceInfraException from exc

core_client = get_kubernetes_core_client()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?

@seanshi-scale seanshi-scale merged commit d30a1a5 into main Oct 12, 2023
@seanshi-scale seanshi-scale deleted the seanshi/dibjb-pending-actually branch October 12, 2023 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants