Skip to content

Conversation

dmchoiboi
Copy link
Collaborator

@dmchoiboi dmchoiboi commented May 14, 2024

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Up max gpu memory utilization to 0.95 for 70b models in attempt to address OOM issues

https://linear.app/scale-epd/issue/MLI-2309/use-095-gpu-memory-utilization-for-70b-models

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Published test docker image for batch_inference. Tested with API request using local gateway: job ft-cp21h54gfe6g02mlqikg

@dmchoiboi dmchoiboi requested a review from yunfeng-scale May 14, 2024 16:59
@dmchoiboi dmchoiboi force-pushed the michaelchoi/mli-2309-use-095-gpu-memory-utilization-for-70b-models branch from 5a2bb87 to a59cf19 Compare May 14, 2024 18:03
@dmchoiboi dmchoiboi force-pushed the michaelchoi/mli-2309-use-095-gpu-memory-utilization-for-70b-models branch from bc50329 to 1e17ab4 Compare May 14, 2024 20:00
@yunfeng-scale
Copy link
Contributor

lgtm

@@ -2198,6 +2199,27 @@ async def execute(self, user: User, request: ModelDownloadRequest) -> ModelDownl
return ModelDownloadResponse(urls=urls)


@dataclass
class VLLMEngineArgs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I know this is by no means the main offender, but implementation specifics like vLLM aren't supposed to go into the use case layer. Granted, that'd require another layer, which I suspect @yunfeng-scale would find perfunctory 😁

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could just call it LLMEngineArgs. It seems right now we only support batch inference w/ vLLM, so we could try to do a proper abstraction when we decide we need to support it for a different engine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is ok for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh 😅 you had a good point, the current code structure does not completely fit into clean architecture. in that sense we might want to move all these framework-specific code to another layer

@dmchoiboi dmchoiboi merged commit fbe7417 into main May 15, 2024
@dmchoiboi dmchoiboi deleted the michaelchoi/mli-2309-use-095-gpu-memory-utilization-for-70b-models branch May 15, 2024 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants