-
Notifications
You must be signed in to change notification settings - Fork 67
Increase max gpu utilization for 70b models #517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase max gpu utilization for 70b models #517
Conversation
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
5a2bb87
to
a59cf19
Compare
bc50329
to
1e17ab4
Compare
lgtm |
@@ -2198,6 +2199,27 @@ async def execute(self, user: User, request: ModelDownloadRequest) -> ModelDownl | |||
return ModelDownloadResponse(urls=urls) | |||
|
|||
|
|||
@dataclass | |||
class VLLMEngineArgs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I know this is by no means the main offender, but implementation specifics like vLLM aren't supposed to go into the use case layer. Granted, that'd require another layer, which I suspect @yunfeng-scale would find perfunctory 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could just call it LLMEngineArgs. It seems right now we only support batch inference w/ vLLM, so we could try to do a proper abstraction when we decide we need to support it for a different engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is ok for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh 😅 you had a good point, the current code structure does not completely fit into clean architecture. in that sense we might want to move all these framework-specific code to another layer
…ization-for-70b-models
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Up max gpu memory utilization to 0.95 for 70b models in attempt to address OOM issues
https://linear.app/scale-epd/issue/MLI-2309/use-095-gpu-memory-utilization-for-70b-models
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.
Published test docker image for batch_inference. Tested with API request using local gateway: job
ft-cp21h54gfe6g02mlqikg