延迟模型加载以支援GPU选择、多卡推理及减少import时间 #2600
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
现存问题
现在如果想在GPTSoVITS的推理模组上开发小工具,几乎必须import inference_webui.py。但是模型会在import 的时候加载进GPU里,会导致很多问题,例如:
改动
此PR通过将所有的模型加载逻辑合进同一个function(def load_models(device_override):)里面解决上述问题。并且在load_model 和get_tts_wav上也加上device_override 来指定推理用卡。device_override 应为cuda序号,如“cuda:0”/"cuda:1",而不设置override的情况下默认为None,将使用默认逻辑(使用global,现时为"cuda"或"cpu")。未改动其他function名和文件名。
兼容
小幅度修改inference_cli 和 inference_gui,以兼容此PR。inference_webui_fast 方面,因未见复用 inference_webui 逻辑而没有做出任何更动。
使用及未来优化
单卡推理需要在推理之前加上一行“load_models()”,无需其他改动。多卡推理可以为每GPU开一个process,每一个process均需要调用load_models()+get_tts_wav(),并加上相应的cuda序号。考虑兼容性及 @XXXXRT666 提及“inference_webui 不是用import读的”,现时采用改动最小的方法。后续需优化代码逻辑,包括分开推理逻辑,争取可以在threading跑以减少CPU缓存占用。