-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Open
Description
When I use 30B model in 4bit and --chat / --cai-chat my gen speeds are 4-6 times slower. You can see it on caps -


Is it bug or feature?
My system specs - 12700k, 32gb RAM (tried with 64gigs, but still the same issue), 3090.
OS - Manjaro Linux, cuda 11.7.
To sum up - with --chat and --cai-chat it gens 3it/s even with empty context. Without - 20it/s, stable.
musicurgy
Metadata
Metadata
Assignees
Labels
No labels