Changes
- Optimize chat streaming by only updating the last message during streaming and adding back the dynamic UI update speed. These changes make streaming smooth even at 100k tokens context.
- Add a CUDA 12.8 installation option for RTX 50XX NVIDIA Blackwell support (ExLlamaV2/V3 and Transformers) (#7011). Thanks @okazaki10
- Make UI settings persistent. Any value you change, including sliders in the Parameters tab, chat mode, character, character description fields, etc, now gets automatically saved to
user_data/settings.yaml
. If you close the UI and launch it again, the values will be where you left them. The Model tab is left as an exception since it's managed by command-line flags and its own "Save settings" menu. - Make the dark theme darker and more aesthetic.
- Add support for .docx attachments.
- Add 🗑️ buttons for easily deleting individual past chats.
- Add new buttons: "Restore preset", "Neutralize samplers", "Restore character".
- Reorganize the Parameters tab with parameters that get saved to presets on the left and everything else on the right.
- Add Qwen3 presets (Thinking and No Thinking), and make
Qwen3 - Thinking
the new default preset. If you update a portable install manually by movinguser_data
, you will not have these files; download them from here if you are interested. - Add the model name to each message's metadata, and show it in the UI when hovering the date/time for a message.
- Scroll up automatically to show the whole editing area when editing a message.
- Add an option to turn long pasted text into an attachment automatically. This is disabled by default and can be enabled in the Session tab.
- Extract the text of web searches with formatting instead of putting all text on a single line.
- Show llama.cpp prompt processing progress on a single line.
- Add informative tooltips when hovering the file upload icon and the web search checkbox.
- Several small UI optimizations.
- Several small UI style improvements.
- Use
user_data/cache/gradio
for Gradio temorary files instead of the system's temporary folder.
Bug fixes
- Filter out failed web search downloads from attachments.
- Remove quotes from LLM-generated web search queries.
- Fix the progress bar for downloading a model not appearing in the UI.
- Fix the text for a sent message reappearing in the input area when the page is reloaded.
- Fix selecting the next chat on the list when deleting a chat with an active search.
- Fix light/dark theme persistence across page reloads.
- Re-highlight code blocks when switching light/dark themes to fix styling issues.
- Stop llama.cpp model during graceful shutdown to avoid an error message (#7042). Thanks @leszekhanusz
- Check .attention.head_count if .attention.head_count_kv doesn't exist for VRAM calculation (#7048). Thanks @miriameng
- Fix failure when --nowebui is called without --api (#7055). Thanks @miriameng
- Fix continue/start reply with when using translation extensions (#6944). Thanks @mykeehu
- Load js and css sources in UTF-8 (#7059). Thanks @LawnMauer
Backend updates
- Bump llama.cpp to ggml-org/llama.cpp@2bb0467
- Bump ExLlamaV3 to 0.0.3
- Bump ExLlamaV2 to 0.3.1
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.