Releases: oobabooga/text-generation-webui
v3.5
Changes
- Optimize chat streaming by only updating the last message during streaming and adding back the dynamic UI update speed. These changes make streaming smooth even at 100k tokens context.
- Add a CUDA 12.8 installation option for RTX 50XX NVIDIA Blackwell support (ExLlamaV2/V3 and Transformers) (#7011). Thanks @okazaki10
- Make UI settings persistent. Any value you change, including sliders in the Parameters tab, chat mode, character, character description fields, etc, now gets automatically saved to
user_data/settings.yaml
. If you close the UI and launch it again, the values will be where you left them. The Model tab is left as an exception since it's managed by command-line flags and its own "Save settings" menu. - Make the dark theme darker and more aesthetic.
- Add support for .docx attachments.
- Add 🗑️ buttons for easily deleting individual past chats.
- Add new buttons: "Restore preset", "Neutralize samplers", "Restore character".
- Reorganize the Parameters tab with parameters that get saved to presets on the left and everything else on the right.
- Add Qwen3 presets (Thinking and No Thinking), and make
Qwen3 - Thinking
the new default preset. If you update a portable install manually by movinguser_data
, you will not have these files; download them from here if you are interested. - Add the model name to each message's metadata, and show it in the UI when hovering the date/time for a message.
- Scroll up automatically to show the whole editing area when editing a message.
- Add an option to turn long pasted text into an attachment automatically. This is disabled by default and can be enabled in the Session tab.
- Extract the text of web searches with formatting instead of putting all text on a single line.
- Show llama.cpp prompt processing progress on a single line.
- Add informative tooltips when hovering the file upload icon and the web search checkbox.
- Several small UI optimizations.
- Several small UI style improvements.
- Use
user_data/cache/gradio
for Gradio temorary files instead of the system's temporary folder.
Bug fixes
- Filter out failed web search downloads from attachments.
- Remove quotes from LLM-generated web search queries.
- Fix the progress bar for downloading a model not appearing in the UI.
- Fix the text for a sent message reappearing in the input area when the page is reloaded.
- Fix selecting the next chat on the list when deleting a chat with an active search.
- Fix light/dark theme persistence across page reloads.
- Re-highlight code blocks when switching light/dark themes to fix styling issues.
- Stop llama.cpp model during graceful shutdown to avoid an error message (#7042). Thanks @leszekhanusz
- Check .attention.head_count if .attention.head_count_kv doesn't exist for VRAM calculation (#7048). Thanks @miriameng
- Fix failure when --nowebui is called without --api (#7055). Thanks @miriameng
- Fix continue/start reply with when using translation extensions (#6944). Thanks @mykeehu
- Load js and css sources in UTF-8 (#7059). Thanks @LawnMauer
Backend updates
- Bump llama.cpp to ggml-org/llama.cpp@2bb0467
- Bump ExLlamaV3 to 0.0.3
- Bump ExLlamaV2 to 0.3.1
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.4.1

Changes
- Add attachments support (text files, PDF documents) (#7005).
- This is not RAG. The attachment gets fully added to the prompt!
- Add a web search feature (#7023). The search query is generated by the LLM based on your input, and the search is performed using DuckDuckGo.
- Add date/time to chat messages (#7003)
- Add message version navigation (#6947). Thanks @Th-Underscore.
- This is equivalent to the "swipes" in SillyTavern. Press left/right to navigate versions, press right while at the latest reply version to generate a new version.
- Add footer buttons for editing messages (#7019). Thanks @Th-Underscore.
- Add a "Branch here" footer button to chat messages (#6967). Thanks @Madrawn
- Add a token counter to the chat tab (counts input + history, including attachments)
- Make the dark theme darker
- Improve the light theme
- Improve the style of thinking blocks
- Add back
max_updates_second
to resolve a UI performance issue when streaming very fast (~200 tokens/second)
Bug fixes
- Close response generator when stopping API generation (#7014). Thanks @djholtby
- Fix the chat area height when "Show controls" is unchecked
- Remove unnecessary js that was causing scrolling issues during streaming
- Fix loading
Llama-3_3-Nemotron-Super-49B-v1
and similar models - Fix Dockerfile for AMD and Intel (#6995). Thanks @TheGameratorT
- Fix 'Start reply with' (new in v3.4.1)
- Fix exllamav3_hf models failing to unload (new in v3.4.1)
Backend updates
- Bump llama.cpp to ggml-org/llama.cpp@b7a1746
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.4
v3.3.2: Patch release
- More robust VRAM calculation
- The updated formula better handles edge cases -
DeepSeek-R1
andMistral-22B-v0.2
- providing more accurate results.
- The updated formula better handles edge cases -
- UI: Use total (not free) VRAM for layers calculation when a model is loaded.
- Fix KeyError: 'gpu_layers' when loading existing model settings (#6991). Thanks, @mamei16.
v3.3.1: Patch release
- Only add a blank space to streaming messages in instruct mode, keeping the chat/chat-instruct styles as before.
- Some fixes to the GPU layers slider:
- Honor saved settings
- Fix the maximum being set to the saved value
- Add backward compatibility with saved
n_gpu_layers
values (now it's calledgpu_layers
)
v3.3

Changes
- Estimate the VRAM for GGUF models using a statistical model + autoset
gpu-layers
on NVIDIA GPUs (#6980).- When you select a GGUF model in the UI, you will see an estimate for its VRAM usage, and the number of layers will be set based on the available (free, not total) VRAM on your system.
- If you change
ctx-size
orcache-type
in the UI, the number of layers will be recalculated and updated in real time. - If you load a model through the command line with e.g.
--model model.gguf --ctx-size 32768 --cache-type q4_0
, the number of GPU layers will also be automatically calculated, without the need to set--gpu-layers
. - It works even with multipart GGUF models or systems with multiple GPUs.
- Greatly simplify the Model tab by splitting settings between "Main options" and "Other options", where "Other options" is in a closed accordion by default.
- Tools support for the OpenAI compatible API (#6827). Thanks, @jkrauss82.
- Dynamic Chat Message UI update speed (#6952). This is a major UI optimization in Chat mode that renders
max_updates_second
obsolete. Thanks, @mamei16 for the very clever idea. - Optimize the Chat tab JavaScript, reducing its CPU usage (#6948).
- Add the
top_n_sigma
sampler to the llama.cpp loader. - Streamline the UI in portable builds: Hide things that do not work such as training, only show the llama.cpp loader, and do not include extensions that do not work. The latter should reduce the build sizes.
- Invert user/assistant message colors in instruct mode to make assistant messages darker and more readable.
- Improve the light theme colors.
- Add a minimum height to the streaming reply to prevent constant scrolling during chat streaming, similar to how ChatGPT and Claude work.
- Show the list of files if the user tries to download an entire GGUF repository instead of a specific file.
- llama.cpp: Handle short arguments in
--extra-flags
, likeot
. - Save the chat history right after sending a message and periodically during streaming to prevent losing messages.
Bug fixes
- API: Fix llama.cpp continuing to generate in the background after cancelling the request, improve disconnect detection, fix deadlock on simultaneous requests.
- Fix
typical_p
in the llama.cpp sampler priority. - Fix manual random seeds in llama.cpp.
- Add a retry mechanism when using the
/internal/logits
API endpoint with the llama.cpp loader to fix random failures. - Ensure environment isolation in portable builds to avoid conflicts.
- docker: Fix app UID typo in docker composes (#6957 and #6958). Thanks, @enovikov11.
- Docker fix for NVIDIA (#6964). Thanks, @phokur.
- SuperboogaV2: Minor update to avoid JSON serialization errors (#6945). Thanks, @alirezagsm.
- Fix model config loading in shared.py for Python 3.13 (#6961). Thanks, @Downtown-Case.
Backend updates
- llama.cpp: Update to ggml-org/llama.cpp@c6a2c9e.
- ExLlamaV3: Update to turboderp-org/exllamav3@a905cff.
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.2
Changes
- Add an option to enable/disable thinking for Qwen3 models (and all future models with this feature). You can find it as a checkbox under Parameters > enable_thinking.
- By default, thinking is enabled.
- This works directly with the Jinja2 template.
- Make
<think>
UI blocks closed by default. - Set
max_updates_second
to 12 by default. This prevents CPU bottlenecking when reasoning models that generate extremely long replies generate at 50 tokens/second. - Find a new API port automatically if the default one is taken.
- Make
--verbose
print thellama-server
launch command to the console.
Bug fixes
- Fix ExLlamaV3_HF leaking memory, especially for long prompts/conversations.
- Fix the
streaming_llm
UI checkbox not being interactive. - Fix the
max_updates_second
UI parameter not working. - Fix getting the llama.cpp token probabilities for
Qwen3-30B-A3B
through the API. - Fix CFG with ExLlamaV2_HF.
Backend updates
- llama.cpp: Update to ggml-org/llama.cpp@3e168be
- ExLlamaV3: Update to turboderp-org/exllamav3@4724b86.
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.
v3.1

Changes
- Add speculative decoding to the llama.cpp loader.
- In tests with
google_gemma-3-27b-it-Q8_0.gguf
usinggoogle_gemma-3-1b-it-Q4_K_M.gguf
as the draft model (both fully offloaded to GPU), the text generation speed went from 24.17 to 45.61 tokens/second (+88.7%). - Speed improvements vary by setup and prompt. Previous tests of mine showed increases of +64% and +34% in tokens/second for different combinations of models.
- I highly recommend trying this feature.
- In tests with
- Add speculative decoding to the non-HF ExLlamaV2 loader (#6899).
- Prevent llamacpp defaults from locking up consumer hardware (#6870). This change should provide a slight increase text generation speed in most cases when using llama.cpp. Thanks, @Matthew-Jenkins.
- llama.cpp: Add a
--extra-flags
parameter for passing additional flags tollama-server
, such asoverride-tensor=exps=CPU
, which is useful for MoE models. - llama.cpp: Add StreamingLLM (
--streaming-llm
). This prevents complete prompt reprocessing when the context length is filled, making it especially useful for role-playing scenarios.- This is called
--cache-reuse
in llama.cpp. You can learn more about it here: ggml-org/llama.cpp#9866
- This is called
- llama.cpp: Add prompt processing progress messages.
- ExLlamaV3: Add KV cache quantization (#6903).
- Add Vulkan portable builds (see below). These should work on AMD and Intel Arc cards on both Windows and Linux.
- UI:
- Add a collapsible thinking block to messages with
<think>
steps. - Make 'instruct' the default chat mode.
- Add a greeting when the web UI launches in instruct mode with an empty chat history.
- Make the model menu display only part 00001 of multipart GGUF files.
- Add a collapsible thinking block to messages with
- Make
llama-cpp-binaries
wheels compatible with any Python >= 3.7 (useful for manually installing the requirements underrequirements/portable/
). - Add an universal
--ctx-size
flag to specify context size across all loaders. - Implement host header validation when using the UI / API on localhost (which is the default).
- This is an important security improvement. It is recommended that you update your local install to the latest version.
- Credits to security researcher Laurian Duma for discovering this issue and reaching out by email.
- Restructure the project to have all user data on
text-generation-webui/user_data
, including models, characters, presets, and saved settings.- This was done to make it possible to update portable installs in the future by just moving the
user_data
folder. - It has the additional benefit of making the repository more organized.
- This is a breaking change. You will need to manually move your models from
models
touser_data/models
, your presets frompresets
touser_data/presets
, etc, after this update.
- This was done to make it possible to update portable installs in the future by just moving the
Bug fixes
- Fix an issue where portable installations ignored the CMD_FLAGS.txt file.
- extensions/superboogav2: existing embedding check bug fix (#6898). Thanks, @ZiyaCu.
- ExLlamaV2_HF: Add another
torch.cuda.synchronize()
call to prevent errors during text generation. - Fix the Notebook tab not loading its default prompt.
Backend updates
- llama.cpp: Update to ggml-org/llama.cpp@295354e
- ExLlamaV3: Update to turboderp-org/exllamav3@de83084.
- ExLlamaV2: Update to version 0.2.9.
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation. Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
v3.0
Changes
- Portable zip builds for
text-generation-webui
+llama.cpp
! You can now download a fully self-contained (~700 MB) version of the web UI with built-inllama.cpp
support. No installation required.- Available for Windows, Linux, and macOS with builds for
cuda12.4
,cuda11.7
,cpu
, macOSarm64
and macOSx86_64
. - No Miniconda, no
torch
, no downloads after unzipping. - Comes bundled with a portable Python from
astral-sh/python-build-standalone
. - Web UI opens automatically in the browser; API starts by default on
localhost
without the need to use--api
. - All the compilation workflows are public, open-source, and executed on GitHub.
- Fully private as always — no telemetry, no CDN resources, no remote requests.
- Available for Windows, Linux, and macOS with builds for
- Make llama.cpp the default loader in the project.
- Add support for llama-cpp builds from https://github.com/ggml-org/llama.cpp (#6862). Thanks, @Matthew-Jenkins.
- Add back the
--model-menu
flag. - Remove the
--gpu-memory
flag, and reuse the--gpu-split
EXL2 flag for Transformers.
Backend updates
- llama.cpp: Bump to commit ggml-org/llama.cpp@2016f07
v2.8.1
🔧 Bug fixes
This release fixes several issues with the new llama.cpp loader, especially on Windows. Thanks everyone for the feedback.
- Fix the poor performance of the new llama.cpp loader on Windows. It was caused by using
localhost
for requests instead of127.0.0.1
. It's a lot faster now. - Fix the new llama.cpp loader failing to unload models.
- Fix using the API without streaming or without 'sampler_priority' when using the new llama.cpp loader.