Skip to content

Releases: oobabooga/text-generation-webui

v3.5

11 Jun 02:15
1e96dcf
Compare
Choose a tag to compare

Changes

  • Optimize chat streaming by only updating the last message during streaming and adding back the dynamic UI update speed. These changes make streaming smooth even at 100k tokens context.
  • Add a CUDA 12.8 installation option for RTX 50XX NVIDIA Blackwell support (ExLlamaV2/V3 and Transformers) (#7011). Thanks @okazaki10
  • Make UI settings persistent. Any value you change, including sliders in the Parameters tab, chat mode, character, character description fields, etc, now gets automatically saved to user_data/settings.yaml. If you close the UI and launch it again, the values will be where you left them. The Model tab is left as an exception since it's managed by command-line flags and its own "Save settings" menu.
  • Make the dark theme darker and more aesthetic.
  • Add support for .docx attachments.
  • Add 🗑️ buttons for easily deleting individual past chats.
  • Add new buttons: "Restore preset", "Neutralize samplers", "Restore character".
  • Reorganize the Parameters tab with parameters that get saved to presets on the left and everything else on the right.
  • Add Qwen3 presets (Thinking and No Thinking), and make Qwen3 - Thinking the new default preset. If you update a portable install manually by moving user_data, you will not have these files; download them from here if you are interested.
  • Add the model name to each message's metadata, and show it in the UI when hovering the date/time for a message.
  • Scroll up automatically to show the whole editing area when editing a message.
  • Add an option to turn long pasted text into an attachment automatically. This is disabled by default and can be enabled in the Session tab.
  • Extract the text of web searches with formatting instead of putting all text on a single line.
  • Show llama.cpp prompt processing progress on a single line.
  • Add informative tooltips when hovering the file upload icon and the web search checkbox.
  • Several small UI optimizations.
  • Several small UI style improvements.
  • Use user_data/cache/gradio for Gradio temorary files instead of the system's temporary folder.

Bug fixes

  • Filter out failed web search downloads from attachments.
  • Remove quotes from LLM-generated web search queries.
  • Fix the progress bar for downloading a model not appearing in the UI.
  • Fix the text for a sent message reappearing in the input area when the page is reloaded.
  • Fix selecting the next chat on the list when deleting a chat with an active search.
  • Fix light/dark theme persistence across page reloads.
  • Re-highlight code blocks when switching light/dark themes to fix styling issues.
  • Stop llama.cpp model during graceful shutdown to avoid an error message (#7042). Thanks @leszekhanusz
  • Check .attention.head_count if .attention.head_count_kv doesn't exist for VRAM calculation (#7048). Thanks @miriameng
  • Fix failure when --nowebui is called without --api (#7055). Thanks @miriameng
  • Fix continue/start reply with when using translation extensions (#6944). Thanks @mykeehu
  • Load js and css sources in UTF-8 (#7059). Thanks @LawnMauer

Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

Updating a portable install:

  1. Download and unzip the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

v3.4.1

31 May 02:12
ae61c1a
Compare
Choose a tag to compare

Changes

  • Add attachments support (text files, PDF documents) (#7005).
    • This is not RAG. The attachment gets fully added to the prompt!
  • Add a web search feature (#7023). The search query is generated by the LLM based on your input, and the search is performed using DuckDuckGo.
  • Add date/time to chat messages (#7003)
  • Add message version navigation (#6947). Thanks @Th-Underscore.
    • This is equivalent to the "swipes" in SillyTavern. Press left/right to navigate versions, press right while at the latest reply version to generate a new version.
  • Add footer buttons for editing messages (#7019). Thanks @Th-Underscore.
  • Add a "Branch here" footer button to chat messages (#6967). Thanks @Madrawn
  • Add a token counter to the chat tab (counts input + history, including attachments)
  • Make the dark theme darker
  • Improve the light theme
  • Improve the style of thinking blocks
  • Add back max_updates_second to resolve a UI performance issue when streaming very fast (~200 tokens/second)

Bug fixes

  • Close response generator when stopping API generation (#7014). Thanks @djholtby
  • Fix the chat area height when "Show controls" is unchecked
  • Remove unnecessary js that was causing scrolling issues during streaming
  • Fix loading Llama-3_3-Nemotron-Super-49B-v1 and similar models
  • Fix Dockerfile for AMD and Intel (#6995). Thanks @TheGameratorT
  • Fix 'Start reply with' (new in v3.4.1)
  • Fix exllamav3_hf models failing to unload (new in v3.4.1)

Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Choosing the right build:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

Updating a portable install:

  1. Download and unzip the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

v3.4

29 May 22:10
af1eef1
Compare
Choose a tag to compare

v3.3.2: Patch release

17 May 15:05
e859573
Compare
Choose a tag to compare
  • More robust VRAM calculation
    • The updated formula better handles edge cases - DeepSeek-R1 and Mistral-22B-v0.2 - providing more accurate results.
  • UI: Use total (not free) VRAM for layers calculation when a model is loaded.
  • Fix KeyError: 'gpu_layers' when loading existing model settings (#6991). Thanks, @mamei16.

v3.3.1: Patch release

17 May 01:31
17c29fa
Compare
Choose a tag to compare
  • Only add a blank space to streaming messages in instruct mode, keeping the chat/chat-instruct styles as before.
  • Some fixes to the GPU layers slider:
    • Honor saved settings
    • Fix the maximum being set to the saved value
    • Add backward compatibility with saved n_gpu_layers values (now it's called gpu_layers)

v3.3

16 May 20:14
dc30945
Compare
Choose a tag to compare

Changes

  • Estimate the VRAM for GGUF models using a statistical model + autoset gpu-layers on NVIDIA GPUs (#6980).
    • When you select a GGUF model in the UI, you will see an estimate for its VRAM usage, and the number of layers will be set based on the available (free, not total) VRAM on your system.
    • If you change ctx-size or cache-type in the UI, the number of layers will be recalculated and updated in real time.
    • If you load a model through the command line with e.g. --model model.gguf --ctx-size 32768 --cache-type q4_0, the number of GPU layers will also be automatically calculated, without the need to set --gpu-layers.
    • It works even with multipart GGUF models or systems with multiple GPUs.
  • Greatly simplify the Model tab by splitting settings between "Main options" and "Other options", where "Other options" is in a closed accordion by default.
  • Tools support for the OpenAI compatible API (#6827). Thanks, @jkrauss82.
  • Dynamic Chat Message UI update speed (#6952). This is a major UI optimization in Chat mode that renders max_updates_second obsolete. Thanks, @mamei16 for the very clever idea.
  • Optimize the Chat tab JavaScript, reducing its CPU usage (#6948).
  • Add the top_n_sigma sampler to the llama.cpp loader.
  • Streamline the UI in portable builds: Hide things that do not work such as training, only show the llama.cpp loader, and do not include extensions that do not work. The latter should reduce the build sizes.
  • Invert user/assistant message colors in instruct mode to make assistant messages darker and more readable.
  • Improve the light theme colors.
  • Add a minimum height to the streaming reply to prevent constant scrolling during chat streaming, similar to how ChatGPT and Claude work.
  • Show the list of files if the user tries to download an entire GGUF repository instead of a specific file.
  • llama.cpp: Handle short arguments in --extra-flags, like ot.
  • Save the chat history right after sending a message and periodically during streaming to prevent losing messages.

Bug fixes

  • API: Fix llama.cpp continuing to generate in the background after cancelling the request, improve disconnect detection, fix deadlock on simultaneous requests.
  • Fix typical_p in the llama.cpp sampler priority.
  • Fix manual random seeds in llama.cpp.
  • Add a retry mechanism when using the /internal/logits API endpoint with the llama.cpp loader to fix random failures.
  • Ensure environment isolation in portable builds to avoid conflicts.
  • docker: Fix app UID typo in docker composes (#6957 and #6958). Thanks, @enovikov11.
  • Docker fix for NVIDIA (#6964). Thanks, @phokur.
  • SuperboogaV2: Minor update to avoid JSON serialization errors (#6945). Thanks, @alirezagsm.
  • Fix model config loading in shared.py for Python 3.13 (#6961). Thanks, @Downtown-Case.

Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Choosing the right build:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

Updating a portable install:

  1. Download and unzip the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

v3.2

01 May 03:18
a41da1e
Compare
Choose a tag to compare

Changes

  • Add an option to enable/disable thinking for Qwen3 models (and all future models with this feature). You can find it as a checkbox under Parameters > enable_thinking.
    • By default, thinking is enabled.
    • This works directly with the Jinja2 template.
  • Make <think> UI blocks closed by default.
  • Set max_updates_second to 12 by default. This prevents CPU bottlenecking when reasoning models that generate extremely long replies generate at 50 tokens/second.
  • Find a new API port automatically if the default one is taken.
  • Make --verbose print the llama-server launch command to the console.

Bug fixes

  • Fix ExLlamaV3_HF leaking memory, especially for long prompts/conversations.
  • Fix the streaming_llm UI checkbox not being interactive.
  • Fix the max_updates_second UI parameter not working.
  • Fix getting the llama.cpp token probabilities for Qwen3-30B-A3B through the API.
  • Fix CFG with ExLlamaV2_HF.

Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Choosing the right build:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

Updating a portable install:

  1. Download and unzip the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

v3.1

27 Apr 03:03
9bb9ce0
Compare
Choose a tag to compare

Changes

  • Add speculative decoding to the llama.cpp loader.
    • In tests with google_gemma-3-27b-it-Q8_0.gguf using google_gemma-3-1b-it-Q4_K_M.gguf as the draft model (both fully offloaded to GPU), the text generation speed went from 24.17 to 45.61 tokens/second (+88.7%).
    • Speed improvements vary by setup and prompt. Previous tests of mine showed increases of +64% and +34% in tokens/second for different combinations of models.
    • I highly recommend trying this feature.
  • Add speculative decoding to the non-HF ExLlamaV2 loader (#6899).
  • Prevent llamacpp defaults from locking up consumer hardware (#6870). This change should provide a slight increase text generation speed in most cases when using llama.cpp. Thanks, @Matthew-Jenkins.
  • llama.cpp: Add a --extra-flags parameter for passing additional flags to llama-server, such as override-tensor=exps=CPU, which is useful for MoE models.
  • llama.cpp: Add StreamingLLM (--streaming-llm). This prevents complete prompt reprocessing when the context length is filled, making it especially useful for role-playing scenarios.
  • llama.cpp: Add prompt processing progress messages.
  • ExLlamaV3: Add KV cache quantization (#6903).
  • Add Vulkan portable builds (see below). These should work on AMD and Intel Arc cards on both Windows and Linux.
  • UI:
    • Add a collapsible thinking block to messages with <think> steps.
    • Make 'instruct' the default chat mode.
    • Add a greeting when the web UI launches in instruct mode with an empty chat history.
    • Make the model menu display only part 00001 of multipart GGUF files.
  • Make llama-cpp-binaries wheels compatible with any Python >= 3.7 (useful for manually installing the requirements under requirements/portable/).
  • Add an universal --ctx-size flag to specify context size across all loaders.
  • Implement host header validation when using the UI / API on localhost (which is the default).
    • This is an important security improvement. It is recommended that you update your local install to the latest version.
    • Credits to security researcher Laurian Duma for discovering this issue and reaching out by email.
  • Restructure the project to have all user data on text-generation-webui/user_data, including models, characters, presets, and saved settings.
    • This was done to make it possible to update portable installs in the future by just moving the user_data folder.
    • It has the additional benefit of making the repository more organized.
    • This is a breaking change. You will need to manually move your models from models to user_data/models, your presets from presets to user_data/presets, etc, after this update.

Bug fixes

  • Fix an issue where portable installations ignored the CMD_FLAGS.txt file.
  • extensions/superboogav2: existing embedding check bug fix (#6898). Thanks, @ZiyaCu.
  • ExLlamaV2_HF: Add another torch.cuda.synchronize() call to prevent errors during text generation.
  • Fix the Notebook tab not loading its default prompt.

Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation. Just download the right version for your system, unzip, and run.

Choosing the right build:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

v3.0

22 Apr 15:11
a778270
Compare
Choose a tag to compare

Changes

  • Portable zip builds for text-generation-webui + llama.cpp! You can now download a fully self-contained (~700 MB) version of the web UI with built-in llama.cpp support. No installation required.
    • Available for Windows, Linux, and macOS with builds for cuda12.4, cuda11.7, cpu, macOS arm64 and macOS x86_64.
    • No Miniconda, no torch, no downloads after unzipping.
    • Comes bundled with a portable Python from astral-sh/python-build-standalone.
    • Web UI opens automatically in the browser; API starts by default on localhost without the need to use --api.
    • All the compilation workflows are public, open-source, and executed on GitHub.
    • Fully private as always — no telemetry, no CDN resources, no remote requests.
  • Make llama.cpp the default loader in the project.
  • Add support for llama-cpp builds from https://github.com/ggml-org/llama.cpp (#6862). Thanks, @Matthew-Jenkins.
  • Add back the --model-menu flag.
  • Remove the --gpu-memory flag, and reuse the --gpu-split EXL2 flag for Transformers.

Backend updates

v2.8.1

20 Apr 00:57
c19b995
Compare
Choose a tag to compare

🔧 Bug fixes

This release fixes several issues with the new llama.cpp loader, especially on Windows. Thanks everyone for the feedback.

  • Fix the poor performance of the new llama.cpp loader on Windows. It was caused by using localhost for requests instead of 127.0.0.1. It's a lot faster now.
  • Fix the new llama.cpp loader failing to unload models.
  • Fix using the API without streaming or without 'sampler_priority' when using the new llama.cpp loader.