Releases: bkataru/igllama
v0.3.11 — Strip residual </think> tokens
Fix
- Strip
</think>from generated output when--no-thinkis active — the prompt prefills<think>\n\n</think>but models sometimes still emit</think>as the first token(s). Both streaming and non-streaming paths now detect and strip this prefix, preventingJSON_ERRin downstream consumers like powerglide trial harness.
Verified
- All tests pass
- Discovered during powerglide 9B T01-T17 trial (17/17 pass, but with recoverable JSON_ERR noise from
</think>leaking)
v0.3.10 — Accurate Usage Token Counts
What's New
Accurate usage.prompt_tokens and usage.completion_tokens in non-streaming responses
Previously, the non-streaming /v1/chat/completions handler returned hardcoded zeros:
"usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}Now it returns real counts:
"usage": {"prompt_tokens": 42, "completion_tokens": 55, "total_tokens": 97}Implementation
handleCompletion() now returns a CompletionResult struct:
prompt_tokens— fromtokenizer.getTokens().lenafter prompt tokenizationcompletion_tokens— incremented per generated token in the decode looptotal_tokens— sum of the above
Notes
- Streaming handler was already reporting
total_tokensseparately (unchanged) - This enables downstream clients (powerglide bench harness, OpenAI SDK consumers) to accurately measure throughput
v0.3.9 — Grammar-Constrained json_mode
What's New
Grammar-constrained json_mode via response_format
When clients send "response_format": {"type": "json_object"} in a chat completion request, igllama now:
- Loads the built-in
JSON_GRAMMARGBNF grammar as a comptime constant - Wires it into the llama.cpp sampler chain via
llama_sampler_init_grammar() - Constrains every token selection to valid JSON — no post-processing needed
Both streaming and non-streaming handlers support json_mode.
Fixed
- Streaming
json_modeuse-after-free — the streaming handler previously calledloadGrammar(allocator, "json")thendefer allocator.free(gs)inside the if-block, freeing the grammar string while the sampler still held a pointer. Replaced with the directJSON_GRAMMARcomptime constant — no allocation, no lifetime issue.
Notes
json_modeis silently ignored if the model's vocabulary doesn't support grammar sampling- Works best with the
--no-thinkflag to suppress reasoning preamble
v0.3.8 - Qwen 3.5 Small Benchmarks & Website Refactor
- Qwen 3.5 Small Series Benchmarks: Comprehensive triage and benchmarks for 0.8B, 2B, 4B, and 9B models
- Website Reorganization: Refactored benchmark showcase into structured subpages (showcase/qwen35-small and showcase/qwen35-35b)
- UX Improvements: Added backlinks to showcase subpages and a new "Learn More" navigation hub on the homepage
- History Cleanup: Pruned large build artifacts from
gh-pagesand corrected author information
v0.3.7 - Documentation Consolidation & llama.cpp Bump
- Documentation Fix: Restored truncated
philosophy.smdcontent - Bumped llama.cpp submodule to gguf-v0.18.0 (Vulkan AMD partial offload improvements, CUDA grid fixes)
- Added
website/.gitignoreto suppress Zine build output files fromgit status - Documentation consolidation pass: development-notes, showcase, api.smd, version strings
v0.3.6 - OpenAI-Compatible Streaming Fix
What's New in v0.3.6
Bug Fix: Streaming Response Compatibility
Forge code and other strict OpenAI-compatible clients (using Rust serde parsers) previously crashed with:
Failed to parse provider response: data did not match any variant of untagged enum Response
Root cause: igllama's SSE stream was missing required OpenAI spec fields:
- Initial role chunk — OpenAI requires
delta: {"role":"assistant","content":""}as the first SSE chunk before any content modelandcreatedfields — Required in every SSE chunk- Final stop chunk —
delta: {}, finish_reason: "stop"must be sent beforedata: [DONE]
Fix: All three issues resolved. Streaming and non-streaming responses now match the OpenAI spec exactly.
Tested With
- forge code CLI:
forge --promptandforge --agent sagesubagent delegation both work - No
<think>blocks:--no-thinkconfirmed working across all forge agents - Response latency on AMD EPYC-Rome (16-core, no GPU): ~50–90s (system prompt prefill; no reasoning overhead)
Recommended Launch Command
igllama api model.gguf \
--threads 8 --threads-batch 16 \
--mlock --ctx-size 8192 \
--no-thinkMerged PRs
- #71 — fix: make streaming response fully OpenAI-compatible (v0.3.6)
v0.3.5 - Thinking Mode Suppression (--no-think)
What's New in v0.3.5
New Feature: --no-think Flag
Qwen3.5 and similar reasoning models produce <think>...</think> chain-of-thought blocks before every response. On CPU hardware these blocks can run 200+ tokens long, adding 30–90 seconds of latency before the actual answer.
--no-think suppresses this entirely:
igllama api model.gguf --no-thinkHow it works: igllama pre-fills an empty <think>\n\n</think> block on the assistant turn. The model treats the reasoning phase as already complete and jumps directly to the answer — the standard llama.cpp technique for disabling Qwen3-style chain-of-thought.
Timing comparison (AMD EPYC-Rome, Qwen3.5-35B-A3B):
| Mode | Simple query latency |
|---|---|
| Default (thinking on) | ~45 s |
--no-think |
~9 s |
Recommended full launch command for CPU servers:
igllama api model.gguf \
--threads 8 --threads-batch 16 \
--mlock --ctx-size 8192 \
--no-thinkMerged PRs
- #70 — feat: add --no-think flag to suppress Qwen3 reasoning blocks (v0.3.5)
v0.3.4 - CPU Thread Tuning & Performance Optimization
What's New in v0.3.4
Performance Improvements
New igllama api flags for CPU thread tuning:
--threads/-t— Set generation thread count (GEMV, memory-bandwidth bound). Default: all cores. Optimal: set to your CPU's memory channel count.--threads-batch/-tb— Set prefill thread count (GEMM, compute-parallel). Default: all cores.--mlock— Pin model weights in physical RAM, preventing OS paging. Critical for consistent throughput when model size approaches available RAM.
Benchmark results on AMD EPYC-Rome 16-core (no GPU):
| Config | Generation speed |
|---|---|
| Previous default (4-thread cap) | 4.40 tok/s |
--threads 8 --threads-batch 16 |
5.56 tok/s (+26%) |
Recommended launch command for CPU-only servers:
igllama api model.gguf --threads 8 --threads-batch 16 --mlock --ctx-size 8192Bug Fixes
- Thread cap removed: Previous code hardcoded
@min(cpu_threads, 4)for generation threads, leaving cores idle on multi-core systems. Now defaults to all available cores (tune with--threads). - Large prompt crash fixed:
n_batchis now set equal toctx_size, preventingGGML_ASSERT(n_tokens_all <= cparams.n_batch)crashes when prompts exceed 2048 tokens (e.g., when using AI coding assistants with large system prompts).
Documentation & Website
- Benchmark showcase updated with real EPYC-Rome measurements and thread sweep data
- API docs updated with new flags and CPU performance tuning guide
- CLI reference updated with new
apicommand flags - Installation guide GPU build flags corrected (
-Dmetal=true,-Dcuda=true) - Qwen3.5 quickstart updated to use
igllama pullinstead of pip, with optimal server command - Qwen3.5 case study expanded with thread optimization analysis and benchmarks
Merged PRs
- #69 — feat: CPU thread tuning, mlock, n_batch fix, docs update (v0.3.4)
v0.3.3 - Fix Qwen model ID in quickstart docs
Bug Fixes
- Fix broken
igllama pullexample in quickstart docs (#67) — replaced the non-existentQwen/Qwen3.5-35B-A3B-GGUFHuggingFace repo with the correctunsloth/Qwen3.5-35B-A3B-GGUFacross all 7 occurrences in the quickstart guide (pull,run,chatexamples and expected output blocks).
Documentation
Full Changelog
v0.3.2 - Qwen3.5-35B-A3B Support
- Qwen3.5-35B-A3B GGUF support verified and documented
- Chat template auto-detection expanded to 12+ formats
- Session auto-save and resume for interactive chat
- Grammar-constrained generation (GBNF) added to
chatandruncommands - OpenAI-compatible API server (
igllama api) implemented with/v1/chat/completions,/v1/embeddings,/health - Import command for local GGUF files