Skip to content

Releases: bkataru/igllama

v0.3.11 — Strip residual </think> tokens

06 Mar 09:53
8ee067c

Choose a tag to compare

Fix

  • Strip </think> from generated output when --no-think is active — the prompt prefills <think>\n\n</think> but models sometimes still emit </think> as the first token(s). Both streaming and non-streaming paths now detect and strip this prefix, preventing JSON_ERR in downstream consumers like powerglide trial harness.

Verified

  • All tests pass
  • Discovered during powerglide 9B T01-T17 trial (17/17 pass, but with recoverable JSON_ERR noise from </think> leaking)

v0.3.10 — Accurate Usage Token Counts

05 Mar 08:21

Choose a tag to compare

What's New

Accurate usage.prompt_tokens and usage.completion_tokens in non-streaming responses

Previously, the non-streaming /v1/chat/completions handler returned hardcoded zeros:

"usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}

Now it returns real counts:

"usage": {"prompt_tokens": 42, "completion_tokens": 55, "total_tokens": 97}

Implementation

handleCompletion() now returns a CompletionResult struct:

  • prompt_tokens — from tokenizer.getTokens().len after prompt tokenization
  • completion_tokens — incremented per generated token in the decode loop
  • total_tokens — sum of the above

Notes

  • Streaming handler was already reporting total_tokens separately (unchanged)
  • This enables downstream clients (powerglide bench harness, OpenAI SDK consumers) to accurately measure throughput

v0.3.9 — Grammar-Constrained json_mode

05 Mar 08:20

Choose a tag to compare

What's New

Grammar-constrained json_mode via response_format

When clients send "response_format": {"type": "json_object"} in a chat completion request, igllama now:

  1. Loads the built-in JSON_GRAMMAR GBNF grammar as a comptime constant
  2. Wires it into the llama.cpp sampler chain via llama_sampler_init_grammar()
  3. Constrains every token selection to valid JSON — no post-processing needed

Both streaming and non-streaming handlers support json_mode.

Fixed

  • Streaming json_mode use-after-free — the streaming handler previously called loadGrammar(allocator, "json") then defer allocator.free(gs) inside the if-block, freeing the grammar string while the sampler still held a pointer. Replaced with the direct JSON_GRAMMAR comptime constant — no allocation, no lifetime issue.

Notes

  • json_mode is silently ignored if the model's vocabulary doesn't support grammar sampling
  • Works best with the --no-think flag to suppress reasoning preamble

v0.3.8 - Qwen 3.5 Small Benchmarks & Website Refactor

03 Mar 23:38

Choose a tag to compare

  • Qwen 3.5 Small Series Benchmarks: Comprehensive triage and benchmarks for 0.8B, 2B, 4B, and 9B models
  • Website Reorganization: Refactored benchmark showcase into structured subpages (showcase/qwen35-small and showcase/qwen35-35b)
  • UX Improvements: Added backlinks to showcase subpages and a new "Learn More" navigation hub on the homepage
  • History Cleanup: Pruned large build artifacts from gh-pages and corrected author information

v0.3.7 - Documentation Consolidation & llama.cpp Bump

02 Mar 11:40

Choose a tag to compare

  • Documentation Fix: Restored truncated philosophy.smd content
  • Bumped llama.cpp submodule to gguf-v0.18.0 (Vulkan AMD partial offload improvements, CUDA grid fixes)
  • Added website/.gitignore to suppress Zine build output files from git status
  • Documentation consolidation pass: development-notes, showcase, api.smd, version strings

v0.3.6 - OpenAI-Compatible Streaming Fix

02 Mar 11:16

Choose a tag to compare

What's New in v0.3.6

Bug Fix: Streaming Response Compatibility

Forge code and other strict OpenAI-compatible clients (using Rust serde parsers) previously crashed with:

Failed to parse provider response: data did not match any variant of untagged enum Response

Root cause: igllama's SSE stream was missing required OpenAI spec fields:

  1. Initial role chunk — OpenAI requires delta: {"role":"assistant","content":""} as the first SSE chunk before any content
  2. model and created fields — Required in every SSE chunk
  3. Final stop chunkdelta: {}, finish_reason: "stop" must be sent before data: [DONE]

Fix: All three issues resolved. Streaming and non-streaming responses now match the OpenAI spec exactly.

Tested With

  • forge code CLI: forge --prompt and forge --agent sage subagent delegation both work
  • No <think> blocks: --no-think confirmed working across all forge agents
  • Response latency on AMD EPYC-Rome (16-core, no GPU): ~50–90s (system prompt prefill; no reasoning overhead)

Recommended Launch Command

igllama api model.gguf \
  --threads 8 --threads-batch 16 \
  --mlock --ctx-size 8192 \
  --no-think

Merged PRs

  • #71 — fix: make streaming response fully OpenAI-compatible (v0.3.6)

v0.3.5 - Thinking Mode Suppression (--no-think)

02 Mar 10:35

Choose a tag to compare

What's New in v0.3.5

New Feature: --no-think Flag

Qwen3.5 and similar reasoning models produce <think>...</think> chain-of-thought blocks before every response. On CPU hardware these blocks can run 200+ tokens long, adding 30–90 seconds of latency before the actual answer.

--no-think suppresses this entirely:

igllama api model.gguf --no-think

How it works: igllama pre-fills an empty <think>\n\n</think> block on the assistant turn. The model treats the reasoning phase as already complete and jumps directly to the answer — the standard llama.cpp technique for disabling Qwen3-style chain-of-thought.

Timing comparison (AMD EPYC-Rome, Qwen3.5-35B-A3B):

Mode Simple query latency
Default (thinking on) ~45 s
--no-think ~9 s

Recommended full launch command for CPU servers:

igllama api model.gguf \
  --threads 8 --threads-batch 16 \
  --mlock --ctx-size 8192 \
  --no-think

Merged PRs

  • #70 — feat: add --no-think flag to suppress Qwen3 reasoning blocks (v0.3.5)

v0.3.4 - CPU Thread Tuning & Performance Optimization

02 Mar 07:08

Choose a tag to compare

What's New in v0.3.4

Performance Improvements

New igllama api flags for CPU thread tuning:

  • --threads / -t — Set generation thread count (GEMV, memory-bandwidth bound). Default: all cores. Optimal: set to your CPU's memory channel count.
  • --threads-batch / -tb — Set prefill thread count (GEMM, compute-parallel). Default: all cores.
  • --mlock — Pin model weights in physical RAM, preventing OS paging. Critical for consistent throughput when model size approaches available RAM.

Benchmark results on AMD EPYC-Rome 16-core (no GPU):

Config Generation speed
Previous default (4-thread cap) 4.40 tok/s
--threads 8 --threads-batch 16 5.56 tok/s (+26%)

Recommended launch command for CPU-only servers:

igllama api model.gguf --threads 8 --threads-batch 16 --mlock --ctx-size 8192

Bug Fixes

  • Thread cap removed: Previous code hardcoded @min(cpu_threads, 4) for generation threads, leaving cores idle on multi-core systems. Now defaults to all available cores (tune with --threads).
  • Large prompt crash fixed: n_batch is now set equal to ctx_size, preventing GGML_ASSERT(n_tokens_all <= cparams.n_batch) crashes when prompts exceed 2048 tokens (e.g., when using AI coding assistants with large system prompts).

Documentation & Website

  • Benchmark showcase updated with real EPYC-Rome measurements and thread sweep data
  • API docs updated with new flags and CPU performance tuning guide
  • CLI reference updated with new api command flags
  • Installation guide GPU build flags corrected (-Dmetal=true, -Dcuda=true)
  • Qwen3.5 quickstart updated to use igllama pull instead of pip, with optimal server command
  • Qwen3.5 case study expanded with thread optimization analysis and benchmarks

Merged PRs

  • #69 — feat: CPU thread tuning, mlock, n_batch fix, docs update (v0.3.4)

v0.3.3 - Fix Qwen model ID in quickstart docs

02 Mar 05:15

Choose a tag to compare

Bug Fixes

  • Fix broken igllama pull example in quickstart docs (#67) — replaced the non-existent Qwen/Qwen3.5-35B-A3B-GGUF HuggingFace repo with the correct unsloth/Qwen3.5-35B-A3B-GGUF across all 7 occurrences in the quickstart guide (pull, run, chat examples and expected output blocks).

Documentation

  • Update README version footer to v0.3.3 (#68) — reflects the v0.3.3 release and PR #67.

Full Changelog

v0.3.2...v0.3.3

v0.3.2 - Qwen3.5-35B-A3B Support

26 Feb 00:45

Choose a tag to compare

  • Qwen3.5-35B-A3B GGUF support verified and documented
  • Chat template auto-detection expanded to 12+ formats
  • Session auto-save and resume for interactive chat
  • Grammar-constrained generation (GBNF) added to chat and run commands
  • OpenAI-compatible API server (igllama api) implemented with /v1/chat/completions, /v1/embeddings, /health
  • Import command for local GGUF files