Releases · bkataru/igllama

06 Mar 09:53

bkataru

v0.3.11

8ee067c

v0.3.11 — Strip residual </think> tokens Latest

Latest

Fix

Strip </think> from generated output when --no-think is active — the prompt prefills <think>\n\n</think> but models sometimes still emit </think> as the first token(s). Both streaming and non-streaming paths now detect and strip this prefix, preventing JSON_ERR in downstream consumers like powerglide trial harness.

Verified

All tests pass
Discovered during powerglide 9B T01-T17 trial (17/17 pass, but with recoverable JSON_ERR noise from </think> leaking)

Assets 2

05 Mar 08:21

bkataru

v0.3.10

5ed3ed3

v0.3.10 — Accurate Usage Token Counts

What's New

Accurate usage.prompt_tokens and usage.completion_tokens in non-streaming responses

Previously, the non-streaming /v1/chat/completions handler returned hardcoded zeros:

"usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}

Now it returns real counts:

"usage": {"prompt_tokens": 42, "completion_tokens": 55, "total_tokens": 97}

Implementation

handleCompletion() now returns a CompletionResult struct:

prompt_tokens — from tokenizer.getTokens().len after prompt tokenization
completion_tokens — incremented per generated token in the decode loop
total_tokens — sum of the above

Notes

Streaming handler was already reporting total_tokens separately (unchanged)
This enables downstream clients (powerglide bench harness, OpenAI SDK consumers) to accurately measure throughput

Assets 2

05 Mar 08:20

bkataru

v0.3.9

fdd108b

v0.3.9 — Grammar-Constrained json_mode

What's New

Grammar-constrained json_mode via response_format

When clients send "response_format": {"type": "json_object"} in a chat completion request, igllama now:

Loads the built-in JSON_GRAMMAR GBNF grammar as a comptime constant
Wires it into the llama.cpp sampler chain via llama_sampler_init_grammar()
Constrains every token selection to valid JSON — no post-processing needed

Both streaming and non-streaming handlers support json_mode.

Fixed

Streaming json_mode use-after-free — the streaming handler previously called loadGrammar(allocator, "json") then defer allocator.free(gs) inside the if-block, freeing the grammar string while the sampler still held a pointer. Replaced with the direct JSON_GRAMMAR comptime constant — no allocation, no lifetime issue.

Notes

json_mode is silently ignored if the model's vocabulary doesn't support grammar sampling
Works best with the --no-think flag to suppress reasoning preamble

Assets 2

03 Mar 23:38

bkataru

v0.3.8

6888c6b

v0.3.8 - Qwen 3.5 Small Benchmarks & Website Refactor

Qwen 3.5 Small Series Benchmarks: Comprehensive triage and benchmarks for 0.8B, 2B, 4B, and 9B models
Website Reorganization: Refactored benchmark showcase into structured subpages (showcase/qwen35-small and showcase/qwen35-35b)
UX Improvements: Added backlinks to showcase subpages and a new "Learn More" navigation hub on the homepage
History Cleanup: Pruned large build artifacts from gh-pages and corrected author information

Assets 2

02 Mar 11:40

bkataru

v0.3.7

fe8336c

v0.3.7 - Documentation Consolidation & llama.cpp Bump

Documentation Fix: Restored truncated philosophy.smd content
Bumped llama.cpp submodule to gguf-v0.18.0 (Vulkan AMD partial offload improvements, CUDA grid fixes)
Added website/.gitignore to suppress Zine build output files from git status
Documentation consolidation pass: development-notes, showcase, api.smd, version strings

Assets 2

02 Mar 11:16

bkataru

v0.3.6

c8d5ebd

v0.3.6 - OpenAI-Compatible Streaming Fix

What's New in v0.3.6

Bug Fix: Streaming Response Compatibility

Forge code and other strict OpenAI-compatible clients (using Rust serde parsers) previously crashed with:

Failed to parse provider response: data did not match any variant of untagged enum Response

Root cause: igllama's SSE stream was missing required OpenAI spec fields:

Initial role chunk — OpenAI requires delta: {"role":"assistant","content":""} as the first SSE chunk before any content
model and created fields — Required in every SSE chunk
Final stop chunk — delta: {}, finish_reason: "stop" must be sent before data: [DONE]

Fix: All three issues resolved. Streaming and non-streaming responses now match the OpenAI spec exactly.

Tested With

forge code CLI: forge --prompt and forge --agent sage subagent delegation both work
No <think> blocks: --no-think confirmed working across all forge agents
Response latency on AMD EPYC-Rome (16-core, no GPU): ~50–90s (system prompt prefill; no reasoning overhead)

Recommended Launch Command

igllama api model.gguf \
  --threads 8 --threads-batch 16 \
  --mlock --ctx-size 8192 \
  --no-think

Merged PRs

#71 — fix: make streaming response fully OpenAI-compatible (v0.3.6)

Assets 2

02 Mar 10:35

bkataru

v0.3.5

69c6c06

v0.3.5 - Thinking Mode Suppression (--no-think)

What's New in v0.3.5

New Feature: `--no-think` Flag

Qwen3.5 and similar reasoning models produce <think>...</think> chain-of-thought blocks before every response. On CPU hardware these blocks can run 200+ tokens long, adding 30–90 seconds of latency before the actual answer.

--no-think suppresses this entirely:

igllama api model.gguf --no-think

How it works: igllama pre-fills an empty <think>\n\n</think> block on the assistant turn. The model treats the reasoning phase as already complete and jumps directly to the answer — the standard llama.cpp technique for disabling Qwen3-style chain-of-thought.

Timing comparison (AMD EPYC-Rome, Qwen3.5-35B-A3B):

Mode	Simple query latency
Default (thinking on)	~45 s
`--no-think`	~9 s

Recommended full launch command for CPU servers:

igllama api model.gguf \
  --threads 8 --threads-batch 16 \
  --mlock --ctx-size 8192 \
  --no-think

Merged PRs

#70 — feat: add --no-think flag to suppress Qwen3 reasoning blocks (v0.3.5)

Assets 2

02 Mar 07:08

bkataru

v0.3.4

440876b

v0.3.4 - CPU Thread Tuning & Performance Optimization

What's New in v0.3.4

Performance Improvements

New igllama api flags for CPU thread tuning:

--threads / -t — Set generation thread count (GEMV, memory-bandwidth bound). Default: all cores. Optimal: set to your CPU's memory channel count.
--threads-batch / -tb — Set prefill thread count (GEMM, compute-parallel). Default: all cores.
--mlock — Pin model weights in physical RAM, preventing OS paging. Critical for consistent throughput when model size approaches available RAM.

Benchmark results on AMD EPYC-Rome 16-core (no GPU):

Config	Generation speed
Previous default (4-thread cap)	4.40 tok/s
`--threads 8 --threads-batch 16`	5.56 tok/s (+26%)

Recommended launch command for CPU-only servers:

igllama api model.gguf --threads 8 --threads-batch 16 --mlock --ctx-size 8192

Bug Fixes

Thread cap removed: Previous code hardcoded @min(cpu_threads, 4) for generation threads, leaving cores idle on multi-core systems. Now defaults to all available cores (tune with --threads).
Large prompt crash fixed: n_batch is now set equal to ctx_size, preventing GGML_ASSERT(n_tokens_all <= cparams.n_batch) crashes when prompts exceed 2048 tokens (e.g., when using AI coding assistants with large system prompts).

Documentation & Website

Benchmark showcase updated with real EPYC-Rome measurements and thread sweep data
API docs updated with new flags and CPU performance tuning guide
CLI reference updated with new api command flags
Installation guide GPU build flags corrected (-Dmetal=true, -Dcuda=true)
Qwen3.5 quickstart updated to use igllama pull instead of pip, with optimal server command
Qwen3.5 case study expanded with thread optimization analysis and benchmarks

Merged PRs

#69 — feat: CPU thread tuning, mlock, n_batch fix, docs update (v0.3.4)

Assets 2

02 Mar 05:15

bkataru

v0.3.3

51b632a

v0.3.3 - Fix Qwen model ID in quickstart docs

Bug Fixes

Fix broken igllama pull example in quickstart docs (#67) — replaced the non-existent Qwen/Qwen3.5-35B-A3B-GGUF HuggingFace repo with the correct unsloth/Qwen3.5-35B-A3B-GGUF across all 7 occurrences in the quickstart guide (pull, run, chat examples and expected output blocks).

Documentation

Update README version footer to v0.3.3 (#68) — reflects the v0.3.3 release and PR #67.

Full Changelog

v0.3.2...v0.3.3

Assets 2

26 Feb 00:45

bkataru

v0.3.2

3babef0

v0.3.2 - Qwen3.5-35B-A3B Support

Qwen3.5-35B-A3B GGUF support verified and documented
Chat template auto-detection expanded to 12+ formats
Session auto-save and resume for interactive chat
Grammar-constrained generation (GBNF) added to chat and run commands
OpenAI-compatible API server (igllama api) implemented with /v1/chat/completions, /v1/embeddings, /health
Import command for local GGUF files

Assets 14

Releases: bkataru/igllama

v0.3.11 — Strip residual </think> tokens

Fix

Verified

Uh oh!

v0.3.10 — Accurate Usage Token Counts

What's New

Implementation

Notes

Uh oh!

v0.3.9 — Grammar-Constrained json_mode

What's New

Fixed

Notes

Uh oh!

v0.3.8 - Qwen 3.5 Small Benchmarks & Website Refactor

Uh oh!

v0.3.7 - Documentation Consolidation & llama.cpp Bump

Uh oh!

v0.3.6 - OpenAI-Compatible Streaming Fix

What's New in v0.3.6

Bug Fix: Streaming Response Compatibility

Tested With

Recommended Launch Command

Merged PRs

Uh oh!

v0.3.5 - Thinking Mode Suppression (--no-think)

What's New in v0.3.5

New Feature: --no-think Flag

Merged PRs

Uh oh!

v0.3.4 - CPU Thread Tuning & Performance Optimization

What's New in v0.3.4

Performance Improvements

Bug Fixes

Documentation & Website

Merged PRs

Uh oh!

v0.3.3 - Fix Qwen model ID in quickstart docs

Bug Fixes

Documentation

Full Changelog

Uh oh!

v0.3.2 - Qwen3.5-35B-A3B Support

Uh oh!

New Feature: `--no-think` Flag