Skip to content

Commit f92b40e

Browse files
authored
Qwen3 Coder Flash & MoE from Scratch (#760)
* Qwen3 Coder Flash & MoE from Scratch * update * refinements * updates * update * update * update
1 parent 145322d commit f92b40e

File tree

13 files changed

+2972
-271
lines changed

13 files changed

+2972
-271
lines changed

.github/workflows/basic-tests-windows-uv-pip.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
shell: bash
3636
run: |
3737
export PATH="$HOME/.local/bin:$PATH"
38-
pip install --upgrade pip
38+
python -m pip install --upgrade pip
3939
pip install uv
4040
uv venv --python=python3.11
4141
source .venv/Scripts/activate

.github/workflows/check-links.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,12 @@ jobs:
2424
run: |
2525
curl -LsSf https://astral.sh/uv/install.sh | sh
2626
uv sync --dev
27-
uv add pytest-ruff pytest-check-links
27+
uv add pytest-check-links
2828
2929
- name: Check links
3030
run: |
3131
source .venv/bin/activate
32-
pytest --ruff --check-links ./ \
32+
pytest --check-links ./ \
3333
--check-links-ignore "https://platform.openai.com/*" \
3434
--check-links-ignore "https://openai.com/*" \
3535
--check-links-ignore "https://arena.lmsys.org" \

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ Several folders contain optional materials as a bonus for interested readers:
158158
- [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface)
159159
- [Converting GPT to Llama](ch05/07_gpt_to_llama)
160160
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
161-
- [Qwen3 From Scratch](ch05/11_qwen3/standalone-qwen3.ipynb)
161+
- [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
162162
- [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
163163
- [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
164164
- [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed)

ch05/11_qwen3/README.md

Lines changed: 98 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,18 @@
11
# Qwen3 From Scratch
22

3-
This [standalone-qwen3.ipynb](standalone-qwen3.ipynb) Jupyter notebook in this folder contains a from-scratch implementation of Qwen3 0.6B, 1.7B, 4B, 8B, and 32 B.
3+
This [standalone-qwen3.ipynb](standalone-qwen3.ipynb) Jupyter notebook in this folder contains a from-scratch implementation of Qwen3 0.6B, 1.7B, 4B, 8B, and 32B.
44

55
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen-overview.webp">
66

77

8+
This [standalone-qwen3-moe.ipynb](standalone-qwen3-moe.ipynb) and [standalone-qwen3-moe-plus-kvcache.ipynb](standalone-qwen3-moe-plus-kvcache.ipynb) Jupyter notebooks in this folder contain a from-scratch implementation of 30B-A3B Mixture-of-Experts (MoE), including the Thinking, Instruct, and Coder model variants.
9+
10+
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen3-coder-flash-overview.webp?123" width="430px">
11+
12+
13+
814
&nbsp;
9-
### Using Qwen3 via the `llms-from-scratch` package
15+
# Using Qwen3 via the `llms-from-scratch` package
1016

1117
For an easy way to use the Qwen3 from-scratch implementation, you can also use the `llms-from-scratch` PyPI package based on the source code in this repository at [pkg/llms_from_scratch](../../pkg/llms_from_scratch).
1218

@@ -23,11 +29,16 @@ pip install llms_from_scratch tokenizers
2329
Specify which model to use:
2430

2531
```python
26-
USE_REASONING_MODEL = True # The "thinking" model
2732
USE_REASONING_MODEL = False # The base model
33+
USE_REASONING_MODEL = True # The "thinking" model
34+
35+
36+
# Use
37+
# USE_REASONING_MODEL = True
38+
# For Qwen3 Coder Flash model as well
2839
```
2940

30-
Basic text generation settings that can be defined by the user. With 150 tokens, the model requires approximately 1.5 GB memory.
41+
Basic text generation settings that can be defined by the user. With 150 tokens, the 0.6B model requires approximately 1.5 GB memory.
3142

3243
```python
3344
MAX_NEW_TOKENS = 150
@@ -104,6 +115,8 @@ elif USE_MODEL == "14B":
104115
from llms_from_scratch.qwen3 import QWEN3_CONFIG_14B as QWEN3_CONFIG
105116
elif USE_MODEL == "32B":
106117
from llms_from_scratch.qwen3 import QWEN3_CONFIG_32B as QWEN3_CONFIG
118+
elif USE_MODEL == "30B-A3B":
119+
from llms_from_scratch.qwen3 import QWEN3_CONFIG_30B_A3B as QWEN3_CONFIG
107120
else:
108121
raise ValueError("Invalid USE_MODEL name.")
109122

@@ -124,22 +137,22 @@ from llms_from_scratch.qwen3 import (
124137
load_weights_into_qwen
125138
)
126139

127-
model = Qwen3Model(QWEN3_CONFIG)
140+
device = (
141+
torch.device("cuda") if torch.cuda.is_available() else
142+
torch.device("mps") if torch.backends.mps.is_available() else
143+
torch.device("cpu")
144+
)
145+
146+
with device:
147+
model = Qwen3Model(QWEN3_CONFIG)
128148

129149
weights_dict = download_from_huggingface_from_snapshots(
130150
repo_id=repo_id,
131151
local_dir=local_dir
132152
)
133153
load_weights_into_qwen(model, QWEN3_CONFIG, weights_dict)
154+
model.to(device) # only required for the MoE models
134155
del weights_dict # delete weight dictionary to free up disk space
135-
136-
device = (
137-
torch.device("cuda") if torch.cuda.is_available() else
138-
torch.device("mps") if torch.backends.mps.is_available() else
139-
torch.device("cpu")
140-
)
141-
142-
model.to(device);
143156
```
144157

145158

@@ -228,7 +241,39 @@ Give me a short introduction to large language models.<|im_end|>
228241
Large language models (LLMs) are advanced artificial intelligence systems designed to generate human-like text. They are trained on vast amounts of text data, allowing them to understand and generate coherent, contextually relevant responses. LLMs are used in a variety of applications, including chatbots, virtual assistants, content generation, and more. They are powered by deep learning algorithms and can be fine-tuned for specific tasks, making them versatile tools for a wide range of industries.<|endoftext|>Human resources department of a company is planning to hire 100 new employees. The company has a budget of $100,000 for the recruitment process. The company has a minimum wage of $10 per hour. The company has a total of...
229242
```
230243

244+
245+
246+
For the larger models, you may prefer the streaming variant, which prints each token as soon as it's generated:
247+
248+
```python
249+
from llms_from_scratch.generate import generate_text_simple_stream
250+
251+
input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)
252+
253+
for token in generate_text_simple_stream(
254+
model=model,
255+
token_ids=input_token_ids_tensor,
256+
max_new_tokens=150,
257+
eos_token_id=tokenizer.eos_token_id
258+
):
259+
token_id = token.squeeze(0).tolist()
260+
print(
261+
tokenizer.decode(token_id),
262+
end="",
263+
flush=True
264+
)
265+
```
266+
267+
```
268+
<|im_start|>user
269+
Give me a short introduction to large language models.<|im_end|>
270+
Large language models (LLMs) are advanced artificial intelligence systems designed to generate human-like text. They are trained on vast amounts of text data, allowing them to understand and generate coherent, contextually relevant responses. LLMs are used in a variety of applications, including chatbots, virtual assistants, content generation, and more. They are powered by deep learning algorithms and can be fine-tuned for specific tasks, making them versatile tools for a wide range of industries.<|endoftext|>Human resources department of a company is planning to hire 100 new employees. The company has a budget of $100,000 for the recruitment process. The company has a minimum wage of $10 per hour. The company has a total of...
271+
```
272+
273+
274+
231275
&nbsp;
276+
232277
#### Pro tip 1: speed up inference with compilation
233278

234279

@@ -241,18 +286,19 @@ model.to(device)
241286
with
242287

243288
```python
244-
model = torch.compile(model)
245289
model.to(device)
290+
model = torch.compile(model)
246291
```
247292

248293
Note: There is a significant multi-minute upfront cost when compiling, and the speed-up takes effect after the first `generate` call.
249294

250295
The following table shows a performance comparison on an A100 for consequent `generate` calls:
251296

252-
| | Tokens/sec | Memory |
253-
| ------------------- | ---------- | ------- |
254-
| Qwen3Model | 25 | 1.49 GB |
255-
| Qwen3Model compiled | 107 | 1.99 GB |
297+
| | Hardware | Tokens/sec | Memory |
298+
| ------------------------ | ----------------|----------- | -------- |
299+
| Qwen3Model 0.6B | Nvidia A100 GPU | 25 | 1.49 GB |
300+
| Qwen3Model 0.6B compiled | Nvidia A100 GPU | 107 | 1.99 GB |
301+
256302

257303
&nbsp;
258304
#### Pro tip 2: speed up inference with KV cache
@@ -275,25 +321,27 @@ token_ids = generate_text_simple(
275321

276322
Note that the peak memory usage is only listed for Nvidia CUDA devices, as it is easier to calculate. However, the memory usage on other devices is likely similar as it uses a similar precision format, and the KV cache storage results in even lower memory usage here for the generated 150-token text (however, different devices may implement matrix multiplication differently and may result in different peak memory requirements; and KV-cache memory may increase prohibitively for longer contexts lengths).
277323

278-
| Model | Mode | Hardware | Tokens/sec | GPU Memory (VRAM) |
279-
| ---------- | ----------------- | --------------- | ---------- | ----------------- |
280-
| Qwen3Model | Regular | Mac Mini M4 CPU | 1 | - |
281-
| Qwen3Model | Regular compiled | Mac Mini M4 CPU | 1 | - |
282-
| Qwen3Model | KV cache | Mac Mini M4 CPU | 80 | - |
283-
| Qwen3Model | KV cache compiled | Mac Mini M4 CPU | 137 | - |
284-
| | | | | |
285-
| Qwen3Model | Regular | Mac Mini M4 GPU | 21 | - |
286-
| Qwen3Model | Regular compiled | Mac Mini M4 GPU | Error | - |
287-
| Qwen3Model | KV cache | Mac Mini M4 GPU | 28 | - |
288-
| Qwen3Model | KV cache compiled | Mac Mini M4 GPU | Error | - |
289-
| | | | | |
290-
| Qwen3Model | Regular | Nvidia A100 GPU | 26 | 1.49 GB |
291-
| Qwen3Model | Regular compiled | Nvidia A100 GPU | 107 | 1.99 GB |
292-
| Qwen3Model | KV cache | Nvidia A100 GPU | 25 | 1.47 GB |
293-
| Qwen3Model | KV cache compiled | Nvidia A100 GPU | 90 | 1.48 GB |
324+
| Model | Mode | Hardware | Tokens/sec | GPU Memory (VRAM) |
325+
| --------------- | ----------------- | --------------- | ---------- | ----------------- |
326+
| Qwen3Model 0.6B | Regular | Mac Mini M4 CPU | 1 | - |
327+
| Qwen3Model 0.6B | Regular compiled | Mac Mini M4 CPU | 1 | - |
328+
| Qwen3Model 0.6B | KV cache | Mac Mini M4 CPU | 80 | - |
329+
| Qwen3Model 0.6B | KV cache compiled | Mac Mini M4 CPU | 137 | - |
330+
| | | | | |
331+
| Qwen3Model 0.6B | Regular | Mac Mini M4 GPU | 21 | - |
332+
| Qwen3Model 0.6B | Regular compiled | Mac Mini M4 GPU | Error | - |
333+
| Qwen3Model 0.6B | KV cache | Mac Mini M4 GPU | 28 | - |
334+
| Qwen3Model 0.6B | KV cache compiled | Mac Mini M4 GPU | Error | - |
335+
| | | | | |
336+
| Qwen3Model 0.6B | Regular | Nvidia A100 GPU | 26 | 1.49 GB |
337+
| Qwen3Model 0.6B | Regular compiled | Nvidia A100 GPU | 107 | 1.99 GB |
338+
| Qwen3Model 0.6B | KV cache | Nvidia A100 GPU | 25 | 1.47 GB |
339+
| Qwen3Model 0.6B | KV cache compiled | Nvidia A100 GPU | 90 | 1.48 GB |
294340

295341
Note that all settings above have been tested to produce the same text outputs.
296342

343+
344+
297345
&nbsp;
298346

299347
#### Pro tip 3: batched inference
@@ -343,21 +391,20 @@ from llms_from_scratch.kv_cache_batched.qwen3 import Qwen3Model
343391

344392
The experiments below are run with a batch size of 8.
345393

346-
| Model | Mode | Hardware | Batch size | Tokens/sec | GPU Memory (VRAM) |
347-
| ---------- | ----------------- | --------------- | ---------- | ---------- | ----------------- |
348-
| Qwen3Model | Regular | Mac Mini M4 CPU | 8 | 2 | - |
349-
| Qwen3Model | Regular compiled | Mac Mini M4 CPU | 8 | - | - |
350-
| Qwen3Model | KV cache | Mac Mini M4 CPU | 8 | 92 | - |
351-
| Qwen3Model | KV cache compiled | Mac Mini M4 CPU | 8 | 128 | - |
352-
| | | | | | |
353-
| Qwen3Model | Regular | Mac Mini M4 GPU | 8 | 36 | - |
354-
| Qwen3Model | Regular compiled | Mac Mini M4 GPU | 8 | - | - |
355-
| Qwen3Model | KV cache | Mac Mini M4 GPU | 8 | 61 | - |
356-
| Qwen3Model | KV cache compiled | Mac Mini M4 GPU | 8 | - | - |
357-
| | | | | | |
358-
| Qwen3Model | Regular | Nvidia A100 GPU | 8 | 184 | 2.19 GB |
359-
| Qwen3Model | Regular compiled | Nvidia A100 GPU | 8 | 351 | 2.19 GB |
360-
| Qwen3Model | KV cache | Nvidia A100 GPU | 8 | 140 | 3.13 GB |
361-
| Qwen3Model | KV cache compiled | Nvidia A100 GPU | 8 | 280 | 1.75 GB |
362-
394+
| Model | Mode | Hardware | Batch size | Tokens/sec | GPU Memory (VRAM) |
395+
| ---------------- | ----------------- | --------------- | ---------- | ---------- | ----------------- |
396+
| Qwen3Model 0.6B | Regular | Mac Mini M4 CPU | 8 | 2 | - |
397+
| Qwen3Model 0.6B | Regular compiled | Mac Mini M4 CPU | 8 | - | - |
398+
| Qwen3Model 0.6B | KV cache | Mac Mini M4 CPU | 8 | 92 | - |
399+
| Qwen3Model 0.6B | KV cache compiled | Mac Mini M4 CPU | 8 | 128 | - |
400+
| | | | | | |
401+
| Qwen3Model 0.6B | Regular | Mac Mini M4 GPU | 8 | 36 | - |
402+
| Qwen3Model 0.6B | Regular compiled | Mac Mini M4 GPU | 8 | - | - |
403+
| Qwen3Model 0.6B | KV cache | Mac Mini M4 GPU | 8 | 61 | - |
404+
| Qwen3Model 0.6B | KV cache compiled | Mac Mini M4 GPU | 8 | - | - |
405+
| | | | | | |
406+
| Qwen3Model 0.6B | Regular | Nvidia A100 GPU | 8 | 184 | 2.19 GB |
407+
| Qwen3Model 0.6B | Regular compiled | Nvidia A100 GPU | 8 | 351 | 2.19 GB |
408+
| Qwen3Model 0.6B | KV cache | Nvidia A100 GPU | 8 | 140 | 3.13 GB |
409+
| Qwen3Model 0.6B | KV cache compiled | Nvidia A100 GPU | 8 | 280 | 1.75 GB |
363410

0 commit comments

Comments
 (0)