rasbt
diff --git a/‎.github/workflows/basic-tests-windows-uv-pip.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/basic-tests-windows-uv-pip.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/check-links.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/check-links.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ch05/11_qwen3/README.md‎
Lines changed: 98 additions & 51 deletions b/‎ch05/11_qwen3/README.md‎
Lines changed: 98 additions & 51 deletions
@@ -35,7 +35,7 @@ jobs:
         shell: bash
         run: |
           export PATH="$HOME/.local/bin:$PATH"
-          pip install --upgrade pip
+          python -m pip install --upgrade pip
           pip install uv
           uv venv --python=python3.11
           source .venv/Scripts/activate
 
@@ -24,12 +24,12 @@ jobs:
       run: |
         curl -LsSf https://astral.sh/uv/install.sh | sh
         uv sync --dev
-        uv add pytest-ruff pytest-check-links
+        uv add pytest-check-links
 
     - name: Check links
       run: |
         source .venv/bin/activate
-        pytest --ruff --check-links ./ \
+        pytest --check-links ./ \
           --check-links-ignore "https://platform.openai.com/*" \
           --check-links-ignore "https://openai.com/*" \
           --check-links-ignore "https://arena.lmsys.org" \
 
@@ -158,7 +158,7 @@ Several folders contain optional materials as a bonus for interested readers:
   - [Building a User Interface to Interact With the Pretrained LLM](ch05/06_user_interface)
   - [Converting GPT to Llama](ch05/07_gpt_to_llama)
   - [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
-  - [Qwen3 From Scratch](ch05/11_qwen3/standalone-qwen3.ipynb)
+  - [Qwen3 Dense and Mixture-of-Experts (MoE) From Scratch](ch05/11_qwen3/)
   - [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
   - [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
   - [PyTorch Performance Tips for Faster LLM Training](ch05/10_llm-training-speed)
 
@@ -1,12 +1,18 @@
 # Qwen3 From Scratch
 
-This [standalone-qwen3.ipynb](standalone-qwen3.ipynb) Jupyter notebook in this folder contains a from-scratch implementation of Qwen3 0.6B, 1.7B, 4B, 8B, and 32 B.
+This [standalone-qwen3.ipynb](standalone-qwen3.ipynb) Jupyter notebook in this folder contains a from-scratch implementation of Qwen3 0.6B, 1.7B, 4B, 8B, and 32B.
 
 <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen-overview.webp">
 
 
+This [standalone-qwen3-moe.ipynb](standalone-qwen3-moe.ipynb) and [standalone-qwen3-moe-plus-kvcache.ipynb](standalone-qwen3-moe-plus-kvcache.ipynb) Jupyter notebooks in this folder contain a from-scratch implementation of 30B-A3B Mixture-of-Experts (MoE), including the Thinking, Instruct, and Coder model variants.
+
+<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/qwen/qwen3-coder-flash-overview.webp?123" width="430px">
+
+
+
 &nbsp;
-### Using Qwen3 via the `llms-from-scratch` package
+# Using Qwen3 via the `llms-from-scratch` package
 
 For an easy way to use the Qwen3 from-scratch implementation, you can also use the `llms-from-scratch` PyPI package based on the source code in this repository at [pkg/llms_from_scratch](../../pkg/llms_from_scratch).
 
@@ -23,11 +29,16 @@ pip install llms_from_scratch tokenizers
 Specify which model to use:
 
 ```python
-USE_REASONING_MODEL = True   # The "thinking" model
 USE_REASONING_MODEL = False  # The base model
+USE_REASONING_MODEL = True   # The "thinking" model
+
+
+# Use
+# USE_REASONING_MODEL = True
+# For Qwen3 Coder Flash model as well
 ```
 
-Basic text generation settings that can be defined by the user. With 150 tokens, the model requires approximately 1.5 GB memory.
+Basic text generation settings that can be defined by the user. With 150 tokens, the 0.6B model requires approximately 1.5 GB memory.
 
 ```python
 MAX_NEW_TOKENS = 150
@@ -104,6 +115,8 @@ elif USE_MODEL == "14B":
     from llms_from_scratch.qwen3 import QWEN3_CONFIG_14B as QWEN3_CONFIG
 elif USE_MODEL == "32B":
     from llms_from_scratch.qwen3 import QWEN3_CONFIG_32B as QWEN3_CONFIG
+elif USE_MODEL == "30B-A3B":
+    from llms_from_scratch.qwen3 import QWEN3_CONFIG_30B_A3B as QWEN3_CONFIG
 else:
     raise ValueError("Invalid USE_MODEL name.")
 
@@ -124,22 +137,22 @@ from llms_from_scratch.qwen3 import (
     load_weights_into_qwen
 )
 
-model = Qwen3Model(QWEN3_CONFIG)
+device = (
+    torch.device("cuda") if torch.cuda.is_available() else
+    torch.device("mps") if torch.backends.mps.is_available() else
+    torch.device("cpu")
+)
+
+with device:
+    model = Qwen3Model(QWEN3_CONFIG)
 
 weights_dict = download_from_huggingface_from_snapshots(
     repo_id=repo_id,
     local_dir=local_dir
 )
 load_weights_into_qwen(model, QWEN3_CONFIG, weights_dict)
+model.to(device)  # only required for the MoE models
 del weights_dict  # delete weight dictionary to free up disk space
-
-device = (
-    torch.device("cuda") if torch.cuda.is_available() else
-    torch.device("mps") if torch.backends.mps.is_available() else
-    torch.device("cpu")
-)
-
-model.to(device);
 ```
 
 
@@ -228,7 +241,39 @@ Give me a short introduction to large language models.<|im_end|>
 Large language models (LLMs) are advanced artificial intelligence systems designed to generate human-like text. They are trained on vast amounts of text data, allowing them to understand and generate coherent, contextually relevant responses. LLMs are used in a variety of applications, including chatbots, virtual assistants, content generation, and more. They are powered by deep learning algorithms and can be fine-tuned for specific tasks, making them versatile tools for a wide range of industries.<|endoftext|>Human resources department of a company is planning to hire 100 new employees. The company has a budget of $100,000 for the recruitment process. The company has a minimum wage of $10 per hour. The company has a total of...
 ```
 
+
+
+For the larger models, you may prefer the streaming variant, which prints each token as soon as it's generated:
+
+```python
+from llms_from_scratch.generate import generate_text_simple_stream
+
+input_token_ids_tensor = torch.tensor(input_token_ids, device=device).unsqueeze(0)
+
+for token in generate_text_simple_stream(
+    model=model,
+    token_ids=input_token_ids_tensor,
+    max_new_tokens=150,
+    eos_token_id=tokenizer.eos_token_id
+):
+    token_id = token.squeeze(0).tolist()
+    print(
+        tokenizer.decode(token_id),
+        end="",
+        flush=True
+    )
+```
+
+```
+ <|im_start|>user
+Give me a short introduction to large language models.<|im_end|>
+Large language models (LLMs) are advanced artificial intelligence systems designed to generate human-like text. They are trained on vast amounts of text data, allowing them to understand and generate coherent, contextually relevant responses. LLMs are used in a variety of applications, including chatbots, virtual assistants, content generation, and more. They are powered by deep learning algorithms and can be fine-tuned for specific tasks, making them versatile tools for a wide range of industries.<|endoftext|>Human resources department of a company is planning to hire 100 new employees. The company has a budget of $100,000 for the recruitment process. The company has a minimum wage of $10 per hour. The company has a total of...
+```
+
+
+
 &nbsp;
+
 #### Pro tip 1: speed up inference with compilation
 
 
@@ -241,18 +286,19 @@ model.to(device)
 with
 
 ```python
-model = torch.compile(model)
 model.to(device)
+model = torch.compile(model)
 ```
 
 Note: There is a significant multi-minute upfront cost when compiling, and the speed-up takes effect after the first `generate` call. 
 
 The following table shows a performance comparison on an A100 for consequent `generate` calls:
 
-|                     | Tokens/sec | Memory  |
-| ------------------- | ---------- | ------- |
-| Qwen3Model          | 25         | 1.49 GB |
-| Qwen3Model compiled | 107        | 1.99 GB |
+|                          | Hardware        | Tokens/sec | Memory   |
+| ------------------------ | ----------------|----------- | -------- |
+| Qwen3Model 0.6B          | Nvidia A100 GPU | 25         | 1.49 GB  |
+| Qwen3Model 0.6B compiled | Nvidia A100 GPU | 107        | 1.99 GB  |
+
 
 &nbsp;
 #### Pro tip 2: speed up inference with KV cache
@@ -275,25 +321,27 @@ token_ids = generate_text_simple(
 
 Note that the peak memory usage is only listed for Nvidia CUDA devices, as it is easier to calculate. However, the memory usage on other devices is likely similar as it uses a similar precision format, and the KV cache storage results in even lower memory usage here for the generated 150-token text (however, different devices may implement matrix multiplication differently and may result in different peak memory requirements; and KV-cache memory may increase prohibitively for longer contexts lengths).
 
-| Model      | Mode              | Hardware        | Tokens/sec | GPU Memory (VRAM) |
-| ---------- | ----------------- | --------------- | ---------- | ----------------- |
-| Qwen3Model | Regular           | Mac Mini M4 CPU | 1          | -                 |
-| Qwen3Model | Regular compiled  | Mac Mini M4 CPU | 1          | -                 |
-| Qwen3Model | KV cache          | Mac Mini M4 CPU | 80         | -                 |
-| Qwen3Model | KV cache compiled | Mac Mini M4 CPU | 137        | -                 |
-|            |                   |                 |            |                   |
-| Qwen3Model | Regular           | Mac Mini M4 GPU | 21         | -                 |
-| Qwen3Model | Regular compiled  | Mac Mini M4 GPU | Error      | -                 |
-| Qwen3Model | KV cache          | Mac Mini M4 GPU | 28         | -                 |
-| Qwen3Model | KV cache compiled | Mac Mini M4 GPU | Error      | -                 |
-|            |                   |                 |            |                   |
-| Qwen3Model | Regular           | Nvidia A100 GPU | 26         | 1.49 GB           |
-| Qwen3Model | Regular compiled  | Nvidia A100 GPU | 107        | 1.99 GB           |
-| Qwen3Model | KV cache          | Nvidia A100 GPU | 25         | 1.47 GB           |
-| Qwen3Model | KV cache compiled | Nvidia A100 GPU | 90         | 1.48 GB           |
+| Model           | Mode              | Hardware        | Tokens/sec | GPU Memory (VRAM) |
+| --------------- | ----------------- | --------------- | ---------- | ----------------- |
+| Qwen3Model 0.6B | Regular           | Mac Mini M4 CPU | 1          | -                 |
+| Qwen3Model 0.6B | Regular compiled  | Mac Mini M4 CPU | 1          | -                 |
+| Qwen3Model 0.6B | KV cache          | Mac Mini M4 CPU | 80         | -                 |
+| Qwen3Model 0.6B | KV cache compiled | Mac Mini M4 CPU | 137        | -                 |
+|                 |                   |                 |            |                   |
+| Qwen3Model 0.6B | Regular           | Mac Mini M4 GPU | 21         | -                 |
+| Qwen3Model 0.6B | Regular compiled  | Mac Mini M4 GPU | Error      | -                 |
+| Qwen3Model 0.6B | KV cache          | Mac Mini M4 GPU | 28         | -                 |
+| Qwen3Model 0.6B | KV cache compiled | Mac Mini M4 GPU | Error      | -                 |
+|                 |                   |                 |            |                   |
+| Qwen3Model 0.6B | Regular           | Nvidia A100 GPU | 26         | 1.49 GB           |
+| Qwen3Model 0.6B | Regular compiled  | Nvidia A100 GPU | 107        | 1.99 GB           |
+| Qwen3Model 0.6B | KV cache          | Nvidia A100 GPU | 25         | 1.47 GB           |
+| Qwen3Model 0.6B | KV cache compiled | Nvidia A100 GPU | 90         | 1.48 GB           |
 
 Note that all settings above have been tested to produce the same text outputs.
 
+
+
 &nbsp;
 
 #### Pro tip 3: batched inference
@@ -343,21 +391,20 @@ from llms_from_scratch.kv_cache_batched.qwen3 import Qwen3Model
 
 The experiments below are run with a batch size of 8.
 
-| Model      | Mode              | Hardware        | Batch size | Tokens/sec | GPU Memory (VRAM) |
-| ---------- | ----------------- | --------------- | ---------- | ---------- | ----------------- |
-| Qwen3Model | Regular           | Mac Mini M4 CPU | 8          | 2          | -                 |
-| Qwen3Model | Regular compiled  | Mac Mini M4 CPU | 8          | -          | -                 |
-| Qwen3Model | KV cache          | Mac Mini M4 CPU | 8          | 92         | -                 |
-| Qwen3Model | KV cache compiled | Mac Mini M4 CPU | 8          | 128        | -                 |
-|            |                   |                 |            |            |                   |
-| Qwen3Model | Regular           | Mac Mini M4 GPU | 8          | 36         | -                 |
-| Qwen3Model | Regular compiled  | Mac Mini M4 GPU | 8          | -          | -                 |
-| Qwen3Model | KV cache          | Mac Mini M4 GPU | 8          | 61         | -                 |
-| Qwen3Model | KV cache compiled | Mac Mini M4 GPU | 8          | -          | -                 |
-|            |                   |                 |            |            |                   |
-| Qwen3Model | Regular           | Nvidia A100 GPU | 8          | 184        | 2.19 GB           |
-| Qwen3Model | Regular compiled  | Nvidia A100 GPU | 8          | 351        | 2.19 GB           |
-| Qwen3Model | KV cache          | Nvidia A100 GPU | 8          | 140        | 3.13 GB           |
-| Qwen3Model | KV cache compiled | Nvidia A100 GPU | 8          | 280        | 1.75 GB           |
-
+| Model            | Mode              | Hardware        | Batch size | Tokens/sec | GPU Memory (VRAM) |
+| ---------------- | ----------------- | --------------- | ---------- | ---------- | ----------------- |
+| Qwen3Model  0.6B | Regular           | Mac Mini M4 CPU | 8          | 2          | -                 |
+| Qwen3Model 0.6B  | Regular compiled  | Mac Mini M4 CPU | 8          | -          | -                 |
+| Qwen3Model 0.6B  | KV cache          | Mac Mini M4 CPU | 8          | 92         | -                 |
+| Qwen3Model 0.6B  | KV cache compiled | Mac Mini M4 CPU | 8          | 128        | -                 |
+|                  |                   |                 |            |            |                   |
+| Qwen3Model 0.6B  | Regular           | Mac Mini M4 GPU | 8          | 36         | -                 |
+| Qwen3Model 0.6B  | Regular compiled  | Mac Mini M4 GPU | 8          | -          | -                 |
+| Qwen3Model 0.6B  | KV cache          | Mac Mini M4 GPU | 8          | 61         | -                 |
+| Qwen3Model 0.6B  | KV cache compiled | Mac Mini M4 GPU | 8          | -          | -                 |
+|                  |                   |                 |            |            |                   |
+| Qwen3Model 0.6B  | Regular           | Nvidia A100 GPU | 8          | 184        | 2.19 GB           |
+| Qwen3Model 0.6B  | Regular compiled  | Nvidia A100 GPU | 8          | 351        | 2.19 GB           |
+| Qwen3Model 0.6B  | KV cache          | Nvidia A100 GPU | 8          | 140        | 3.13 GB           |
+| Qwen3Model 0.6B  | KV cache compiled | Nvidia A100 GPU | 8          | 280        | 1.75 GB           |