You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ch05/11_qwen3/README.md
+98-51Lines changed: 98 additions & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,18 @@
1
1
# Qwen3 From Scratch
2
2
3
-
This [standalone-qwen3.ipynb](standalone-qwen3.ipynb) Jupyter notebook in this folder contains a from-scratch implementation of Qwen3 0.6B, 1.7B, 4B, 8B, and 32 B.
3
+
This [standalone-qwen3.ipynb](standalone-qwen3.ipynb) Jupyter notebook in this folder contains a from-scratch implementation of Qwen3 0.6B, 1.7B, 4B, 8B, and 32B.
This [standalone-qwen3-moe.ipynb](standalone-qwen3-moe.ipynb) and [standalone-qwen3-moe-plus-kvcache.ipynb](standalone-qwen3-moe-plus-kvcache.ipynb) Jupyter notebooks in this folder contain a from-scratch implementation of 30B-A3B Mixture-of-Experts (MoE), including the Thinking, Instruct, and Coder model variants.
###Using Qwen3 via the `llms-from-scratch` package
15
+
# Using Qwen3 via the `llms-from-scratch` package
10
16
11
17
For an easy way to use the Qwen3 from-scratch implementation, you can also use the `llms-from-scratch` PyPI package based on the source code in this repository at [pkg/llms_from_scratch](../../pkg/llms_from_scratch).
model.to(device) # only required for the MoE models
134
155
del weights_dict # delete weight dictionary to free up disk space
135
-
136
-
device = (
137
-
torch.device("cuda") if torch.cuda.is_available() else
138
-
torch.device("mps") if torch.backends.mps.is_available() else
139
-
torch.device("cpu")
140
-
)
141
-
142
-
model.to(device);
143
156
```
144
157
145
158
@@ -228,7 +241,39 @@ Give me a short introduction to large language models.<|im_end|>
228
241
Large language models (LLMs) are advanced artificial intelligence systems designed to generate human-like text. They are trained on vast amounts of text data, allowing them to understand and generate coherent, contextually relevant responses. LLMs are used in a variety of applications, including chatbots, virtual assistants, content generation, and more. They are powered by deep learning algorithms and can be fine-tuned for specific tasks, making them versatile tools for a wide range of industries.<|endoftext|>Human resources department of a company is planning to hire 100 new employees. The company has a budget of $100,000 for the recruitment process. The company has a minimum wage of $10 per hour. The company has a total of...
229
242
```
230
243
244
+
245
+
246
+
For the larger models, you may prefer the streaming variant, which prints each token as soon as it's generated:
247
+
248
+
```python
249
+
from llms_from_scratch.generate import generate_text_simple_stream
Give me a short introduction to large language models.<|im_end|>
270
+
Large language models (LLMs) are advanced artificial intelligence systems designed to generate human-like text. They are trained on vast amounts of text data, allowing them to understand and generate coherent, contextually relevant responses. LLMs are used in a variety of applications, including chatbots, virtual assistants, content generation, and more. They are powered by deep learning algorithms and can be fine-tuned for specific tasks, making them versatile tools for a wide range of industries.<|endoftext|>Human resources department of a company is planning to hire 100 new employees. The company has a budget of $100,000 for the recruitment process. The company has a minimum wage of $10 per hour. The company has a total of...
271
+
```
272
+
273
+
274
+
231
275
276
+
232
277
#### Pro tip 1: speed up inference with compilation
233
278
234
279
@@ -241,18 +286,19 @@ model.to(device)
241
286
with
242
287
243
288
```python
244
-
model = torch.compile(model)
245
289
model.to(device)
290
+
model = torch.compile(model)
246
291
```
247
292
248
293
Note: There is a significant multi-minute upfront cost when compiling, and the speed-up takes effect after the first `generate` call.
249
294
250
295
The following table shows a performance comparison on an A100 for consequent `generate` calls:
Note that the peak memory usage is only listed for Nvidia CUDA devices, as it is easier to calculate. However, the memory usage on other devices is likely similar as it uses a similar precision format, and the KV cache storage results in even lower memory usage here for the generated 150-token text (however, different devices may implement matrix multiplication differently and may result in different peak memory requirements; and KV-cache memory may increase prohibitively for longer contexts lengths).
0 commit comments