Skip to content

Optimize tensor.slice() #1381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 30, 2025
Merged

Optimize tensor.slice() #1381

merged 3 commits into from
Jul 30, 2025

Conversation

Honry
Copy link
Contributor

@Honry Honry commented Jul 30, 2025

The performance of executing tensor.slice() is super poor, especially for the 'logits' tensor with large dimensions.

const logits = outputs.logits.slice(null, -1, null);`

This is because currently implementation of the slice method manually iterates through each element and calculate indices which is a big time consuming if the tensor shape is large.

For cases like slice(null, -1, null), where the slicing operation is contiguous along certain dimensions, which can be optimized by bulk copy by using TypeArray.subarray() and TypeArray.set().

The performance of executing `tensor.slice()` is super poor, especially for
the 'logits' tensor with large dimensions.

```
const logits = outputs.logits.slice(null, -1, null);`
```

This is because currently implementation of the `slice` method manually iterates
through each element and calculate indices which is a big time consuming if
the tensor shape is large.

For cases like `slice(null, -1, null)`, where the slicing operation is
contiguous along certain dimensions, which can be optimized by bulk copy
by using `TypeArray.subarray()` and `TypeArray.set()`.
@Honry
Copy link
Contributor Author

Honry commented Jul 30, 2025

@xenova, PTAL, thanks!

On my test device, this can improve the execution time for const logits = outputs.logits.slice(null, -1, null); from 11ms to 0.3ms for Qwen model (with logits shape: [batch_size,sequence_length,151936]).

cc/ @ibelem

@xenova
Copy link
Collaborator

xenova commented Jul 30, 2025

Oh wow, this looks like a great PR! Running tests and reviewing now!

Thanks @Honry

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova
Copy link
Collaborator

xenova commented Jul 30, 2025

Very impressive: ran this with qwen3 and got a pretty substantial TPS increase 🙌

Before: 45.087279331416916 tokens per second
After: 50.945617704855884 tokens per second

+13% speed boost to generate 512 tokens 🔥

Merging now!

@xenova xenova merged commit bd2449e into huggingface:main Jul 30, 2025
4 checks passed
xenova added a commit that referenced this pull request Jul 30, 2025
* ONNX Runtime improvements (experimental native webgpu; fix iOS) (#1231)

* customize the wasm paths

* update implementation

* allow using 'webgpu' in nodejs binding

* update version of onnxruntime-node

* Upgrade onnxruntime-web to same version as onnxruntime-node

* Update list of supported devices

---------

Co-authored-by: Joshua Lochner <[email protected]>

* customize the wasm paths (#1250)

* customize the wasm paths

* update implementation

* [internal] Add is_decoder option to session retrieval for preferred output location

* Update tests

* Formatting

* Bump ort versions

* Bump onnxruntime-node version

* Bump versions

* Bump ORT versions

* Bump versions

* Only check webgpu fp16 for non-node environments

* Fix

* Assume node supports webgpu

* Update ORT node support comment

* Relax test strictness

* Update conversion script versions

* Downgrade onnxslim

* cleanup

* Update package-lock.json

* Update onnxruntime versions

* Update post-build script

* Use built-in session release function

* Call garbage collection after each tokenizer test

* Do not double-throw error

* Fix race-condition in build process with file removal

* Update versions

* Bump jinja version

* [version] Update to 3.6.3

* Bump jinja version to support new features

* [version] Update to 3.6.3

* Add support for LFM2 models (#1367)

* Use prefix in lfm2 output location (#1369)

* Update package-lock.json

* Run `npm audit fix`

* Add special tokens in text-generation pipeline if tokenizer requires (#1370)

* Add special tokens in text-generation pipeline if tokenizer requires

* Fix logits processors tests

* Update bundles.test.js

* Update comment

* Formatting

* Add support for ModernBERT Decoder (#1371)

* Use from/to buffer instead of string

Actually fixes #1343

* Add support for Voxtral (#1373)

* Support longform voxtral processing (#1375)

* [version] Update to 3.7.0

* Add support for Arcee (#1377)

* Optimize tensor.slice() (#1381)

* Optimize tensor.slice()

The performance of executing `tensor.slice()` is super poor, especially for
the 'logits' tensor with large dimensions.

```
const logits = outputs.logits.slice(null, -1, null);`
```

This is because currently implementation of the `slice` method manually iterates
through each element and calculate indices which is a big time consuming if
the tensor shape is large.

For cases like `slice(null, -1, null)`, where the slicing operation is
contiguous along certain dimensions, which can be optimized by bulk copy
by using `TypeArray.subarray()` and `TypeArray.set()`.

* nit

* Add a few more tensor slice unit tests

---------

Co-authored-by: Joshua Lochner <[email protected]>

---------

Co-authored-by: Yulong Wang <[email protected]>
Co-authored-by: Wanming Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants