[AINode] Preliminary version of concurrent inference #15884
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds an request–pooling engine for multi-request inference on time-series models such as TimerXL. The change set introduces three core Python modules—requestpool.py, request.py, and utils.py—plus a self-contained benchmark harness (guarded by if name == "main":) to compare pooled vs. baseline generation speed and numerical fidelity.
Design
Overlap multiple user requests on one device: RequestPool.step() batches all ready requests every 15 ms if there are no requests running, and feeds a single forward pass to the model.
Handle variable sequence lengths: Left-padding to max_len per tensor type; preserves causal semantics while enabling torch.cat.
Behavior & configuration
RequestPool.add_request truncates inputs that are not an exact multiple of config.input_token_len, ensuring model state alignment;
Oversized write attempts are silently clipped to max_new_steps;
Class & method organization
RequestPool
Public API: add_request, run_inference (starts loop), step (single scheduling + forward pass).
Request
id, chunk_size, …, state, cur_step_idx, output_tensor
write_step_output pre-allocates and in-place fills a fixed buffer—no Python-side reallocation after start.
utils
split_moe_output Slices Moe[Causal]LMOutputWithPast into per-request objects.
This PR has:
for an unfamiliar reader.
Key changed/added classes (or packages if there are too many classes) in this PR
ainode.core.inference.requestpool.RequestPool
ainode.core.inference.request.Request
ainode.core.inference.utils.split_moe_output