Skip to content

[Feature Request] D3D12 External Resource Interop API for Plugin EPs #26821

@nieubank

Description

@nieubank

Describe the feature request

ONNX Runtime: D3D12 External Resource Import

1. Problem statement

Windows graphics/video apps increasingly hold data in GPU-resident D3D12 resources (captured frames, render targets, compute buffers). Today, using those resources as ONNX Runtime inputs/outputs typically requires extra copies (GPU→CPU→GPU) or EP-specific private hooks.

This proposal defines a generic, EP-agnostic ORT public API that lets an application:

  • provide D3D12-shared resources/heaps as zero-copy tensor storage, and
  • perform explicit GPU↔GPU synchronization between the app’s D3D12 work and the EP’s compute stream,
    without sharing command queues and without exposing GPU virtual addresses.

2. Requirements (MVP)

  • Zero-copy tensor views over imported external GPU allocations.
  • D3D12 sharing via Win32 shared handles:
    • committed ID3D12Resource (buffer) shared handles
    • ID3D12Heap shared handles
  • Explicit synchronization using imported external semaphores:
    • D3D12 timeline fence shared handle + 64-bit value wait/signal
  • EP capability discovery (CanImport*) with clear failure: ORT_NOT_IMPLEMENTED.
  • Device identity matching so apps can select an EP device compatible with a given D3D12 adapter.
  • Fits existing ORT primitives: OrtEpDevice, OrtEpFactory, OrtSyncStream, opaque handles, explicit Release*.
  • Output tensors supported (write access) when EP supports it.

3. Non-goals (MVP)

  • No queue sharing requirement — app and EP keep independent queues/streams.
  • No GPU virtual addresses in the public contract.
  • No implicit fallback copies — apps choose behavior; EPs return NOT_IMPLEMENTED when unsupported.
  • Texture/surface-native tensors are vNext — MVP targets resources that can be treated as dense, linearly-addressable tensor storage (D3D12 buffers / heaps).
  • Alternatives to Windows D3D12 — This proposal targets Windows D3D12 shared handles exclusively. The extensible OrtExternalMemoryHandleType and OrtExternalSemaphoreType enums accommodate additional handle types in future proposals without breaking API compatibility.

4. Design overview

The public API introduces:

  • OrtExternalResourceImporter: A capability object for external resource interop (memory + semaphore operations). Named following the agent-noun pattern (Allocator, Importer).
  • OrtExternalMemoryHandle: EP-imported view of a shared external allocation.
  • OrtExternalSemaphoreHandle: EP-imported view of a shared external semaphore.

ORT routes calls to the selected plugin EP via a single extension to OrtEpFactory:

  • CreateExternalResourceImporterForDevice() — returns OrtExternalResourceImporterImpl* (same pattern as CreateSyncStreamForDevice).

The capability object bundles all memory and semaphore operations. EPs that don't support external resources return null or ORT_NOT_IMPLEMENTED from the factory method.

Key point: the application only passes opaque OS handles (HANDLE) plus sizes/offsets and synchronization values. The EP may internally map these to CUDA/HIP/etc, but ORT does not expose those internals.

5. Public API (onnxruntime_c_api.h)

5.1 Types

typedef enum OrtExternalMemoryHandleType {
  ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE = 0,  /* shared HANDLE from CreateSharedHandle(resource) */
  ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP     = 1,  /* shared HANDLE from CreateSharedHandle(heap) */
} OrtExternalMemoryHandleType;

typedef enum OrtExternalMemoryAccessMode {
  ORT_EXTERNAL_MEMORY_ACCESS_READ_WRITE = 0,
  ORT_EXTERNAL_MEMORY_ACCESS_READ_ONLY  = 1,
  ORT_EXTERNAL_MEMORY_ACCESS_WRITE_ONLY = 2,
} OrtExternalMemoryAccessMode;

#define ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION 1
typedef struct OrtExternalMemoryDescriptor {
  uint32_t version;                        /* Must be ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION */
  OrtExternalMemoryHandleType handle_type;
  void* native_handle;      /* Windows HANDLE */
  size_t size_bytes;        /* total bytes in allocation */
  size_t offset_bytes;      /* base offset into allocation */
  OrtExternalMemoryAccessMode access_mode;
} OrtExternalMemoryDescriptor;

ORT_RUNTIME_CLASS(ExternalMemoryHandle);
/* EP-owned implementation type; lives in the plugin EP API header */
ORT_RUNTIME_CLASS(ExternalMemoryHandleImpl);

typedef enum OrtExternalSemaphoreType {
  ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE = 0, /* shared HANDLE from CreateSharedHandle(fence) */
} OrtExternalSemaphoreType;

#define ORT_EXTERNAL_SEMAPHORE_DESCRIPTOR_VERSION 1
typedef struct OrtExternalSemaphoreDescriptor {
  uint32_t version;        /* Must be ORT_EXTERNAL_SEMAPHORE_DESCRIPTOR_VERSION */
  OrtExternalSemaphoreType type;
  void* native_handle;  /* Windows HANDLE */
} OrtExternalSemaphoreDescriptor;

ORT_RUNTIME_CLASS(ExternalSemaphoreHandle);
/* EP-owned implementation type; lives in the plugin EP API header */
ORT_RUNTIME_CLASS(ExternalSemaphoreHandleImpl);

/* Capability object for external resource interop (agent-noun pattern like Allocator) */
ORT_RUNTIME_CLASS(ExternalResourceImporter);
ORT_RUNTIME_CLASS(ExternalResourceImporterImpl);

#define ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION 1
typedef struct OrtExternalTensorDescriptor {
  uint32_t version;        /* Must be ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION */
  ONNXTensorElementDataType element_type;
  const int64_t* shape;
  size_t rank;
  size_t offset_bytes; /* optional: view offset within imported memory (default 0) */
} OrtExternalTensorDescriptor;

5.2 Functions

The public API mirrors the consolidated EP plugin pattern. ORT wraps OrtExternalResourceImporterImpl in an opaque OrtExternalResourceImporter handle.

/* Create the external resource importer for a specific EP device */
ORT_API2_STATUS(CreateExternalResourceImporterForDevice,
  _In_ const OrtEpDevice* ep_device,
  _Outptr_ OrtExternalResourceImporter** out_importer);

ORT_API(void, ReleaseExternalResourceImporter,
  _Frees_ptr_opt_ OrtExternalResourceImporter* importer);

/* Memory operations */
ORT_API2_STATUS(ExternalResourceImporter_CanImportMemory,
  _In_ const OrtExternalResourceImporter* importer,
  _In_ OrtExternalMemoryHandleType handle_type,
  _Out_ bool* out_supported);

ORT_API2_STATUS(ExternalResourceImporter_ImportMemory,
  _In_ OrtExternalResourceImporter* importer,
  _In_ const OrtExternalMemoryDescriptor* desc,
  _Outptr_ OrtExternalMemoryHandle** out_handle);

ORT_API(void, ReleaseExternalMemoryHandle,
  _Frees_ptr_opt_ OrtExternalMemoryHandle* handle);

ORT_API2_STATUS(ExternalResourceImporter_CreateTensorFromMemory,
  _In_ OrtExternalResourceImporter* importer,
  _In_ const OrtExternalMemoryHandle* mem_handle,
  _In_ const OrtExternalTensorDescriptor* tensor_desc,
  _In_opt_ const OrtMemoryInfo* tensor_location,
  _Outptr_ OrtValue** out_tensor);

/* Semaphore operations */
ORT_API2_STATUS(ExternalResourceImporter_CanImportSemaphore,
  _In_ const OrtExternalResourceImporter* importer,
  _In_ OrtExternalSemaphoreType type,
  _Out_ bool* out_supported);

ORT_API2_STATUS(ExternalResourceImporter_ImportSemaphore,
  _In_ OrtExternalResourceImporter* importer,
  _In_ const OrtExternalSemaphoreDescriptor* desc,
  _Outptr_ OrtExternalSemaphoreHandle** out_handle);

ORT_API(void, ReleaseExternalSemaphoreHandle,
  _Frees_ptr_opt_ OrtExternalSemaphoreHandle* handle);

ORT_API2_STATUS(ExternalResourceImporter_WaitSemaphore,
  _In_ OrtExternalResourceImporter* importer,
  _In_ OrtExternalSemaphoreHandle* semaphore_handle,
  _In_ OrtSyncStream* stream,
  _In_ uint64_t value);

ORT_API2_STATUS(ExternalResourceImporter_SignalSemaphore,
  _In_ OrtExternalResourceImporter* importer,
  _In_ OrtExternalSemaphoreHandle* semaphore_handle,
  _In_ OrtSyncStream* stream,
  _In_ uint64_t value);

/* Session device query for outputs (mirrors SessionGetEpDeviceForInputs) */
ORT_API2_STATUS(SessionGetEpDeviceForOutputs, _In_ const OrtSession* session,
                _Out_writes_(num_outputs) const OrtEpDevice** outputs_ep_devices,
                _In_ size_t num_outputs);

/* Associate an OrtSyncStream with RunOptions for async Run */
ORT_API2_STATUS(RunOptions_SetSyncStream,
  _Inout_ OrtRunOptions* run_options,
  _In_ OrtSyncStream* stream);

5.3 Session and RunOptions extensions

SessionGetEpDeviceForOutputs: Mirrors the existing SessionGetEpDeviceForInputs (ORT 1.23). Returns the EP device assigned to each output, enabling applications to validate that outputs will be placed on the expected device for external resource sharing.

RunOptions_SetSyncStream: Associates an OrtSyncStream with RunOptions. When Run() or RunWithBinding() is called, the EP uses this stream for execution, enabling proper synchronization with imported external semaphores. This approach:

  • Works with both Run() and RunWithBinding() — no IOBinding requirement
  • Follows the existing RunOptions pattern (RunOptions_SetRunTag, etc.)
  • Allows different Run calls to use different streams for concurrent inference
  • Integrates cleanly with the external semaphore wait/signal pattern

5.4 Device identity (adapter matching)

ORT already discovers Windows GPU devices and populates OrtHardwareDevice metadata during device discovery.

In particular, ORT’s Windows device discovery currently emits a "LUID" metadata entry for GPU/NPU devices (a 64-bit value serialized as a decimal string). Applications can read it via HardwareDevice_Metadata() and compare it to their D3D12 adapter LUID.

To avoid string parsing in client code, a small convenience API (e.g., HardwareDevice_GetAdapterLuidLowHigh) can be added later without changing the interop design.

6. EP plugin API (onnxruntime_ep_c_api.h)

6.1 Capability object pattern

The existing OrtEpFactory uses a "create capability object" pattern for optional features:

Factory Method Returns Operations bundled in object
CreateAllocator() OrtAllocator* Alloc, Free
CreateDataTransfer() OrtDataTransferImpl* Copy, CanCopy
CreateSyncStreamForDevice() OrtSyncStreamImpl* Flush, GetHandle

Following this pattern, the API adds one factory method that returns a capability object:

/* Add to OrtEpFactory (version-gated) */
OrtStatus*(ORT_API_CALL* CreateExternalResourceImporterForDevice)(
  _In_ OrtEpFactory* this_ptr,
  _In_ const OrtMemoryDevice* memory_device,
  _Outptr_ OrtExternalResourceImporterImpl** out_importer);

6.2 OrtExternalResourceImporterImpl interface

The returned OrtExternalResourceImporterImpl owns operations for memory and semaphore interop:

struct OrtExternalResourceImporterImpl {
  /* ──────────────── Memory operations (stream-independent) ──────────────── */
  
  bool(ORT_API_CALL* CanImportMemory)(
    _In_ const OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalMemoryHandleType handle_type);

  OrtStatus*(ORT_API_CALL* ImportMemory)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ const OrtExternalMemoryDescriptor* desc,
    _Outptr_ OrtExternalMemoryHandleImpl** out_handle);

  void(ORT_API_CALL* ReleaseMemory)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalMemoryHandleImpl* handle);

  OrtStatus*(ORT_API_CALL* CreateTensorFromMemory)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ const OrtExternalMemoryHandleImpl* mem_handle,
    _In_ const OrtExternalTensorDescriptor* tensor_desc,
    _Outptr_ OrtValue** out_tensor);

  /* ──────────────── Semaphore operations (require stream) ──────────────── */
  
  bool(ORT_API_CALL* CanImportSemaphore)(
    _In_ const OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreType type);

  OrtStatus*(ORT_API_CALL* ImportSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ const OrtExternalSemaphoreDescriptor* desc,
    _Outptr_ OrtExternalSemaphoreHandleImpl** out_handle);

  void(ORT_API_CALL* ReleaseSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreHandleImpl* handle);

  OrtStatus*(ORT_API_CALL* WaitSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreHandleImpl* handle,
    _In_ OrtSyncStream* stream,
    _In_ uint64_t value);

  OrtStatus*(ORT_API_CALL* SignalSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreHandleImpl* handle,
    _In_ OrtSyncStream* stream,
    _In_ uint64_t value);

  /* ──────────────── Release the capability object itself ──────────────── */
  
  void(ORT_API_CALL* Release)(
    _In_ OrtExternalResourceImporterImpl* this_ptr);
};

6.3 Dependency handling

The consolidated design makes dependencies explicit:

Dependency How it's handled
Semaphore wait/signal requires stream WaitSemaphore/SignalSemaphore take OrtSyncStream*; EP can return ORT_NOT_IMPLEMENTED if !IsStreamAware()
Memory import is stream-independent ImportMemory / CreateTensorFromMemory don't take a stream; usable with sync Run()
EP doesn't support external resources CreateExternalResourceImporterForDevice returns null or ORT_NOT_IMPLEMENTED

Capability matrix is now:

  1. Check CreateExternalResourceImporterForDevice != null — EP has the feature
  2. Call CanImportMemory(D3D12_RESOURCE) — EP supports this memory type
  3. Call CanImportSemaphore(D3D12_FENCE) — EP supports fence sync (implies stream-aware)

6.4 ORT routing layer (public API → EP factory)

The ORT core provides the glue between the public C API and the EP factory. This is analogous to how CreateSyncStreamForEpDevice (public) routes to OrtEpFactory::CreateSyncStreamForDevice (EP plugin).

Type mapping:

Public API types          →  EP Plugin types
─────────────────────────────────────────────
OrtEpDevice*              →  OrtMemoryDevice*  (extracted by ORT)
OrtExternalResourceImporter*  →  OrtExternalResourceImporterImpl*  (wrapped by ORT)
OrtExternalMemoryHandle*      →  OrtExternalMemoryHandleImpl*  (wrapped by ORT)
OrtExternalSemaphoreHandle*   →  OrtExternalSemaphoreHandleImpl*  (wrapped by ORT)

Call flow for CreateExternalResourceImporterForDevice:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Client Application                                                         │
│  ─────────────────                                                          │
│  OrtApi->CreateExternalResourceImporterForDevice(epDevice, &importer)       │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  ORT Core (onnxruntime_c_api.cc)                                            │
│  ───────────────────────────────                                            │
│  1. Extract OrtMemoryDevice* from OrtEpDevice*                              │
│  2. Look up OrtEpFactory* for the EP that owns this device                  │
│  3. Check factory->CreateExternalResourceImporterForDevice != nullptr       │
│  4. Call factory->CreateExternalResourceImporterForDevice(memoryDevice,     │
│                                                           &implPtr)         │
│  5. Wrap OrtExternalResourceImporterImpl* in OrtExternalResourceImporter*   │
│  6. Return wrapped handle to client                                         │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  EP Factory (e.g., NvTensorRtRtxEpFactory)                                  │
│  ─────────────────────────────────────────                                  │
│  CreateExternalResourceImporterForDevice(memoryDevice, &implPtr):           │
│    1. Validate memoryDevice matches EP's supported devices                  │
│    2. Create NvTrtRtxExternalResourceImporterImpl instance                  │
│    3. Initialize CUDA context for the device                                │
│    4. Return impl pointer                                                   │
└─────────────────────────────────────────────────────────────────────────────┘

Call flow for memory/semaphore operations:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Client Application                                                         │
│  ─────────────────                                                          │
│  OrtApi->ExternalResourceImporter_ImportMemory(importer, &desc, &memHandle) │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  ORT Core                                                                   │
│  ────────                                                                   │
│  1. Unwrap OrtExternalResourceImporter* → OrtExternalResourceImporterImpl*  │
│  2. Call impl->ImportMemory(impl, desc, &implHandle)                        │
│  3. Wrap OrtExternalMemoryHandleImpl* in OrtExternalMemoryHandle*           │
│  4. Return wrapped handle to client                                         │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  EP Implementation (e.g., NvTrtRtxExternalResourceImporterImpl)             │
│  ──────────────────────────────────────────────────────────────             │
│  ImportMemory(desc, &implHandle):                                           │
│    1. Map ORT handle type → CUDA handle type                                │
│    2. Call cuImportExternalMemory() with D3D12 shared handle                │
│    3. Call cuExternalMemoryGetMappedBuffer() to get device pointer          │
│    4. Store CUexternalMemory + CUdeviceptr in impl handle                   │
│    5. Return impl handle                                                    │
└─────────────────────────────────────────────────────────────────────────────┘

Key design points:

  • ORT owns the wrapper lifetime: The public handles (OrtExternalResourceImporter*, etc.) are ORT-allocated wrappers that hold a pointer to the EP-allocated impl.
  • EP owns the impl lifetime: When the client calls ReleaseExternalResourceImporter, ORT calls impl->Release(impl) to let the EP clean up.
  • Type safety via opaque handles: Clients cannot accidentally pass an impl pointer; only ORT can unwrap the public handle.
  • EP-agnostic public API: The public API has no EP-specific types; all EP details are hidden behind the impl interface.

7. Justification (fit with existing ORT patterns)

The proposed API extends OrtEpFactory with one optional factory method, following the same "create capability object" pattern used for existing optional EP features:

Existing OrtEpFactory capability Pattern
CreateAllocator() Returns OrtAllocator* with bundled Alloc/Free operations
CreateDataTransfer() Returns OrtDataTransferImpl* with bundled Copy operations
CreateSyncStreamForDevice() Returns OrtSyncStreamImpl* with bundled stream operations
Proposed capability Pattern
CreateExternalResourceImporterForDevice() Returns OrtExternalResourceImporterImpl* with bundled memory/semaphore import operations

Key alignments:

  • Agent-noun naming: Importer follows the same pattern as Allocator.
  • ForDevice suffix: Matches CreateSyncStreamForDevice; resources are inherently device-scoped.
  • Single factory entry point: Check one pointer for capability, not 8+.
  • Capability object owns operations: All memory and semaphore ops are methods on the returned object.
  • EP returns *Impl objects: ORT wraps them in public handles (same as OrtSyncStreamImplOrtSyncStream).
  • Explicit release: Release method on the capability object; separate Release*Handle for imported resources.
  • Dependency is explicit: Semaphore wait/signal take OrtSyncStream*; EP can return ORT_NOT_IMPLEMENTED if not stream-aware.

8. Client calling pattern (MVP code)

Below is concrete C++ code demonstrating the MVP usage flow. Error handling is abbreviated for clarity.

8.1 Setup (once per EP device)

The OrtExternalResourceImporter is created per EP device (not per session). Multiple sessions using the same EP device can share the same importer. Imported memory and semaphore handles can be reused across sessions.

// ─────────────────────────────────────────────────────────────────────────────
// 1. App creates D3D12 resources with sharing enabled
// ─────────────────────────────────────────────────────────────────────────────
ComPtr<ID3D12Resource> inputBuffer, outputBuffer;
ComPtr<ID3D12Fence> fence;
HANDLE inputHandle, outputHandle, fenceHandle;

// Create buffers with D3D12_HEAP_FLAG_SHARED
CreateD3D12Buffer(d3d12Device, inputSize, D3D12_HEAP_FLAG_SHARED, &inputBuffer);
CreateD3D12Buffer(d3d12Device, outputSize, D3D12_HEAP_FLAG_SHARED, &outputBuffer);
d3d12Device->CreateFence(0, D3D12_FENCE_FLAG_SHARED, IID_PPV_ARGS(&fence));

// Create shared handles (NT handles)
d3d12Device->CreateSharedHandle(inputBuffer.Get(), nullptr, GENERIC_ALL, nullptr, &inputHandle);
d3d12Device->CreateSharedHandle(outputBuffer.Get(), nullptr, GENERIC_ALL, nullptr, &outputHandle);
d3d12Device->CreateSharedHandle(fence.Get(), nullptr, GENERIC_ALL, nullptr, &fenceHandle);

// ─────────────────────────────────────────────────────────────────────────────
// 2. Find an ORT EP device matching the D3D12 adapter
// ─────────────────────────────────────────────────────────────────────────────
const OrtEpDevice* epDevice = nullptr;
const OrtEpDevice* const* devices = nullptr;
size_t numDevices = 0;
OrtApi->GetEpDevices(env, &devices, &numDevices);

LUID adapterLuid = d3d12Device->GetAdapterLuid();
for (size_t i = 0; i < numDevices; ++i) {
    const OrtHardwareDevice* hw = OrtApi->EpDevice_Device(devices[i]);
    const OrtKeyValuePairs* metadata = OrtApi->HardwareDevice_Metadata(hw);
    const char* luidStr = OrtApi->GetKeyValue(metadata, "LUID");
    if (luidStr && ParseLuid(luidStr) == adapterLuid) {
        epDevice = devices[i];
        break;
    }
}

// ─────────────────────────────────────────────────────────────────────────────
// 3. Create the external resource importer for this device
// ─────────────────────────────────────────────────────────────────────────────
OrtExternalResourceImporter* importer = nullptr;
OrtStatus* status = OrtApi->CreateExternalResourceImporterForDevice(epDevice, &importer);
if (status != nullptr) {
    // EP doesn't support external resource interop
    // Fall back to copy-based path
    OrtApi->ReleaseStatus(status);
}

// ─────────────────────────────────────────────────────────────────────────────
// 4. Check capabilities
// ─────────────────────────────────────────────────────────────────────────────
bool canImportD3D12Resource = false, canImportFence = false;
OrtApi->ExternalResourceImporter_CanImportMemory(importer, ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE, &canImportD3D12Resource);
OrtApi->ExternalResourceImporter_CanImportSemaphore(importer, ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE, &canImportFence);

if (!canImportD3D12Resource || !canImportFence) {
    // EP supports external resources but not this handle type
}

// ─────────────────────────────────────────────────────────────────────────────
// 5. Import memory and semaphore
// ─────────────────────────────────────────────────────────────────────────────
OrtExternalMemoryDescriptor inputDesc = {
    .version = ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION,
    .handle_type = ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE,
    .native_handle = inputHandle,
    .size_bytes = inputSize,
    .offset_bytes = 0,
    .access_mode = ORT_EXTERNAL_MEMORY_ACCESS_READ_ONLY
};
OrtExternalMemoryDescriptor outputDesc = {
    .version = ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION,
    .handle_type = ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE,
    .native_handle = outputHandle,
    .size_bytes = outputSize,
    .offset_bytes = 0,
    .access_mode = ORT_EXTERNAL_MEMORY_ACCESS_WRITE_ONLY
};
OrtExternalSemaphoreDescriptor semDesc = {
    .version = ORT_EXTERNAL_SEMAPHORE_DESCRIPTOR_VERSION,
    .type = ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE,
    .native_handle = fenceHandle
};

OrtExternalMemoryHandle* inputMem = nullptr;
OrtExternalMemoryHandle* outputMem = nullptr;
OrtExternalSemaphoreHandle* sem = nullptr;

OrtApi->ExternalResourceImporter_ImportMemory(importer, &inputDesc, &inputMem);
OrtApi->ExternalResourceImporter_ImportMemory(importer, &outputDesc, &outputMem);
OrtApi->ExternalResourceImporter_ImportSemaphore(importer, &semDesc, &sem);

// ─────────────────────────────────────────────────────────────────────────────
// 6. Create tensors aliasing the imported memory
// ─────────────────────────────────────────────────────────────────────────────
int64_t inputShape[] = {1, 3, 224, 224};
int64_t outputShape[] = {1, 1000};

OrtExternalTensorDescriptor inputTensorDesc = {
    .version = ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION,
    .element_type = ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT,
    .shape = inputShape,
    .rank = 4,
    .offset_bytes = 0
};
OrtExternalTensorDescriptor outputTensorDesc = {
    .version = ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION,
    .element_type = ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT,
    .shape = outputShape,
    .rank = 2,
    .offset_bytes = 0
};

OrtValue* inputTensor = nullptr;
OrtValue* outputTensor = nullptr;
OrtApi->ExternalResourceImporter_CreateTensorFromMemory(importer, inputMem, &inputTensorDesc, nullptr, &inputTensor);
OrtApi->ExternalResourceImporter_CreateTensorFromMemory(importer, outputMem, &outputTensorDesc, nullptr, &outputTensor);

// ─────────────────────────────────────────────────────────────────────────────
// 7. Create sync stream for this EP device
// ─────────────────────────────────────────────────────────────────────────────
OrtSyncStream* stream = nullptr;
OrtApi->CreateSyncStreamForEpDevice(epDevice, nullptr, &stream);

8.2 Session setup and device validation

// ─────────────────────────────────────────────────────────────────────────────
// Create session with the EP device
// ─────────────────────────────────────────────────────────────────────────────
OrtSessionOptions* sessionOptions = nullptr;
OrtApi->CreateSessionOptions(&sessionOptions);

const OrtEpDevice* deviceArray[] = {epDevice};
OrtApi->SessionOptionsAppendExecutionProvider_V2(sessionOptions, env, deviceArray, 1, nullptr, nullptr, 0);

OrtSession* session = nullptr;
OrtApi->CreateSession(env, modelPath, sessionOptions, &session);

// ─────────────────────────────────────────────────────────────────────────────
// Validate device assignment for inputs and outputs
// ─────────────────────────────────────────────────────────────────────────────
size_t numInputs = 0, numOutputs = 0;
OrtApi->SessionGetInputCount(session, &numInputs);
OrtApi->SessionGetOutputCount(session, &numOutputs);

// Validate inputs are assigned to expected EP device
std::vector<const OrtEpDevice*> inputDevices(numInputs);
OrtApi->SessionGetEpDeviceForInputs(session, inputDevices.data(), numInputs);
for (size_t i = 0; i < numInputs; ++i) {
    if (inputDevices[i] != epDevice) {
        // Input is not on expected device - may need fallback
    }
}

// Validate outputs are assigned to expected EP device
std::vector<const OrtEpDevice*> outputDevices(numOutputs);
OrtApi->SessionGetEpDeviceForOutputs(session, outputDevices.data(), numOutputs);
for (size_t i = 0; i < numOutputs; ++i) {
    if (outputDevices[i] != epDevice) {
        // Output is not on expected device - may need fallback
    }
}

8.3 Per-frame execution with Run() (no IOBinding)

This example shows the simple Run() pattern with RunOptions_SetSyncStream.

// ─────────────────────────────────────────────────────────────────────────────
// Per-frame: D3D12 upload → ORT inference → D3D12 consume
// ─────────────────────────────────────────────────────────────────────────────
uint64_t fenceValue = currentFrame * 2;

// App: copy input data to inputBuffer via D3D12 command list
d3d12CommandQueue->ExecuteCommandLists(1, &uploadCmdList);
d3d12CommandQueue->Signal(fence.Get(), fenceValue);

// ORT: wait for D3D12 upload to complete
OrtApi->ExternalResourceImporter_WaitSemaphore(importer, sem, stream, fenceValue);

// Set up RunOptions with the sync stream
OrtRunOptions* runOptions = nullptr;
OrtApi->CreateRunOptions(&runOptions);
OrtApi->RunOptions_SetSyncStream(runOptions, stream);

// Run inference using simple Run() API (no IOBinding required)
const char* inputNames[] = {"input"};
const char* outputNames[] = {"output"};
const OrtValue* inputs[] = {inputTensor};
OrtValue* outputs[] = {outputTensor};  // Pre-allocated external tensor

OrtApi->Run(session, runOptions, inputNames, inputs, 1, outputNames, 1, outputs);

// Signal completion so D3D12 can consume the output
OrtApi->ExternalResourceImporter_SignalSemaphore(importer, sem, stream, fenceValue + 1);

// App: wait for ORT inference to complete, then use outputBuffer
fence->SetEventOnCompletion(fenceValue + 1, waitEvent);
WaitForSingleObject(waitEvent, INFINITE);
// ... read back or further process outputBuffer

OrtApi->ReleaseRunOptions(runOptions);

8.4 Per-frame execution with IOBinding (alternative)

This example shows the IOBinding pattern, which may be preferred for complex input/output scenarios.

// ─────────────────────────────────────────────────────────────────────────────
// Setup IOBinding (can be reused across frames)
// ─────────────────────────────────────────────────────────────────────────────
OrtIoBinding* ioBinding = nullptr;
OrtApi->CreateIoBinding(session, &ioBinding);
OrtApi->BindInput(ioBinding, "input", inputTensor);
OrtApi->BindOutput(ioBinding, "output", outputTensor);

// ─────────────────────────────────────────────────────────────────────────────
// Per-frame: D3D12 upload → ORT inference → D3D12 consume
// ─────────────────────────────────────────────────────────────────────────────
uint64_t fenceValue = currentFrame * 2;

// App: copy input data to inputBuffer via D3D12 command list
d3d12CommandQueue->ExecuteCommandLists(1, &uploadCmdList);
d3d12CommandQueue->Signal(fence.Get(), fenceValue);

// ORT: wait for D3D12 upload to complete
OrtApi->ExternalResourceImporter_WaitSemaphore(importer, sem, stream, fenceValue);

// Set up RunOptions with the sync stream
OrtRunOptions* runOptions = nullptr;
OrtApi->CreateRunOptions(&runOptions);
OrtApi->RunOptions_SetSyncStream(runOptions, stream);

// Run inference using IOBinding
OrtApi->RunWithBinding(session, runOptions, ioBinding);

// Signal completion so D3D12 can consume the output
OrtApi->ExternalResourceImporter_SignalSemaphore(importer, sem, stream, fenceValue + 1);

// App: wait for ORT inference to complete, then use outputBuffer
fence->SetEventOnCompletion(fenceValue + 1, waitEvent);
WaitForSingleObject(waitEvent, INFINITE);

OrtApi->ReleaseRunOptions(runOptions);

8.5 Cleanup

// Release in reverse order of creation
OrtApi->ReleaseIoBinding(ioBinding);  // If using IOBinding
OrtApi->ReleaseValue(outputTensor);
OrtApi->ReleaseValue(inputTensor);
OrtApi->ReleaseExternalSemaphoreHandle(sem);
OrtApi->ReleaseExternalMemoryHandle(outputMem);
OrtApi->ReleaseExternalMemoryHandle(inputMem);
OrtApi->ReleaseSyncStream(stream);
OrtApi->ReleaseExternalResourceImporter(importer);
OrtApi->ReleaseSession(session);
OrtApi->ReleaseSessionOptions(sessionOptions);

// Close Win32 handles
CloseHandle(fenceHandle);
CloseHandle(outputHandle);
CloseHandle(inputHandle);

9. Ownership & lifetime

  • The app owns the underlying D3D12 resource/heap/fence and their shared handles.
  • ORT owns OrtExternalMemoryHandle / OrtExternalSemaphoreHandle wrappers and EP-side imports.
  • OrtValue tensors created from external memory are views and remain valid only while the underlying external memory handle remains valid.
  • OrtExternalResourceImporter is stateless — it provides import capabilities for a device but holds no per-session state. Multiple sessions can independently import resources and use separate streams/semaphores concurrently.

10. EP implementation expectations

10.1 NvTensorRtRtx (CUDA)

NvTensorRtRtx is a distinct EP (not the built-in CUDA EP). It uses CUDA Driver APIs internally for interop.

Handle types (from CUexternalMemoryHandleType_enum / CUexternalSemaphoreHandleType_enum):

ORT Handle Type CUDA Handle Type
ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE (value 5)
ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP (value 4)
ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE (value 4)

API mapping:

OrtExternalResourceImporterImpl method CUDA Driver API(s)
ImportMemory cuImportExternalMemory() + cuExternalMemoryGetMappedBuffer()
ReleaseMemory cuDestroyExternalMemory() + cuMemFree() (mapped buffer)
ImportSemaphore cuImportExternalSemaphore()
ReleaseSemaphore cuDestroyExternalSemaphore()
WaitSemaphore cuWaitExternalSemaphoresAsync()
SignalSemaphore cuSignalExternalSemaphoresAsync()

Implementation notes:

  • For D3D12 resources, set CUDA_EXTERNAL_MEMORY_DEDICATED flag in CUDA_EXTERNAL_MEMORY_HANDLE_DESC::flags.
  • The CUDA driver does not take ownership of the Win32 HANDLE; the application must keep it valid until import completes.
  • Wait/signal use CUDA_EXTERNAL_SEMAPHORE_WAIT_PARAMS::params::fence::value / CUDA_EXTERNAL_SEMAPHORE_SIGNAL_PARAMS::params::fence::value for the 64-bit timeline fence value.
  • The CUstream is obtained from OrtSyncStream native handle.

Note: the CUDA EP (separate from NvTensorRtRtx) could implement the same optional factory entry points in the future.

10.2 MiGraphX (HIP)

HIP has full native support for D3D12 external memory and semaphore interop on Windows. The HIP runtime API (hip_runtime_api.h) defines:

Handle types (from hipExternalMemoryHandleType_enum / hipExternalSemaphoreHandleType_enum):

ORT Handle Type HIP Handle Type
ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE hipExternalMemoryHandleTypeD3D12Resource (value 5)
ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP hipExternalMemoryHandleTypeD3D12Heap (value 4)
ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE hipExternalSemaphoreHandleTypeD3D12Fence (value 4)

API mapping:

OrtExternalResourceImporterImpl method HIP API(s)
ImportMemory hipImportExternalMemory() + hipExternalMemoryGetMappedBuffer()
ReleaseMemory hipDestroyExternalMemory()
ImportSemaphore hipImportExternalSemaphore()
ReleaseSemaphore hipDestroyExternalSemaphore()
WaitSemaphore hipWaitExternalSemaphoresAsync()
SignalSemaphore hipSignalExternalSemaphoresAsync()

Implementation notes:

  • HIP external semaphore functions are documented as "currently not supported on Linux", which implies Windows is the supported path.
  • MiGraphX already has HIP stream integration in this repo; the new entry points would use the existing hipStream_t obtained from OrtSyncStream.
  • The implementation pattern is effectively identical to the CUDA path (HIP mirrors the CUDA driver API semantics for external resource interop).

Therefore, MiGraphX can and should implement CreateExternalResourceImporterForDevice using HIP today.

10.3 OpenVINO GPU (OpenCL / D3D11)

OpenVINO's GPU plugin has a mature Remote Tensor API for memory sharing with native APIs. On Windows, it supports:

  • D3D11 surfaces: ID3D11Buffer and ID3D11Texture2D via ov::intel_gpu::ocl::D3DContext
  • OpenCL interop: cl_mem, cl_context, cl_command_queue via ov::intel_gpu::ocl::ClContext
  • NV12 video surfaces: Direct consumption of hardware video decoder output

D3D12 → D3D11 import path:

D3D11.1 added ID3D11Device1::OpenSharedResource1() which can import NT handles created by D3D12's CreateSharedHandle(). This means OpenVINO can implement the proposed API:

D3D12 app                           ORT API                        OpenVINO EP
───────────────────────────────────────────────────────────────────────────────
CreateSharedHandle() ───► ExternalResourceImporter_ImportMemory() ───► OpenSharedResource1()
     │                                                                      │
     └─► NT HANDLE ────────────────────────────────────────────────────────►└─► ID3D11Buffer

Requirements:

  • D3D12 resource created with D3D12_HEAP_FLAG_SHARED
  • Same GPU (matching adapter LUID)
  • OpenVINO's internal D3D11 device is D3D11.1+

Implementation sketch:

// Inside OpenVINO's OrtExternalResourceImporterImpl
OrtStatus* OpenVINO_ImportMemory(
    OrtExternalResourceImporterImpl* this_ptr,
    const OrtExternalMemoryDescriptor* desc,
    OrtExternalMemoryHandleImpl** out) {
  
  // OpenVINO already has an ID3D11Device for its D3DContext
  ID3D11Device1* d3d11Device = GetOpenVINOD3D11Device();
  
  ID3D11Buffer* d3d11Buffer = nullptr;
  HRESULT hr = d3d11Device->OpenSharedResource1(
      (HANDLE)desc->native_handle,
      IID_PPV_ARGS(&d3d11Buffer));
  
  if (FAILED(hr)) return ORT_MAKE_STATUS(/* ... */);
  
  // Now use d3d11Buffer with existing OpenVINO D3DContext::create_tensor()
  // ...
}

Synchronization consideration: D3D12 timeline fences can also be imported to D3D11 via ID3D11Device5::OpenSharedFence(), enabling fence-based sync if OpenVINO's D3D11 device is 11.4+.

Conclusion: OpenVINO can implement this D3D12 external resource API using D3D11's shared resource import. The key insight is that D3D11.1+ can consume D3D12 shared handles—there's no "D3D12-on-D3D11" layer needed, just the standard shared handle import path.

10.4 Vulkan-based EPs (future considerations)

Vulkan has native support for importing D3D12 external memory and semaphores via extensions that are part of Vulkan 1.1 core. This means Vulkan-based EPs can implement the proposed D3D12 interop API.

Handle types (from VkExternalMemoryHandleTypeFlagBits / VkExternalSemaphoreHandleTypeFlagBits):

ORT Handle Type Vulkan Handle Type Vulkan Value
ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT 0x00000040
ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT 0x00000020
ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE_BIT 0x00000008

API mapping:

OrtExternalResourceImporterImpl method Vulkan API(s)
ImportMemory vkAllocateMemory with VkImportMemoryWin32HandleInfoKHR in pNext chain
ReleaseMemory vkFreeMemory
ImportSemaphore vkImportSemaphoreWin32HandleKHR
ReleaseSemaphore vkDestroySemaphore
WaitSemaphore vkWaitSemaphores (timeline) or submit with wait semaphore
SignalSemaphore vkSignalSemaphore (timeline) or submit with signal semaphore

Required Vulkan extensions (all promoted to Vulkan 1.1 core):

  • VK_KHR_external_memory_win32 — import Win32 handles as VkDeviceMemory
  • VK_KHR_external_semaphore_win32 — import Win32 fence handles as VkSemaphore
  • VK_KHR_timeline_semaphore — required for D3D12 fence interop (64-bit values)

Implementation notes:

  • D3D12 fences are timeline semaphores; Vulkan requires VK_SEMAPHORE_TYPE_TIMELINE for interop.
  • The Win32 HANDLE is passed via VkImportMemoryWin32HandleInfoKHR::handle / VkImportSemaphoreWin32HandleInfoKHR::handle.
  • Device UUID must match between D3D12 adapter and Vulkan physical device (same GPU requirement).
  • After memory import, bind it to a VkBuffer via vkBindBufferMemory to get usable storage.

Applicability to ORT EPs:

EP Vulkan-based? Can implement D3D12 interop?
WebGPU (Dawn backend) ✅ Dawn uses Vulkan on Windows ✅ Yes, via Vulkan external memory extensions
Future Vulkan EP ✅ Native Vulkan ✅ Yes, direct implementation
DirectML ❌ D3D12-native N/A (D3D12 is native)

Conclusion: Vulkan-based ORT EPs can implement the proposed D3D12 external resource API using Vulkan's external memory/semaphore extensions. The API design is fully compatible.

11. Future considerations

  • Texture/surface interop: Define a clean path for non-linear/strided/image-backed tensors

Why not D3D11 handle types in the API?

The API only defines D3D12 handle types. Adding D3D11-specific types (ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_RESOURCE) is out of scope because:

D3D12 handles (in scope) D3D11 handles (out of scope)
App creates CreateSharedHandle() → standalone NT handle App would need to export via IDXGIResource1::CreateSharedHandle()
Any API can import directly (CUDA, HIP, Vulkan, D3D11) Only D3D11 devices can import D3D11 handles
Timeline fences are self-contained Keyed mutexes require device-level coordination

The key insight: D3D12 shared handles are the universal currency. They can be imported by:

  • CUDA (cuImportExternalMemory with D3D12_RESOURCE)
  • HIP (hipImportExternalMemory with D3D12Resource)
  • Vulkan (vkAllocateMemory with VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT)
  • D3D11 (ID3D11Device1::OpenSharedResource1)

So even EPs that internally use D3D11 (like OpenVINO) can consume D3D12 handles via OpenSharedResource1(). There's no need for separate D3D11 handle types.

For applications that only have D3D11 resources:

  1. Create an ID3D11On12Device to wrap D3D11 on D3D12
  2. Get the underlying D3D12 resource and create a shared handle
  3. Use this API with the D3D12 handle

12. Requirements traceability

Requirement Addressed by
Zero-copy tensor views ExternalResourceImporter_CreateTensorFromMemory creates a view backed by imported GPU memory; no copy occurs.
D3D12 sharing (resource + heap) OrtExternalMemoryHandleType enum supports D3D12_RESOURCE and D3D12_HEAP.
Explicit synchronization ExternalResourceImporter_ImportSemaphore + WaitSemaphore/SignalSemaphore with 64-bit fence value on OrtSyncStream.
EP capability discovery CreateExternalResourceImporterForDevice returns null/NOT_IMPLEMENTED if EP doesn't support; CanImportMemory/CanImportSemaphore for specific types.
Device identity matching ORT already populates "LUID" metadata in OrtHardwareDevice; apps match to D3D12 adapter LUID. SessionGetEpDeviceForOutputs validates output placement.
Fits existing ORT primitives Single CreateExternalResourceImporterForDevice factory method returns capability object; uses OrtSyncStream; opaque handles with Release*.
Output tensors (write access) OrtExternalMemoryAccessMode enum includes WRITE_ONLY and READ_WRITE.
Async Run integration RunOptions_SetSyncStream associates a stream with Run for use with Run() or RunWithBinding().

Describe scenario use case

Scenario / Use Case

Importing D3D12 resources that are already resident on a particular inferencing device is a commonly requested scenario. See #26543 for an example of a similar implementation proposal.

Concrete examples:

  1. Video/media pipelines — A video conferencing app decodes frames via hardware video decoder (D3D12/D3D11). Today, to run background blur or super-resolution via ORT, it must copy GPU→CPU→GPU. With this API, the decoded frame stays on GPU and is directly consumed by the EP.

  2. Game/render integration — A game uses D3D12 for rendering and wants ML-based upscaling (DLSS-style) or denoising. The render target is already in VRAM; copying to CPU and back adds latency and bandwidth overhead.

  3. Multi-engine composition — Apps like video editors or creative tools use D3D12 compute for some operations and ORT for others. Resources ping-pong between engines; zero-copy sharing eliminates redundant transfers.

  4. Real-time latency-sensitive workloads — AR/VR applications where every millisecond matters. GPU→CPU→GPU copies can add 2-5ms of latency per frame, breaking real-time constraints.

Current workarounds are unsatisfactory:

  • EP-specific private hooks (not portable across EPs)
  • Copy-based paths (defeats the purpose of GPU residency)
  • DML-only binding (limited to one EP, no cross-EP pattern)

Metadata

Metadata

Assignees

Labels

ep:DMLissues related to the DirectML execution providerep:MIGraphXissues related to AMD MI GraphX execution providerep:OpenVINOissues related to OpenVINO execution providerep:WebGPUort-web webgpu providerfeature requestrequest for unsupported feature or enhancementplatform:webissues related to ONNX Runtime web; typically submitted using template

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions