[Feature Request] D3D12 External Resource Interop API for Plugin EPs

### Describe the feature request

# ONNX Runtime: D3D12 External Resource Import

## 1. Problem statement
Windows graphics/video apps increasingly hold data in GPU-resident D3D12 resources (captured frames, render targets, compute buffers). Today, using those resources as ONNX Runtime inputs/outputs typically requires extra copies (GPU→CPU→GPU) or EP-specific private hooks.

This proposal defines a **generic, EP-agnostic ORT public API** that lets an application:
- provide D3D12-shared resources/heaps as **zero-copy tensor storage**, and
- perform **explicit GPU↔GPU synchronization** between the app’s D3D12 work and the EP’s compute stream,
without sharing command queues and without exposing GPU virtual addresses.

## 2. Requirements (MVP)
- **Zero-copy tensor views** over imported external GPU allocations.
- **D3D12 sharing** via Win32 shared handles:
  - committed `ID3D12Resource` (buffer) shared handles
  - `ID3D12Heap` shared handles
- **Explicit synchronization** using imported external semaphores:
  - D3D12 timeline fence shared handle + 64-bit value wait/signal
- **EP capability discovery** (`CanImport*`) with clear failure: `ORT_NOT_IMPLEMENTED`.
- **Device identity matching** so apps can select an EP device compatible with a given D3D12 adapter.
- **Fits existing ORT primitives**: `OrtEpDevice`, `OrtEpFactory`, `OrtSyncStream`, opaque handles, explicit `Release*`.
- **Output tensors supported** (write access) when EP supports it.

## 3. Non-goals (MVP)
- **No queue sharing requirement** — app and EP keep independent queues/streams.
- **No GPU virtual addresses** in the public contract.
- **No implicit fallback copies** — apps choose behavior; EPs return NOT_IMPLEMENTED when unsupported.
- **Texture/surface-native tensors** are **vNext** — MVP targets resources that can be treated as dense, linearly-addressable tensor storage (D3D12 buffers / heaps).
- **Alternatives to Windows D3D12** — This proposal targets Windows D3D12 shared handles exclusively. The extensible `OrtExternalMemoryHandleType` and `OrtExternalSemaphoreType` enums accommodate additional handle types in future proposals without breaking API compatibility.

## 4. Design overview
The public API introduces:
- `OrtExternalResourceImporter`: A **capability object** for external resource interop (memory + semaphore operations). Named following the agent-noun pattern (`Allocator`, `Importer`).
- `OrtExternalMemoryHandle`: EP-imported view of a shared external allocation.
- `OrtExternalSemaphoreHandle`: EP-imported view of a shared external semaphore.

ORT routes calls to the selected plugin EP via a single extension to `OrtEpFactory`:
- `CreateExternalResourceImporterForDevice()` — returns `OrtExternalResourceImporterImpl*` (same pattern as `CreateSyncStreamForDevice`).

The capability object bundles all memory and semaphore operations. EPs that don't support external resources return null or `ORT_NOT_IMPLEMENTED` from the factory method.

**Key point**: the application only passes **opaque OS handles** (`HANDLE`) plus sizes/offsets and synchronization values. The EP may internally map these to CUDA/HIP/etc, but ORT does not expose those internals.

## 5. Public API (onnxruntime_c_api.h)
### 5.1 Types
```c
typedef enum OrtExternalMemoryHandleType {
  ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE = 0,  /* shared HANDLE from CreateSharedHandle(resource) */
  ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP     = 1,  /* shared HANDLE from CreateSharedHandle(heap) */
} OrtExternalMemoryHandleType;

typedef enum OrtExternalMemoryAccessMode {
  ORT_EXTERNAL_MEMORY_ACCESS_READ_WRITE = 0,
  ORT_EXTERNAL_MEMORY_ACCESS_READ_ONLY  = 1,
  ORT_EXTERNAL_MEMORY_ACCESS_WRITE_ONLY = 2,
} OrtExternalMemoryAccessMode;

#define ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION 1
typedef struct OrtExternalMemoryDescriptor {
  uint32_t version;                        /* Must be ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION */
  OrtExternalMemoryHandleType handle_type;
  void* native_handle;      /* Windows HANDLE */
  size_t size_bytes;        /* total bytes in allocation */
  size_t offset_bytes;      /* base offset into allocation */
  OrtExternalMemoryAccessMode access_mode;
} OrtExternalMemoryDescriptor;

ORT_RUNTIME_CLASS(ExternalMemoryHandle);
/* EP-owned implementation type; lives in the plugin EP API header */
ORT_RUNTIME_CLASS(ExternalMemoryHandleImpl);

typedef enum OrtExternalSemaphoreType {
  ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE = 0, /* shared HANDLE from CreateSharedHandle(fence) */
} OrtExternalSemaphoreType;

#define ORT_EXTERNAL_SEMAPHORE_DESCRIPTOR_VERSION 1
typedef struct OrtExternalSemaphoreDescriptor {
  uint32_t version;        /* Must be ORT_EXTERNAL_SEMAPHORE_DESCRIPTOR_VERSION */
  OrtExternalSemaphoreType type;
  void* native_handle;  /* Windows HANDLE */
} OrtExternalSemaphoreDescriptor;

ORT_RUNTIME_CLASS(ExternalSemaphoreHandle);
/* EP-owned implementation type; lives in the plugin EP API header */
ORT_RUNTIME_CLASS(ExternalSemaphoreHandleImpl);

/* Capability object for external resource interop (agent-noun pattern like Allocator) */
ORT_RUNTIME_CLASS(ExternalResourceImporter);
ORT_RUNTIME_CLASS(ExternalResourceImporterImpl);

#define ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION 1
typedef struct OrtExternalTensorDescriptor {
  uint32_t version;        /* Must be ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION */
  ONNXTensorElementDataType element_type;
  const int64_t* shape;
  size_t rank;
  size_t offset_bytes; /* optional: view offset within imported memory (default 0) */
} OrtExternalTensorDescriptor;
```

### 5.2 Functions

The public API mirrors the consolidated EP plugin pattern. ORT wraps `OrtExternalResourceImporterImpl` in an opaque `OrtExternalResourceImporter` handle.

```c
/* Create the external resource importer for a specific EP device */
ORT_API2_STATUS(CreateExternalResourceImporterForDevice,
  _In_ const OrtEpDevice* ep_device,
  _Outptr_ OrtExternalResourceImporter** out_importer);

ORT_API(void, ReleaseExternalResourceImporter,
  _Frees_ptr_opt_ OrtExternalResourceImporter* importer);

/* Memory operations */
ORT_API2_STATUS(ExternalResourceImporter_CanImportMemory,
  _In_ const OrtExternalResourceImporter* importer,
  _In_ OrtExternalMemoryHandleType handle_type,
  _Out_ bool* out_supported);

ORT_API2_STATUS(ExternalResourceImporter_ImportMemory,
  _In_ OrtExternalResourceImporter* importer,
  _In_ const OrtExternalMemoryDescriptor* desc,
  _Outptr_ OrtExternalMemoryHandle** out_handle);

ORT_API(void, ReleaseExternalMemoryHandle,
  _Frees_ptr_opt_ OrtExternalMemoryHandle* handle);

ORT_API2_STATUS(ExternalResourceImporter_CreateTensorFromMemory,
  _In_ OrtExternalResourceImporter* importer,
  _In_ const OrtExternalMemoryHandle* mem_handle,
  _In_ const OrtExternalTensorDescriptor* tensor_desc,
  _In_opt_ const OrtMemoryInfo* tensor_location,
  _Outptr_ OrtValue** out_tensor);

/* Semaphore operations */
ORT_API2_STATUS(ExternalResourceImporter_CanImportSemaphore,
  _In_ const OrtExternalResourceImporter* importer,
  _In_ OrtExternalSemaphoreType type,
  _Out_ bool* out_supported);

ORT_API2_STATUS(ExternalResourceImporter_ImportSemaphore,
  _In_ OrtExternalResourceImporter* importer,
  _In_ const OrtExternalSemaphoreDescriptor* desc,
  _Outptr_ OrtExternalSemaphoreHandle** out_handle);

ORT_API(void, ReleaseExternalSemaphoreHandle,
  _Frees_ptr_opt_ OrtExternalSemaphoreHandle* handle);

ORT_API2_STATUS(ExternalResourceImporter_WaitSemaphore,
  _In_ OrtExternalResourceImporter* importer,
  _In_ OrtExternalSemaphoreHandle* semaphore_handle,
  _In_ OrtSyncStream* stream,
  _In_ uint64_t value);

ORT_API2_STATUS(ExternalResourceImporter_SignalSemaphore,
  _In_ OrtExternalResourceImporter* importer,
  _In_ OrtExternalSemaphoreHandle* semaphore_handle,
  _In_ OrtSyncStream* stream,
  _In_ uint64_t value);

/* Session device query for outputs (mirrors SessionGetEpDeviceForInputs) */
ORT_API2_STATUS(SessionGetEpDeviceForOutputs, _In_ const OrtSession* session,
                _Out_writes_(num_outputs) const OrtEpDevice** outputs_ep_devices,
                _In_ size_t num_outputs);

/* Associate an OrtSyncStream with RunOptions for async Run */
ORT_API2_STATUS(RunOptions_SetSyncStream,
  _Inout_ OrtRunOptions* run_options,
  _In_ OrtSyncStream* stream);
```

### 5.3 Session and RunOptions extensions

**`SessionGetEpDeviceForOutputs`**: Mirrors the existing `SessionGetEpDeviceForInputs` (ORT 1.23). Returns the EP device assigned to each output, enabling applications to validate that outputs will be placed on the expected device for external resource sharing.

**`RunOptions_SetSyncStream`**: Associates an `OrtSyncStream` with `RunOptions`. When `Run()` or `RunWithBinding()` is called, the EP uses this stream for execution, enabling proper synchronization with imported external semaphores. This approach:
- Works with both `Run()` and `RunWithBinding()` — no IOBinding requirement
- Follows the existing `RunOptions` pattern (`RunOptions_SetRunTag`, etc.)
- Allows different Run calls to use different streams for concurrent inference
- Integrates cleanly with the external semaphore wait/signal pattern

### 5.4 Device identity (adapter matching)
ORT already discovers Windows GPU devices and populates `OrtHardwareDevice` metadata during device discovery.

In particular, ORT’s Windows device discovery currently emits a `"LUID"` metadata entry for GPU/NPU devices (a 64-bit value serialized as a decimal string). Applications can read it via `HardwareDevice_Metadata()` and compare it to their D3D12 adapter `LUID`.

To avoid string parsing in client code, a small convenience API (e.g., `HardwareDevice_GetAdapterLuidLowHigh`) can be added later without changing the interop design.

## 6. EP plugin API (onnxruntime_ep_c_api.h)

### 6.1 Capability object pattern

The existing `OrtEpFactory` uses a "**create capability object**" pattern for optional features:

| Factory Method | Returns | Operations bundled in object |
|---------------|---------|------------------------------|
| `CreateAllocator()` | `OrtAllocator*` | Alloc, Free |
| `CreateDataTransfer()` | `OrtDataTransferImpl*` | Copy, CanCopy |
| `CreateSyncStreamForDevice()` | `OrtSyncStreamImpl*` | Flush, GetHandle |

Following this pattern, the API adds **one** factory method that returns a capability object:

```c
/* Add to OrtEpFactory (version-gated) */
OrtStatus*(ORT_API_CALL* CreateExternalResourceImporterForDevice)(
  _In_ OrtEpFactory* this_ptr,
  _In_ const OrtMemoryDevice* memory_device,
  _Outptr_ OrtExternalResourceImporterImpl** out_importer);
```

### 6.2 OrtExternalResourceImporterImpl interface

The returned `OrtExternalResourceImporterImpl` owns operations for memory and semaphore interop:

```c
struct OrtExternalResourceImporterImpl {
  /* ──────────────── Memory operations (stream-independent) ──────────────── */
  
  bool(ORT_API_CALL* CanImportMemory)(
    _In_ const OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalMemoryHandleType handle_type);

  OrtStatus*(ORT_API_CALL* ImportMemory)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ const OrtExternalMemoryDescriptor* desc,
    _Outptr_ OrtExternalMemoryHandleImpl** out_handle);

  void(ORT_API_CALL* ReleaseMemory)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalMemoryHandleImpl* handle);

  OrtStatus*(ORT_API_CALL* CreateTensorFromMemory)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ const OrtExternalMemoryHandleImpl* mem_handle,
    _In_ const OrtExternalTensorDescriptor* tensor_desc,
    _Outptr_ OrtValue** out_tensor);

  /* ──────────────── Semaphore operations (require stream) ──────────────── */
  
  bool(ORT_API_CALL* CanImportSemaphore)(
    _In_ const OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreType type);

  OrtStatus*(ORT_API_CALL* ImportSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ const OrtExternalSemaphoreDescriptor* desc,
    _Outptr_ OrtExternalSemaphoreHandleImpl** out_handle);

  void(ORT_API_CALL* ReleaseSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreHandleImpl* handle);

  OrtStatus*(ORT_API_CALL* WaitSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreHandleImpl* handle,
    _In_ OrtSyncStream* stream,
    _In_ uint64_t value);

  OrtStatus*(ORT_API_CALL* SignalSemaphore)(
    _In_ OrtExternalResourceImporterImpl* this_ptr,
    _In_ OrtExternalSemaphoreHandleImpl* handle,
    _In_ OrtSyncStream* stream,
    _In_ uint64_t value);

  /* ──────────────── Release the capability object itself ──────────────── */
  
  void(ORT_API_CALL* Release)(
    _In_ OrtExternalResourceImporterImpl* this_ptr);
};
```

### 6.3 Dependency handling

The consolidated design makes dependencies explicit:

| Dependency | How it's handled |
|------------|------------------|
| Semaphore wait/signal requires stream | `WaitSemaphore`/`SignalSemaphore` take `OrtSyncStream*`; EP can return `ORT_NOT_IMPLEMENTED` if `!IsStreamAware()` |
| Memory import is stream-independent | `ImportMemory` / `CreateTensorFromMemory` don't take a stream; usable with sync `Run()` |
| EP doesn't support external resources | `CreateExternalResourceImporterForDevice` returns null or `ORT_NOT_IMPLEMENTED` |

**Capability matrix is now:**
1. Check `CreateExternalResourceImporterForDevice != null` — EP has the feature
2. Call `CanImportMemory(D3D12_RESOURCE)` — EP supports this memory type
3. Call `CanImportSemaphore(D3D12_FENCE)` — EP supports fence sync (implies stream-aware)

### 6.4 ORT routing layer (public API → EP factory)

The ORT core provides the glue between the public C API and the EP factory. This is analogous to how `CreateSyncStreamForEpDevice` (public) routes to `OrtEpFactory::CreateSyncStreamForDevice` (EP plugin).

**Type mapping:**
```
Public API types          →  EP Plugin types
─────────────────────────────────────────────
OrtEpDevice*              →  OrtMemoryDevice*  (extracted by ORT)
OrtExternalResourceImporter*  →  OrtExternalResourceImporterImpl*  (wrapped by ORT)
OrtExternalMemoryHandle*      →  OrtExternalMemoryHandleImpl*  (wrapped by ORT)
OrtExternalSemaphoreHandle*   →  OrtExternalSemaphoreHandleImpl*  (wrapped by ORT)
```

**Call flow for `CreateExternalResourceImporterForDevice`:**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│  Client Application                                                         │
│  ─────────────────                                                          │
│  OrtApi->CreateExternalResourceImporterForDevice(epDevice, &importer)       │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  ORT Core (onnxruntime_c_api.cc)                                            │
│  ───────────────────────────────                                            │
│  1. Extract OrtMemoryDevice* from OrtEpDevice*                              │
│  2. Look up OrtEpFactory* for the EP that owns this device                  │
│  3. Check factory->CreateExternalResourceImporterForDevice != nullptr       │
│  4. Call factory->CreateExternalResourceImporterForDevice(memoryDevice,     │
│                                                           &implPtr)         │
│  5. Wrap OrtExternalResourceImporterImpl* in OrtExternalResourceImporter*   │
│  6. Return wrapped handle to client                                         │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  EP Factory (e.g., NvTensorRtRtxEpFactory)                                  │
│  ─────────────────────────────────────────                                  │
│  CreateExternalResourceImporterForDevice(memoryDevice, &implPtr):           │
│    1. Validate memoryDevice matches EP's supported devices                  │
│    2. Create NvTrtRtxExternalResourceImporterImpl instance                  │
│    3. Initialize CUDA context for the device                                │
│    4. Return impl pointer                                                   │
└─────────────────────────────────────────────────────────────────────────────┘
```

**Call flow for memory/semaphore operations:**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│  Client Application                                                         │
│  ─────────────────                                                          │
│  OrtApi->ExternalResourceImporter_ImportMemory(importer, &desc, &memHandle) │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  ORT Core                                                                   │
│  ────────                                                                   │
│  1. Unwrap OrtExternalResourceImporter* → OrtExternalResourceImporterImpl*  │
│  2. Call impl->ImportMemory(impl, desc, &implHandle)                        │
│  3. Wrap OrtExternalMemoryHandleImpl* in OrtExternalMemoryHandle*           │
│  4. Return wrapped handle to client                                         │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  EP Implementation (e.g., NvTrtRtxExternalResourceImporterImpl)             │
│  ──────────────────────────────────────────────────────────────             │
│  ImportMemory(desc, &implHandle):                                           │
│    1. Map ORT handle type → CUDA handle type                                │
│    2. Call cuImportExternalMemory() with D3D12 shared handle                │
│    3. Call cuExternalMemoryGetMappedBuffer() to get device pointer          │
│    4. Store CUexternalMemory + CUdeviceptr in impl handle                   │
│    5. Return impl handle                                                    │
└─────────────────────────────────────────────────────────────────────────────┘
```

**Key design points:**
- **ORT owns the wrapper lifetime**: The public handles (`OrtExternalResourceImporter*`, etc.) are ORT-allocated wrappers that hold a pointer to the EP-allocated impl.
- **EP owns the impl lifetime**: When the client calls `ReleaseExternalResourceImporter`, ORT calls `impl->Release(impl)` to let the EP clean up.
- **Type safety via opaque handles**: Clients cannot accidentally pass an impl pointer; only ORT can unwrap the public handle.
- **EP-agnostic public API**: The public API has no EP-specific types; all EP details are hidden behind the impl interface.


## 7. Justification (fit with existing ORT patterns)

The proposed API extends `OrtEpFactory` with **one** optional factory method, following the same "create capability object" pattern used for existing optional EP features:

| Existing OrtEpFactory capability | Pattern |
|----------------------------------|--------|
| `CreateAllocator()` | Returns `OrtAllocator*` with bundled Alloc/Free operations |
| `CreateDataTransfer()` | Returns `OrtDataTransferImpl*` with bundled Copy operations |
| `CreateSyncStreamForDevice()` | Returns `OrtSyncStreamImpl*` with bundled stream operations |

| Proposed capability | Pattern |
|---------------------|--------|
| `CreateExternalResourceImporterForDevice()` | Returns `OrtExternalResourceImporterImpl*` with bundled memory/semaphore import operations |

**Key alignments**:
- **Agent-noun naming**: `Importer` follows the same pattern as `Allocator`.
- **ForDevice suffix**: Matches `CreateSyncStreamForDevice`; resources are inherently device-scoped.
- **Single factory entry point**: Check one pointer for capability, not 8+.
- **Capability object owns operations**: All memory and semaphore ops are methods on the returned object.
- **EP returns `*Impl` objects**: ORT wraps them in public handles (same as `OrtSyncStreamImpl` → `OrtSyncStream`).
- **Explicit release**: `Release` method on the capability object; separate `Release*Handle` for imported resources.
- **Dependency is explicit**: Semaphore wait/signal take `OrtSyncStream*`; EP can return `ORT_NOT_IMPLEMENTED` if not stream-aware.

## 8. Client calling pattern (MVP code)

Below is concrete C++ code demonstrating the MVP usage flow. Error handling is abbreviated for clarity.

### 8.1 Setup (once per EP device)

The `OrtExternalResourceImporter` is created per EP device (not per session). Multiple sessions using the same EP device can share the same importer. Imported memory and semaphore handles can be reused across sessions.

```cpp
// ─────────────────────────────────────────────────────────────────────────────
// 1. App creates D3D12 resources with sharing enabled
// ─────────────────────────────────────────────────────────────────────────────
ComPtr<ID3D12Resource> inputBuffer, outputBuffer;
ComPtr<ID3D12Fence> fence;
HANDLE inputHandle, outputHandle, fenceHandle;

// Create buffers with D3D12_HEAP_FLAG_SHARED
CreateD3D12Buffer(d3d12Device, inputSize, D3D12_HEAP_FLAG_SHARED, &inputBuffer);
CreateD3D12Buffer(d3d12Device, outputSize, D3D12_HEAP_FLAG_SHARED, &outputBuffer);
d3d12Device->CreateFence(0, D3D12_FENCE_FLAG_SHARED, IID_PPV_ARGS(&fence));

// Create shared handles (NT handles)
d3d12Device->CreateSharedHandle(inputBuffer.Get(), nullptr, GENERIC_ALL, nullptr, &inputHandle);
d3d12Device->CreateSharedHandle(outputBuffer.Get(), nullptr, GENERIC_ALL, nullptr, &outputHandle);
d3d12Device->CreateSharedHandle(fence.Get(), nullptr, GENERIC_ALL, nullptr, &fenceHandle);

// ─────────────────────────────────────────────────────────────────────────────
// 2. Find an ORT EP device matching the D3D12 adapter
// ─────────────────────────────────────────────────────────────────────────────
const OrtEpDevice* epDevice = nullptr;
const OrtEpDevice* const* devices = nullptr;
size_t numDevices = 0;
OrtApi->GetEpDevices(env, &devices, &numDevices);

LUID adapterLuid = d3d12Device->GetAdapterLuid();
for (size_t i = 0; i < numDevices; ++i) {
    const OrtHardwareDevice* hw = OrtApi->EpDevice_Device(devices[i]);
    const OrtKeyValuePairs* metadata = OrtApi->HardwareDevice_Metadata(hw);
    const char* luidStr = OrtApi->GetKeyValue(metadata, "LUID");
    if (luidStr && ParseLuid(luidStr) == adapterLuid) {
        epDevice = devices[i];
        break;
    }
}

// ─────────────────────────────────────────────────────────────────────────────
// 3. Create the external resource importer for this device
// ─────────────────────────────────────────────────────────────────────────────
OrtExternalResourceImporter* importer = nullptr;
OrtStatus* status = OrtApi->CreateExternalResourceImporterForDevice(epDevice, &importer);
if (status != nullptr) {
    // EP doesn't support external resource interop
    // Fall back to copy-based path
    OrtApi->ReleaseStatus(status);
}

// ─────────────────────────────────────────────────────────────────────────────
// 4. Check capabilities
// ─────────────────────────────────────────────────────────────────────────────
bool canImportD3D12Resource = false, canImportFence = false;
OrtApi->ExternalResourceImporter_CanImportMemory(importer, ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE, &canImportD3D12Resource);
OrtApi->ExternalResourceImporter_CanImportSemaphore(importer, ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE, &canImportFence);

if (!canImportD3D12Resource || !canImportFence) {
    // EP supports external resources but not this handle type
}

// ─────────────────────────────────────────────────────────────────────────────
// 5. Import memory and semaphore
// ─────────────────────────────────────────────────────────────────────────────
OrtExternalMemoryDescriptor inputDesc = {
    .version = ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION,
    .handle_type = ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE,
    .native_handle = inputHandle,
    .size_bytes = inputSize,
    .offset_bytes = 0,
    .access_mode = ORT_EXTERNAL_MEMORY_ACCESS_READ_ONLY
};
OrtExternalMemoryDescriptor outputDesc = {
    .version = ORT_EXTERNAL_MEMORY_DESCRIPTOR_VERSION,
    .handle_type = ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE,
    .native_handle = outputHandle,
    .size_bytes = outputSize,
    .offset_bytes = 0,
    .access_mode = ORT_EXTERNAL_MEMORY_ACCESS_WRITE_ONLY
};
OrtExternalSemaphoreDescriptor semDesc = {
    .version = ORT_EXTERNAL_SEMAPHORE_DESCRIPTOR_VERSION,
    .type = ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE,
    .native_handle = fenceHandle
};

OrtExternalMemoryHandle* inputMem = nullptr;
OrtExternalMemoryHandle* outputMem = nullptr;
OrtExternalSemaphoreHandle* sem = nullptr;

OrtApi->ExternalResourceImporter_ImportMemory(importer, &inputDesc, &inputMem);
OrtApi->ExternalResourceImporter_ImportMemory(importer, &outputDesc, &outputMem);
OrtApi->ExternalResourceImporter_ImportSemaphore(importer, &semDesc, &sem);

// ─────────────────────────────────────────────────────────────────────────────
// 6. Create tensors aliasing the imported memory
// ─────────────────────────────────────────────────────────────────────────────
int64_t inputShape[] = {1, 3, 224, 224};
int64_t outputShape[] = {1, 1000};

OrtExternalTensorDescriptor inputTensorDesc = {
    .version = ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION,
    .element_type = ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT,
    .shape = inputShape,
    .rank = 4,
    .offset_bytes = 0
};
OrtExternalTensorDescriptor outputTensorDesc = {
    .version = ORT_EXTERNAL_TENSOR_DESCRIPTOR_VERSION,
    .element_type = ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT,
    .shape = outputShape,
    .rank = 2,
    .offset_bytes = 0
};

OrtValue* inputTensor = nullptr;
OrtValue* outputTensor = nullptr;
OrtApi->ExternalResourceImporter_CreateTensorFromMemory(importer, inputMem, &inputTensorDesc, nullptr, &inputTensor);
OrtApi->ExternalResourceImporter_CreateTensorFromMemory(importer, outputMem, &outputTensorDesc, nullptr, &outputTensor);

// ─────────────────────────────────────────────────────────────────────────────
// 7. Create sync stream for this EP device
// ─────────────────────────────────────────────────────────────────────────────
OrtSyncStream* stream = nullptr;
OrtApi->CreateSyncStreamForEpDevice(epDevice, nullptr, &stream);
```

### 8.2 Session setup and device validation

```cpp
// ─────────────────────────────────────────────────────────────────────────────
// Create session with the EP device
// ─────────────────────────────────────────────────────────────────────────────
OrtSessionOptions* sessionOptions = nullptr;
OrtApi->CreateSessionOptions(&sessionOptions);

const OrtEpDevice* deviceArray[] = {epDevice};
OrtApi->SessionOptionsAppendExecutionProvider_V2(sessionOptions, env, deviceArray, 1, nullptr, nullptr, 0);

OrtSession* session = nullptr;
OrtApi->CreateSession(env, modelPath, sessionOptions, &session);

// ─────────────────────────────────────────────────────────────────────────────
// Validate device assignment for inputs and outputs
// ─────────────────────────────────────────────────────────────────────────────
size_t numInputs = 0, numOutputs = 0;
OrtApi->SessionGetInputCount(session, &numInputs);
OrtApi->SessionGetOutputCount(session, &numOutputs);

// Validate inputs are assigned to expected EP device
std::vector<const OrtEpDevice*> inputDevices(numInputs);
OrtApi->SessionGetEpDeviceForInputs(session, inputDevices.data(), numInputs);
for (size_t i = 0; i < numInputs; ++i) {
    if (inputDevices[i] != epDevice) {
        // Input is not on expected device - may need fallback
    }
}

// Validate outputs are assigned to expected EP device
std::vector<const OrtEpDevice*> outputDevices(numOutputs);
OrtApi->SessionGetEpDeviceForOutputs(session, outputDevices.data(), numOutputs);
for (size_t i = 0; i < numOutputs; ++i) {
    if (outputDevices[i] != epDevice) {
        // Output is not on expected device - may need fallback
    }
}
```

### 8.3 Per-frame execution with Run() (no IOBinding)

This example shows the simple `Run()` pattern with `RunOptions_SetSyncStream`.

```cpp
// ─────────────────────────────────────────────────────────────────────────────
// Per-frame: D3D12 upload → ORT inference → D3D12 consume
// ─────────────────────────────────────────────────────────────────────────────
uint64_t fenceValue = currentFrame * 2;

// App: copy input data to inputBuffer via D3D12 command list
d3d12CommandQueue->ExecuteCommandLists(1, &uploadCmdList);
d3d12CommandQueue->Signal(fence.Get(), fenceValue);

// ORT: wait for D3D12 upload to complete
OrtApi->ExternalResourceImporter_WaitSemaphore(importer, sem, stream, fenceValue);

// Set up RunOptions with the sync stream
OrtRunOptions* runOptions = nullptr;
OrtApi->CreateRunOptions(&runOptions);
OrtApi->RunOptions_SetSyncStream(runOptions, stream);

// Run inference using simple Run() API (no IOBinding required)
const char* inputNames[] = {"input"};
const char* outputNames[] = {"output"};
const OrtValue* inputs[] = {inputTensor};
OrtValue* outputs[] = {outputTensor};  // Pre-allocated external tensor

OrtApi->Run(session, runOptions, inputNames, inputs, 1, outputNames, 1, outputs);

// Signal completion so D3D12 can consume the output
OrtApi->ExternalResourceImporter_SignalSemaphore(importer, sem, stream, fenceValue + 1);

// App: wait for ORT inference to complete, then use outputBuffer
fence->SetEventOnCompletion(fenceValue + 1, waitEvent);
WaitForSingleObject(waitEvent, INFINITE);
// ... read back or further process outputBuffer

OrtApi->ReleaseRunOptions(runOptions);
```

### 8.4 Per-frame execution with IOBinding (alternative)

This example shows the IOBinding pattern, which may be preferred for complex input/output scenarios.

```cpp
// ─────────────────────────────────────────────────────────────────────────────
// Setup IOBinding (can be reused across frames)
// ─────────────────────────────────────────────────────────────────────────────
OrtIoBinding* ioBinding = nullptr;
OrtApi->CreateIoBinding(session, &ioBinding);
OrtApi->BindInput(ioBinding, "input", inputTensor);
OrtApi->BindOutput(ioBinding, "output", outputTensor);

// ─────────────────────────────────────────────────────────────────────────────
// Per-frame: D3D12 upload → ORT inference → D3D12 consume
// ─────────────────────────────────────────────────────────────────────────────
uint64_t fenceValue = currentFrame * 2;

// App: copy input data to inputBuffer via D3D12 command list
d3d12CommandQueue->ExecuteCommandLists(1, &uploadCmdList);
d3d12CommandQueue->Signal(fence.Get(), fenceValue);

// ORT: wait for D3D12 upload to complete
OrtApi->ExternalResourceImporter_WaitSemaphore(importer, sem, stream, fenceValue);

// Set up RunOptions with the sync stream
OrtRunOptions* runOptions = nullptr;
OrtApi->CreateRunOptions(&runOptions);
OrtApi->RunOptions_SetSyncStream(runOptions, stream);

// Run inference using IOBinding
OrtApi->RunWithBinding(session, runOptions, ioBinding);

// Signal completion so D3D12 can consume the output
OrtApi->ExternalResourceImporter_SignalSemaphore(importer, sem, stream, fenceValue + 1);

// App: wait for ORT inference to complete, then use outputBuffer
fence->SetEventOnCompletion(fenceValue + 1, waitEvent);
WaitForSingleObject(waitEvent, INFINITE);

OrtApi->ReleaseRunOptions(runOptions);
```

### 8.5 Cleanup

```cpp
// Release in reverse order of creation
OrtApi->ReleaseIoBinding(ioBinding);  // If using IOBinding
OrtApi->ReleaseValue(outputTensor);
OrtApi->ReleaseValue(inputTensor);
OrtApi->ReleaseExternalSemaphoreHandle(sem);
OrtApi->ReleaseExternalMemoryHandle(outputMem);
OrtApi->ReleaseExternalMemoryHandle(inputMem);
OrtApi->ReleaseSyncStream(stream);
OrtApi->ReleaseExternalResourceImporter(importer);
OrtApi->ReleaseSession(session);
OrtApi->ReleaseSessionOptions(sessionOptions);

// Close Win32 handles
CloseHandle(fenceHandle);
CloseHandle(outputHandle);
CloseHandle(inputHandle);
```

## 9. Ownership & lifetime
- The app **owns** the underlying D3D12 resource/heap/fence and their shared handles.
- ORT **owns** `OrtExternalMemoryHandle` / `OrtExternalSemaphoreHandle` wrappers and EP-side imports.
- `OrtValue` tensors created from external memory are **views** and remain valid only while the underlying external memory handle remains valid.
- `OrtExternalResourceImporter` is **stateless** — it provides import capabilities for a device but holds no per-session state. Multiple sessions can independently import resources and use separate streams/semaphores concurrently.

## 10. EP implementation expectations
### 10.1 NvTensorRtRtx (CUDA)
NvTensorRtRtx is a distinct EP (not the built-in CUDA EP). It uses CUDA Driver APIs internally for interop.

**Handle types** (from `CUexternalMemoryHandleType_enum` / `CUexternalSemaphoreHandleType_enum`):
| ORT Handle Type | CUDA Handle Type |
|-----------------|------------------|
| `ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE` | `CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE` (value 5) |
| `ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP` | `CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP` (value 4) |
| `ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE` | `CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE` (value 4) |

**API mapping**:
| OrtExternalResourceImporterImpl method | CUDA Driver API(s) |
|-----------------------------------------|-------------------|
| `ImportMemory` | `cuImportExternalMemory()` + `cuExternalMemoryGetMappedBuffer()` |
| `ReleaseMemory` | `cuDestroyExternalMemory()` + `cuMemFree()` (mapped buffer) |
| `ImportSemaphore` | `cuImportExternalSemaphore()` |
| `ReleaseSemaphore` | `cuDestroyExternalSemaphore()` |
| `WaitSemaphore` | `cuWaitExternalSemaphoresAsync()` |
| `SignalSemaphore` | `cuSignalExternalSemaphoresAsync()` |

**Implementation notes**:
- For D3D12 resources, set `CUDA_EXTERNAL_MEMORY_DEDICATED` flag in `CUDA_EXTERNAL_MEMORY_HANDLE_DESC::flags`.
- The CUDA driver does **not** take ownership of the Win32 HANDLE; the application must keep it valid until import completes.
- Wait/signal use `CUDA_EXTERNAL_SEMAPHORE_WAIT_PARAMS::params::fence::value` / `CUDA_EXTERNAL_SEMAPHORE_SIGNAL_PARAMS::params::fence::value` for the 64-bit timeline fence value.
- The `CUstream` is obtained from `OrtSyncStream` native handle.

Note: the CUDA EP (separate from NvTensorRtRtx) could implement the same optional factory entry points in the future.

### 10.2 MiGraphX (HIP)
HIP **has full native support** for D3D12 external memory and semaphore interop on Windows. The HIP runtime API (`hip_runtime_api.h`) defines:

**Handle types** (from `hipExternalMemoryHandleType_enum` / `hipExternalSemaphoreHandleType_enum`):
| ORT Handle Type | HIP Handle Type |
|-----------------|-----------------|
| `ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE` | `hipExternalMemoryHandleTypeD3D12Resource` (value 5) |
| `ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP` | `hipExternalMemoryHandleTypeD3D12Heap` (value 4) |
| `ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE` | `hipExternalSemaphoreHandleTypeD3D12Fence` (value 4) |

**API mapping**:
| OrtExternalResourceImporterImpl method | HIP API(s) |
|-----------------------------------------|------------|
| `ImportMemory` | `hipImportExternalMemory()` + `hipExternalMemoryGetMappedBuffer()` |
| `ReleaseMemory` | `hipDestroyExternalMemory()` |
| `ImportSemaphore` | `hipImportExternalSemaphore()` |
| `ReleaseSemaphore` | `hipDestroyExternalSemaphore()` |
| `WaitSemaphore` | `hipWaitExternalSemaphoresAsync()` |
| `SignalSemaphore` | `hipSignalExternalSemaphoresAsync()` |

**Implementation notes**:
- HIP external semaphore functions are documented as "currently not supported on Linux", which implies **Windows is the supported path**.
- MiGraphX already has HIP stream integration in this repo; the new entry points would use the existing `hipStream_t` obtained from `OrtSyncStream`.
- The implementation pattern is effectively identical to the CUDA path (HIP mirrors the CUDA driver API semantics for external resource interop).

Therefore, MiGraphX **can and should** implement `CreateExternalResourceImporterForDevice` using HIP today.

### 10.3 OpenVINO GPU (OpenCL / D3D11)
OpenVINO's GPU plugin has a mature **Remote Tensor API** for memory sharing with native APIs. On Windows, it supports:

- **D3D11 surfaces**: `ID3D11Buffer` and `ID3D11Texture2D` via `ov::intel_gpu::ocl::D3DContext`
- **OpenCL interop**: `cl_mem`, `cl_context`, `cl_command_queue` via `ov::intel_gpu::ocl::ClContext`
- **NV12 video surfaces**: Direct consumption of hardware video decoder output

**D3D12 → D3D11 import path**:

D3D11.1 added `ID3D11Device1::OpenSharedResource1()` which can import NT handles created by D3D12's `CreateSharedHandle()`. This means OpenVINO **can** implement the proposed API:

```
D3D12 app                           ORT API                        OpenVINO EP
───────────────────────────────────────────────────────────────────────────────
CreateSharedHandle() ───► ExternalResourceImporter_ImportMemory() ───► OpenSharedResource1()
     │                                                                      │
     └─► NT HANDLE ────────────────────────────────────────────────────────►└─► ID3D11Buffer
```

**Requirements**:
- D3D12 resource created with `D3D12_HEAP_FLAG_SHARED`
- Same GPU (matching adapter LUID)
- OpenVINO's internal D3D11 device is D3D11.1+

**Implementation sketch**:
```cpp
// Inside OpenVINO's OrtExternalResourceImporterImpl
OrtStatus* OpenVINO_ImportMemory(
    OrtExternalResourceImporterImpl* this_ptr,
    const OrtExternalMemoryDescriptor* desc,
    OrtExternalMemoryHandleImpl** out) {
  
  // OpenVINO already has an ID3D11Device for its D3DContext
  ID3D11Device1* d3d11Device = GetOpenVINOD3D11Device();
  
  ID3D11Buffer* d3d11Buffer = nullptr;
  HRESULT hr = d3d11Device->OpenSharedResource1(
      (HANDLE)desc->native_handle,
      IID_PPV_ARGS(&d3d11Buffer));
  
  if (FAILED(hr)) return ORT_MAKE_STATUS(/* ... */);
  
  // Now use d3d11Buffer with existing OpenVINO D3DContext::create_tensor()
  // ...
}
```

**Synchronization consideration**: D3D12 timeline fences can also be imported to D3D11 via `ID3D11Device5::OpenSharedFence()`, enabling fence-based sync if OpenVINO's D3D11 device is 11.4+.

**Conclusion**: OpenVINO **can implement** this D3D12 external resource API using D3D11's shared resource import. The key insight is that D3D11.1+ can consume D3D12 shared handles—there's no "D3D12-on-D3D11" layer needed, just the standard shared handle import path.

### 10.4 Vulkan-based EPs (future considerations)
Vulkan has **native support** for importing D3D12 external memory and semaphores via extensions that are part of Vulkan 1.1 core. This means Vulkan-based EPs can implement the proposed D3D12 interop API.

**Handle types** (from `VkExternalMemoryHandleTypeFlagBits` / `VkExternalSemaphoreHandleTypeFlagBits`):
| ORT Handle Type | Vulkan Handle Type | Vulkan Value |
|-----------------|-------------------|--------------|
| `ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE` | `VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT` | 0x00000040 |
| `ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP` | `VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT` | 0x00000020 |
| `ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE` | `VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE_BIT` | 0x00000008 |

**API mapping**:
| OrtExternalResourceImporterImpl method | Vulkan API(s) |
|-----------------------------------------|---------------|
| `ImportMemory` | `vkAllocateMemory` with `VkImportMemoryWin32HandleInfoKHR` in pNext chain |
| `ReleaseMemory` | `vkFreeMemory` |
| `ImportSemaphore` | `vkImportSemaphoreWin32HandleKHR` |
| `ReleaseSemaphore` | `vkDestroySemaphore` |
| `WaitSemaphore` | `vkWaitSemaphores` (timeline) or submit with wait semaphore |
| `SignalSemaphore` | `vkSignalSemaphore` (timeline) or submit with signal semaphore |

**Required Vulkan extensions** (all promoted to Vulkan 1.1 core):
- `VK_KHR_external_memory_win32` — import Win32 handles as `VkDeviceMemory`
- `VK_KHR_external_semaphore_win32` — import Win32 fence handles as `VkSemaphore`
- `VK_KHR_timeline_semaphore` — required for D3D12 fence interop (64-bit values)

**Implementation notes**:
- D3D12 fences are timeline semaphores; Vulkan requires `VK_SEMAPHORE_TYPE_TIMELINE` for interop.
- The Win32 HANDLE is passed via `VkImportMemoryWin32HandleInfoKHR::handle` / `VkImportSemaphoreWin32HandleInfoKHR::handle`.
- Device UUID must match between D3D12 adapter and Vulkan physical device (same GPU requirement).
- After memory import, bind it to a `VkBuffer` via `vkBindBufferMemory` to get usable storage.

**Applicability to ORT EPs**:
| EP | Vulkan-based? | Can implement D3D12 interop? |
|----|---------------|------------------------------|
| WebGPU (Dawn backend) | ✅ Dawn uses Vulkan on Windows | ✅ Yes, via Vulkan external memory extensions |
| Future Vulkan EP | ✅ Native Vulkan | ✅ Yes, direct implementation |
| DirectML | ❌ D3D12-native | N/A (D3D12 is native) |

**Conclusion**: Vulkan-based ORT EPs **can implement** the proposed D3D12 external resource API using Vulkan's external memory/semaphore extensions. The API design is fully compatible.

## 11. Future considerations
- **Texture/surface interop**: Define a clean path for non-linear/strided/image-backed tensors

### Why not D3D11 handle types in the API?

The API only defines D3D12 handle types. Adding D3D11-specific types (`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_RESOURCE`) is **out of scope** because:

| D3D12 handles (in scope) | D3D11 handles (out of scope) |
|--------------------------|------------------------------|
| App creates `CreateSharedHandle()` → standalone NT handle | App would need to export via `IDXGIResource1::CreateSharedHandle()` |
| Any API can import directly (CUDA, HIP, Vulkan, D3D11) | Only D3D11 devices can import D3D11 handles |
| Timeline fences are self-contained | Keyed mutexes require device-level coordination |

The key insight: **D3D12 shared handles are the universal currency**. They can be imported by:
- CUDA (`cuImportExternalMemory` with `D3D12_RESOURCE`)
- HIP (`hipImportExternalMemory` with `D3D12Resource`)
- Vulkan (`vkAllocateMemory` with `VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT`)
- D3D11 (`ID3D11Device1::OpenSharedResource1`)

So even EPs that internally use D3D11 (like OpenVINO) can consume D3D12 handles via `OpenSharedResource1()`. There's no need for separate D3D11 handle types.

For applications that only have D3D11 resources:
1. Create an `ID3D11On12Device` to wrap D3D11 on D3D12
2. Get the underlying D3D12 resource and create a shared handle
3. Use this API with the D3D12 handle

## 12. Requirements traceability
| Requirement | Addressed by |
|-------------|-------------|
| Zero-copy tensor views | `ExternalResourceImporter_CreateTensorFromMemory` creates a view backed by imported GPU memory; no copy occurs. |
| D3D12 sharing (resource + heap) | `OrtExternalMemoryHandleType` enum supports `D3D12_RESOURCE` and `D3D12_HEAP`. |
| Explicit synchronization | `ExternalResourceImporter_ImportSemaphore` + `WaitSemaphore`/`SignalSemaphore` with 64-bit fence value on `OrtSyncStream`. |
| EP capability discovery | `CreateExternalResourceImporterForDevice` returns null/NOT_IMPLEMENTED if EP doesn't support; `CanImportMemory`/`CanImportSemaphore` for specific types. |
| Device identity matching | ORT already populates `"LUID"` metadata in `OrtHardwareDevice`; apps match to D3D12 adapter LUID. `SessionGetEpDeviceForOutputs` validates output placement. |
| Fits existing ORT primitives | Single `CreateExternalResourceImporterForDevice` factory method returns capability object; uses `OrtSyncStream`; opaque handles with `Release*`. |
| Output tensors (write access) | `OrtExternalMemoryAccessMode` enum includes `WRITE_ONLY` and `READ_WRITE`. |
| Async Run integration | `RunOptions_SetSyncStream` associates a stream with Run for use with `Run()` or `RunWithBinding()`. |

---


### Describe scenario use case

## Scenario / Use Case

Importing D3D12 resources that are already resident on a particular inferencing device is a commonly requested scenario. See #26543 for an example of a similar implementation proposal.

**Concrete examples:**

1. **Video/media pipelines** — A video conferencing app decodes frames via hardware video decoder (D3D12/D3D11). Today, to run background blur or super-resolution via ORT, it must copy GPU→CPU→GPU. With this API, the decoded frame stays on GPU and is directly consumed by the EP.

2. **Game/render integration** — A game uses D3D12 for rendering and wants ML-based upscaling (DLSS-style) or denoising. The render target is already in VRAM; copying to CPU and back adds latency and bandwidth overhead.

3. **Multi-engine composition** — Apps like video editors or creative tools use D3D12 compute for some operations and ORT for others. Resources ping-pong between engines; zero-copy sharing eliminates redundant transfers.

4. **Real-time latency-sensitive workloads** — AR/VR applications where every millisecond matters. GPU→CPU→GPU copies can add 2-5ms of latency per frame, breaking real-time constraints.

**Current workarounds are unsatisfactory:**
- EP-specific private hooks (not portable across EPs)
- Copy-based paths (defeats the purpose of GPU residency)
- DML-only binding (limited to one EP, no cross-EP pattern)

Existing OrtEpFactory capability	Pattern
`CreateAllocator()`	Returns `OrtAllocator*` with bundled Alloc/Free operations
`CreateDataTransfer()`	Returns `OrtDataTransferImpl*` with bundled Copy operations
`CreateSyncStreamForDevice()`	Returns `OrtSyncStreamImpl*` with bundled stream operations

ORT Handle Type	CUDA Handle Type
`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE`	`CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE` (value 5)
`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP`	`CU_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP` (value 4)
`ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE`	`CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE` (value 4)

OrtExternalResourceImporterImpl method	CUDA Driver API(s)
`ImportMemory`	`cuImportExternalMemory()` + `cuExternalMemoryGetMappedBuffer()`
`ReleaseMemory`	`cuDestroyExternalMemory()` + `cuMemFree()` (mapped buffer)
`ImportSemaphore`	`cuImportExternalSemaphore()`
`ReleaseSemaphore`	`cuDestroyExternalSemaphore()`
`WaitSemaphore`	`cuWaitExternalSemaphoresAsync()`
`SignalSemaphore`	`cuSignalExternalSemaphoresAsync()`

ORT Handle Type	HIP Handle Type
`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE`	`hipExternalMemoryHandleTypeD3D12Resource` (value 5)
`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP`	`hipExternalMemoryHandleTypeD3D12Heap` (value 4)
`ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE`	`hipExternalSemaphoreHandleTypeD3D12Fence` (value 4)

OrtExternalResourceImporterImpl method	HIP API(s)
`ImportMemory`	`hipImportExternalMemory()` + `hipExternalMemoryGetMappedBuffer()`
`ReleaseMemory`	`hipDestroyExternalMemory()`
`ImportSemaphore`	`hipImportExternalSemaphore()`
`ReleaseSemaphore`	`hipDestroyExternalSemaphore()`
`WaitSemaphore`	`hipWaitExternalSemaphoresAsync()`
`SignalSemaphore`	`hipSignalExternalSemaphoresAsync()`

Factory Method	Returns	Operations bundled in object
`CreateAllocator()`	`OrtAllocator*`	Alloc, Free
`CreateDataTransfer()`	`OrtDataTransferImpl*`	Copy, CanCopy
`CreateSyncStreamForDevice()`	`OrtSyncStreamImpl*`	Flush, GetHandle

Dependency	How it's handled
Semaphore wait/signal requires stream	`WaitSemaphore`/`SignalSemaphore` take `OrtSyncStream*`; EP can return `ORT_NOT_IMPLEMENTED` if `!IsStreamAware()`
Memory import is stream-independent	`ImportMemory` / `CreateTensorFromMemory` don't take a stream; usable with sync `Run()`
EP doesn't support external resources	`CreateExternalResourceImporterForDevice` returns null or `ORT_NOT_IMPLEMENTED`

ORT Handle Type	Vulkan Handle Type	Vulkan Value
`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE`	`VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT`	0x00000040
`ORT_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP`	`VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT`	0x00000020
`ORT_EXTERNAL_SEMAPHORE_D3D12_FENCE`	`VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_D3D12_FENCE_BIT`	0x00000008

OrtExternalResourceImporterImpl method	Vulkan API(s)
`ImportMemory`	`vkAllocateMemory` with `VkImportMemoryWin32HandleInfoKHR` in pNext chain
`ReleaseMemory`	`vkFreeMemory`
`ImportSemaphore`	`vkImportSemaphoreWin32HandleKHR`
`ReleaseSemaphore`	`vkDestroySemaphore`
`WaitSemaphore`	`vkWaitSemaphores` (timeline) or submit with wait semaphore
`SignalSemaphore`	`vkSignalSemaphore` (timeline) or submit with signal semaphore

EP	Vulkan-based?	Can implement D3D12 interop?
WebGPU (Dawn backend)	✅ Dawn uses Vulkan on Windows	✅ Yes, via Vulkan external memory extensions
Future Vulkan EP	✅ Native Vulkan	✅ Yes, direct implementation
DirectML	❌ D3D12-native	N/A (D3D12 is native)

D3D12 handles (in scope)	D3D11 handles (out of scope)
App creates `CreateSharedHandle()` → standalone NT handle	App would need to export via `IDXGIResource1::CreateSharedHandle()`
Any API can import directly (CUDA, HIP, Vulkan, D3D11)	Only D3D11 devices can import D3D11 handles
Timeline fences are self-contained	Keyed mutexes require device-level coordination

Requirement	Addressed by
Zero-copy tensor views	`ExternalResourceImporter_CreateTensorFromMemory` creates a view backed by imported GPU memory; no copy occurs.
D3D12 sharing (resource + heap)	`OrtExternalMemoryHandleType` enum supports `D3D12_RESOURCE` and `D3D12_HEAP`.
Explicit synchronization	`ExternalResourceImporter_ImportSemaphore` + `WaitSemaphore`/`SignalSemaphore` with 64-bit fence value on `OrtSyncStream`.
EP capability discovery	`CreateExternalResourceImporterForDevice` returns null/NOT_IMPLEMENTED if EP doesn't support; `CanImportMemory`/`CanImportSemaphore` for specific types.
Device identity matching	ORT already populates `"LUID"` metadata in `OrtHardwareDevice`; apps match to D3D12 adapter LUID. `SessionGetEpDeviceForOutputs` validates output placement.
Fits existing ORT primitives	Single `CreateExternalResourceImporterForDevice` factory method returns capability object; uses `OrtSyncStream`; opaque handles with `Release*`.
Output tensors (write access)	`OrtExternalMemoryAccessMode` enum includes `WRITE_ONLY` and `READ_WRITE`.
Async Run integration	`RunOptions_SetSyncStream` associates a stream with Run for use with `Run()` or `RunWithBinding()`.

[Feature Request] D3D12 External Resource Interop API for Plugin EPs #26821

Description

Describe the feature request

ONNX Runtime: D3D12 External Resource Import

1. Problem statement

2. Requirements (MVP)

3. Non-goals (MVP)

4. Design overview

5. Public API (onnxruntime_c_api.h)

5.1 Types

5.2 Functions

5.3 Session and RunOptions extensions

5.4 Device identity (adapter matching)

6. EP plugin API (onnxruntime_ep_c_api.h)

6.1 Capability object pattern

6.2 OrtExternalResourceImporterImpl interface

6.3 Dependency handling

6.4 ORT routing layer (public API → EP factory)

7. Justification (fit with existing ORT patterns)

8. Client calling pattern (MVP code)

8.1 Setup (once per EP device)

8.2 Session setup and device validation

8.3 Per-frame execution with Run() (no IOBinding)

8.4 Per-frame execution with IOBinding (alternative)

8.5 Cleanup

9. Ownership & lifetime

10. EP implementation expectations

10.1 NvTensorRtRtx (CUDA)

10.2 MiGraphX (HIP)

10.3 OpenVINO GPU (OpenCL / D3D11)

10.4 Vulkan-based EPs (future considerations)

11. Future considerations

Why not D3D11 handle types in the API?

12. Requirements traceability

Describe scenario use case

Scenario / Use Case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions