-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fixes for non-standard Arm SoC PCIe integrations #972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mario Bălănică <[email protected]>
Arm platforms have historically had issues (corruption, bus errors) with non-Device MMIO mappings. Unlike DMA coherency, there's no way to check for this at runtime. Therefore, in the absence of better chipset info, disable WC iomaps by default. Signed-off-by: Mario Bălănică <[email protected]>
Do not set the `DMA_ATTR_SKIP_CPU_SYNC` flag on dma_map_*() calls even for memory marked as "uncached". On Arm, we always allocate cacheable pages and then use aliased (vmap) uncached mappings when necessary. Without explicit flushing right after allocation, previous stale data in these backing pages could be evicted at any point and end up clobbering memory that was already written through the aliased mapping. Note that no flushing will be performed on cache-coherent hardware. This is not an issue in the unmap path since no further writes are made to the cached mappings. Signed-off-by: Mario Bălănică <[email protected]>
Support both CPU-side flushing (to device) and invalidation (from device) for cached memory descriptors. The previous logic was entirely broken: - `dma_sync_*_for_device()` in `nv_dma_cache_invalidate()` actually performed flushing (cleaning) rather than invalidation, since the direction argument is ignored on ARM64. The correct API variant for invalidation is `dma_sync_*_for_cpu()`. - `flush_cache_all()` was removed a long time ago from the ARM64 kernel because there's no reliable way to flush all cache lines on this arch. This notably fixes `cliresCtrlCmdOsUnixFlushUserCache_IMPL()` and will also be needed in other places where cached memory is used. However, paths calling `memdescMapInternal/memdescUnmapInternal()` in streaming DMA fashion should be fine as these functions now properly handle synchronization. Signed-off-by: Mario Bălănică <[email protected]>
Repurpose `NV_MEMORY_DEFAULT` to hand out either cached or uncached CPU mappings based on hardware cache-coherency support. This type should be preferred over `NV_MEMORY_CACHED`, unless there's a good reason not to: - explicit cache maintenance is done where necessary (does not seem the case for most allocations so far). - there are certain memory requirements (e.g. atomics usually need cached memory on Arm). Most `NV_MEMORY_CACHED` allocations are replaced with this default type, except in cases where I've seen cache maintenance or uncached memory caused issues. There are some remaining cached allocations (e.g. imported from user memory, RUSD) that I haven't looked into - it's unclear whether those are subject to DMA coherency issues. In practice, all things I've tested (games, benchmarks, monitoring tools, CUDA) appear to work fine now on a non-coherent system (RK3588-based). Signed-off-by: Mario Bălănică <[email protected]>
|
@mariobalanica - Thanks for posting the PR. I'm trying to test it on a CM5 with an Nvidia A4000, which is detected via
Driver install: Download 580.95.05 aarch64 driver from: https://www.nvidia.com/en-us/drivers/unix/ Install without kernel modules (to not write over the ones we just built): After that completes, and I reboot, the module does not load in. And attempting to load it manually:
EDIT: I was not updating the module database: Now I'm getting the module to load, but I still wind up with the classic 'RmInitAdapter failed!' error: |
Is the kernel built with a non-4K page size? |
|
@mariobalanica - Ah yes... this is the Pi default kernel, 16K page size. I can try switching to a 4K kernel. |
|
@mariobalanica - Okay, building off the 4K kernel gets me a loaded driver: However, displays plugged into the DisplayPort plugs don't seem to get any signal. But nvidia-smi does work: Vulkan info: I tried compiling So I switched to Vulkan to see if acceleration is working: And it's definitely accelerated... NICE! More info here geerlingguy/raspberry-pi-pcie-devices#692 (comment) Do you know if there are any other tricks to getting display output? On other cards where Mesa wasn't happy but the drivers worked, I would at least get output with a flashing cursor, and I could press Alt + F2 to get to console. Here I'm not even seeing that... |
|
Not sure, I didn't have to do anything special to get HDMI out. What does |
From dmesg during boot: |
|
Any chance to support 1070 Ti? |
|
I'm fairly certain the 1070 doesn't have the GSP firmware required to work with the open-gpu-kernel-modules — see, similarly, the 750 Ti: geerlingguy/raspberry-pi-pcie-devices#26 (comment) |
|
Thanks!
|
|
So do we have to peg to 4k to use this? |
|
I've now tested an RTX A4000, 750 Ti, and 3080 Ti on a Raspberry Pi CM5 running Debian Trixie. The 750 Ti (referenced above) doesn't have GSP firmware so can't work with the open driver. For the other two, the behavior was the same:
I tested everything on both cards on a separate Intel x86 system on my bench, and both HDMI and DisplayPort outputs worked fine on that setup (running Ubuntu 25.10, using the same version of the driver that I've installed on the Pi setup). |
|
@geerlingguy, do you have a RK3588 board you can test this card with? It's not going to work just yet as I still have to push some firmware changes (likely by the end of this week), but I'm curious whether you can reproduce the issue there with a mainline kernel.
@pj1976, the open driver variant only supports newer GSP-capable cards - so no, not directly. If this patchset gets merged, there's a chance the fixes could also land in the proprietary driver. But it's also possible that the legacy firmware code has similar issues, and since we have no public sources for that, it would require special attention from NVIDIA. For older cards, you can try YMMV though, as nouveau does no reclocking for older cards and so you're probably not going to get a lot of performance out of it.
@Lewiscowles1986, the driver's memory manager layer complained about the page size. I'd recommend opening another issue for that. |
Note that I have tested the open and proprietary drivers with a few cards on the Thelio Astra, with 64K page size, and that worked fine, so I wonder if there are special cases for Ampere in Nvidia's drivers... |
|
Looks like it just doesn't currently support a 16K page size: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/580.95.05/src/nvidia/src/kernel/gpu/mem_mgr/mem_mgr.c#L1935 |
Do you think it could be as easy as Surely not? I am assuming in this that the same patch could macro in ifndef patches and within define the new defines. Not asking for it to be a part of your patch, but this kind of thing always surprises and interests me. |
It's not. The GPU does not support a 16K (or 8K) page size. That said the driver works on 16K system (M2 Ultra running linux) wtih following patch: There is some logsplat but the card works including display for desktop session and glmark2/vkmark. |
|
Any hope to resurrect old jetson nanos with a standard distro thanks to this? Thanks! |
The latest EDK2 firmware build (https://github.com/edk2-porting/edk2-rk3588/actions) enables full support for NVIDIA cards on RK3588, without any kernel/DT patches.
@Darkhub, this PR does not enable GPUs that aren't already supported by the open driver variant. Jetson Nano is older Maxwell architecture. See: #19 |
|
should this work with an RTX 5000 Quadro? |
|
Do all Nvidia cards work, including data center cards? |
|
@itsanirudhsrinivasan not all cards will work, this is for a specific generation or fe generations. That information is contained above in this thread. #972 (comment) Also, it seems like https://github.com/NVIDIA/open-gpu-kernel-modules?tab=readme-ov-file#compatible-gpus lists compatible GPUs |
|
Oh, thanks @Lewiscowles1986 |
|
Hi everyone, I have an RPi 5 running a simple media server with docker-compose. I’m considering connecting an NVIDIA RTX 3080 to it, powered by a 620W PSU, to handle 4K transcoding. Is this setup even possible at the moment? Thanks! |
|
How does this patch work. I by no means know about gpu acceleration in the kernel level, but I’m still curious |
This patchset attempts to address a number of limitations present in commonly available Arm SoCs:
Tested on RK3588 (has all issues above) and CIX P1 (no issues, SBSA-compliant) with an RTX 3050 8 GB.
Most things I've tried (Steam games, benchmarks, monitoring tools, CUDA) work fine now.
See mariobalanica/arm-pcie-gpu-patches#2 for related discussion and demos of the driver running.
Side note: there's currently no Arm userspace release for driver version 580.105.08, so you'll need to stick with 580.95.05.