[AMDGPU] Add MaxMemoryClauseSchedStrategy #114957

ruiling · 2024-11-05T09:16:32Z

The major behaviors of the max memory clause schedule strategy includes:

Try to cluster memory instructions more aggressively.
Try to schedule long latency load earlier than short latency
instruction.

I tested locally against about 470 real shaders and got the perf changes:
(only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.

(This was tested together with another change which increases the maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.

This is a motiviting example that drives us to do better on grouping image sample instructions.

The AMDGPU specific version mainly includes two major differences: 1. Try to cluster memory instructions more aggressively. 2. Try to schedule long latency load earlier than short latency instruction. I tested locally against about 470 real shaders and got the perf changes: (only count perf changes over +/-10%): About 15 shaders improved 10%~40%. Only 3 shaders drops ~10%. (This was tested together with another change which increases the maximum clustered dword from 8 to 32). I will make another change to make that threshold configurable.

llvmbot · 2024-11-05T09:17:12Z

@llvm/pr-subscribers-backend-amdgpu

Author: Ruiling, Song (ruiling)

Changes

The AMDGPU specific version mainly includes two major differences:

Try to cluster memory instructions more aggressively.
Try to schedule long latency load earlier than short latency
instruction.

I tested locally against about 470 real shaders and got the perf changes:
(only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.

(This was tested together with another change which increases the maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.

Patch is 38.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/114957.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+134)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+3)
(added) llvm/test/CodeGen/AMDGPU/group-image-instructions.ll (+488)

diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 57f517bfba0ebb..37802d335fb9fd 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -63,6 +63,10 @@ static cl::opt<bool> GCNTrackers(
     cl::desc("Use the AMDGPU specific RPTrackers during scheduling"),
     cl::init(false));
 
+static cl::opt<bool> UseAMDGPUScheduleHeuristic(
+    "amdgpu-use-amdgpu-schedule-heuristic", cl::Hidden,
+    cl::desc("Use AMDGPU specific schedule heuristic "), cl::init(false));
+
 const unsigned ScheduleMetrics::ScaleFactor = 100;
 
 GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)
@@ -311,6 +315,136 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
   }
 }
 
+/// AMDGPU specific implementation, which is largely copy-pasted from the
+/// generic version, with some modifications to better hide memory latency.
+//  Major differences from the generic version:
+//  1. Prioritize clustered operations before stall latency heuristic.
+//  2. Prioritize long-latency-load before stall latency heuristic.
+///
+/// \param Cand provides the policy and current best candidate.
+/// \param TryCand refers to the next SUnit candidate, otherwise uninitialized.
+/// \param Zone describes the scheduled zone that we are extending, or nullptr
+///             if Cand is from a different zone than TryCand.
+/// \return \c true if TryCand is better than Cand (Reason is NOT NoCand)
+bool GCNSchedStrategy::tryCandidate(SchedCandidate &Cand,
+                                    SchedCandidate &TryCand,
+                                    SchedBoundary *Zone) const {
+  if (!UseAMDGPUScheduleHeuristic)
+    return GenericScheduler::tryCandidate(Cand, TryCand, Zone);
+
+  // Initialize the candidate if needed.
+  if (!Cand.isValid()) {
+    TryCand.Reason = NodeOrder;
+    return true;
+  }
+
+  // Bias PhysReg Defs and copies to their uses and defined respectively.
+  if (tryGreater(biasPhysReg(TryCand.SU, TryCand.AtTop),
+                 biasPhysReg(Cand.SU, Cand.AtTop), TryCand, Cand, PhysReg))
+    return TryCand.Reason != NoCand;
+
+  // Avoid exceeding the target's limit.
+  if (DAG->isTrackingPressure() &&
+      tryPressure(TryCand.RPDelta.Excess, Cand.RPDelta.Excess, TryCand, Cand,
+                  RegExcess, TRI, DAG->MF))
+    return TryCand.Reason != NoCand;
+
+  // Avoid increasing the max critical pressure in the scheduled region.
+  if (DAG->isTrackingPressure() &&
+      tryPressure(TryCand.RPDelta.CriticalMax, Cand.RPDelta.CriticalMax,
+                  TryCand, Cand, RegCritical, TRI, DAG->MF))
+    return TryCand.Reason != NoCand;
+
+  // AMDGPU-specific: We prioritize clustered instructions as we would get more
+  // benefit from clausing these memory instructions.
+  const SUnit *CandNextClusterSU =
+      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
+  const SUnit *TryCandNextClusterSU =
+      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
+  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
+                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+    return TryCand.Reason != NoCand;
+
+  // We only compare a subset of features when comparing nodes between
+  // Top and Bottom boundary. Some properties are simply incomparable, in many
+  // other instances we should only override the other boundary if something
+  // is a clear good pick on one boundary. Skip heuristics that are more
+  // "tie-breaking" in nature.
+  bool SameBoundary = Zone != nullptr;
+  if (SameBoundary) {
+    // For loops that are acyclic path limited, aggressively schedule for
+    // latency. Within an single cycle, whenever CurrMOps > 0, allow normal
+    // heuristics to take precedence.
+    if (Rem.IsAcyclicLatencyLimited && !Zone->getCurrMOps() &&
+        tryLatency(TryCand, Cand, *Zone))
+      return TryCand.Reason != NoCand;
+
+    // AMDGPU-specific: Prioritize long latency memory load instructions in
+    // top-bottom order to hide more latency. The mayLoad check is used
+    // to exclude store-like instructions, which we do not want to scheduler
+    // them too early.
+    bool TryMayLoad =
+        TryCand.SU->isInstr() && TryCand.SU->getInstr()->mayLoad();
+    bool CandMayLoad = Cand.SU->isInstr() && Cand.SU->getInstr()->mayLoad();
+
+    if (TryMayLoad || CandMayLoad) {
+      bool TryLongLatency =
+          TryCand.SU->Latency > 10 * Cand.SU->Latency && TryMayLoad;
+      bool CandLongLatency =
+          10 * TryCand.SU->Latency < Cand.SU->Latency && CandMayLoad;
+
+      if (tryGreater(Zone->isTop() ? TryLongLatency : CandLongLatency,
+                     Zone->isTop() ? CandLongLatency : TryLongLatency, TryCand,
+                     Cand, Stall))
+        return TryCand.Reason != NoCand;
+    }
+    // Prioritize instructions that read unbuffered resources by stall cycles.
+    if (tryLess(Zone->getLatencyStallCycles(TryCand.SU),
+                Zone->getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))
+      return TryCand.Reason != NoCand;
+  }
+
+  if (SameBoundary) {
+    // Weak edges are for clustering and other constraints.
+    if (tryLess(getWeakLeft(TryCand.SU, TryCand.AtTop),
+                getWeakLeft(Cand.SU, Cand.AtTop), TryCand, Cand, Weak))
+      return TryCand.Reason != NoCand;
+  }
+
+  // Avoid increasing the max pressure of the entire region.
+  if (DAG->isTrackingPressure() &&
+      tryPressure(TryCand.RPDelta.CurrentMax, Cand.RPDelta.CurrentMax, TryCand,
+                  Cand, RegMax, TRI, DAG->MF))
+    return TryCand.Reason != NoCand;
+
+  if (SameBoundary) {
+    // Avoid critical resource consumption and balance the schedule.
+    TryCand.initResourceDelta(DAG, SchedModel);
+    if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
+                TryCand, Cand, ResourceReduce))
+      return TryCand.Reason != NoCand;
+    if (tryGreater(TryCand.ResDelta.DemandedResources,
+                   Cand.ResDelta.DemandedResources, TryCand, Cand,
+                   ResourceDemand))
+      return TryCand.Reason != NoCand;
+
+    // Avoid serializing long latency dependence chains.
+    // For acyclic path limited loops, latency was already checked above.
+    if (!RegionPolicy.DisableLatencyHeuristic && TryCand.Policy.ReduceLatency &&
+        !Rem.IsAcyclicLatencyLimited && tryLatency(TryCand, Cand, *Zone))
+      return TryCand.Reason != NoCand;
+
+    // Fall through to original instruction order.
+    if ((Zone->isTop() && TryCand.SU->NodeNum < Cand.SU->NodeNum) ||
+        (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {
+      TryCand.Reason = NodeOrder;
+      return true;
+    }
+  }
+
+  return false;
+}
+
 // This function is mostly cut and pasted from
 // GenericScheduler::pickNodeFromQueue()
 void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 64d517038f90e0..addb05922cee66 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -41,6 +41,9 @@ raw_ostream &operator<<(raw_ostream &OS, const GCNSchedStageID &StageID);
 /// heuristics to determine excess/critical pressure sets.
 class GCNSchedStrategy : public GenericScheduler {
 protected:
+  bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand,
+                    SchedBoundary *Zone) const override;
+
   SUnit *pickNodeBidirectional(bool &IsTopNode);
 
   void pickNodeFromQueue(SchedBoundary &Zone, const CandPolicy &ZonePolicy,
diff --git a/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll b/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll
new file mode 100644
index 00000000000000..8644cd3cc1ef85
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll
@@ -0,0 +1,488 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -amdgpu-use-amdgpu-schedule-heuristic=true -verify-machineinstrs < %s | FileCheck -check-prefixes=GFX11 %s
+
+define amdgpu_ps void @group_image_sample(i32 inreg noundef %globalTable, i32 inreg noundef %userdata6, i32 inreg noundef %userdata7, i32 inreg noundef %userdata8, i32 inreg noundef %PrimMask, <2 x float> noundef %PerspInterpSample, <2 x float> noundef %PerspInterpCenter, <2 x float> noundef %PerspInterpCentroid) #2 {
+; GFX11-LABEL: group_image_sample:
+; GFX11:       ; %bb.0: ; %.entry
+; GFX11-NEXT:    s_mov_b64 s[16:17], exec
+; GFX11-NEXT:    s_wqm_b64 exec, exec
+; GFX11-NEXT:    s_mov_b32 m0, s4
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:    s_mov_b32 s0, s1
+; GFX11-NEXT:    s_mov_b32 s6, s3
+; GFX11-NEXT:    s_mov_b32 s1, s5
+; GFX11-NEXT:    s_mov_b32 s3, s5
+; GFX11-NEXT:    s_mov_b32 s7, s5
+; GFX11-NEXT:    s_load_b128 s[12:15], s[0:1], 0x0
+; GFX11-NEXT:    s_load_b128 s[8:11], s[2:3], 0x0
+; GFX11-NEXT:    s_load_b256 s[0:7], s[6:7], 0x0
+; GFX11-NEXT:    s_mov_b64 s[18:19], exec
+; GFX11-NEXT:    s_wqm_b64 exec, exec
+; GFX11-NEXT:    lds_param_load v2, attr0.y wait_vdst:15
+; GFX11-NEXT:    lds_param_load v3, attr0.x wait_vdst:15
+; GFX11-NEXT:    s_mov_b64 exec, s[18:19]
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_clause 0x3
+; GFX11-NEXT:    s_buffer_load_b64 s[18:19], s[12:15], 0x10
+; GFX11-NEXT:    s_buffer_load_b64 s[20:21], s[12:15], 0x20
+; GFX11-NEXT:    s_buffer_load_b64 s[22:23], s[12:15], 0x30
+; GFX11-NEXT:    s_buffer_load_b64 s[24:25], s[12:15], 0x40
+; GFX11-NEXT:    v_interp_p10_f32 v4, v2, v0, v2 wait_exp:1
+; GFX11-NEXT:    v_interp_p10_f32 v0, v3, v0, v3 wait_exp:0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_interp_p2_f32 v45, v2, v1, v4 wait_exp:7
+; GFX11-NEXT:    v_interp_p2_f32 v44, v3, v1, v0 wait_exp:7
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_add_f32_e32 v0, s18, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s19, v45
+; GFX11-NEXT:    v_add_f32_e32 v8, s20, v44
+; GFX11-NEXT:    v_add_f32_e32 v9, s21, v45
+; GFX11-NEXT:    v_add_f32_e32 v16, s24, v44
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[4:7], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[8:11], v[8:9], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    v_add_f32_e32 v0, s22, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s23, v45
+; GFX11-NEXT:    v_add_f32_e32 v17, s25, v45
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[12:15], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[16:19], v[16:17], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_clause 0x3
+; GFX11-NEXT:    s_buffer_load_b64 s[18:19], s[12:15], 0x50
+; GFX11-NEXT:    s_buffer_load_b64 s[20:21], s[12:15], 0x60
+; GFX11-NEXT:    s_buffer_load_b64 s[22:23], s[12:15], 0x70
+; GFX11-NEXT:    s_buffer_load_b64 s[24:25], s[12:15], 0x80
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_add_f32_e32 v0, s18, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s19, v45
+; GFX11-NEXT:    v_add_f32_e32 v24, s20, v44
+; GFX11-NEXT:    v_add_f32_e32 v25, s21, v45
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[20:23], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[24:27], v[24:25], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_clause 0x7
+; GFX11-NEXT:    s_buffer_load_b64 s[18:19], s[12:15], 0x90
+; GFX11-NEXT:    s_buffer_load_b64 s[20:21], s[12:15], 0xa0
+; GFX11-NEXT:    s_buffer_load_b64 s[26:27], s[12:15], 0xb0
+; GFX11-NEXT:    s_buffer_load_b64 s[28:29], s[12:15], 0xc0
+; GFX11-NEXT:    s_buffer_load_b64 s[30:31], s[12:15], 0xd0
+; GFX11-NEXT:    s_buffer_load_b64 s[34:35], s[12:15], 0xe0
+; GFX11-NEXT:    s_buffer_load_b64 s[36:37], s[12:15], 0xf0
+; GFX11-NEXT:    s_buffer_load_b64 s[12:13], s[12:15], 0x100
+; GFX11-NEXT:    v_add_f32_e32 v0, s22, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s23, v45
+; GFX11-NEXT:    v_add_f32_e32 v28, s24, v44
+; GFX11-NEXT:    v_add_f32_e32 v29, s25, v45
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_add_f32_e32 v30, s18, v44
+; GFX11-NEXT:    v_add_f32_e32 v31, s19, v45
+; GFX11-NEXT:    v_add_f32_e32 v32, s20, v44
+; GFX11-NEXT:    v_add_f32_e32 v33, s21, v45
+; GFX11-NEXT:    v_add_f32_e32 v34, s26, v44
+; GFX11-NEXT:    v_add_f32_e32 v35, s27, v45
+; GFX11-NEXT:    v_add_f32_e32 v36, s28, v44
+; GFX11-NEXT:    v_add_f32_e32 v37, s29, v45
+; GFX11-NEXT:    v_add_f32_e32 v38, s30, v44
+; GFX11-NEXT:    v_add_f32_e32 v39, s31, v45
+; GFX11-NEXT:    v_add_f32_e32 v40, s34, v44
+; GFX11-NEXT:    v_add_f32_e32 v41, s35, v45
+; GFX11-NEXT:    v_add_f32_e32 v42, s36, v44
+; GFX11-NEXT:    v_add_f32_e32 v43, s37, v45
+; GFX11-NEXT:    v_add_f32_e32 v44, s12, v44
+; GFX11-NEXT:    v_add_f32_e32 v45, s13, v45
+; GFX11-NEXT:    s_waitcnt vmcnt(4)
+; GFX11-NEXT:    v_add_f32_e32 v46, v8, v4
+; GFX11-NEXT:    v_add_f32_e32 v47, v9, v5
+; GFX11-NEXT:    v_add_f32_e32 v48, v10, v6
+; GFX11-NEXT:    v_add_f32_e32 v49, v11, v7
+; GFX11-NEXT:    s_and_b64 exec, exec, s[16:17]
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[4:7], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[8:11], v[28:29], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(5)
+; GFX11-NEXT:    v_add_f32_e32 v0, v12, v46
+; GFX11-NEXT:    v_add_f32_e32 v1, v13, v47
+; GFX11-NEXT:    v_add_f32_e32 v46, v14, v48
+; GFX11-NEXT:    v_add_f32_e32 v47, v15, v49
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[12:15], v[30:31], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[28:31], v[32:33], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(6)
+; GFX11-NEXT:    v_add_f32_e32 v0, v16, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v17, v1
+; GFX11-NEXT:    v_add_f32_e32 v46, v18, v46
+; GFX11-NEXT:    v_add_f32_e32 v47, v19, v47
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[16:19], v[34:35], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[32:35], v[36:37], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(7)
+; GFX11-NEXT:    v_add_f32_e32 v0, v20, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v21, v1
+; GFX11-NEXT:    v_add_f32_e32 v46, v22, v46
+; GFX11-NEXT:    v_add_f32_e32 v47, v23, v47
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[20:23], v[38:39], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[36:39], v[40:41], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(8)
+; GFX11-NEXT:    v_add_f32_e32 v0, v24, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v25, v1
+; GFX11-NEXT:    v_add_f32_e32 v46, v26, v46
+; GFX11-NEXT:    v_add_f32_e32 v47, v27, v47
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[24:27], v[42:43], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[40:43], v[44:45], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(9)
+; GFX11-NEXT:    v_add_f32_e32 v0, v4, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v5, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v6, v46
+; GFX11-NEXT:    v_add_f32_e32 v5, v7, v47
+; GFX11-NEXT:    s_waitcnt vmcnt(8)
+; GFX11-NEXT:    v_add_f32_e32 v0, v8, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v9, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v10, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v11, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(7)
+; GFX11-NEXT:    v_add_f32_e32 v0, v12, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v13, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v14, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v15, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(6)
+; GFX11-NEXT:    v_add_f32_e32 v0, v28, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v29, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v30, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v31, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(5)
+; GFX11-NEXT:    v_add_f32_e32 v0, v16, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v17, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v18, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v19, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(4)
+; GFX11-NEXT:    v_add_f32_e32 v0, v32, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v33, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v34, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v35, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(3)
+; GFX11-NEXT:    v_add_f32_e32 v0, v20, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v21, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v22, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v23, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(2)
+; GFX11-NEXT:    v_add_f32_e32 v0, v36, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v37, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v38, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v39, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(1)
+; GFX11-NEXT:    v_add_f32_e32 v0, v24, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v25, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v26, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v27, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_add_f32_e32 v0, v40, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v41, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v42, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v43, v5
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_cvt_pk_rtz_f16_f32_e32 v0, v0, v1
+; GFX11-NEXT:    v_cvt_pk_rtz_f16_f32_e32 v1, v4, v5
+; GFX11-NEXT:    exp mrt0 v0, v1, off, off done
+; GFX11-NEXT:    s_endpgm
+.entry:
+  %0 = call i64 @llvm.amdgcn.s.getpc()
+  %1 = and i64 %0, -4294967296
+  %2 = zext i32 %userdata6 to i64
+  %3 = or disjoint i64 %1, %2
+  %4 = inttoptr i64 %3 to ptr addrspace(4)
+  %5 = load <4 x i32>, ptr addrspace(4) %4, align 16
+  %6 = zext i32 %userdata7 to i64
+  %7 = or disjoint i64 %1, %6
+  %8 = inttoptr i64 %7 to ptr addrspace(4)
+  %9 = load <4 x i32>, ptr addrspace(4) %8, align 4, !invariant.load !0
+  %10 = zext i32 %userdata8 to i64
+  %11 = or disjoint i64 %1, %10
+  %12 = inttoptr i64 %11 to ptr addrspace(4)
+  %13 = load <8 x i32>, ptr addrspace(4) %12, align 4, !invariant.load !0
+  %14 = call float @llvm.amdgcn.lds.param.load(i32 1, i32 0, i32 %PrimMask)
+  %PerspInterpCenter.i1 = extractelement <2 x float> %PerspInterpCenter, i64 1
+  %PerspInterpCenter.i0 = extractelement <2 x float> %PerspInterpCenter, i64 0
+  %15 = call float @llvm.amdgcn.interp.inreg.p10(float %14, float %PerspInterpCenter.i0, float %14)
+  %16 = call float @llvm.amdgcn.interp.inreg.p2(float %14, float %PerspInterpCenter.i1, float %15)
+  %17 = call float @llvm.amdgcn.lds.param.load(i32 0, i32 0, i32 %PrimMask)
+  %18 = call float @llvm.amdgcn.interp.inreg.p10(float %17, float %PerspInterpCenter.i0, float %17)
+  %19 = call float @llvm.amdgcn.interp.inreg.p2(float %17, float %PerspInterpCenter.i1, float %18)
+  %20 = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %5, i32 16, i32 0), !invariant.load !0
+  %21 = shufflevector <2 x i32> %20, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
+  %22 = bitcast <4 x i32> %21 to <4 x float>
+  %.i0 = extractelement <4 x float> %22, i64 0
+  %.i1 = extractelement <4 x float> %22, i64 1
+  %.i03 = fadd reassoc nnan nsz arcp contract afn float %.i0, %19
+  %.i14 = fadd reassoc nnan nsz arcp contract afn float %.i1, %16
+  %23 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %.i03, float %.i14, <8 x i32> %13, <4 x i32> %9, i1 false, i32 0, i32 0)
+  %.i010 = extractelement <4 x float> %23, i64 0
+  %.i113 = extractelement <4 x float> %23, i64 1
+  %.i215 = extractelement <4 x float> %23, i64 2
+  %.i317 = extractelement <4 x float> %23, i64 3
+  %24 = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %5, i32 32, i32 0), !invariant.load !0
+  %25 = shufflevector <2 x i32> %24, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
+  %26 = bitcast <4 x i32> %25 to <4 x float>
+  %.i05 = extractelement <4 x float> %26, i64 0
+  %.i16 = extractelement <4 x float> %26, i64 1
+  %.i07 = fadd reassoc nnan nsz arcp contract afn float %.i05, %19
+  %.i18 = fadd reassoc nnan nsz arcp contract afn float %.i16, %16
+  %27 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.a...
[truncated]