Skip to content

Conversation

ruiling
Copy link
Contributor

@ruiling ruiling commented Nov 5, 2024

The major behaviors of the max memory clause schedule strategy includes:

  1. Try to cluster memory instructions more aggressively.
  2. Try to schedule long latency load earlier than short latency
    instruction.

I tested locally against about 470 real shaders and got the perf changes:
(only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.

(This was tested together with another change which increases the maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.

This is a motiviting example that drives us to do better on grouping
image sample instructions.
The AMDGPU specific version mainly includes two major differences:
1. Try to cluster memory instructions more aggressively.
2. Try to schedule long latency load earlier than short latency
   instruction.

I tested locally against about 470 real shaders and got the perf changes:
(only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.

(This was tested together with another change which increases the maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.
@llvmbot
Copy link
Member

llvmbot commented Nov 5, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Ruiling, Song (ruiling)

Changes

The AMDGPU specific version mainly includes two major differences:

  1. Try to cluster memory instructions more aggressively.
  2. Try to schedule long latency load earlier than short latency
    instruction.

I tested locally against about 470 real shaders and got the perf changes:
(only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.

(This was tested together with another change which increases the maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.


Patch is 38.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/114957.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+134)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+3)
  • (added) llvm/test/CodeGen/AMDGPU/group-image-instructions.ll (+488)
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 57f517bfba0ebb..37802d335fb9fd 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -63,6 +63,10 @@ static cl::opt<bool> GCNTrackers(
     cl::desc("Use the AMDGPU specific RPTrackers during scheduling"),
     cl::init(false));
 
+static cl::opt<bool> UseAMDGPUScheduleHeuristic(
+    "amdgpu-use-amdgpu-schedule-heuristic", cl::Hidden,
+    cl::desc("Use AMDGPU specific schedule heuristic "), cl::init(false));
+
 const unsigned ScheduleMetrics::ScaleFactor = 100;
 
 GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)
@@ -311,6 +315,136 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
   }
 }
 
+/// AMDGPU specific implementation, which is largely copy-pasted from the
+/// generic version, with some modifications to better hide memory latency.
+//  Major differences from the generic version:
+//  1. Prioritize clustered operations before stall latency heuristic.
+//  2. Prioritize long-latency-load before stall latency heuristic.
+///
+/// \param Cand provides the policy and current best candidate.
+/// \param TryCand refers to the next SUnit candidate, otherwise uninitialized.
+/// \param Zone describes the scheduled zone that we are extending, or nullptr
+///             if Cand is from a different zone than TryCand.
+/// \return \c true if TryCand is better than Cand (Reason is NOT NoCand)
+bool GCNSchedStrategy::tryCandidate(SchedCandidate &Cand,
+                                    SchedCandidate &TryCand,
+                                    SchedBoundary *Zone) const {
+  if (!UseAMDGPUScheduleHeuristic)
+    return GenericScheduler::tryCandidate(Cand, TryCand, Zone);
+
+  // Initialize the candidate if needed.
+  if (!Cand.isValid()) {
+    TryCand.Reason = NodeOrder;
+    return true;
+  }
+
+  // Bias PhysReg Defs and copies to their uses and defined respectively.
+  if (tryGreater(biasPhysReg(TryCand.SU, TryCand.AtTop),
+                 biasPhysReg(Cand.SU, Cand.AtTop), TryCand, Cand, PhysReg))
+    return TryCand.Reason != NoCand;
+
+  // Avoid exceeding the target's limit.
+  if (DAG->isTrackingPressure() &&
+      tryPressure(TryCand.RPDelta.Excess, Cand.RPDelta.Excess, TryCand, Cand,
+                  RegExcess, TRI, DAG->MF))
+    return TryCand.Reason != NoCand;
+
+  // Avoid increasing the max critical pressure in the scheduled region.
+  if (DAG->isTrackingPressure() &&
+      tryPressure(TryCand.RPDelta.CriticalMax, Cand.RPDelta.CriticalMax,
+                  TryCand, Cand, RegCritical, TRI, DAG->MF))
+    return TryCand.Reason != NoCand;
+
+  // AMDGPU-specific: We prioritize clustered instructions as we would get more
+  // benefit from clausing these memory instructions.
+  const SUnit *CandNextClusterSU =
+      Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
+  const SUnit *TryCandNextClusterSU =
+      TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
+  if (tryGreater(TryCand.SU == TryCandNextClusterSU,
+                 Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+    return TryCand.Reason != NoCand;
+
+  // We only compare a subset of features when comparing nodes between
+  // Top and Bottom boundary. Some properties are simply incomparable, in many
+  // other instances we should only override the other boundary if something
+  // is a clear good pick on one boundary. Skip heuristics that are more
+  // "tie-breaking" in nature.
+  bool SameBoundary = Zone != nullptr;
+  if (SameBoundary) {
+    // For loops that are acyclic path limited, aggressively schedule for
+    // latency. Within an single cycle, whenever CurrMOps > 0, allow normal
+    // heuristics to take precedence.
+    if (Rem.IsAcyclicLatencyLimited && !Zone->getCurrMOps() &&
+        tryLatency(TryCand, Cand, *Zone))
+      return TryCand.Reason != NoCand;
+
+    // AMDGPU-specific: Prioritize long latency memory load instructions in
+    // top-bottom order to hide more latency. The mayLoad check is used
+    // to exclude store-like instructions, which we do not want to scheduler
+    // them too early.
+    bool TryMayLoad =
+        TryCand.SU->isInstr() && TryCand.SU->getInstr()->mayLoad();
+    bool CandMayLoad = Cand.SU->isInstr() && Cand.SU->getInstr()->mayLoad();
+
+    if (TryMayLoad || CandMayLoad) {
+      bool TryLongLatency =
+          TryCand.SU->Latency > 10 * Cand.SU->Latency && TryMayLoad;
+      bool CandLongLatency =
+          10 * TryCand.SU->Latency < Cand.SU->Latency && CandMayLoad;
+
+      if (tryGreater(Zone->isTop() ? TryLongLatency : CandLongLatency,
+                     Zone->isTop() ? CandLongLatency : TryLongLatency, TryCand,
+                     Cand, Stall))
+        return TryCand.Reason != NoCand;
+    }
+    // Prioritize instructions that read unbuffered resources by stall cycles.
+    if (tryLess(Zone->getLatencyStallCycles(TryCand.SU),
+                Zone->getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))
+      return TryCand.Reason != NoCand;
+  }
+
+  if (SameBoundary) {
+    // Weak edges are for clustering and other constraints.
+    if (tryLess(getWeakLeft(TryCand.SU, TryCand.AtTop),
+                getWeakLeft(Cand.SU, Cand.AtTop), TryCand, Cand, Weak))
+      return TryCand.Reason != NoCand;
+  }
+
+  // Avoid increasing the max pressure of the entire region.
+  if (DAG->isTrackingPressure() &&
+      tryPressure(TryCand.RPDelta.CurrentMax, Cand.RPDelta.CurrentMax, TryCand,
+                  Cand, RegMax, TRI, DAG->MF))
+    return TryCand.Reason != NoCand;
+
+  if (SameBoundary) {
+    // Avoid critical resource consumption and balance the schedule.
+    TryCand.initResourceDelta(DAG, SchedModel);
+    if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
+                TryCand, Cand, ResourceReduce))
+      return TryCand.Reason != NoCand;
+    if (tryGreater(TryCand.ResDelta.DemandedResources,
+                   Cand.ResDelta.DemandedResources, TryCand, Cand,
+                   ResourceDemand))
+      return TryCand.Reason != NoCand;
+
+    // Avoid serializing long latency dependence chains.
+    // For acyclic path limited loops, latency was already checked above.
+    if (!RegionPolicy.DisableLatencyHeuristic && TryCand.Policy.ReduceLatency &&
+        !Rem.IsAcyclicLatencyLimited && tryLatency(TryCand, Cand, *Zone))
+      return TryCand.Reason != NoCand;
+
+    // Fall through to original instruction order.
+    if ((Zone->isTop() && TryCand.SU->NodeNum < Cand.SU->NodeNum) ||
+        (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {
+      TryCand.Reason = NodeOrder;
+      return true;
+    }
+  }
+
+  return false;
+}
+
 // This function is mostly cut and pasted from
 // GenericScheduler::pickNodeFromQueue()
 void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 64d517038f90e0..addb05922cee66 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -41,6 +41,9 @@ raw_ostream &operator<<(raw_ostream &OS, const GCNSchedStageID &StageID);
 /// heuristics to determine excess/critical pressure sets.
 class GCNSchedStrategy : public GenericScheduler {
 protected:
+  bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand,
+                    SchedBoundary *Zone) const override;
+
   SUnit *pickNodeBidirectional(bool &IsTopNode);
 
   void pickNodeFromQueue(SchedBoundary &Zone, const CandPolicy &ZonePolicy,
diff --git a/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll b/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll
new file mode 100644
index 00000000000000..8644cd3cc1ef85
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll
@@ -0,0 +1,488 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -amdgpu-use-amdgpu-schedule-heuristic=true -verify-machineinstrs < %s | FileCheck -check-prefixes=GFX11 %s
+
+define amdgpu_ps void @group_image_sample(i32 inreg noundef %globalTable, i32 inreg noundef %userdata6, i32 inreg noundef %userdata7, i32 inreg noundef %userdata8, i32 inreg noundef %PrimMask, <2 x float> noundef %PerspInterpSample, <2 x float> noundef %PerspInterpCenter, <2 x float> noundef %PerspInterpCentroid) #2 {
+; GFX11-LABEL: group_image_sample:
+; GFX11:       ; %bb.0: ; %.entry
+; GFX11-NEXT:    s_mov_b64 s[16:17], exec
+; GFX11-NEXT:    s_wqm_b64 exec, exec
+; GFX11-NEXT:    s_mov_b32 m0, s4
+; GFX11-NEXT:    s_getpc_b64 s[4:5]
+; GFX11-NEXT:    s_mov_b32 s0, s1
+; GFX11-NEXT:    s_mov_b32 s6, s3
+; GFX11-NEXT:    s_mov_b32 s1, s5
+; GFX11-NEXT:    s_mov_b32 s3, s5
+; GFX11-NEXT:    s_mov_b32 s7, s5
+; GFX11-NEXT:    s_load_b128 s[12:15], s[0:1], 0x0
+; GFX11-NEXT:    s_load_b128 s[8:11], s[2:3], 0x0
+; GFX11-NEXT:    s_load_b256 s[0:7], s[6:7], 0x0
+; GFX11-NEXT:    s_mov_b64 s[18:19], exec
+; GFX11-NEXT:    s_wqm_b64 exec, exec
+; GFX11-NEXT:    lds_param_load v2, attr0.y wait_vdst:15
+; GFX11-NEXT:    lds_param_load v3, attr0.x wait_vdst:15
+; GFX11-NEXT:    s_mov_b64 exec, s[18:19]
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_clause 0x3
+; GFX11-NEXT:    s_buffer_load_b64 s[18:19], s[12:15], 0x10
+; GFX11-NEXT:    s_buffer_load_b64 s[20:21], s[12:15], 0x20
+; GFX11-NEXT:    s_buffer_load_b64 s[22:23], s[12:15], 0x30
+; GFX11-NEXT:    s_buffer_load_b64 s[24:25], s[12:15], 0x40
+; GFX11-NEXT:    v_interp_p10_f32 v4, v2, v0, v2 wait_exp:1
+; GFX11-NEXT:    v_interp_p10_f32 v0, v3, v0, v3 wait_exp:0
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_interp_p2_f32 v45, v2, v1, v4 wait_exp:7
+; GFX11-NEXT:    v_interp_p2_f32 v44, v3, v1, v0 wait_exp:7
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT:    v_add_f32_e32 v0, s18, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s19, v45
+; GFX11-NEXT:    v_add_f32_e32 v8, s20, v44
+; GFX11-NEXT:    v_add_f32_e32 v9, s21, v45
+; GFX11-NEXT:    v_add_f32_e32 v16, s24, v44
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[4:7], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[8:11], v[8:9], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    v_add_f32_e32 v0, s22, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s23, v45
+; GFX11-NEXT:    v_add_f32_e32 v17, s25, v45
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[12:15], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[16:19], v[16:17], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_clause 0x3
+; GFX11-NEXT:    s_buffer_load_b64 s[18:19], s[12:15], 0x50
+; GFX11-NEXT:    s_buffer_load_b64 s[20:21], s[12:15], 0x60
+; GFX11-NEXT:    s_buffer_load_b64 s[22:23], s[12:15], 0x70
+; GFX11-NEXT:    s_buffer_load_b64 s[24:25], s[12:15], 0x80
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_add_f32_e32 v0, s18, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s19, v45
+; GFX11-NEXT:    v_add_f32_e32 v24, s20, v44
+; GFX11-NEXT:    v_add_f32_e32 v25, s21, v45
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[20:23], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[24:27], v[24:25], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_clause 0x7
+; GFX11-NEXT:    s_buffer_load_b64 s[18:19], s[12:15], 0x90
+; GFX11-NEXT:    s_buffer_load_b64 s[20:21], s[12:15], 0xa0
+; GFX11-NEXT:    s_buffer_load_b64 s[26:27], s[12:15], 0xb0
+; GFX11-NEXT:    s_buffer_load_b64 s[28:29], s[12:15], 0xc0
+; GFX11-NEXT:    s_buffer_load_b64 s[30:31], s[12:15], 0xd0
+; GFX11-NEXT:    s_buffer_load_b64 s[34:35], s[12:15], 0xe0
+; GFX11-NEXT:    s_buffer_load_b64 s[36:37], s[12:15], 0xf0
+; GFX11-NEXT:    s_buffer_load_b64 s[12:13], s[12:15], 0x100
+; GFX11-NEXT:    v_add_f32_e32 v0, s22, v44
+; GFX11-NEXT:    v_add_f32_e32 v1, s23, v45
+; GFX11-NEXT:    v_add_f32_e32 v28, s24, v44
+; GFX11-NEXT:    v_add_f32_e32 v29, s25, v45
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_add_f32_e32 v30, s18, v44
+; GFX11-NEXT:    v_add_f32_e32 v31, s19, v45
+; GFX11-NEXT:    v_add_f32_e32 v32, s20, v44
+; GFX11-NEXT:    v_add_f32_e32 v33, s21, v45
+; GFX11-NEXT:    v_add_f32_e32 v34, s26, v44
+; GFX11-NEXT:    v_add_f32_e32 v35, s27, v45
+; GFX11-NEXT:    v_add_f32_e32 v36, s28, v44
+; GFX11-NEXT:    v_add_f32_e32 v37, s29, v45
+; GFX11-NEXT:    v_add_f32_e32 v38, s30, v44
+; GFX11-NEXT:    v_add_f32_e32 v39, s31, v45
+; GFX11-NEXT:    v_add_f32_e32 v40, s34, v44
+; GFX11-NEXT:    v_add_f32_e32 v41, s35, v45
+; GFX11-NEXT:    v_add_f32_e32 v42, s36, v44
+; GFX11-NEXT:    v_add_f32_e32 v43, s37, v45
+; GFX11-NEXT:    v_add_f32_e32 v44, s12, v44
+; GFX11-NEXT:    v_add_f32_e32 v45, s13, v45
+; GFX11-NEXT:    s_waitcnt vmcnt(4)
+; GFX11-NEXT:    v_add_f32_e32 v46, v8, v4
+; GFX11-NEXT:    v_add_f32_e32 v47, v9, v5
+; GFX11-NEXT:    v_add_f32_e32 v48, v10, v6
+; GFX11-NEXT:    v_add_f32_e32 v49, v11, v7
+; GFX11-NEXT:    s_and_b64 exec, exec, s[16:17]
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[4:7], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[8:11], v[28:29], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(5)
+; GFX11-NEXT:    v_add_f32_e32 v0, v12, v46
+; GFX11-NEXT:    v_add_f32_e32 v1, v13, v47
+; GFX11-NEXT:    v_add_f32_e32 v46, v14, v48
+; GFX11-NEXT:    v_add_f32_e32 v47, v15, v49
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[12:15], v[30:31], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[28:31], v[32:33], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(6)
+; GFX11-NEXT:    v_add_f32_e32 v0, v16, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v17, v1
+; GFX11-NEXT:    v_add_f32_e32 v46, v18, v46
+; GFX11-NEXT:    v_add_f32_e32 v47, v19, v47
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[16:19], v[34:35], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[32:35], v[36:37], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(7)
+; GFX11-NEXT:    v_add_f32_e32 v0, v20, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v21, v1
+; GFX11-NEXT:    v_add_f32_e32 v46, v22, v46
+; GFX11-NEXT:    v_add_f32_e32 v47, v23, v47
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[20:23], v[38:39], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[36:39], v[40:41], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(8)
+; GFX11-NEXT:    v_add_f32_e32 v0, v24, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v25, v1
+; GFX11-NEXT:    v_add_f32_e32 v46, v26, v46
+; GFX11-NEXT:    v_add_f32_e32 v47, v27, v47
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    image_sample v[24:27], v[42:43], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    image_sample v[40:43], v[44:45], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT:    s_waitcnt vmcnt(9)
+; GFX11-NEXT:    v_add_f32_e32 v0, v4, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v5, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v6, v46
+; GFX11-NEXT:    v_add_f32_e32 v5, v7, v47
+; GFX11-NEXT:    s_waitcnt vmcnt(8)
+; GFX11-NEXT:    v_add_f32_e32 v0, v8, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v9, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v10, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v11, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(7)
+; GFX11-NEXT:    v_add_f32_e32 v0, v12, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v13, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v14, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v15, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(6)
+; GFX11-NEXT:    v_add_f32_e32 v0, v28, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v29, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v30, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v31, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(5)
+; GFX11-NEXT:    v_add_f32_e32 v0, v16, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v17, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v18, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v19, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(4)
+; GFX11-NEXT:    v_add_f32_e32 v0, v32, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v33, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v34, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v35, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(3)
+; GFX11-NEXT:    v_add_f32_e32 v0, v20, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v21, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v22, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v23, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(2)
+; GFX11-NEXT:    v_add_f32_e32 v0, v36, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v37, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v38, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v39, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(1)
+; GFX11-NEXT:    v_add_f32_e32 v0, v24, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v25, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v26, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v27, v5
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_add_f32_e32 v0, v40, v0
+; GFX11-NEXT:    v_add_f32_e32 v1, v41, v1
+; GFX11-NEXT:    v_add_f32_e32 v4, v42, v4
+; GFX11-NEXT:    v_add_f32_e32 v5, v43, v5
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT:    v_cvt_pk_rtz_f16_f32_e32 v0, v0, v1
+; GFX11-NEXT:    v_cvt_pk_rtz_f16_f32_e32 v1, v4, v5
+; GFX11-NEXT:    exp mrt0 v0, v1, off, off done
+; GFX11-NEXT:    s_endpgm
+.entry:
+  %0 = call i64 @llvm.amdgcn.s.getpc()
+  %1 = and i64 %0, -4294967296
+  %2 = zext i32 %userdata6 to i64
+  %3 = or disjoint i64 %1, %2
+  %4 = inttoptr i64 %3 to ptr addrspace(4)
+  %5 = load <4 x i32>, ptr addrspace(4) %4, align 16
+  %6 = zext i32 %userdata7 to i64
+  %7 = or disjoint i64 %1, %6
+  %8 = inttoptr i64 %7 to ptr addrspace(4)
+  %9 = load <4 x i32>, ptr addrspace(4) %8, align 4, !invariant.load !0
+  %10 = zext i32 %userdata8 to i64
+  %11 = or disjoint i64 %1, %10
+  %12 = inttoptr i64 %11 to ptr addrspace(4)
+  %13 = load <8 x i32>, ptr addrspace(4) %12, align 4, !invariant.load !0
+  %14 = call float @llvm.amdgcn.lds.param.load(i32 1, i32 0, i32 %PrimMask)
+  %PerspInterpCenter.i1 = extractelement <2 x float> %PerspInterpCenter, i64 1
+  %PerspInterpCenter.i0 = extractelement <2 x float> %PerspInterpCenter, i64 0
+  %15 = call float @llvm.amdgcn.interp.inreg.p10(float %14, float %PerspInterpCenter.i0, float %14)
+  %16 = call float @llvm.amdgcn.interp.inreg.p2(float %14, float %PerspInterpCenter.i1, float %15)
+  %17 = call float @llvm.amdgcn.lds.param.load(i32 0, i32 0, i32 %PrimMask)
+  %18 = call float @llvm.amdgcn.interp.inreg.p10(float %17, float %PerspInterpCenter.i0, float %17)
+  %19 = call float @llvm.amdgcn.interp.inreg.p2(float %17, float %PerspInterpCenter.i1, float %18)
+  %20 = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %5, i32 16, i32 0), !invariant.load !0
+  %21 = shufflevector <2 x i32> %20, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
+  %22 = bitcast <4 x i32> %21 to <4 x float>
+  %.i0 = extractelement <4 x float> %22, i64 0
+  %.i1 = extractelement <4 x float> %22, i64 1
+  %.i03 = fadd reassoc nnan nsz arcp contract afn float %.i0, %19
+  %.i14 = fadd reassoc nnan nsz arcp contract afn float %.i1, %16
+  %23 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %.i03, float %.i14, <8 x i32> %13, <4 x i32> %9, i1 false, i32 0, i32 0)
+  %.i010 = extractelement <4 x float> %23, i64 0
+  %.i113 = extractelement <4 x float> %23, i64 1
+  %.i215 = extractelement <4 x float> %23, i64 2
+  %.i317 = extractelement <4 x float> %23, i64 3
+  %24 = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %5, i32 32, i32 0), !invariant.load !0
+  %25 = shufflevector <2 x i32> %24, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
+  %26 = bitcast <4 x i32> %25 to <4 x float>
+  %.i05 = extractelement <4 x float> %26, i64 0
+  %.i16 = extractelement <4 x float> %26, i64 1
+  %.i07 = fadd reassoc nnan nsz arcp contract afn float %.i05, %19
+  %.i18 = fadd reassoc nnan nsz arcp contract afn float %.i16, %16
+  %27 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.a...
[truncated]

@arsenm arsenm requested review from kerbowa and jrbyrnes November 5, 2024 15:38
@ruiling ruiling changed the title [AMDGPU] Add AMDGPU specific tryCandidate() [AMDGPU] Add MaxMemoryClauseSchedStrategy Nov 26, 2024
The graphics frontend could not set command-line option per function.
Add the function attribute to allow the frontend to set it per
function.
@ruiling
Copy link
Contributor Author

ruiling commented Dec 3, 2024

ping

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with nits

@ruiling ruiling merged commit b33c807 into llvm:main Dec 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants