-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[AMDGPU] Add MaxMemoryClauseSchedStrategy #114957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a motiviting example that drives us to do better on grouping image sample instructions.
The AMDGPU specific version mainly includes two major differences: 1. Try to cluster memory instructions more aggressively. 2. Try to schedule long latency load earlier than short latency instruction. I tested locally against about 470 real shaders and got the perf changes: (only count perf changes over +/-10%): About 15 shaders improved 10%~40%. Only 3 shaders drops ~10%. (This was tested together with another change which increases the maximum clustered dword from 8 to 32). I will make another change to make that threshold configurable.
@llvm/pr-subscribers-backend-amdgpu Author: Ruiling, Song (ruiling) ChangesThe AMDGPU specific version mainly includes two major differences:
I tested locally against about 470 real shaders and got the perf changes: (This was tested together with another change which increases the maximum clustered dword from 8 to 32). Patch is 38.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/114957.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 57f517bfba0ebb..37802d335fb9fd 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -63,6 +63,10 @@ static cl::opt<bool> GCNTrackers(
cl::desc("Use the AMDGPU specific RPTrackers during scheduling"),
cl::init(false));
+static cl::opt<bool> UseAMDGPUScheduleHeuristic(
+ "amdgpu-use-amdgpu-schedule-heuristic", cl::Hidden,
+ cl::desc("Use AMDGPU specific schedule heuristic "), cl::init(false));
+
const unsigned ScheduleMetrics::ScaleFactor = 100;
GCNSchedStrategy::GCNSchedStrategy(const MachineSchedContext *C)
@@ -311,6 +315,136 @@ void GCNSchedStrategy::initCandidate(SchedCandidate &Cand, SUnit *SU,
}
}
+/// AMDGPU specific implementation, which is largely copy-pasted from the
+/// generic version, with some modifications to better hide memory latency.
+// Major differences from the generic version:
+// 1. Prioritize clustered operations before stall latency heuristic.
+// 2. Prioritize long-latency-load before stall latency heuristic.
+///
+/// \param Cand provides the policy and current best candidate.
+/// \param TryCand refers to the next SUnit candidate, otherwise uninitialized.
+/// \param Zone describes the scheduled zone that we are extending, or nullptr
+/// if Cand is from a different zone than TryCand.
+/// \return \c true if TryCand is better than Cand (Reason is NOT NoCand)
+bool GCNSchedStrategy::tryCandidate(SchedCandidate &Cand,
+ SchedCandidate &TryCand,
+ SchedBoundary *Zone) const {
+ if (!UseAMDGPUScheduleHeuristic)
+ return GenericScheduler::tryCandidate(Cand, TryCand, Zone);
+
+ // Initialize the candidate if needed.
+ if (!Cand.isValid()) {
+ TryCand.Reason = NodeOrder;
+ return true;
+ }
+
+ // Bias PhysReg Defs and copies to their uses and defined respectively.
+ if (tryGreater(biasPhysReg(TryCand.SU, TryCand.AtTop),
+ biasPhysReg(Cand.SU, Cand.AtTop), TryCand, Cand, PhysReg))
+ return TryCand.Reason != NoCand;
+
+ // Avoid exceeding the target's limit.
+ if (DAG->isTrackingPressure() &&
+ tryPressure(TryCand.RPDelta.Excess, Cand.RPDelta.Excess, TryCand, Cand,
+ RegExcess, TRI, DAG->MF))
+ return TryCand.Reason != NoCand;
+
+ // Avoid increasing the max critical pressure in the scheduled region.
+ if (DAG->isTrackingPressure() &&
+ tryPressure(TryCand.RPDelta.CriticalMax, Cand.RPDelta.CriticalMax,
+ TryCand, Cand, RegCritical, TRI, DAG->MF))
+ return TryCand.Reason != NoCand;
+
+ // AMDGPU-specific: We prioritize clustered instructions as we would get more
+ // benefit from clausing these memory instructions.
+ const SUnit *CandNextClusterSU =
+ Cand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
+ const SUnit *TryCandNextClusterSU =
+ TryCand.AtTop ? DAG->getNextClusterSucc() : DAG->getNextClusterPred();
+ if (tryGreater(TryCand.SU == TryCandNextClusterSU,
+ Cand.SU == CandNextClusterSU, TryCand, Cand, Cluster))
+ return TryCand.Reason != NoCand;
+
+ // We only compare a subset of features when comparing nodes between
+ // Top and Bottom boundary. Some properties are simply incomparable, in many
+ // other instances we should only override the other boundary if something
+ // is a clear good pick on one boundary. Skip heuristics that are more
+ // "tie-breaking" in nature.
+ bool SameBoundary = Zone != nullptr;
+ if (SameBoundary) {
+ // For loops that are acyclic path limited, aggressively schedule for
+ // latency. Within an single cycle, whenever CurrMOps > 0, allow normal
+ // heuristics to take precedence.
+ if (Rem.IsAcyclicLatencyLimited && !Zone->getCurrMOps() &&
+ tryLatency(TryCand, Cand, *Zone))
+ return TryCand.Reason != NoCand;
+
+ // AMDGPU-specific: Prioritize long latency memory load instructions in
+ // top-bottom order to hide more latency. The mayLoad check is used
+ // to exclude store-like instructions, which we do not want to scheduler
+ // them too early.
+ bool TryMayLoad =
+ TryCand.SU->isInstr() && TryCand.SU->getInstr()->mayLoad();
+ bool CandMayLoad = Cand.SU->isInstr() && Cand.SU->getInstr()->mayLoad();
+
+ if (TryMayLoad || CandMayLoad) {
+ bool TryLongLatency =
+ TryCand.SU->Latency > 10 * Cand.SU->Latency && TryMayLoad;
+ bool CandLongLatency =
+ 10 * TryCand.SU->Latency < Cand.SU->Latency && CandMayLoad;
+
+ if (tryGreater(Zone->isTop() ? TryLongLatency : CandLongLatency,
+ Zone->isTop() ? CandLongLatency : TryLongLatency, TryCand,
+ Cand, Stall))
+ return TryCand.Reason != NoCand;
+ }
+ // Prioritize instructions that read unbuffered resources by stall cycles.
+ if (tryLess(Zone->getLatencyStallCycles(TryCand.SU),
+ Zone->getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))
+ return TryCand.Reason != NoCand;
+ }
+
+ if (SameBoundary) {
+ // Weak edges are for clustering and other constraints.
+ if (tryLess(getWeakLeft(TryCand.SU, TryCand.AtTop),
+ getWeakLeft(Cand.SU, Cand.AtTop), TryCand, Cand, Weak))
+ return TryCand.Reason != NoCand;
+ }
+
+ // Avoid increasing the max pressure of the entire region.
+ if (DAG->isTrackingPressure() &&
+ tryPressure(TryCand.RPDelta.CurrentMax, Cand.RPDelta.CurrentMax, TryCand,
+ Cand, RegMax, TRI, DAG->MF))
+ return TryCand.Reason != NoCand;
+
+ if (SameBoundary) {
+ // Avoid critical resource consumption and balance the schedule.
+ TryCand.initResourceDelta(DAG, SchedModel);
+ if (tryLess(TryCand.ResDelta.CritResources, Cand.ResDelta.CritResources,
+ TryCand, Cand, ResourceReduce))
+ return TryCand.Reason != NoCand;
+ if (tryGreater(TryCand.ResDelta.DemandedResources,
+ Cand.ResDelta.DemandedResources, TryCand, Cand,
+ ResourceDemand))
+ return TryCand.Reason != NoCand;
+
+ // Avoid serializing long latency dependence chains.
+ // For acyclic path limited loops, latency was already checked above.
+ if (!RegionPolicy.DisableLatencyHeuristic && TryCand.Policy.ReduceLatency &&
+ !Rem.IsAcyclicLatencyLimited && tryLatency(TryCand, Cand, *Zone))
+ return TryCand.Reason != NoCand;
+
+ // Fall through to original instruction order.
+ if ((Zone->isTop() && TryCand.SU->NodeNum < Cand.SU->NodeNum) ||
+ (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {
+ TryCand.Reason = NodeOrder;
+ return true;
+ }
+ }
+
+ return false;
+}
+
// This function is mostly cut and pasted from
// GenericScheduler::pickNodeFromQueue()
void GCNSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone,
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
index 64d517038f90e0..addb05922cee66 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
@@ -41,6 +41,9 @@ raw_ostream &operator<<(raw_ostream &OS, const GCNSchedStageID &StageID);
/// heuristics to determine excess/critical pressure sets.
class GCNSchedStrategy : public GenericScheduler {
protected:
+ bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand,
+ SchedBoundary *Zone) const override;
+
SUnit *pickNodeBidirectional(bool &IsTopNode);
void pickNodeFromQueue(SchedBoundary &Zone, const CandPolicy &ZonePolicy,
diff --git a/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll b/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll
new file mode 100644
index 00000000000000..8644cd3cc1ef85
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/group-image-instructions.ll
@@ -0,0 +1,488 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -amdgpu-use-amdgpu-schedule-heuristic=true -verify-machineinstrs < %s | FileCheck -check-prefixes=GFX11 %s
+
+define amdgpu_ps void @group_image_sample(i32 inreg noundef %globalTable, i32 inreg noundef %userdata6, i32 inreg noundef %userdata7, i32 inreg noundef %userdata8, i32 inreg noundef %PrimMask, <2 x float> noundef %PerspInterpSample, <2 x float> noundef %PerspInterpCenter, <2 x float> noundef %PerspInterpCentroid) #2 {
+; GFX11-LABEL: group_image_sample:
+; GFX11: ; %bb.0: ; %.entry
+; GFX11-NEXT: s_mov_b64 s[16:17], exec
+; GFX11-NEXT: s_wqm_b64 exec, exec
+; GFX11-NEXT: s_mov_b32 m0, s4
+; GFX11-NEXT: s_getpc_b64 s[4:5]
+; GFX11-NEXT: s_mov_b32 s0, s1
+; GFX11-NEXT: s_mov_b32 s6, s3
+; GFX11-NEXT: s_mov_b32 s1, s5
+; GFX11-NEXT: s_mov_b32 s3, s5
+; GFX11-NEXT: s_mov_b32 s7, s5
+; GFX11-NEXT: s_load_b128 s[12:15], s[0:1], 0x0
+; GFX11-NEXT: s_load_b128 s[8:11], s[2:3], 0x0
+; GFX11-NEXT: s_load_b256 s[0:7], s[6:7], 0x0
+; GFX11-NEXT: s_mov_b64 s[18:19], exec
+; GFX11-NEXT: s_wqm_b64 exec, exec
+; GFX11-NEXT: lds_param_load v2, attr0.y wait_vdst:15
+; GFX11-NEXT: lds_param_load v3, attr0.x wait_vdst:15
+; GFX11-NEXT: s_mov_b64 exec, s[18:19]
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_clause 0x3
+; GFX11-NEXT: s_buffer_load_b64 s[18:19], s[12:15], 0x10
+; GFX11-NEXT: s_buffer_load_b64 s[20:21], s[12:15], 0x20
+; GFX11-NEXT: s_buffer_load_b64 s[22:23], s[12:15], 0x30
+; GFX11-NEXT: s_buffer_load_b64 s[24:25], s[12:15], 0x40
+; GFX11-NEXT: v_interp_p10_f32 v4, v2, v0, v2 wait_exp:1
+; GFX11-NEXT: v_interp_p10_f32 v0, v3, v0, v3 wait_exp:0
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT: v_interp_p2_f32 v45, v2, v1, v4 wait_exp:7
+; GFX11-NEXT: v_interp_p2_f32 v44, v3, v1, v0 wait_exp:7
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_3)
+; GFX11-NEXT: v_add_f32_e32 v0, s18, v44
+; GFX11-NEXT: v_add_f32_e32 v1, s19, v45
+; GFX11-NEXT: v_add_f32_e32 v8, s20, v44
+; GFX11-NEXT: v_add_f32_e32 v9, s21, v45
+; GFX11-NEXT: v_add_f32_e32 v16, s24, v44
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[4:7], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[8:11], v[8:9], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: v_add_f32_e32 v0, s22, v44
+; GFX11-NEXT: v_add_f32_e32 v1, s23, v45
+; GFX11-NEXT: v_add_f32_e32 v17, s25, v45
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[12:15], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[16:19], v[16:17], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_clause 0x3
+; GFX11-NEXT: s_buffer_load_b64 s[18:19], s[12:15], 0x50
+; GFX11-NEXT: s_buffer_load_b64 s[20:21], s[12:15], 0x60
+; GFX11-NEXT: s_buffer_load_b64 s[22:23], s[12:15], 0x70
+; GFX11-NEXT: s_buffer_load_b64 s[24:25], s[12:15], 0x80
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: v_add_f32_e32 v0, s18, v44
+; GFX11-NEXT: v_add_f32_e32 v1, s19, v45
+; GFX11-NEXT: v_add_f32_e32 v24, s20, v44
+; GFX11-NEXT: v_add_f32_e32 v25, s21, v45
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[20:23], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[24:27], v[24:25], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_clause 0x7
+; GFX11-NEXT: s_buffer_load_b64 s[18:19], s[12:15], 0x90
+; GFX11-NEXT: s_buffer_load_b64 s[20:21], s[12:15], 0xa0
+; GFX11-NEXT: s_buffer_load_b64 s[26:27], s[12:15], 0xb0
+; GFX11-NEXT: s_buffer_load_b64 s[28:29], s[12:15], 0xc0
+; GFX11-NEXT: s_buffer_load_b64 s[30:31], s[12:15], 0xd0
+; GFX11-NEXT: s_buffer_load_b64 s[34:35], s[12:15], 0xe0
+; GFX11-NEXT: s_buffer_load_b64 s[36:37], s[12:15], 0xf0
+; GFX11-NEXT: s_buffer_load_b64 s[12:13], s[12:15], 0x100
+; GFX11-NEXT: v_add_f32_e32 v0, s22, v44
+; GFX11-NEXT: v_add_f32_e32 v1, s23, v45
+; GFX11-NEXT: v_add_f32_e32 v28, s24, v44
+; GFX11-NEXT: v_add_f32_e32 v29, s25, v45
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: v_add_f32_e32 v30, s18, v44
+; GFX11-NEXT: v_add_f32_e32 v31, s19, v45
+; GFX11-NEXT: v_add_f32_e32 v32, s20, v44
+; GFX11-NEXT: v_add_f32_e32 v33, s21, v45
+; GFX11-NEXT: v_add_f32_e32 v34, s26, v44
+; GFX11-NEXT: v_add_f32_e32 v35, s27, v45
+; GFX11-NEXT: v_add_f32_e32 v36, s28, v44
+; GFX11-NEXT: v_add_f32_e32 v37, s29, v45
+; GFX11-NEXT: v_add_f32_e32 v38, s30, v44
+; GFX11-NEXT: v_add_f32_e32 v39, s31, v45
+; GFX11-NEXT: v_add_f32_e32 v40, s34, v44
+; GFX11-NEXT: v_add_f32_e32 v41, s35, v45
+; GFX11-NEXT: v_add_f32_e32 v42, s36, v44
+; GFX11-NEXT: v_add_f32_e32 v43, s37, v45
+; GFX11-NEXT: v_add_f32_e32 v44, s12, v44
+; GFX11-NEXT: v_add_f32_e32 v45, s13, v45
+; GFX11-NEXT: s_waitcnt vmcnt(4)
+; GFX11-NEXT: v_add_f32_e32 v46, v8, v4
+; GFX11-NEXT: v_add_f32_e32 v47, v9, v5
+; GFX11-NEXT: v_add_f32_e32 v48, v10, v6
+; GFX11-NEXT: v_add_f32_e32 v49, v11, v7
+; GFX11-NEXT: s_and_b64 exec, exec, s[16:17]
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[4:7], v[0:1], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[8:11], v[28:29], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_waitcnt vmcnt(5)
+; GFX11-NEXT: v_add_f32_e32 v0, v12, v46
+; GFX11-NEXT: v_add_f32_e32 v1, v13, v47
+; GFX11-NEXT: v_add_f32_e32 v46, v14, v48
+; GFX11-NEXT: v_add_f32_e32 v47, v15, v49
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[12:15], v[30:31], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[28:31], v[32:33], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_waitcnt vmcnt(6)
+; GFX11-NEXT: v_add_f32_e32 v0, v16, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v17, v1
+; GFX11-NEXT: v_add_f32_e32 v46, v18, v46
+; GFX11-NEXT: v_add_f32_e32 v47, v19, v47
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[16:19], v[34:35], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[32:35], v[36:37], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_waitcnt vmcnt(7)
+; GFX11-NEXT: v_add_f32_e32 v0, v20, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v21, v1
+; GFX11-NEXT: v_add_f32_e32 v46, v22, v46
+; GFX11-NEXT: v_add_f32_e32 v47, v23, v47
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[20:23], v[38:39], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[36:39], v[40:41], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_waitcnt vmcnt(8)
+; GFX11-NEXT: v_add_f32_e32 v0, v24, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v25, v1
+; GFX11-NEXT: v_add_f32_e32 v46, v26, v46
+; GFX11-NEXT: v_add_f32_e32 v47, v27, v47
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: image_sample v[24:27], v[42:43], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: image_sample v[40:43], v[44:45], s[0:7], s[8:11] dmask:0xf dim:SQ_RSRC_IMG_2D
+; GFX11-NEXT: s_waitcnt vmcnt(9)
+; GFX11-NEXT: v_add_f32_e32 v0, v4, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v5, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v6, v46
+; GFX11-NEXT: v_add_f32_e32 v5, v7, v47
+; GFX11-NEXT: s_waitcnt vmcnt(8)
+; GFX11-NEXT: v_add_f32_e32 v0, v8, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v9, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v10, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v11, v5
+; GFX11-NEXT: s_waitcnt vmcnt(7)
+; GFX11-NEXT: v_add_f32_e32 v0, v12, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v13, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v14, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v15, v5
+; GFX11-NEXT: s_waitcnt vmcnt(6)
+; GFX11-NEXT: v_add_f32_e32 v0, v28, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v29, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v30, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v31, v5
+; GFX11-NEXT: s_waitcnt vmcnt(5)
+; GFX11-NEXT: v_add_f32_e32 v0, v16, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v17, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v18, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v19, v5
+; GFX11-NEXT: s_waitcnt vmcnt(4)
+; GFX11-NEXT: v_add_f32_e32 v0, v32, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v33, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v34, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v35, v5
+; GFX11-NEXT: s_waitcnt vmcnt(3)
+; GFX11-NEXT: v_add_f32_e32 v0, v20, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v21, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v22, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v23, v5
+; GFX11-NEXT: s_waitcnt vmcnt(2)
+; GFX11-NEXT: v_add_f32_e32 v0, v36, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v37, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v38, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v39, v5
+; GFX11-NEXT: s_waitcnt vmcnt(1)
+; GFX11-NEXT: v_add_f32_e32 v0, v24, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v25, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v26, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v27, v5
+; GFX11-NEXT: s_waitcnt vmcnt(0)
+; GFX11-NEXT: v_add_f32_e32 v0, v40, v0
+; GFX11-NEXT: v_add_f32_e32 v1, v41, v1
+; GFX11-NEXT: v_add_f32_e32 v4, v42, v4
+; GFX11-NEXT: v_add_f32_e32 v5, v43, v5
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT: v_cvt_pk_rtz_f16_f32_e32 v0, v0, v1
+; GFX11-NEXT: v_cvt_pk_rtz_f16_f32_e32 v1, v4, v5
+; GFX11-NEXT: exp mrt0 v0, v1, off, off done
+; GFX11-NEXT: s_endpgm
+.entry:
+ %0 = call i64 @llvm.amdgcn.s.getpc()
+ %1 = and i64 %0, -4294967296
+ %2 = zext i32 %userdata6 to i64
+ %3 = or disjoint i64 %1, %2
+ %4 = inttoptr i64 %3 to ptr addrspace(4)
+ %5 = load <4 x i32>, ptr addrspace(4) %4, align 16
+ %6 = zext i32 %userdata7 to i64
+ %7 = or disjoint i64 %1, %6
+ %8 = inttoptr i64 %7 to ptr addrspace(4)
+ %9 = load <4 x i32>, ptr addrspace(4) %8, align 4, !invariant.load !0
+ %10 = zext i32 %userdata8 to i64
+ %11 = or disjoint i64 %1, %10
+ %12 = inttoptr i64 %11 to ptr addrspace(4)
+ %13 = load <8 x i32>, ptr addrspace(4) %12, align 4, !invariant.load !0
+ %14 = call float @llvm.amdgcn.lds.param.load(i32 1, i32 0, i32 %PrimMask)
+ %PerspInterpCenter.i1 = extractelement <2 x float> %PerspInterpCenter, i64 1
+ %PerspInterpCenter.i0 = extractelement <2 x float> %PerspInterpCenter, i64 0
+ %15 = call float @llvm.amdgcn.interp.inreg.p10(float %14, float %PerspInterpCenter.i0, float %14)
+ %16 = call float @llvm.amdgcn.interp.inreg.p2(float %14, float %PerspInterpCenter.i1, float %15)
+ %17 = call float @llvm.amdgcn.lds.param.load(i32 0, i32 0, i32 %PrimMask)
+ %18 = call float @llvm.amdgcn.interp.inreg.p10(float %17, float %PerspInterpCenter.i0, float %17)
+ %19 = call float @llvm.amdgcn.interp.inreg.p2(float %17, float %PerspInterpCenter.i1, float %18)
+ %20 = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %5, i32 16, i32 0), !invariant.load !0
+ %21 = shufflevector <2 x i32> %20, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
+ %22 = bitcast <4 x i32> %21 to <4 x float>
+ %.i0 = extractelement <4 x float> %22, i64 0
+ %.i1 = extractelement <4 x float> %22, i64 1
+ %.i03 = fadd reassoc nnan nsz arcp contract afn float %.i0, %19
+ %.i14 = fadd reassoc nnan nsz arcp contract afn float %.i1, %16
+ %23 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(i32 15, float %.i03, float %.i14, <8 x i32> %13, <4 x i32> %9, i1 false, i32 0, i32 0)
+ %.i010 = extractelement <4 x float> %23, i64 0
+ %.i113 = extractelement <4 x float> %23, i64 1
+ %.i215 = extractelement <4 x float> %23, i64 2
+ %.i317 = extractelement <4 x float> %23, i64 3
+ %24 = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %5, i32 32, i32 0), !invariant.load !0
+ %25 = shufflevector <2 x i32> %24, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
+ %26 = bitcast <4 x i32> %25 to <4 x float>
+ %.i05 = extractelement <4 x float> %26, i64 0
+ %.i16 = extractelement <4 x float> %26, i64 1
+ %.i07 = fadd reassoc nnan nsz arcp contract afn float %.i05, %19
+ %.i18 = fadd reassoc nnan nsz arcp contract afn float %.i16, %16
+ %27 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.a...
[truncated]
|
The graphics frontend could not set command-line option per function. Add the function attribute to allow the frontend to set it per function.
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with nits
The major behaviors of the max memory clause schedule strategy includes:
instruction.
I tested locally against about 470 real shaders and got the perf changes:
(only count perf changes over +/-10%):
About 15 shaders improved 10%~40%.
Only 3 shaders drops ~10%.
(This was tested together with another change which increases the maximum clustered dword from 8 to 32).
I will make another change to make that threshold configurable.