Skip to content

Extend kernel-info to emit PGO-based FLOP count #110586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 184 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
184 commits
Select commit Hold shift + click to select a range
530eb98
Add profiling functions to libomptarget
EthanLuisMcDonough Dec 16, 2023
fb067d4
Fix PGO instrumentation for GPU targets
EthanLuisMcDonough Dec 16, 2023
7a0e0ef
Change global visibility on GPU targets
EthanLuisMcDonough Dec 21, 2023
fddc079
Make names global public on GPU
EthanLuisMcDonough Dec 23, 2023
e9db03c
Read and print GPU device PGO globals
EthanLuisMcDonough Dec 29, 2023
aa83bd2
Merge branch 'main' into gpuprof
EthanLuisMcDonough Dec 29, 2023
e468760
Fix rebase bug
EthanLuisMcDonough Jan 3, 2024
ec18ce9
Refactor portions to be more idiomatic
EthanLuisMcDonough Jan 3, 2024
0872556
Reformat DeviceRTL prof functions
EthanLuisMcDonough Jan 3, 2024
94f47f3
Merge branch 'main' into gpuprof
EthanLuisMcDonough Jan 3, 2024
62f31d1
Style changes + catch name error
EthanLuisMcDonough Jan 9, 2024
0c4bbeb
Add GPU PGO test
EthanLuisMcDonough Jan 18, 2024
c7ae2a7
Fix PGO test formatting
EthanLuisMcDonough Jan 18, 2024
9e66bfb
Merge branch 'main' into gpuprof
EthanLuisMcDonough Jan 19, 2024
8bb2207
Refactor visibility logic
EthanLuisMcDonough Jan 19, 2024
9f13943
Add LLVM instrumentation support
EthanLuisMcDonough Jan 24, 2024
b28d4a9
Merge branch 'main' into gpuprof
EthanLuisMcDonough Jan 24, 2024
23d7fe2
Merge branch 'main' into gpuprof
EthanLuisMcDonough Feb 14, 2024
0606f0d
Use explicit addrspace instead of unqual
EthanLuisMcDonough Feb 14, 2024
23f75b2
Merge branch 'main' into gpuprof
EthanLuisMcDonough Feb 15, 2024
c1f9be3
Remove redundant namespaces
EthanLuisMcDonough Feb 16, 2024
721dac6
Merge branch 'main' into gpuprof
EthanLuisMcDonough Feb 16, 2024
6a3ae40
Clang format
EthanLuisMcDonough Feb 16, 2024
6866862
Use getAddrSpaceCast
EthanLuisMcDonough Feb 16, 2024
62a5ee1
Revert "Use getAddrSpaceCast"
EthanLuisMcDonough Feb 27, 2024
052394f
Revert "Use getAddrSpaceCast"
EthanLuisMcDonough Feb 27, 2024
612d5a5
Write PGO
EthanLuisMcDonough Mar 1, 2024
b8c9163
Fix tests
EthanLuisMcDonough Mar 14, 2024
e572452
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Mar 14, 2024
4568c42
Fix arguments
EthanLuisMcDonough Mar 14, 2024
d86b101
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Mar 19, 2024
1fc4cb9
Add GPU prof flags
EthanLuisMcDonough Mar 19, 2024
849b244
Fix elf obj file
EthanLuisMcDonough Mar 19, 2024
55bd8d2
Add GPU use profile option
EthanLuisMcDonough Mar 19, 2024
7231080
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Apr 6, 2024
4ebbb45
Add more addrspace casts for GPU targets
EthanLuisMcDonough May 7, 2024
4be80e5
Merge branch 'main' into gpuprof
EthanLuisMcDonough May 7, 2024
b2fe222
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough May 7, 2024
7770b37
Fix params
EthanLuisMcDonough May 7, 2024
702d170
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough May 7, 2024
619fb69
Resolve merge conflict
EthanLuisMcDonough May 7, 2024
f6a1545
Merge branch 'main' into gpuprof
EthanLuisMcDonough May 9, 2024
92260d8
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough May 9, 2024
58491a7
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough May 9, 2024
6267c2a
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough May 11, 2024
3f08ae9
Have test read from profraw instead of dump
EthanLuisMcDonough May 11, 2024
09f2b39
Remove debug dump
EthanLuisMcDonough May 11, 2024
1dbde8e
Merge branch 'main' into gpuprof
EthanLuisMcDonough May 13, 2024
1278989
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough May 13, 2024
ff8f233
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough May 13, 2024
ed2a289
Merge branch 'main' into gpuprof_ptrcastfix
EthanLuisMcDonough May 13, 2024
aa895a1
Fix elf obj file
EthanLuisMcDonough Mar 19, 2024
2031e49
Add more addrspace casts for GPU targets
EthanLuisMcDonough May 7, 2024
5de6082
Merge branch 'gpuprof_ptrcastfix' into gpuprofwrite
EthanLuisMcDonough May 13, 2024
3e43a18
Merge branch 'gpuprof_ptrcastfix' into gpuprofdriver
EthanLuisMcDonough May 13, 2024
be6524b
Have test read from profraw instead of dump
EthanLuisMcDonough May 13, 2024
000deed
Merge branch 'gpuprofwrite' into gpuprofdriver
EthanLuisMcDonough May 13, 2024
e266cc7
Fix GPU PGO names
EthanLuisMcDonough May 17, 2024
2ba27e8
Merge branch 'main' into gpuprof
EthanLuisMcDonough May 21, 2024
c754f7f
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough May 24, 2024
2b8eb29
Fix PGO test format
EthanLuisMcDonough May 25, 2024
67f3009
Refactor profile writer
EthanLuisMcDonough May 25, 2024
1cec247
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough May 25, 2024
cee07bc
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough May 27, 2024
e8ad132
Fix refactor bug
EthanLuisMcDonough May 27, 2024
9e23b08
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough May 28, 2024
1e8fafc
Merge branch 'gpuprofwrite' into gpuprofdriver
EthanLuisMcDonough May 28, 2024
79bf08e
Check for level in test case
EthanLuisMcDonough May 28, 2024
4c9f814
Make requested clang-format change
EthanLuisMcDonough May 28, 2024
e187f5a
Merge branch 'gpuprofwrite' into gpuprofdriver
EthanLuisMcDonough May 28, 2024
cfe1660
Check for version global on GPU
EthanLuisMcDonough May 30, 2024
5bf4376
Add host/device combination test
EthanLuisMcDonough May 31, 2024
2530137
Add PGO dump debug option
EthanLuisMcDonough May 31, 2024
9cddcf4
Merge branch 'main' into gpuprof
EthanLuisMcDonough Jun 1, 2024
53d6309
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Jun 1, 2024
f9138fb
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Jun 1, 2024
344e357
Tighten PGO test requirements
EthanLuisMcDonough Jun 1, 2024
2f75142
Tighten PGO test requirements
EthanLuisMcDonough Jun 1, 2024
79ceacb
Tighten PGO test requirements
EthanLuisMcDonough Jun 1, 2024
ff0dd62
Add note about PGO debug flag
EthanLuisMcDonough Jun 1, 2024
0b9cc35
Fix clang format
EthanLuisMcDonough Jun 4, 2024
bf5dbd6
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Jun 23, 2024
488cb4a
Apply requested formatting changes
EthanLuisMcDonough Jun 26, 2024
b90c015
Add memop function shim to DeviceRTL
EthanLuisMcDonough Jun 26, 2024
dc90a5c
Merge branch 'gpuprof' into gpuprofwrite
EthanLuisMcDonough Jun 27, 2024
c68c6e2
Make requested changes
EthanLuisMcDonough Jun 27, 2024
ca52c58
Only dump counters if PGODump flag is set
EthanLuisMcDonough Jun 27, 2024
0da7627
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Aug 10, 2024
ee4431a
Update requirements
EthanLuisMcDonough Aug 10, 2024
90a6e30
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Aug 10, 2024
f9a24e3
Update test requirements
EthanLuisMcDonough Aug 10, 2024
efe70ad
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Aug 10, 2024
fb699b6
Merge changes
EthanLuisMcDonough Aug 10, 2024
6eb137e
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Aug 10, 2024
5a671f6
[KernelInfo] Implement new LLVM IR pass for GPU code analysis
jdenny-ornl Aug 12, 2024
a7656de
Move docs to KernelInfo.rst
jdenny-ornl Aug 12, 2024
d92856e
Move conditional outside registration call
jdenny-ornl Aug 12, 2024
5727284
Merge changes
EthanLuisMcDonough Aug 12, 2024
6ac3f41
Use llvm::SmallString
jdenny-ornl Aug 12, 2024
6367ad7
Use TTI.getFlatAddressSpace for addrspace(0)
jdenny-ornl Aug 12, 2024
78446bb
Avoid repetition between amdgpu and nvptx tests
jdenny-ornl Aug 12, 2024
fede524
Use named values in tests
jdenny-ornl Aug 12, 2024
4c30b8a
Say flat address space instead of addrspace(0)
jdenny-ornl Aug 13, 2024
33f0d4d
Cache the flat address space
jdenny-ornl Aug 13, 2024
a2a512c
Link KernelInfo.rst from Passes.rst
jdenny-ornl Aug 13, 2024
de04ac4
Don't filter out cpus
jdenny-ornl Aug 13, 2024
ec5d2bd
Include less in header
jdenny-ornl Aug 16, 2024
c06b905
Removed unused comparison operators
jdenny-ornl Aug 16, 2024
d83d22a
Remove redundant null check
jdenny-ornl Aug 16, 2024
1649cf8
Move KernelInfo to KernelInfo.cpp, remove KernelInfoAnalysis
jdenny-ornl Aug 16, 2024
1a3c0ae
Use printAsOperand not getName to identify instruction
jdenny-ornl Aug 16, 2024
ea89a81
Use printAsOperand to report indirect callee
jdenny-ornl Aug 16, 2024
8da602b
Report inline assembly calls
jdenny-ornl Aug 16, 2024
45114fd
Use llvm::SmallString
jdenny-ornl Aug 16, 2024
eea139c
Use llvm::SmallString
jdenny-ornl Aug 16, 2024
8bf6e4e
getKernelInfo -> emitKernelInfo because return is unused
jdenny-ornl Aug 16, 2024
d2ee05d
Merge branch 'main' into kernel-info-pr
jdenny-ornl Aug 21, 2024
10e6c48
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Sep 4, 2024
9b865f4
Merge branch 'main' into kernel-info-pr
jdenny-ornl Sep 5, 2024
39979f7
Merge branch 'main' into kernel-info-pr
jdenny-ornl Sep 12, 2024
62d494d
Clean up launch bounds
jdenny-ornl Sep 13, 2024
e4d3fca
Merge branch 'main' into kernel-info-pr
jdenny-ornl Sep 16, 2024
94d90d1
Adjust forEachLaunchBound param
jdenny-ornl Sep 16, 2024
762a217
Reuse Function::getFnAttributeAsParsedInteger
jdenny-ornl Sep 16, 2024
df66a3d
Move forEachLaunchBound to TargetTransformInfo
jdenny-ornl Sep 16, 2024
5488764
Merge branch 'main' into kernel-info-pr
jdenny-ornl Sep 26, 2024
3f63d53
forEachLaunchBound -> collectLaunchBounds
jdenny-ornl Sep 26, 2024
0658a21
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Sep 27, 2024
f5d9f55
Rebase updates
EthanLuisMcDonough Sep 28, 2024
e246227
Hack offload tests to find built llvm-profdata
jdenny-ornl Sep 27, 2024
3b6ce07
Merge branch 'main' into kernel-info-pr
jdenny-ornl Sep 28, 2024
feeaa37
Remove redundant private
jdenny-ornl Sep 28, 2024
557dd16
Merge branch 'pr-94268-fixup' into kernel-info-pgo
jdenny-ornl Sep 28, 2024
d2847b0
Extend kernel-info to emit PGO-based FLOP count
jdenny-ornl Sep 30, 2024
0672e2c
Merge branch 'main' into kernel-info-pgo
jdenny-ornl Oct 3, 2024
e04b933
Improve some kernel-info instruction remarks
jdenny-ornl Oct 3, 2024
b9b95a2
Merge branch 'main' into kernel-info-pr
jdenny-ornl Oct 11, 2024
116f1c9
Remove todos, as requested
jdenny-ornl Oct 11, 2024
2094465
Combine registerFullLinkTimeOptimizationLastEPCallback calls
jdenny-ornl Oct 11, 2024
39bce7c
collectLaunchBounds -> collectKernelLaunchBounds
jdenny-ornl Oct 11, 2024
14345cf
Spell kernel-info properties like their IR attributes
jdenny-ornl Oct 11, 2024
ad393d2
Replace -kernel-info-end-lto with -no-kernel-info-end-lto
jdenny-ornl Oct 11, 2024
d3beccf
Apply clang-format
jdenny-ornl Oct 11, 2024
5a4b873
Avoid auto, as requested
jdenny-ornl Oct 14, 2024
571181b
For function name, use debug info or keep @
jdenny-ornl Oct 14, 2024
cfda91d
Merge branch 'kernel-info-pr' into kernel-info-pgo
jdenny-ornl Oct 15, 2024
a5ce547
Use anonymous namespace
jdenny-ornl Oct 16, 2024
4d60911
Remove currently unused capabilities, as requested
jdenny-ornl Oct 16, 2024
0c30e7c
Rename test files without LLVM IR to .test
jdenny-ornl Oct 16, 2024
f5a6fbd
Regenerate OpenMP tests from current clang
jdenny-ornl Oct 17, 2024
baad223
Include LLVM value name in alloca report
jdenny-ornl Oct 17, 2024
28a5bcb
Merge branch 'kernel-info-pr' into kernel-info-pgo
jdenny-ornl Oct 18, 2024
1d0a961
Add llvm-profdata substitution to offload tests
EthanLuisMcDonough Oct 25, 2024
0ac2d5f
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Oct 25, 2024
c6b34ad
Prepend target prefix to basename
EthanLuisMcDonough Oct 28, 2024
94ed55b
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Nov 15, 2024
86f9683
Merge branch 'main' into kernel-info-pr
jdenny-ornl Nov 27, 2024
c9aebce
Update expected amdgpu-max-num-workgroups default values
jdenny-ornl Nov 27, 2024
d460ffa
Merge branch 'kernel-info-pr' into kernel-info-pgo
jdenny-ornl Nov 27, 2024
e690e2a
Update llvm-profdata test fix
jdenny-ornl Nov 27, 2024
e80f7ff
Prepend target prefix to basename
EthanLuisMcDonough Oct 28, 2024
26f5428
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Dec 10, 2024
d9e864e
Merge branch 'main' into gpuprofwrite
EthanLuisMcDonough Dec 27, 2024
3f80999
Merge branch 'gpuprofwrite' into gpuprofdriver
EthanLuisMcDonough Dec 27, 2024
8982f8f
Merge branch 'main' into kernel-info-pr
jdenny-ornl Dec 27, 2024
517c018
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Dec 27, 2024
151bfb3
Regenerate OpenMP tests from current clang
jdenny-ornl Dec 27, 2024
2438390
Merge branch 'kernel-info-pr' into kernel-info-pgo
jdenny-ornl Dec 27, 2024
6b9cfdd
Merge branch 'pr-94268' into kernel-info-pgo
jdenny-ornl Dec 27, 2024
eb3eb14
Merge branch 'main' into kernel-info-pgo
jdenny-ornl Jan 29, 2025
51d1d91
Merge branch 'main' into gpuprofdriver
EthanLuisMcDonough Feb 11, 2025
182cbaf
Merge branch 'pr-94268' into kernel-info-pgo
jdenny-ornl Feb 19, 2025
8acd057
Merge branch 'main' into kernel-info-pgo
jdenny-ornl Apr 9, 2025
ffcc50d
Update KernelInfo.rst for upstream PGO GPU interface changes
jdenny-ornl Apr 9, 2025
84a0c45
Merge branch 'main' into kernel-info-pgo
jdenny-ornl May 6, 2025
8149708
Extend to intrinsics (e.g., @llvm.fmuladd.*)
jdenny-ornl May 10, 2025
002f393
Use `-Xarch_device -fprofile-update=atomic` in example
jdenny-ornl May 10, 2025
6e7208e
Also report floating point bytes moved from profile
jdenny-ornl May 13, 2025
47d6b9a
Update KernelInfo.rst example
jdenny-ornl May 13, 2025
8daf984
Use getTypeStoreSize not getTypeAllocSize for bytes moved
jdenny-ornl May 13, 2025
e1c50c2
Remove an answered todo
jdenny-ornl May 13, 2025
57a9848
Adjust flops for some instructions based on amdgpu hw counters
jdenny-ornl May 28, 2025
30e7101
KernelInfo.rst: Drop unnecessary -Xarch_device
jdenny-ornl May 29, 2025
c965a95
Merge branch 'main' into kernel-info-pgo
jdenny-ornl Jun 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions llvm/docs/KernelInfo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,56 @@ behavior so you can position ``kernel-info`` explicitly:
$ opt -disable-output test-openmp-nvptx64-nvidia-cuda-sm_70.bc \
-pass-remarks=kernel-info -no-kernel-info-end-lto \
-passes='module(kernel-info),lto<O2>'

PGO
===

Using LLVM's PGO implementation for GPUs, profile data can augment the info
reported by kernel-info. In particular, kernel-info can estimate the number of
floating point operations executed or bytes moved.

For example, the following computes 2\ :sup:`4`\ , so we expect 4 fmul
instructions to execute at run time, and we expect a load and store for ``x``:

.. code-block:: shell

$ cat test.c
#include <stdio.h>
#include <stdlib.h>
__attribute__((noinline))
double test(double x, int n) {
double res = 1;
for (int i = 0; i < n; ++i)
res *= x;
return res;
}
int main(int argc, char *argv[]) {
double x = atof(argv[1]);
unsigned n = atoi(argv[2]);
#pragma omp target map(tofrom:x)
x = test(x, n);
printf("%f\n", x);
return 0;
}

$ clang -O1 -g -fopenmp --offload-arch=native test.c -o test \
-fprofile-generate -fprofile-update=atomic

$ LLVM_PROFILE_FILE=test.profraw ./test 2 4
16.000000

$ llvm-profdata merge -output=test.profdata *.profraw

$ clang -O1 -g -fopenmp --offload-arch=native test.c -foffload-lto \
-Rpass=kernel-info -fprofile-use=test.profdata | \
grep "test.c:.*Floating\|double"
test.c:14:14: in artificial function '__omp_offloading_34_1c64d55_main_l13', double 'load' ('%11') moved 8 fp bytes
test.c:14:7: in artificial function '__omp_offloading_34_1c64d55_main_l13', double 'store' moved 8 fp bytes
test.c:13:0: in artificial function '__omp_offloading_34_1c64d55_main_l13', ProfileFloatingPointOpCount = 0
test.c:13:0: in artificial function '__omp_offloading_34_1c64d55_main_l13', ProfileFloatingPointBytesMoved = 16
test.c:7:11: in function 'test', double 'fmul' ('%9') executed 4 flops
test.c:4:0: in function 'test', ProfileFloatingPointOpCount = 4
test.c:4:0: in function 'test', ProfileFloatingPointBytesMoved = 0

While ``-fprofile-update=atomic`` is not required for the simple example above,
it can be critical while profiling parallel code.
219 changes: 201 additions & 18 deletions llvm/lib/Analysis/KernelInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ namespace {

/// Data structure holding function info for kernels.
class KernelInfo {
void updateForBB(const BasicBlock &BB, OptimizationRemarkEmitter &ORE);
void updateForBB(const BasicBlock &BB, BlockFrequencyInfo &BFI,
OptimizationRemarkEmitter &ORE);

public:
static void emitKernelInfo(Function &F, FunctionAnalysisManager &FAM,
Expand Down Expand Up @@ -73,10 +74,120 @@ class KernelInfo {

/// Number of flat address space memory accesses (via load, store, etc.).
int64_t FlatAddrspaceAccesses = 0;

/// Estimate of the number of floating point operations typically executed
/// based on any available profile data. If no profile data is available, the
/// count is zero.
uint64_t ProfileFloatingPointOpCount = 0;

/// Estimate of the number bytes of floating point memory typically moved
/// (e.g., load or store) based on any available profile data. If no profile
/// data is available, the count is zero. LLVM memory access operations
/// (e.g., llvm.memcpy.*, cmpxchg) that are always encoded as operating on
/// integer types and never on floating point types are not included.
uint64_t ProfileFloatingPointBytesMoved = 0;
};

} // end anonymous namespace

// For the purposes of KernelInfo::ProfileFloatingPointOpCount, should the
// specified Instruction be considered a floating point operation? If so,
// return the floating point type and a multiplier for its FLOP count.
// Otherwise, return std::nullopt.
//
// TODO: Does this correctly identify floating point operations we care about?
// For example, we skip phi even when it returns a floating point value, and
// load is covered by KernelInfo::ProfileFloatingPointBytesMoved instead. Is
// there anything missing that should be covered here? Is there anything else
// that we should exclude? For example, at least for AMD GPU, there are
// floating point instruction patterns (e.g., fmul with one operand in some
// category of immediate) that lower to instructions that do not trigger AMD's
// floating point hardware counters. Should we somehow query target-specific
// lowering to exclude such cases?
static std::optional<std::pair<Type *, unsigned>>
getFloatingPointOp(const Instruction &I) {
if (const AtomicRMWInst *At = dyn_cast<AtomicRMWInst>(&I)) {
if (At->isFloatingPointOperation())
return std::make_pair(At->getType(), 1);
return std::nullopt;
}
if (const CastInst *CI = dyn_cast<CastInst>(&I)) {
Type *SrcTy = CI->getSrcTy();
Type *DestTy = CI->getDestTy();
// For AMD GPU, conversions between fp and integer types where either is not
// 64-bit lower to instructions that do not trigger AMD's floating point
// hardware counters. TODO: Is that true for all archs, all non-64-bit
// floating point types, and all non-64-bit integer types? On AMD GPU, we
// have checked 64 vs. 32 and 32 vs. 32 so far.
if (SrcTy->getScalarSizeInBits() != 64 ||
DestTy->getScalarSizeInBits() != 64)
return std::nullopt;
// For AMD GPU, uitofp and sitofp lower to FADD instructions. TODO: Is that
// true for all archs?
if (isa<UIToFPInst>(I) || isa<SIToFPInst>(I))
return std::make_pair(DestTy, 1);
// For AMD GPU, fptoui and fptosi lower to FMA instructions. Thus, as for
// FMA instructions below, we mutliply by 2. TODO: Is that true for all
// archs?
if (isa<FPToUIInst>(I) || isa<FPToSIInst>(I))
return std::make_pair(SrcTy, 2);
return std::nullopt;
}
Type *Ty = I.getType();
if (!Ty->isFPOrFPVectorTy())
return std::nullopt;
if (I.isBinaryOp() || I.isUnaryOp()) {
switch (I.getOpcode()) {
// For AMD GPU, fneg lowers to instructions that do not trigger AMD's
// floating point hardware counters. TODO: Is that true for all archs and
// all floating point types? On AMD GPU, we have check 64 bit.
case Instruction::FNeg:
return std::nullopt;
// This multiplier is based on AMD hardware fp counters for fdiv:
// - SQ_INSTS_VALU_FMA_F64 = 6*2
// - SQ_INSTS_VALU_MUL_F64 = 1
// - SQ_INSTS_VALU_TRANS_F64 = 1
// TODO: Is that true for all archs and all floating point types? On AMD
// GPU, we have checked 64 bit. Moreover, this is surely brittle. What if
// the implementation changes?
case Instruction::FDiv:
return std::make_pair(Ty, 14);
}
return std::make_pair(Ty, 1);
}
if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(&I)) {
switch (II->getIntrinsicID()) {
// For AMD GPU, these lower to instructions that do not trigger AMD's
// floating point hardware counters. TODO: Is that true for all archs and
// all floating point types? On AMD GPU, we have checked 64 bit.
case Intrinsic::copysign:
case Intrinsic::fabs:
case Intrinsic::floor:
case Intrinsic::ldexp:
case Intrinsic::minnum:
case Intrinsic::rint:
return std::nullopt;
// For FMA instructions, we mimic AMD's rocprofiler-compute, which
// multiplies SQ_INSTS_VALU_FMA_* counts by 2.
case Intrinsic::fmuladd:
case Intrinsic::fma:
return std::make_pair(Ty, 2);
// This multiplier is based on AMD hardware fp counters for this intrinsic:
// - SQ_INSTS_VALU_FMA_F64 = 7*2
// - SQ_INSTS_VALU_MUL_F64 = 2
// - SQ_INSTS_VALU_TRANS_F64 = 1
// TODO: Is that true for all archs and all floating point types? On AMD
// GPU, we have check 64 bit. Moreover, this is surely brittle. What if
// the implementation changes?
case Intrinsic::sqrt:
return std::make_pair(Ty, 17);
default:
return std::make_pair(Ty, 1);
}
}
return std::nullopt;
}

static void identifyCallee(OptimizationRemark &R, const Module *M,
const Value *V, StringRef Kind = "") {
SmallString<100> Name; // might be function name or asm expression
Expand All @@ -100,6 +211,19 @@ static void identifyFunction(OptimizationRemark &R, const Function &F) {
identifyCallee(R, F.getParent(), &F, "function");
}

static void identifyInstruction(OptimizationRemark &R, const Instruction &I) {
if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(&I))
R << "'" << II->getCalledFunction()->getName() << "' call";
else
R << "'" << I.getOpcodeName() << "'";
if (!I.getType()->isVoidTy()) {
SmallString<20> Name;
raw_svector_ostream OS(Name);
I.printAsOperand(OS, /*PrintType=*/false, I.getModule());
R << " ('" << Name << "')";
}
}

static void remarkAlloca(OptimizationRemarkEmitter &ORE, const Function &Caller,
const AllocaInst &Alloca,
TypeSize::ScalarTy StaticSize) {
Expand Down Expand Up @@ -153,33 +277,69 @@ static void remarkCall(OptimizationRemarkEmitter &ORE, const Function &Caller,

static void remarkFlatAddrspaceAccess(OptimizationRemarkEmitter &ORE,
const Function &Caller,
const Instruction &Inst) {
const Instruction &I) {
ORE.emit([&] {
OptimizationRemark R(DEBUG_TYPE, "FlatAddrspaceAccess", &I);
R << "in ";
identifyFunction(R, Caller);
R << ", ";
identifyInstruction(R, I);
R << " accesses memory in flat address space";
return R;
});
}

static void
remarkFloatingPointOp(OptimizationRemarkEmitter &ORE, const Function &Caller,
const Instruction &I, Type *Ty, unsigned Multiplier,
std::optional<uint64_t> BlockProfileCount,
std::optional<uint64_t> BytesMoved = std::nullopt) {
ORE.emit([&] {
OptimizationRemark R(DEBUG_TYPE, "FlatAddrspaceAccess", &Inst);
OptimizationRemark R(DEBUG_TYPE,
BytesMoved ? "ProfileFloatingPointBytesMoved"
: "ProfileFloatingPointOpCount",
&I);
R << "in ";
identifyFunction(R, Caller);
if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst)) {
R << ", '" << II->getCalledFunction()->getName() << "' call";
R << ", ";
SmallString<10> TyName;
raw_svector_ostream OS(TyName);
Ty->print(OS);
R << TyName << " ";
identifyInstruction(R, I);
if (BlockProfileCount) {
if (BytesMoved)
R << " moved " << itostr(*BytesMoved * *BlockProfileCount)
<< " fp bytes";
else
R << " executed " << utostr(*BlockProfileCount) << " flops";
if (Multiplier != 1)
R << " x " << utostr(Multiplier);
} else {
R << ", '" << Inst.getOpcodeName() << "' instruction";
R << " has no profile data";
}
if (!Inst.getType()->isVoidTy()) {
SmallString<20> Name;
raw_svector_ostream OS(Name);
Inst.printAsOperand(OS, /*PrintType=*/false, Caller.getParent());
R << " ('" << Name << "')";
}
R << " accesses memory in flat address space";
return R;
});
}

void KernelInfo::updateForBB(const BasicBlock &BB,
void KernelInfo::updateForBB(const BasicBlock &BB, BlockFrequencyInfo &BFI,
OptimizationRemarkEmitter &ORE) {
const Function &F = *BB.getParent();
const Module &M = *F.getParent();
const DataLayout &DL = M.getDataLayout();
// TODO: Is AllowSynthetic what we want?
std::optional<uint64_t> BlockProfileCount =
BFI.getBlockProfileCount(&BB, /*AllowSynthetic=*/true);
for (const Instruction &I : BB.instructionsWithoutDebug()) {
auto HandleFloatingPointBytesMoved = [&]() {
Type *Ty = I.getAccessType();
if (!Ty || !Ty->isFPOrFPVectorTy())
return;
TypeSize::ScalarTy Size = DL.getTypeStoreSize(Ty).getFixedValue();
ProfileFloatingPointBytesMoved += BlockProfileCount.value_or(0) * Size;
remarkFloatingPointOp(ORE, F, I, Ty, /*Multiplier=*/1, BlockProfileCount,
Size);
};
if (const AllocaInst *Alloca = dyn_cast<AllocaInst>(&I)) {
++Allocas;
TypeSize::ScalarTy StaticSize = 0;
Expand Down Expand Up @@ -237,38 +397,58 @@ void KernelInfo::updateForBB(const BasicBlock &BB,
remarkFlatAddrspaceAccess(ORE, F, I);
}
}
// llvm.memcpy.*, llvm.memset.*, etc. are encoded as operating on
// integer types not floating point types, so
// HandleFloatingPointBytesMoved is useless here.
}
} else if (const LoadInst *Load = dyn_cast<LoadInst>(&I)) {
if (Load->getPointerAddressSpace() == FlatAddrspace) {
++FlatAddrspaceAccesses;
remarkFlatAddrspaceAccess(ORE, F, I);
}
HandleFloatingPointBytesMoved();
} else if (const StoreInst *Store = dyn_cast<StoreInst>(&I)) {
if (Store->getPointerAddressSpace() == FlatAddrspace) {
++FlatAddrspaceAccesses;
remarkFlatAddrspaceAccess(ORE, F, I);
}
HandleFloatingPointBytesMoved();
} else if (const AtomicRMWInst *At = dyn_cast<AtomicRMWInst>(&I)) {
if (At->getPointerAddressSpace() == FlatAddrspace) {
++FlatAddrspaceAccesses;
remarkFlatAddrspaceAccess(ORE, F, I);
}
HandleFloatingPointBytesMoved();
} else if (const AtomicCmpXchgInst *At = dyn_cast<AtomicCmpXchgInst>(&I)) {
if (At->getPointerAddressSpace() == FlatAddrspace) {
++FlatAddrspaceAccesses;
remarkFlatAddrspaceAccess(ORE, F, I);
}
// cmpxchg is encoded as operating on integer types not floating point
// types, so HandleFloatingPointBytesMoved is useless here.
}
if (auto Op = getFloatingPointOp(I)) {
Type *Ty;
unsigned Multiplier;
std::tie(Ty, Multiplier) = *Op;
ProfileFloatingPointOpCount += Multiplier * BlockProfileCount.value_or(0);
remarkFloatingPointOp(ORE, F, I, Ty, Multiplier, BlockProfileCount);
}
}
}

static void remarkProperty(OptimizationRemarkEmitter &ORE, const Function &F,
StringRef Name, int64_t Value) {
static std::string toString(bool Val) { return itostr(Val); }
static std::string toString(int64_t Val) { return itostr(Val); }
static std::string toString(uint64_t Val) { return utostr(Val); }

template <typename T>
void remarkProperty(OptimizationRemarkEmitter &ORE, const Function &F,
StringRef Name, T Val) {
ORE.emit([&] {
OptimizationRemark R(DEBUG_TYPE, Name, &F);
R << "in ";
identifyFunction(R, F);
R << ", " << Name << " = " << itostr(Value);
R << ", " << Name << " = " << toString(Val);
return R;
});
}
Expand All @@ -284,6 +464,7 @@ void KernelInfo::emitKernelInfo(Function &F, FunctionAnalysisManager &FAM,
TargetMachine *TM) {
KernelInfo KI;
TargetTransformInfo &TheTTI = FAM.getResult<TargetIRAnalysis>(F);
BlockFrequencyInfo &BFI = FAM.getResult<BlockFrequencyAnalysis>(F);
KI.FlatAddrspace = TheTTI.getFlatAddressSpace();

// Record function properties.
Expand All @@ -296,7 +477,7 @@ void KernelInfo::emitKernelInfo(Function &F, FunctionAnalysisManager &FAM,

auto &ORE = FAM.getResult<OptimizationRemarkEmitterAnalysis>(F);
for (const auto &BB : F)
KI.updateForBB(BB, ORE);
KI.updateForBB(BB, BFI, ORE);

#define REMARK_PROPERTY(PROP_NAME) \
remarkProperty(ORE, F, #PROP_NAME, KI.PROP_NAME)
Expand All @@ -312,6 +493,8 @@ void KernelInfo::emitKernelInfo(Function &F, FunctionAnalysisManager &FAM,
REMARK_PROPERTY(InlineAssemblyCalls);
REMARK_PROPERTY(Invokes);
REMARK_PROPERTY(FlatAddrspaceAccesses);
REMARK_PROPERTY(ProfileFloatingPointOpCount);
REMARK_PROPERTY(ProfileFloatingPointBytesMoved);
#undef REMARK_PROPERTY
}

Expand Down
Loading
Loading