Skip to content

[AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates #134016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 67 commits into
base: main
Choose a base branch
from

Conversation

AlexVlx
Copy link
Contributor

@AlexVlx AlexVlx commented Apr 2, 2025

This change adds two semi-magical builtins for AMDGPU:

  • __builtin_amdgcn_processor_is, which is similar in observable behaviour with __builtin_cpu_is, except that it is never "evaluated" at run time;
  • __builtin_amdgcn_is_invocable, which is behaviourally similar with __has_builtin, except that it is not a macro (i.e. not evaluated at preprocessing time).

Neither of these are constexpr, even though when compiling for concrete (i.e. gfxXXX / gfxXXX-generic) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).

The motivation for adding these is two-fold:

  • as a nice to have, it provides an AST-visible way to incorporate architecture specific code, rather than having to rely on macros and the preprocessor, which burn in the choice quite early;
  • as a must have, it allows featureful AMDGCN flavoured SPIR-V to be produced, where target specific capability is guarded and chosen or discarded when finalising compilation for a concrete target.

I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).

In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.

@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Apr 2, 2025
@llvmbot
Copy link
Member

llvmbot commented Apr 2, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang

Author: Alex Voicu (AlexVlx)

Changes

This change adds two semi-magical builtins for AMDGPU:

  • __builtin_amdgcn_processor_is, which is similar in observable behaviour with __builtin_cpu_is, except that it is never "evaluated" at run time;
  • __builtin_amdgcn_is_invocable, which is behaviourally similar with __has_builtin, except that it is not a macro (i.e. not evaluated at preprocessing time).

Neither of these are constexpr, even though when compiling for concrete (i.e. gfxXXX / gfxXXX-generic) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).

The motivation for adding these is two-fold:

  • as a nice to have, it provides an AST-visible way to incorporate architecture specific code, rather than having to rely on macros and the preprocessor, which burn in the choice quite early;
  • as a must have, it allows featureful AMDGCN flavoured SPIR-V to be produced, where target specific capability is guarded and chosen or discarded when finalising compilation for a concrete target.

I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).

In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.


Patch is 59.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134016.diff

17 Files Affected:

  • (modified) clang/docs/LanguageExtensions.rst (+110)
  • (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+5)
  • (modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+10)
  • (modified) clang/lib/Basic/Targets/SPIR.cpp (+4)
  • (modified) clang/lib/Basic/Targets/SPIR.h (+4)
  • (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+29)
  • (modified) clang/lib/Sema/SemaExpr.cpp (+157)
  • (added) clang/test/CodeGen/amdgpu-builtin-cpu-is.c (+65)
  • (added) clang/test/CodeGen/amdgpu-builtin-is-invocable.c (+64)
  • (added) clang/test/CodeGen/amdgpu-feature-builtins-invalid-use.cpp (+43)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp (+207)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates-unfoldable.ll (+28)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates.ll (+359)
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 3b8a9cac6587a..8a7cb75af13e5 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -4920,6 +4920,116 @@ If no address spaces names are provided, all address spaces are fenced.
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local", "global")
 
+__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``__builtin_amdgcn_processor_is`` and ``__builtin_amdgcn_is_invocable`` provide
+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.
+
+**Syntax**:
+
+.. code-block:: c
+
+  // When used as the predicate for a control structure
+  bool __builtin_amdgcn_processor_is(const char*);
+  bool __builtin_amdgcn_is_invocable(builtin_name);
+  // Otherwise
+  void __builtin_amdgcn_processor_is(const char*);
+  void __builtin_amdgcn_is_invocable(void);
+
+**Example of use**:
+
+.. code-block:: c++
+
+  if (__builtin_amdgcn_processor_is("gfx1201") ||
+      __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var))
+    __builtin_amdgcn_s_sleep_var(x);
+
+  if (!__builtin_amdgcn_processor_is("gfx906"))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_processor_is("gfx1010") ||
+           __builtin_amdgcn_processor_is("gfx1101"))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  while (__builtin_amdgcn_processor_is("gfx1101")) *p += x;
+
+  do { *p -= x; } while (__builtin_amdgcn_processor_is("gfx1010"));
+
+  for (; __builtin_amdgcn_processor_is("gfx1201"); ++*p) break;
+
+  if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_ttracedata_imm))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  do {
+    *p -= x;
+  } while (__builtin_amdgcn_is_invocable(__builtin_amdgcn_global_load_tr_b64_i32));
+
+  for (; __builtin_amdgcn_is_invocable(__builtin_amdgcn_permlane64); ++*p) break;
+
+**Description**:
+
+When used as the predicate value of the following control structures:
+
+.. code-block:: c++
+
+  if (...)
+  while (...)
+  do { } while (...)
+  for (...)
+
+be it directly, or as arguments to logical operators such as ``!, ||, &&``, the
+builtins return a boolean value that:
+
+* indicates whether the current target matches the argument; the argument MUST
+  be a string literal and a valid AMDGPU target
+* indicates whether the builtin function passed as the argument can be invoked
+  by the current target; the argument MUST be either a generic or AMDGPU
+  specific builtin name
+
+Outside of these contexts, the builtins have a ``void`` returning signature
+which prevents their misuse.
+
+**Example of invalid use**:
+
+.. code-block:: c++
+
+  void kernel(int* p, int x, bool (*pfn)(bool), const char* str) {
+    if (__builtin_amdgcn_processor_is("not_an_amdgcn_gfx_id")) return;
+    else if (__builtin_amdgcn_processor_is(str)) __builtin_trap();
+
+    bool a = __builtin_amdgcn_processor_is("gfx906");
+    const bool b = !__builtin_amdgcn_processor_is("gfx906");
+    const bool c = !__builtin_amdgcn_processor_is("gfx906");
+    bool d = __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    bool e = !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto f =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto g =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    __builtin_amdgcn_processor_is("gfx1201")
+      ? __builtin_amdgcn_s_sleep_var(x) : __builtin_amdgcn_s_sleep(42);
+    if (pfn(__builtin_amdgcn_processor_is("gfx1200")))
+      __builtin_amdgcn_s_sleep_var(x);
+
+    if (__builtin_amdgcn_is_invocable("__builtin_amdgcn_s_sleep_var")) return;
+    else if (__builtin_amdgcn_is_invocable(x)) __builtin_trap();
+  }
+
+When invoked while compiling for a concrete target, the builtins are evaluated
+early by Clang, and never produce any CodeGen effects / have no observable
+side-effects in IR. Conversely, when compiling for AMDGCN flavoured SPIR-v,
+which is an abstract target, a series of predicate values are implicitly
+created. These predicates get resolved when finalizing the compilation process
+for a concrete target, and shall reflect the latter's identity and features.
+Thus, it is possible to author high-level code, in e.g. HIP, that is target
+adaptive in a dynamic fashion, contrary to macro based mechanisms.
 
 ARM/AArch64 Language Extensions
 -------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 44ef404aee72f..5d01a7e75f7e7 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -346,6 +346,11 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
 BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
 BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
 
+// These are special FE only builtins intended for forwarding the requirements
+// to the ME.
+BUILTIN(__builtin_amdgcn_processor_is, "vcC*", "nctu")
+BUILTIN(__builtin_amdgcn_is_invocable, "v", "nctu")
+
 //===----------------------------------------------------------------------===//
 // R600-NI only builtins.
 //===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 5e45482584946..45f0f9eb88e55 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13054,4 +13054,14 @@ def err_acc_decl_for_routine
 // AMDGCN builtins diagnostics
 def err_amdgcn_global_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_global_load_lds_size_valid_value : Note<"size must be %select{1, 2, or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_processor_is_arg_not_literal
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a string "
+            "literal">;
+def err_amdgcn_processor_is_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a valid "
+            "AMDGCN processor identifier; '%0' is not valid">;
+def err_amdgcn_is_invocable_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_is_invocable must be either a "
+            "target agnostic builtin or an AMDGCN target specific builtin; `%0`"
+            " is not valid">;
 } // end of sema component.
diff --git a/clang/lib/Basic/Targets/SPIR.cpp b/clang/lib/Basic/Targets/SPIR.cpp
index 5b5f47f9647a2..eb43d9b0be283 100644
--- a/clang/lib/Basic/Targets/SPIR.cpp
+++ b/clang/lib/Basic/Targets/SPIR.cpp
@@ -152,3 +152,7 @@ void SPIRV64AMDGCNTargetInfo::setAuxTarget(const TargetInfo *Aux) {
     Float128Format = DoubleFormat;
   }
 }
+
+bool SPIRV64AMDGCNTargetInfo::isValidCPUName(StringRef CPU) const {
+  return AMDGPUTI.isValidCPUName(CPU);
+}
diff --git a/clang/lib/Basic/Targets/SPIR.h b/clang/lib/Basic/Targets/SPIR.h
index 78505d66d6f2f..7aa13cbeb89fd 100644
--- a/clang/lib/Basic/Targets/SPIR.h
+++ b/clang/lib/Basic/Targets/SPIR.h
@@ -432,6 +432,10 @@ class LLVM_LIBRARY_VISIBILITY SPIRV64AMDGCNTargetInfo final
   }
 
   bool hasInt128Type() const override { return TargetInfo::hasInt128Type(); }
+
+  // This is only needed for validating arguments passed to
+  // __builtin_amdgcn_processor_is
+  bool isValidCPUName(StringRef Name) const override;
 };
 
 } // namespace targets
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index b56b739094ff3..7b1a3815144b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -284,6 +284,18 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
   Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
 }
 
+static Value *GetOrInsertAMDGPUPredicate(CodeGenFunction &CGF, Twine Name) {
+  auto PTy = IntegerType::getInt1Ty(CGF.getLLVMContext());
+
+  auto P = cast<GlobalVariable>(
+      CGF.CGM.getModule().getOrInsertGlobal(Name.str(), PTy));
+  P->setConstant(true);
+  P->setExternallyInitialized(true);
+
+  return CGF.Builder.CreateLoad(RawAddress(P, PTy, CharUnits::One(),
+                                           KnownNonNull));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -585,6 +597,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
     llvm::Value *Env = EmitScalarExpr(E->getArg(0));
     return Builder.CreateCall(F, {Env});
   }
+  case AMDGPU::BI__builtin_amdgcn_processor_is: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_processor_is should never reach CodeGen for "
+             "concrete targets!");
+    StringRef Proc = cast<clang::StringLiteral>(E->getArg(0))->getString();
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.is." + Proc);
+  }
+  case AMDGPU::BI__builtin_amdgcn_is_invocable: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_is_invocable should never reach CodeGen for "
+           "concrete targets!");
+    auto FD = cast<FunctionDecl>(
+      cast<DeclRefExpr>(E->getArg(0))->getReferencedDeclOfCallee());
+    StringRef RF =
+        getContext().BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.has." + RF);
+  }
   case AMDGPU::BI__builtin_amdgcn_read_exec:
     return EmitAMDGCNBallotForExec(*this, E, Int64Ty, Int64Ty, false);
   case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index 7cc8374e69d73..24f5262ab3cf4 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -6541,6 +6541,22 @@ ExprResult Sema::BuildCallExpr(Scope *Scope, Expr *Fn, SourceLocation LParenLoc,
   if (Result.isInvalid()) return ExprError();
   Fn = Result.get();
 
+  // The __builtin_amdgcn_is_invocable builtin is special, and will be resolved
+  // later, when we check boolean conditions, for now we merely forward it
+  // without any additional checking.
+  if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+      ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+    auto FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
+
+    if (FD->getName() == "__builtin_amdgcn_is_invocable") {
+      auto FnPtrTy = Context.getPointerType(FD->getType());
+      auto R = ImpCastExprToType(Fn, FnPtrTy, CK_BuiltinFnToFnPtr).get();
+      return CallExpr::Create(Context, R, ArgExprs, Context.VoidTy,
+                              ExprValueKind::VK_PRValue, RParenLoc,
+                              FPOptionsOverride());
+    }
+  }
+
   if (CheckArgsForPlaceholders(ArgExprs))
     return ExprError();
 
@@ -13234,6 +13250,20 @@ inline QualType Sema::CheckBitwiseOperands(ExprResult &LHS, ExprResult &RHS,
   return InvalidOperands(Loc, LHS, RHS);
 }
 
+static inline bool IsAMDGPUPredicateBI(Expr *E) {
+  if (!E->getType()->isVoidType())
+    return false;
+
+  if (auto CE = dyn_cast<CallExpr>(E)) {
+    if (auto BI = CE->getDirectCallee())
+      if (BI->getName() == "__builtin_amdgcn_processor_is" ||
+          BI->getName() == "__builtin_amdgcn_is_invocable")
+        return true;
+  }
+
+  return false;
+}
+
 // C99 6.5.[13,14]
 inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
                                            SourceLocation Loc,
@@ -13329,6 +13359,9 @@ inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
   // The following is safe because we only use this method for
   // non-overloadable operands.
 
+  if (IsAMDGPUPredicateBI(LHS.get()) && IsAMDGPUPredicateBI(RHS.get()))
+    return Context.VoidTy;
+
   // C++ [expr.log.and]p1
   // C++ [expr.log.or]p1
   // The operands are both contextually converted to type bool.
@@ -15576,6 +15609,38 @@ static bool isOverflowingIntegerType(ASTContext &Ctx, QualType T) {
   return Ctx.getIntWidth(T) >= Ctx.getIntWidth(Ctx.IntTy);
 }
 
+static Expr *ExpandAMDGPUPredicateBI(ASTContext &Ctx, CallExpr *CE) {
+  if (!CE->getBuiltinCallee())
+    return CXXBoolLiteralExpr::Create(Ctx, false, Ctx.BoolTy, CE->getExprLoc());
+
+  if (Ctx.getTargetInfo().getTriple().isSPIRV()) {
+    CE->setType(Ctx.getLogicalOperationType());
+    return CE;
+  }
+
+  bool P = false;
+  auto &TI = Ctx.getTargetInfo();
+
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    auto TID = TI.getTargetID();
+    if (GFX && TID) {
+      auto N = GFX->getString();
+      P = TI.isValidCPUName(GFX->getString()) && TID->find(N) == 0;
+    }
+  } else {
+    auto FD = cast<FunctionDecl>(CE->getArg(0)->getReferencedDeclOfCallee());
+
+    StringRef RF = Ctx.BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    llvm::StringMap<bool> CF;
+    Ctx.getFunctionFeatureMap(CF, FD);
+
+    P = Builtin::evaluateRequiredTargetFeatures(RF, CF);
+  }
+
+  return CXXBoolLiteralExpr::Create(Ctx, P, Ctx.BoolTy, CE->getExprLoc());
+}
+
 ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
                                       UnaryOperatorKind Opc, Expr *InputExpr,
                                       bool IsAfterAmp) {
@@ -15753,6 +15818,8 @@ ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
         // Vector logical not returns the signed variant of the operand type.
         resultType = GetSignedVectorType(resultType);
         break;
+      } else if (IsAMDGPUPredicateBI(InputExpr)) {
+        break;
       } else {
         return ExprError(Diag(OpLoc, diag::err_typecheck_unary_expr)
                          << resultType << Input.get()->getSourceRange());
@@ -20469,6 +20536,88 @@ void Sema::DiagnoseEqualityWithExtraParens(ParenExpr *ParenE) {
     }
 }
 
+static bool ValidateAMDGPUPredicateBI(Sema &Sema, CallExpr *CE) {
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    if (!GFX) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_not_literal);
+      return false;
+    }
+    auto N = GFX->getString();
+    if (!Sema.getASTContext().getTargetInfo().isValidCPUName(N) &&
+        (!Sema.getASTContext().getAuxTargetInfo() ||
+         !Sema.getASTContext().getAuxTargetInfo()->isValidCPUName(N))) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_invalid_value) << N;
+      return false;
+    }
+  } else {
+    auto Arg = CE->getArg(0);
+    if (!Arg || Arg->getType() != Sema.getASTContext().BuiltinFnTy) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_is_invocable_arg_invalid_value) << Arg;
+      return false;
+    }
+  }
+
+  return true;
+}
+
+static Expr *MaybeHandleAMDGPUPredicateBI(Sema &Sema, Expr *E, bool &Invalid) {
+  if (auto UO = dyn_cast<UnaryOperator>(E)) {
+    auto SE = dyn_cast<CallExpr>(UO->getSubExpr());
+    if (IsAMDGPUPredicateBI(SE)) {
+      assert(
+        UO->getOpcode() == UnaryOperator::Opcode::UO_LNot &&
+        "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+          "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, SE)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      UO->setSubExpr(ExpandAMDGPUPredicateBI(Sema.getASTContext(), SE));
+      UO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return UO;
+    }
+  }
+  if (auto BO = dyn_cast<BinaryOperator>(E)) {
+    auto LHS = dyn_cast<CallExpr>(BO->getLHS());
+    auto RHS = dyn_cast<CallExpr>(BO->getRHS());
+    if (IsAMDGPUPredicateBI(LHS) && IsAMDGPUPredicateBI(RHS)) {
+      assert(
+          BO->isLogicalOp() &&
+          "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+            "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, LHS) ||
+          !ValidateAMDGPUPredicateBI(Sema, RHS)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      BO->setLHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), LHS));
+      BO->setRHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), RHS));
+      BO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return BO;
+    }
+  }
+  if (auto CE = dyn_cast<CallExpr>(E))
+    if (IsAMDGPUPredicateBI(CE)) {
+      if (!ValidateAMDGPUPredicateBI(Sema, CE)) {
+        Invalid = true;
+        return nullptr;
+      }
+      return ExpandAMDGPUPredicateBI(Sema.getASTContext(), CE);
+    }
+
+  return nullptr;
+}
+
 ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
                                        bool IsConstexpr) {
   DiagnoseAssignmentAsCondition(E);
@@ -20480,6 +20629,14 @@ ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
   E = result.get();
 
   if (!E->isTypeDependent()) {
+    if (E->getType()->isVoidType()) {
+      bool IsInvalidPredicate = false;
+      if (auto BIC = MaybeHandleAMDGPUPredicateBI(*this, E, IsInvalidPredicate))
+        return BIC;
+      else if (IsInvalidPredicate)
+        return ExprError();
+    }
+
     if (getLangOpts().CPlusPlus)
       return CheckCXXBooleanCondition(E, IsConstexpr); // C++ 6.4p4
 
diff --git a/clang/test/CodeGen/amdgpu-builtin-cpu-is.c b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
new file mode 100644
index 0000000000000..6e261d9f5d239
--- /dev/null
+++ b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
@@ -0,0 +1,65 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --version 5
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx900 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX900 %s
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1010 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX1010 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCNSPIRV %s
+
+// Test that, depending on triple and, if applicable, target-cpu, one of three
+// things happens:
+//    1) for gfx900 we emit a call to trap (concrete target, matches)
+//    2) for gfx1010 we emit an empty kernel (concrete target, does not match)
+//    3) for AMDGCNSPIRV we emit llvm.amdgcn.is.gfx900 as a bool global, and
+//       load from it to provide the condition a br (abstract target)
+//.
+// AMDGCN-GFX900: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCN-GFX1010: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCNSPIRV: @llvm.amdgcn.is.gfx900 = external addrspace(1) externally_initialized constant i1
+//.
+// AMDGCN-GFX900-LABEL: define dso_local void @foo(
+// AMDGCN-GFX900-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX900-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX900-NEXT:    call void @llvm.trap()
+// AMDGCN-GFX900-NEXT:    ret void
+//
+// AMDGCN-GFX1010-LABEL: define dso_local void @foo(
+// AMDGCN-GFX1010-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX1010-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX1010-NEXT:    ret void
+//
+// AMDGCNSPIRV-LABEL: define spir_func void @foo(
+// AMDGCNSPIRV-SAME: ) addrspace(4) #[[ATTR0:[0-9]+]] {
+// AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = load i1, ptr addrspace(1) @llvm.amdgcn.is.gfx900, align 1
+// AMDGCNSPIRV-NEXT:    br i1 [[TMP0]], label %[[IF_THEN:.*]], label %[[IF_END:.*]]
+// AMDGCNSPIRV:       [[IF_THEN]]:
+// AMDGCNSPIRV-NEXT:    call addrspace(4) void @llvm.trap()
+// AMDGCNSPIRV-NEXT:    br label %[[IF_END]]
+// AMDGCNSPIRV:       [[IF_END]]:
+// AMDGCNSPIRV-NEXT:    ret void
+//
+void foo() {
+    if (__builtin_cpu_is("gfx90...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Apr 2, 2025

@llvm/pr-subscribers-clang-codegen

Author: Alex Voicu (AlexVlx)

Changes

This change adds two semi-magical builtins for AMDGPU:

  • __builtin_amdgcn_processor_is, which is similar in observable behaviour with __builtin_cpu_is, except that it is never "evaluated" at run time;
  • __builtin_amdgcn_is_invocable, which is behaviourally similar with __has_builtin, except that it is not a macro (i.e. not evaluated at preprocessing time).

Neither of these are constexpr, even though when compiling for concrete (i.e. gfxXXX / gfxXXX-generic) targets they get evaluated in Clang, so they shouldn't tear the AST too badly / at all for multi-pass compilation cases like HIP. They can only be used in specific contexts (as args to control structures).

The motivation for adding these is two-fold:

  • as a nice to have, it provides an AST-visible way to incorporate architecture specific code, rather than having to rely on macros and the preprocessor, which burn in the choice quite early;
  • as a must have, it allows featureful AMDGCN flavoured SPIR-V to be produced, where target specific capability is guarded and chosen or discarded when finalising compilation for a concrete target.

I've tried to keep the overall footprint of the change small. The changes to Sema are a bit unpleasant, but there was a strong desire to have Clang validate these, and to constrain their uses, and this was the most compact solution I could come up with (suggestions welcome).

In the end, I will note there is nothing that is actually AMDGPU specific here, so it is possible that in the future, assuming interests from other targets / users, we'd just promote them to generic intrinsics.


Patch is 59.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/134016.diff

17 Files Affected:

  • (modified) clang/docs/LanguageExtensions.rst (+110)
  • (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+5)
  • (modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+10)
  • (modified) clang/lib/Basic/Targets/SPIR.cpp (+4)
  • (modified) clang/lib/Basic/Targets/SPIR.h (+4)
  • (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+29)
  • (modified) clang/lib/Sema/SemaExpr.cpp (+157)
  • (added) clang/test/CodeGen/amdgpu-builtin-cpu-is.c (+65)
  • (added) clang/test/CodeGen/amdgpu-builtin-is-invocable.c (+64)
  • (added) clang/test/CodeGen/amdgpu-feature-builtins-invalid-use.cpp (+43)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp (+207)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-1)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates-unfoldable.ll (+28)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-expand-feature-predicates.ll (+359)
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 3b8a9cac6587a..8a7cb75af13e5 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -4920,6 +4920,116 @@ If no address spaces names are provided, all address spaces are fenced.
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local")
   __builtin_amdgcn_fence(__ATOMIC_SEQ_CST, "workgroup", "local", "global")
 
+__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``__builtin_amdgcn_processor_is`` and ``__builtin_amdgcn_is_invocable`` provide
+a functional mechanism for programatically querying:
+
+* the identity of the current target processor;
+* the capability of the current target processor to invoke a particular builtin.
+
+**Syntax**:
+
+.. code-block:: c
+
+  // When used as the predicate for a control structure
+  bool __builtin_amdgcn_processor_is(const char*);
+  bool __builtin_amdgcn_is_invocable(builtin_name);
+  // Otherwise
+  void __builtin_amdgcn_processor_is(const char*);
+  void __builtin_amdgcn_is_invocable(void);
+
+**Example of use**:
+
+.. code-block:: c++
+
+  if (__builtin_amdgcn_processor_is("gfx1201") ||
+      __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var))
+    __builtin_amdgcn_s_sleep_var(x);
+
+  if (!__builtin_amdgcn_processor_is("gfx906"))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_processor_is("gfx1010") ||
+           __builtin_amdgcn_processor_is("gfx1101"))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  while (__builtin_amdgcn_processor_is("gfx1101")) *p += x;
+
+  do { *p -= x; } while (__builtin_amdgcn_processor_is("gfx1010"));
+
+  for (; __builtin_amdgcn_processor_is("gfx1201"); ++*p) break;
+
+  if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready))
+    __builtin_amdgcn_s_wait_event_export_ready();
+  else if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_ttracedata_imm))
+    __builtin_amdgcn_s_ttracedata_imm(1);
+
+  do {
+    *p -= x;
+  } while (__builtin_amdgcn_is_invocable(__builtin_amdgcn_global_load_tr_b64_i32));
+
+  for (; __builtin_amdgcn_is_invocable(__builtin_amdgcn_permlane64); ++*p) break;
+
+**Description**:
+
+When used as the predicate value of the following control structures:
+
+.. code-block:: c++
+
+  if (...)
+  while (...)
+  do { } while (...)
+  for (...)
+
+be it directly, or as arguments to logical operators such as ``!, ||, &&``, the
+builtins return a boolean value that:
+
+* indicates whether the current target matches the argument; the argument MUST
+  be a string literal and a valid AMDGPU target
+* indicates whether the builtin function passed as the argument can be invoked
+  by the current target; the argument MUST be either a generic or AMDGPU
+  specific builtin name
+
+Outside of these contexts, the builtins have a ``void`` returning signature
+which prevents their misuse.
+
+**Example of invalid use**:
+
+.. code-block:: c++
+
+  void kernel(int* p, int x, bool (*pfn)(bool), const char* str) {
+    if (__builtin_amdgcn_processor_is("not_an_amdgcn_gfx_id")) return;
+    else if (__builtin_amdgcn_processor_is(str)) __builtin_trap();
+
+    bool a = __builtin_amdgcn_processor_is("gfx906");
+    const bool b = !__builtin_amdgcn_processor_is("gfx906");
+    const bool c = !__builtin_amdgcn_processor_is("gfx906");
+    bool d = __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    bool e = !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto f =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || __builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    const auto g =
+        !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_wait_event_export_ready)
+        || !__builtin_amdgcn_is_invocable(__builtin_amdgcn_s_sleep_var);
+    __builtin_amdgcn_processor_is("gfx1201")
+      ? __builtin_amdgcn_s_sleep_var(x) : __builtin_amdgcn_s_sleep(42);
+    if (pfn(__builtin_amdgcn_processor_is("gfx1200")))
+      __builtin_amdgcn_s_sleep_var(x);
+
+    if (__builtin_amdgcn_is_invocable("__builtin_amdgcn_s_sleep_var")) return;
+    else if (__builtin_amdgcn_is_invocable(x)) __builtin_trap();
+  }
+
+When invoked while compiling for a concrete target, the builtins are evaluated
+early by Clang, and never produce any CodeGen effects / have no observable
+side-effects in IR. Conversely, when compiling for AMDGCN flavoured SPIR-v,
+which is an abstract target, a series of predicate values are implicitly
+created. These predicates get resolved when finalizing the compilation process
+for a concrete target, and shall reflect the latter's identity and features.
+Thus, it is possible to author high-level code, in e.g. HIP, that is target
+adaptive in a dynamic fashion, contrary to macro based mechanisms.
 
 ARM/AArch64 Language Extensions
 -------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 44ef404aee72f..5d01a7e75f7e7 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -346,6 +346,11 @@ BUILTIN(__builtin_amdgcn_endpgm, "v", "nr")
 BUILTIN(__builtin_amdgcn_get_fpenv, "WUi", "n")
 BUILTIN(__builtin_amdgcn_set_fpenv, "vWUi", "n")
 
+// These are special FE only builtins intended for forwarding the requirements
+// to the ME.
+BUILTIN(__builtin_amdgcn_processor_is, "vcC*", "nctu")
+BUILTIN(__builtin_amdgcn_is_invocable, "v", "nctu")
+
 //===----------------------------------------------------------------------===//
 // R600-NI only builtins.
 //===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 5e45482584946..45f0f9eb88e55 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -13054,4 +13054,14 @@ def err_acc_decl_for_routine
 // AMDGCN builtins diagnostics
 def err_amdgcn_global_load_lds_size_invalid_value : Error<"invalid size value">;
 def note_amdgcn_global_load_lds_size_valid_value : Note<"size must be %select{1, 2, or 4|1, 2, 4, 12 or 16}0">;
+def err_amdgcn_processor_is_arg_not_literal
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a string "
+            "literal">;
+def err_amdgcn_processor_is_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_processor_is must be a valid "
+            "AMDGCN processor identifier; '%0' is not valid">;
+def err_amdgcn_is_invocable_arg_invalid_value
+    : Error<"the argument to __builtin_amdgcn_is_invocable must be either a "
+            "target agnostic builtin or an AMDGCN target specific builtin; `%0`"
+            " is not valid">;
 } // end of sema component.
diff --git a/clang/lib/Basic/Targets/SPIR.cpp b/clang/lib/Basic/Targets/SPIR.cpp
index 5b5f47f9647a2..eb43d9b0be283 100644
--- a/clang/lib/Basic/Targets/SPIR.cpp
+++ b/clang/lib/Basic/Targets/SPIR.cpp
@@ -152,3 +152,7 @@ void SPIRV64AMDGCNTargetInfo::setAuxTarget(const TargetInfo *Aux) {
     Float128Format = DoubleFormat;
   }
 }
+
+bool SPIRV64AMDGCNTargetInfo::isValidCPUName(StringRef CPU) const {
+  return AMDGPUTI.isValidCPUName(CPU);
+}
diff --git a/clang/lib/Basic/Targets/SPIR.h b/clang/lib/Basic/Targets/SPIR.h
index 78505d66d6f2f..7aa13cbeb89fd 100644
--- a/clang/lib/Basic/Targets/SPIR.h
+++ b/clang/lib/Basic/Targets/SPIR.h
@@ -432,6 +432,10 @@ class LLVM_LIBRARY_VISIBILITY SPIRV64AMDGCNTargetInfo final
   }
 
   bool hasInt128Type() const override { return TargetInfo::hasInt128Type(); }
+
+  // This is only needed for validating arguments passed to
+  // __builtin_amdgcn_processor_is
+  bool isValidCPUName(StringRef Name) const override;
 };
 
 } // namespace targets
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index b56b739094ff3..7b1a3815144b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -284,6 +284,18 @@ void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
   Inst->setMetadata(LLVMContext::MD_mmra, MMRAMetadata::getMD(Ctx, MMRAs));
 }
 
+static Value *GetOrInsertAMDGPUPredicate(CodeGenFunction &CGF, Twine Name) {
+  auto PTy = IntegerType::getInt1Ty(CGF.getLLVMContext());
+
+  auto P = cast<GlobalVariable>(
+      CGF.CGM.getModule().getOrInsertGlobal(Name.str(), PTy));
+  P->setConstant(true);
+  P->setExternallyInitialized(true);
+
+  return CGF.Builder.CreateLoad(RawAddress(P, PTy, CharUnits::One(),
+                                           KnownNonNull));
+}
+
 Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
                                               const CallExpr *E) {
   llvm::AtomicOrdering AO = llvm::AtomicOrdering::SequentiallyConsistent;
@@ -585,6 +597,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
     llvm::Value *Env = EmitScalarExpr(E->getArg(0));
     return Builder.CreateCall(F, {Env});
   }
+  case AMDGPU::BI__builtin_amdgcn_processor_is: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_processor_is should never reach CodeGen for "
+             "concrete targets!");
+    StringRef Proc = cast<clang::StringLiteral>(E->getArg(0))->getString();
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.is." + Proc);
+  }
+  case AMDGPU::BI__builtin_amdgcn_is_invocable: {
+    assert(CGM.getTriple().isSPIRV() &&
+           "__builtin_amdgcn_is_invocable should never reach CodeGen for "
+           "concrete targets!");
+    auto FD = cast<FunctionDecl>(
+      cast<DeclRefExpr>(E->getArg(0))->getReferencedDeclOfCallee());
+    StringRef RF =
+        getContext().BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    return GetOrInsertAMDGPUPredicate(*this, "llvm.amdgcn.has." + RF);
+  }
   case AMDGPU::BI__builtin_amdgcn_read_exec:
     return EmitAMDGCNBallotForExec(*this, E, Int64Ty, Int64Ty, false);
   case AMDGPU::BI__builtin_amdgcn_read_exec_lo:
diff --git a/clang/lib/Sema/SemaExpr.cpp b/clang/lib/Sema/SemaExpr.cpp
index 7cc8374e69d73..24f5262ab3cf4 100644
--- a/clang/lib/Sema/SemaExpr.cpp
+++ b/clang/lib/Sema/SemaExpr.cpp
@@ -6541,6 +6541,22 @@ ExprResult Sema::BuildCallExpr(Scope *Scope, Expr *Fn, SourceLocation LParenLoc,
   if (Result.isInvalid()) return ExprError();
   Fn = Result.get();
 
+  // The __builtin_amdgcn_is_invocable builtin is special, and will be resolved
+  // later, when we check boolean conditions, for now we merely forward it
+  // without any additional checking.
+  if (Fn->getType() == Context.BuiltinFnTy && ArgExprs.size() == 1 &&
+      ArgExprs[0]->getType() == Context.BuiltinFnTy) {
+    auto FD = cast<FunctionDecl>(Fn->getReferencedDeclOfCallee());
+
+    if (FD->getName() == "__builtin_amdgcn_is_invocable") {
+      auto FnPtrTy = Context.getPointerType(FD->getType());
+      auto R = ImpCastExprToType(Fn, FnPtrTy, CK_BuiltinFnToFnPtr).get();
+      return CallExpr::Create(Context, R, ArgExprs, Context.VoidTy,
+                              ExprValueKind::VK_PRValue, RParenLoc,
+                              FPOptionsOverride());
+    }
+  }
+
   if (CheckArgsForPlaceholders(ArgExprs))
     return ExprError();
 
@@ -13234,6 +13250,20 @@ inline QualType Sema::CheckBitwiseOperands(ExprResult &LHS, ExprResult &RHS,
   return InvalidOperands(Loc, LHS, RHS);
 }
 
+static inline bool IsAMDGPUPredicateBI(Expr *E) {
+  if (!E->getType()->isVoidType())
+    return false;
+
+  if (auto CE = dyn_cast<CallExpr>(E)) {
+    if (auto BI = CE->getDirectCallee())
+      if (BI->getName() == "__builtin_amdgcn_processor_is" ||
+          BI->getName() == "__builtin_amdgcn_is_invocable")
+        return true;
+  }
+
+  return false;
+}
+
 // C99 6.5.[13,14]
 inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
                                            SourceLocation Loc,
@@ -13329,6 +13359,9 @@ inline QualType Sema::CheckLogicalOperands(ExprResult &LHS, ExprResult &RHS,
   // The following is safe because we only use this method for
   // non-overloadable operands.
 
+  if (IsAMDGPUPredicateBI(LHS.get()) && IsAMDGPUPredicateBI(RHS.get()))
+    return Context.VoidTy;
+
   // C++ [expr.log.and]p1
   // C++ [expr.log.or]p1
   // The operands are both contextually converted to type bool.
@@ -15576,6 +15609,38 @@ static bool isOverflowingIntegerType(ASTContext &Ctx, QualType T) {
   return Ctx.getIntWidth(T) >= Ctx.getIntWidth(Ctx.IntTy);
 }
 
+static Expr *ExpandAMDGPUPredicateBI(ASTContext &Ctx, CallExpr *CE) {
+  if (!CE->getBuiltinCallee())
+    return CXXBoolLiteralExpr::Create(Ctx, false, Ctx.BoolTy, CE->getExprLoc());
+
+  if (Ctx.getTargetInfo().getTriple().isSPIRV()) {
+    CE->setType(Ctx.getLogicalOperationType());
+    return CE;
+  }
+
+  bool P = false;
+  auto &TI = Ctx.getTargetInfo();
+
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    auto TID = TI.getTargetID();
+    if (GFX && TID) {
+      auto N = GFX->getString();
+      P = TI.isValidCPUName(GFX->getString()) && TID->find(N) == 0;
+    }
+  } else {
+    auto FD = cast<FunctionDecl>(CE->getArg(0)->getReferencedDeclOfCallee());
+
+    StringRef RF = Ctx.BuiltinInfo.getRequiredFeatures(FD->getBuiltinID());
+    llvm::StringMap<bool> CF;
+    Ctx.getFunctionFeatureMap(CF, FD);
+
+    P = Builtin::evaluateRequiredTargetFeatures(RF, CF);
+  }
+
+  return CXXBoolLiteralExpr::Create(Ctx, P, Ctx.BoolTy, CE->getExprLoc());
+}
+
 ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
                                       UnaryOperatorKind Opc, Expr *InputExpr,
                                       bool IsAfterAmp) {
@@ -15753,6 +15818,8 @@ ExprResult Sema::CreateBuiltinUnaryOp(SourceLocation OpLoc,
         // Vector logical not returns the signed variant of the operand type.
         resultType = GetSignedVectorType(resultType);
         break;
+      } else if (IsAMDGPUPredicateBI(InputExpr)) {
+        break;
       } else {
         return ExprError(Diag(OpLoc, diag::err_typecheck_unary_expr)
                          << resultType << Input.get()->getSourceRange());
@@ -20469,6 +20536,88 @@ void Sema::DiagnoseEqualityWithExtraParens(ParenExpr *ParenE) {
     }
 }
 
+static bool ValidateAMDGPUPredicateBI(Sema &Sema, CallExpr *CE) {
+  if (CE->getDirectCallee()->getName() == "__builtin_amdgcn_processor_is") {
+    auto GFX = dyn_cast<StringLiteral>(CE->getArg(0)->IgnoreParenCasts());
+    if (!GFX) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_not_literal);
+      return false;
+    }
+    auto N = GFX->getString();
+    if (!Sema.getASTContext().getTargetInfo().isValidCPUName(N) &&
+        (!Sema.getASTContext().getAuxTargetInfo() ||
+         !Sema.getASTContext().getAuxTargetInfo()->isValidCPUName(N))) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_processor_is_arg_invalid_value) << N;
+      return false;
+    }
+  } else {
+    auto Arg = CE->getArg(0);
+    if (!Arg || Arg->getType() != Sema.getASTContext().BuiltinFnTy) {
+      Sema.Diag(CE->getExprLoc(),
+                diag::err_amdgcn_is_invocable_arg_invalid_value) << Arg;
+      return false;
+    }
+  }
+
+  return true;
+}
+
+static Expr *MaybeHandleAMDGPUPredicateBI(Sema &Sema, Expr *E, bool &Invalid) {
+  if (auto UO = dyn_cast<UnaryOperator>(E)) {
+    auto SE = dyn_cast<CallExpr>(UO->getSubExpr());
+    if (IsAMDGPUPredicateBI(SE)) {
+      assert(
+        UO->getOpcode() == UnaryOperator::Opcode::UO_LNot &&
+        "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+          "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, SE)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      UO->setSubExpr(ExpandAMDGPUPredicateBI(Sema.getASTContext(), SE));
+      UO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return UO;
+    }
+  }
+  if (auto BO = dyn_cast<BinaryOperator>(E)) {
+    auto LHS = dyn_cast<CallExpr>(BO->getLHS());
+    auto RHS = dyn_cast<CallExpr>(BO->getRHS());
+    if (IsAMDGPUPredicateBI(LHS) && IsAMDGPUPredicateBI(RHS)) {
+      assert(
+          BO->isLogicalOp() &&
+          "__builtin_amdgcn_processor_is and __builtin_amdgcn_is_invocable "
+            "can only be used as operands of logical ops!");
+
+      if (!ValidateAMDGPUPredicateBI(Sema, LHS) ||
+          !ValidateAMDGPUPredicateBI(Sema, RHS)) {
+        Invalid = true;
+        return nullptr;
+      }
+
+      BO->setLHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), LHS));
+      BO->setRHS(ExpandAMDGPUPredicateBI(Sema.getASTContext(), RHS));
+      BO->setType(Sema.getASTContext().getLogicalOperationType());
+
+      return BO;
+    }
+  }
+  if (auto CE = dyn_cast<CallExpr>(E))
+    if (IsAMDGPUPredicateBI(CE)) {
+      if (!ValidateAMDGPUPredicateBI(Sema, CE)) {
+        Invalid = true;
+        return nullptr;
+      }
+      return ExpandAMDGPUPredicateBI(Sema.getASTContext(), CE);
+    }
+
+  return nullptr;
+}
+
 ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
                                        bool IsConstexpr) {
   DiagnoseAssignmentAsCondition(E);
@@ -20480,6 +20629,14 @@ ExprResult Sema::CheckBooleanCondition(SourceLocation Loc, Expr *E,
   E = result.get();
 
   if (!E->isTypeDependent()) {
+    if (E->getType()->isVoidType()) {
+      bool IsInvalidPredicate = false;
+      if (auto BIC = MaybeHandleAMDGPUPredicateBI(*this, E, IsInvalidPredicate))
+        return BIC;
+      else if (IsInvalidPredicate)
+        return ExprError();
+    }
+
     if (getLangOpts().CPlusPlus)
       return CheckCXXBooleanCondition(E, IsConstexpr); // C++ 6.4p4
 
diff --git a/clang/test/CodeGen/amdgpu-builtin-cpu-is.c b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
new file mode 100644
index 0000000000000..6e261d9f5d239
--- /dev/null
+++ b/clang/test/CodeGen/amdgpu-builtin-cpu-is.c
@@ -0,0 +1,65 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals all --version 5
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx900 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX900 %s
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx1010 -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCN-GFX1010 %s
+// RUN: %clang_cc1 -triple spirv64-amd-amdhsa -emit-llvm %s -o - | FileCheck --check-prefix=AMDGCNSPIRV %s
+
+// Test that, depending on triple and, if applicable, target-cpu, one of three
+// things happens:
+//    1) for gfx900 we emit a call to trap (concrete target, matches)
+//    2) for gfx1010 we emit an empty kernel (concrete target, does not match)
+//    3) for AMDGCNSPIRV we emit llvm.amdgcn.is.gfx900 as a bool global, and
+//       load from it to provide the condition a br (abstract target)
+//.
+// AMDGCN-GFX900: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCN-GFX1010: @__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 600
+//.
+// AMDGCNSPIRV: @llvm.amdgcn.is.gfx900 = external addrspace(1) externally_initialized constant i1
+//.
+// AMDGCN-GFX900-LABEL: define dso_local void @foo(
+// AMDGCN-GFX900-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX900-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX900-NEXT:    call void @llvm.trap()
+// AMDGCN-GFX900-NEXT:    ret void
+//
+// AMDGCN-GFX1010-LABEL: define dso_local void @foo(
+// AMDGCN-GFX1010-SAME: ) #[[ATTR0:[0-9]+]] {
+// AMDGCN-GFX1010-NEXT:  [[ENTRY:.*:]]
+// AMDGCN-GFX1010-NEXT:    ret void
+//
+// AMDGCNSPIRV-LABEL: define spir_func void @foo(
+// AMDGCNSPIRV-SAME: ) addrspace(4) #[[ATTR0:[0-9]+]] {
+// AMDGCNSPIRV-NEXT:  [[ENTRY:.*:]]
+// AMDGCNSPIRV-NEXT:    [[TMP0:%.*]] = load i1, ptr addrspace(1) @llvm.amdgcn.is.gfx900, align 1
+// AMDGCNSPIRV-NEXT:    br i1 [[TMP0]], label %[[IF_THEN:.*]], label %[[IF_END:.*]]
+// AMDGCNSPIRV:       [[IF_THEN]]:
+// AMDGCNSPIRV-NEXT:    call addrspace(4) void @llvm.trap()
+// AMDGCNSPIRV-NEXT:    br label %[[IF_END]]
+// AMDGCNSPIRV:       [[IF_END]]:
+// AMDGCNSPIRV-NEXT:    ret void
+//
+void foo() {
+    if (__builtin_cpu_is("gfx90...
[truncated]

@AlexVlx AlexVlx added SPIR-V SPIR-V language support llvm:transforms and removed clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Apr 2, 2025
Copy link

github-actions bot commented Apr 2, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Apr 2, 2025
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, in general I'm a fan of being able to use LLVM-IR as a more general target. We already hack around these things in practice, so I think it's only beneficial to formalize is in a more correct way, even if LLVM-IR wasn't 'strictly' intended to be this kind of serialization format.

// AMDGCNSPIRV-NEXT: ret void
//
void foo() {
if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_permlanex16))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to handle builtins that require certain target features to be set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a test? Something simple like +dpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a test? Something simple like +dpp?

Sure, but if possible, could you clarify what you would like to be tested / what you expect to see, so that we avoid churning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with how the ROCm device libs does it, is that certain builtins require target features to be used. It hacks around this with the __attribute__((target)). I just want to know that you can call a builtin that requires +ddp features without that.

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth a release note item.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Apr 2, 2025

This is worth a release note item.

Indeed! I botched moving the changes from my internal scratchpad, and the rel notes got lost; fixing.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jun 30, 2025

Alex, can you talk about why your design decides to check for specific builtins rather than building out the set of features supported by __builtin_cpu_supports?

I went into it a bit above without having seen your question (race condition I guess:) ), but to have it in one spot:

  • AMDGPU features are a bit volatile and subject to disruptive change, sadly (we should be better about this but it's going to be a marathon, and it's not entirely under our - LLVM compiler team - control);
  • We don't really document the features / they are formulated in a way that makes sense for the BE, and maybe for a compiler dev, but would be extremely confusing for an user - for example note that we have about a dozen DOT related features, which aren't always inclusive of each other, so you cannot actually infer that DOTn implies DOTn-1;
  • Conversely, the builtins devs reach for most often implement some specific capability i.e. just mirror an ISA instruction that they want to use (e.g. mfma / wmma), and these are documented via the ISA docs we publish, so having a per-builtin check seemed to match common usage and benefited from what is already in place as opposed to depending on hypothetical long-pole changes.

Now, this is specific to AMDGPU, I don't want to speculate too much about how other targets deal with this - which is another reason for which these are target builtins rather than going for something more generic.

@MrSidims
Copy link
Contributor

Let me add my few cents here.

In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately.

Let me try to attempt to answer this question without introducing a new builtin in clang (at first). In SPIR-V there is specialization constant which AFAIK doesn't have a direct LLVM IR counterpart.
Some pseudo-code on SPIR-V would be looking like this:

%int = OpTypeInt 32 1
%runtime_known_hw_id = OpSpecConstant %int 0 // global var
%hw_id_that_supports_feature = OpConstant %int 42

kernel void foo(...) {
/* ... */
%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}

At runtime, when such SPIR-V module is JIT compiled OpSpecConstant materializes, so DCE (or better say some variation of DCE that is enforced to work with optnone) will be able to reason about %cmp result removing the dead branch, so we won't get unsupported feature at codegen.

Problem is: how to generate such SPIR-V from clang. So my understanding, that the new builtin should eventually lowered (by SPIR-V backend?) to a construct like in the pseudo-code, though that is not what is currently happening. And I believe, that existing __builtin_cpu_supports is not a good match for such lowering.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jun 30, 2025

Let me add my few cents here.

In the case where the target features are known during clang codegen, lowering is easy: you just skip generating the bodies of the if statements that don't match. If you want to some kind of "runtime" (actual runtime, or SPIR-V compilation-time) detection, it's not clear what the LLVM IR should look like: we only support specifying target features on a per-function level. But we can look at that separately.

Let me try to attempt to answer this question without introducing a new builtin in clang (at first). In SPIR-V there is specialization constant which AFAIK doesn't have a direct LLVM IR counterpart. Some pseudo-code on SPIR-V would be looking like this:

%int = OpTypeInt 32 1
%runtime_known_hw_id = OpSpecConstant %int 0 // global var
%hw_id_that_supports_feature = OpConstant %int 42

kernel void foo(...) {
/* ... */
%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}

At runtime, when such SPIR-V module is JIT compiled OpSpecConstant materializes, so DCE (or better say some variation of DCE that is enforced to work with optnone) will be able to reason about %cmp result removing the dead branch, so we won't get unsupported feature at codegen.

Problem is: how to generate such SPIR-V from clang. So my understanding, that the new builtin should eventually lowered (by SPIR-V backend?) to a construct like in the pseudo-code, though that is not what is currently happening. And I believe, that existing __builtin_cpu_supports is not a good match for such lowering.

This is one possible implementation indeed, for a workflow that goes from SPIR-V to ISA, or chooses to do the DCE in SPIR-V. Due to having to compose with an existing mature toolchain, rather than starting fresh, we have a slightly different flow where we reverse translate to LLVM IR and "resume" compilation from that point. Hence, the implicitly inserted never to be emitted globals, which play the role the spec constants play in your example, when coupled with the dedicated predicate expansion pass. Something similar could be added to e.g. spirv-opt. Thank you for the example, it is helpful.

@jhuber6
Copy link
Contributor

jhuber6 commented Jun 30, 2025

High liklihood that I'll need something similar for my GPU libraries so I'd prefer something not explicitly tied to SPIR-V.

@rjmccall
Copy link
Contributor

rjmccall commented Jun 30, 2025

An intrinsic seems like the right IR model for CPU recognition, even for targets that don't specifically need to late-resolve it. That should be much easier for passes to optimize based on CPU settings than directly emitting the compiler-rt reference in the frontend. I know that generating IR with conservative target options and then bumping the target CPU in a pass is something various people have been interested in, so late optimization is specifically worth planning for here.

We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:

  %0 = call @llvm.compiler_supports(...)
  br i1 %0, label %foo, label %bar

LLVM could theoretically complicate this by e.g. introducing a PHI or an or. But that's a general LLVM problem that any lowering would have to deal with.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jun 30, 2025

An intrinsic seems like the right IR model for CPU recognition, even for targets that don't specifically need to late-resolve it. That should be much easier for passes to optimize based on CPU settings than directly emitting the compiler-rt reference in the frontend. I know that generating IR with conservative target options and then bumping the target CPU in a pass is something various people have been interested in, so late optimization is specifically worth planning for here.

We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:

  %0 = call @llvm.compiler_supports(...)
  br i1 %0, label %foo, label %bar

LLVM could theoretically complicate this by e.g. introducing a PHI or an or. But that's a general LLVM problem that any lowering would have to deal with.

The solution we went with here (for our use case) is to just run the predicate expansion pass over pristine Clang generated IR, before any other optimisation. I think that @nikic suggested an alternative based on callbr, but that'd be somewhat challenging to represent in SPIRV which is important to us, but then again this could just be an implementation detail for cpu_is gets lowered, I guess? I.e., since we know we're only ever going to deal this early, we could just leave the call in place since we know no optimisation will complicate things, conversely other targets could go with callbr etc.

@efriedma-quic
Copy link
Collaborator

Whilst I am thankful for the feedback I think it is somewhat unfortunate that we could not have a shared discussion about this, since I think that there are some core misunderstandings that seem to recur, which makes forward progress one way or the other difficult.

We didn't really say much on the call itself; we just spent a minute while we were going through controversial RFCs/PRs, to call this out as something that needed attention.

If you think this topic would benefit from a meeting, we can organize one... but maybe a 1-on-1 chat would be better to start with, just to make sure we're on the same page.

The front-end cannot generate accurate diagnostics for the actual interesting case where the target is abstract (AMDGCNSPIRV, or the generic target @jhuber6 uses in GPU libc, if we extend things in that direction), because there isn't enough information - we only know what the concrete target is, and hence what features are available, at a point in time which is sequenced after the front-end has finished processing (could be run-time JIT for SPIR-V, could be bit code linking in a completely different compilation for GPU libc etc.);

If you have a construct like the following:

if (__builtin_amdgcn_processor_is("gfx900"))) {
  some_gfx9000_specific_intrinsic();
}

some_gfx9000_specific_intrinsic()

We can tell, statically, that the first call is correctly guarded by an if statement: it's guaranteed it will never run on a non-gfx9000 processor. The second call, on the other hand, is not. So we can add a frontend rule: the first call is legal, the second is not. Obviously the error has false positives, in the sense that we can't actually prove the second call is incorrect at runtime... but that's fine, probably.

What I don't want is that we end up with, essentially, the same constraint, but enforced by the backend.

There is not watertight mechanism here in the presence of indirect function calls / pointers to function

Sure; we can't stop people from calling arbitrary pointers.

We do have a theoretical problem with guaranteeing that non-matching code isn't emitted, because LLVM IR doesn't promise to leave a code sequence like this alone:

There are ways to solve this: for example, we can make the llvm.compiler.supports produce a token, and staple that token onto the intrinsics using a bundle. Making this work requires that IRGen knows which intrinic calls are actually impacted...

I care less about exactly how we solve this because we can adjust the solution later. Whatever we expose in the frontend is much harder to change later.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jun 30, 2025

If you think this topic would benefit from a meeting, we can organize one... but maybe a 1-on-1 chat would be better to start with, just to make sure we're on the same page.

Definitely, more than happy to have a 1-on-1 (2-on-1 even, since I think @AaronBallman also suggested something along these lines as well :) ).

The front-end cannot generate accurate diagnostics for the actual interesting case where the target is abstract (AMDGCNSPIRV, or the generic target @jhuber6 uses in GPU libc, if we extend things in that direction), because there isn't enough information - we only know what the concrete target is, and hence what features are available, at a point in time which is sequenced after the front-end has finished processing (could be run-time JIT for SPIR-V, could be bit code linking in a completely different compilation for GPU libc etc.);

If you have a construct like the following:

if (__builtin_amdgcn_processor_is("gfx900"))) {
  some_gfx9000_specific_intrinsic();
}

some_gfx9000_specific_intrinsic()

We can tell, statically, that the first call is correctly guarded by an if statement: it's guaranteed it will never run on a non-gfx9000 processor. The second call, on the other hand, is not. So we can add a frontend rule: the first call is legal, the second is not. Obviously the error has false positives, in the sense that we can't actually prove the second call is incorrect at runtime... but that's fine, probably.

I will note that on concrete targets, what is being proposed already works as described, by virtue of it being an error to call a builtin that is not available. Having said that, this gives me some trepidation and I think it can end up being user adverse. Consider the following case:

void foo() { 
  if (__builtin_amdgcn_is_invocable(__builtin_amdgcn_gfx9000_specific_intrinsic)) 
    __builtin_amdgcn_gfx9000_specific_intrinsic; 
}

void bar() {
  if (__builtin_amdgcn_processor_is("gfx9000")
    foo();

  foo();
}

We've just made the call to foo() illegal on anything that is not gfx9000, but that builtin / intrinsic could exist in 8999 other gfx versions. These don't always form binary, mutually exclusive structures. So I think I disagree with the "that's fine, probably".

What I don't want is that we end up with, essentially, the same constraint, but enforced by the backend.

Could you please detail why? Ultimately the BE still gets to decide on the legality of things that tweak it pretty intrinsically, even if said things come from otherwise linguistically correct constructs which have passed FE analysis. Also, we'd never really reach the BE, we're just sliding in immediately after Clang, before optimisation, so there's still enough info to provide an useful error message. Furthermore, this might be a better point to check anyways, as linking in bitcode could / should have already occured, so what would otherwise have been external symbols that impact viability would now be satisfied.

I care less about exactly how we solve this because we can adjust the solution later. Whatever we expose in the frontend is much harder to change later.

Between making the wrong choice and going with something that's user adverse early on, then trying to build increasingly complicated mechanisms to make it work, I would prefer we just left these as target specific, low level builtins returning bool, with no convenient Sema guards / errors. This solves actual problems we are aware of / matches uses we have already seen in practice. Target builtins with no safety handles are supposed to be volatile, and unstable specialist tools, so they don't encumber the FE in the same way. At the risk of being repetitive, there's already functionality in upstream that works along the same lines, so there is a precedent. Furthermore, if we derive a superior generic design, it'd hopefully stand on its own merits and would allow and warrant migration.

@efriedma-quic
Copy link
Collaborator

Definitely, more than happy to have a 1-on-1 (2-on-1 even, since I think @AaronBallman also suggested something along these lines as well :) ).

Please email me with some times that will work for you.

We've just made the call to foo() illegal on anything that is not gfx9000

I... don't think I'm suggesting this? The fact that a call to foo() from a __builtin_amdgcn_processor_is block shouldn't imply anything about other calls to foo().

What I'm basically suggesting is just exposing SPIR-V specialization constants as a C construct. Your example SPIR-V was something like:

%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}

We want to come up with a corresponding C construct that's guaranteed to compile to valid SPIR-V. My suggestion is something like:

if (__runtime_known_hw_id_eq("hw_id_that_supports_feature")) {
  /* some feature */
}

In the body of the if statement, you can use whatever intrinsics are legal on hw_id_that_supports_feature.

we're just sliding in immediately after Clang, before optimisation

Isn't doing checks immediately after IR generation basically the same as checking the AST, just on a slightly different representation?

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jul 2, 2025

Definitely, more than happy to have a 1-on-1 (2-on-1 even, since I think @AaronBallman also suggested something along these lines as well :) ).

Please email me with some times that will work for you.

We've just made the call to foo() illegal on anything that is not gfx9000

I... don't think I'm suggesting this? The fact that a call to foo() from a __builtin_amdgcn_processor_is block shouldn't imply anything about other calls to foo().

Perhaps I am misunderstanding, case in which I apologise. I started from: "We can tell, statically, that the first call is correctly guarded by an if statement: it's guaranteed it will never run on a non-gfx9000 processor. The second call, on the other hand, is not. So we can add a frontend rule: the first call is legal, the second is not." I'm saying we cannot really infer anything about the legality of a naked call to a builtin either, at this point. Because the builtin might be available on many processors / processors other than gfx9000. We can develop the argument to say "well, fine, what we actually meant here is is_invocable, rather than processor_is, and then thing work out", but the corollary to that appears to be that if you ever use the predicate on a builtin, you must touch every other use of that builtin within at least the same function, and relate it to the predicate evaluation.

What I'm basically suggesting is just exposing SPIR-V specialization constants as a C construct. Your example SPIR-V was something like:

%cmp = OpIEqual %bool %runtime_known_hw_id %hw_id_that_supports_feature
if (%cmp = true) {
/* some feature */
} else {
/* other feature */
}

We want to come up with a corresponding C construct that's guaranteed to compile to valid SPIR-V. My suggestion is something like:

if (__runtime_known_hw_id_eq("hw_id_that_supports_feature")) {
  /* some feature */
}

I'm confused as to what is different versus what this PR does, which is does generate valid SPIRV / LLVM IR. Perhaps there is an underlying assumption that there is some construct that makes the otherwise dead block still contain valid code, and there really isn't. There's an example I provided above where what is guarded is (static) finite resource allocation, not just the use of an intrinsic; we'd not know in the FE which is correct, and we cannot allocate both until we know the target at JIT / finalisation time (so before executing the code), and we cannot generate executable code with both allocation requests live, as the finite resource gets exhausted. So the only place where we can meaningfully deal with this is in the ME / over IR, before hitting the BE. We should be careful to avoid focusing on the processor_is / hw_id aspect, this leads to brittle code that has to constantly grow additional identity checks via || disjunction.

In the body of the if statement, you can use whatever intrinsics are legal on hw_id_that_supports_feature.

we're just sliding in immediately after Clang, before optimisation

Isn't doing checks immediately after IR generation basically the same as checking the AST, just on a slightly different representation?

Not in this case. There's at least two aspects that make a difference:

  • linking in bitcode, which can allow more extensive analysis than what you can do per TU in the AST - this is minor, however please note the conversation above about having to be conservative around external symbols, and the risks of leaving them around;
  • lack of information when generating the AST, when dealing with abstract targets like SPIRV (more specifically, AMDGCN flavoured SPIRV, for the purposes of this PR)
    • the FE targets amdgcnspirv, which is generic across all concrete AMDGPU targets (union of features);
    • the predicates proposed here offer customisation points for which the resolution is deferred to the point where the target is known;
    • we only know the concrete target when we are finalising, which happens at a completely different time-point, on possibly a different machine;
    • we cannot time-travel to inform the AST about this, but we can compose generic IR with target IR, and lower it as target IR (this is already how various flavours of device / offload libs work, so it's hardly novel).

None of the above matters for concrete targets, where we just resolve everything in the AST already, because we have full information in the FE.

@rjmccall
Copy link
Contributor

rjmccall commented Jul 2, 2025

I think Eli is suggesting something like the rule for @available:

  • If a builtin is unconditionally available, you can always use it without any diagnostics.
  • If a builtin is only available on specific subtargets, you can only use it in a block that's guarded by an if (some_intrinsic(...)) condition that will only pass if you're running on one of those subtargets.

So it's not that adding a check for the builtin will suddenly cause un-checked calls to it to fail, it's that you have to have such a check to use it in the first place, which makes sense because it's not always available on the target.

Note that the @available model also includes an attribute for marking your own functions as conditionally available, in which case (1) the entire body of the function is considered guarded but (2) you have to guard calls to it. That might be a necessary enhancement for your feature, too.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jul 2, 2025

I think Eli is suggesting something like the rule for @available:

  • If a builtin is unconditionally available, you can always use it without any diagnostics.
  • If a builtin is only available on specific subtargets, you can only use it in a block that's guarded by an if (some_intrinsic(...)) condition that will only pass if you're running on one of those subtargets.

So it's not that adding a check for the builtin will suddenly cause un-checked calls to it to fail, it's that you have to have such a check to use it in the first place, which makes sense because it's not always available on the target.

This is interesting, and I had / have looked at @available (albeit I am far from being a scholar on the topic). It makes logical sense, but I expect users will simply ignore it since it is very restrictive if we go with you have to have a check to use a builtin. It's not as if all builtin uses today which are present in user code are guarded by __has_builtin or by a check against an architecture macro. I will also note that as far as I can see __builtin_available, which we also provide for C & C++, at most warns https://gcc.godbolt.org/z/Msrrn4x9v, with the warning being opt-in. It also does not appear to generate any sort of special IR construct, it's just sugar over a call to a runtime provided interface, AFAICT. Furthermore, unlike for __builtin_available, there's no immediately apparent way to provide an useful warning here:

  • if we're compiling for concrete we already know which builtins are available / what target is present, so whether something is legal or not is fully determined;
  • conversely, for the abstract case we are targeting a generic target which has all the features, so at most we could be somewhat spammy and say "be sure to wrap this in a __builtin_amdgcn_is_invocable call;
  • this only covers a subset of cases, since there are also e.g. per target resource allocation choices, so now we have to hoist into Clang even more architecture details such as the size of shared memory i.e. we'd have to warn;
  • this'd probably balloon into a non-trivial amount of checking (think the Sema footprint for @available is not exactly petite), we'd still at most get to warn, and would still have to run the IR pass, which is actually in a position to correctly diagnose an error state.

If the added warning is considered I can loot at adding that but I think that should be a separate patch / conversation since it'd mess with established builtin behaviour (as mentioned, one can reach for an unguarded builtin today without any additional diagnostics / invitation to touch __has_builtin, and there are examples where builtins that the FE believes work are actually not available on a target, see, for example, the math ones).

Note that the @available model also includes an attribute for marking your own functions as conditionally available, in which case (1) the entire body of the function is considered guarded but (2) you have to guard calls to it. That might be a necessary enhancement for your feature, too.

Unless I am missing something obvious, this brings us dangerously close to re-inventing language subsetting / restrictions, that are already present in single-source(-ish) offload languages. It ends up being __device__ / restrict(amp) / omp declare target in slightly different clothes. I don't think that is a desirable effect here. At least for our use case (which is what these are trying to support), we need a mechanism that just works with what is already there, and can be directly used from C / C++, with existing C / C++ codebases i.e. has to work with what our library authors have, using idioms they are familiar with.

@rjmccall
Copy link
Contributor

rjmccall commented Jul 2, 2025

So your users today are building for generic AMDGPU but using builtins that are only available on a specific processor release? Presumably those builtin calls are protected somehow, since otherwise their programs would be crashing in the backend or, worse, at runtime. Are they managing that in some way that the language could theoretically leverage at all, or in practice is it too varied and ad hoc?

@jhuber6
Copy link
Contributor

jhuber6 commented Jul 2, 2025

So your users today are building for generic AMDGPU but using builtins that are only available on a specific processor release? Presumably that code is protected somehow and their programs are not simply crashing at runtime. Is that something you'd be able to leverage at all, or is completely ad hoc?

This is basically how the NVIDIA device library and ROCm device library already work. In the latter case we just accept that globalopt,dce is required to clean that up after injecting the library code into the user's application. Part of this formalizes that.

@rjmccall
Copy link
Contributor

rjmccall commented Jul 2, 2025

Right, but the code still contains some kind of check to decide which version of the library function to call. Maybe it's implicit by multiversioned functions or something, but there's something in the source code, because if the user just writes a kernel that does nothing but call a builtin that's not always available, you've got a problem. My question is just what those checks look like and whether that's something we can reasonably leverage in the language design here to help programmers not make that kind of mistake.

And if these checks are all done in the library, the library can of course just be annotated.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jul 2, 2025

So your users today are building for generic AMDGPU but using builtins that are only available on a specific processor release? Presumably that code is protected somehow and their programs are not simply crashing at runtime. Is that something you'd be able to leverage at all, or is completely ad hoc?

They do crash at run time, except not in the way one would expect - they fail when finalising / JIT-ing from SPIRV, which is still a compiler failure, except it's a BE / ISEL one. But yes, this is a current problem (which this is addressing). Here's an example (there are others):

  • client code uses some builtins that are only available on RDNA (GFX10+), #if __has_builtin(something_something);
  • when targeting AMDGCN-flavoured SPIRV (amdgcnspirv), the union of builtins is available, since we don't know what the concrete target will end up being, and we want to maximally leverage features, so the check is true and the RDNA builtin ends up in SPIRV;
  • the compiled library / executable gets executed on a CDNA machine;
  • depending the nature of the intrinsic a JIT-time error ensues.

What we would like to do is to allow people to handle these cases with a linear translation from the above into if (__builtin_amdgcn_is_invocable(something_something), which then would lead to having code that works everywhere with maximum capability (we don't have to reduce things to a common subset), without having to be linear in targets. I mention the latter because our device libs (which are not upstream), deal with this via a different convoluted mechanism, since there was no amdgcnspirv / generic target at the time, which requires generating separate bitcode per target, which is not long term viable as we get more and more targets.

@efriedma-quic
Copy link
Collaborator

we only know the concrete target when we are finalising, which happens at a completely different time-point, on possibly a different machine;

This is precisely why we want the frontend diagnostic: if we diagnose the bad cases in the frontend, later stages are guaranteed to lower correctly. If we diagnose later, you don't know you messed up until you get crash reports from your users.

conversely, for the abstract case we are targeting a generic target which has all the features, so at most we could be somewhat spammy and say "be sure to wrap this in a __builtin_amdgcn_is_invocable call;

I prefer to think of it as a generic target which has none of the features.

Yes, you might have to take some time to annotate your code, but once you have the annotations, it catches a lot of potential mistakes.


In case nobody else has brought it up, we currently do the following on Arm, which is conceptually similar:

#include <arm_sve.h>
__attribute((target("sve"))) void f() {
  svbool_t x = svptrue_b8(); // Allowed
}
void f2() {
  svbool_t x = svptrue_b8(); // Error: SVE isn't enabled.
}

@jhuber6
Copy link
Contributor

jhuber6 commented Jul 2, 2025

In case nobody else has brought it up, we currently do the following on Arm, which is conceptually similar:

#include <arm_sve.h>
__attribute((target("sve"))) void f() {
  svbool_t x = svptrue_b8(); // Allowed
}
void f2() {
  svbool_t x = svptrue_b8(); // Error: SVE isn't enabled.
}

The ROCm Device Libs use the target attribute but it mostly just serves to make the compiler shut up about the incompatible builtin the function contains. It also means that we need to propagate attributes for it to be correct during linking.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jul 2, 2025

This is precisely why we want the frontend diagnostic: if we diagnose the bad cases in the frontend, later stages are guaranteed to lower correctly. If we diagnose later, you don't know you messed up until you get crash reports from your users.

I guess I am lost as to what one can diagnose. The upper bound would be the warning __builtin_available generates. You simply cannot unambiguously establish if something is or isn't there until you get the concrete target, so we have to (and do) accept that some errors will come from users.

I prefer to think of it as a generic target which has none of the features.

This is an interesting view but does not address our uses / interests, which is why we defined it as described, as union rather than intersection.

Yes, you might have to take some time to annotate your code, but once you have the annotations, it catches a lot of potential mistakes.

But not all, and it does so at the cost of yet another annotation (we have quite a few already). Furthermore, I feel that what is getting lost in translation is that the scope of these is not just functions. You can opt to guard ASM blocks, which might have target specific constraints, or resource allocation (as I mentioned). Extracting all of those into separate functions, that are now annotated, pretty much guarantees limited interest - very few users are going to start rewriting their working code to extract builtin / asm / resource allocation into separate named functions with attributes. Even if they did, you still do not know the concrete target, and therefore cannot unambiguously reason about the legality of this or that call (which is why even __builtin_available warns, since it doesn't actually know what environ it'll execute on). So we still have to defer things to finalisation time, and will still get user error reports. Furthermore, as @jhuber6 points out, going with attributes means we lose the ability to reason about things locally, at an instruction boundary, but rather have to start relying on things like the Attributor (which doesn't actually run at O0, for example).

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jul 3, 2025

@efriedma-quic was kind enough to have a call where we discussed this a bit more. I'll update tomorrow with a potential way forward, for the group's consideration.

@AlexVlx
Copy link
Contributor Author

AlexVlx commented Jul 3, 2025

@efriedma-quic was kind enough to have a call where we discussed this a bit more. I'll update tomorrow with a potential way forward, for the group's consideration.

Following up, here's a possible approach to making progress, broken down in phases, (@efriedma-quic can correct me if I am misrepresenting any of these):

  1. Have what is proposed here as an initial step, with the addition that we issue warnings on unguarded uses of builtins / ASM (similar to what __builtin_available / @available do), and we clean-up non-extern functions that become unreachable as a consequence of predicate expansion (i.e. foo can only be called from within this module, and it was only being called from a predicate guarded block, which was removed);
  2. Add attribute based checking for predicate guarded areas:
    • Functions can be annotated either with the existing target attribute or with a new target_can_invoke (name up for bike-shedding) attribute;
    • Within a predicate guarded scope, if we encounter contradictions, e.g. we call a target("gfx9000") function, or a target_can_invoke(builtin_only_on_gfx9000), within a __builtin_amdgcn_processor_is("gfx8999"), that is an error
    • This should reward users that go through the effort of annotating their functions, making it much harder to write bugs
    • I'm not entirely sure how to do this well yet (nested guarded regions, where to track the currently active guard etc.), and it probably needs a bit more design, hence why it's a different phase
    • It is a pre-requisite for any attempt at making these general, rather than target specific
  3. In relation with generalisation, if we go in that direction (i.e. other targets are interested / we think there's merit into hoisting these into generic Clang builtins), we will have to look at whether or not we want a different IR representation (possibly / probably along the lines of what has been discussed here), for cases where a target must run some potentially disruptive optimisations before and cannot just do the expansion right after Clang.

@JonChesterfield
Copy link
Collaborator

The frontend tradeoffs here are complicated and already under discussion so I'm going to skip over that aspect.

This looks like a layer of stuff which can be built on top of an llvm intrinsic that guarantees branch folding before instruction selection. The problems with the rocm device libs having invalid code on branches that the compiler is meant to strip but doesn't at O0 would be solvable by leaving calls to that raw intrinsic in place.

We can probably do that as an intrinsic returning bool passed to the branch, where a target hook is called on it to resolve to true/false/report-error, as part of a simple simplify-cfg style pass. Essentially just force the evaluation of the intrinsic and then promise to delete dead branches.

I'd like that intrinsic anyway for language runtime hackery. bool llvm.name_tbd(...) sort of prototype, whichever backend is live goes grovelling through the arguments / metadata / whatever to make the decision. That we could also implement these front end sema style things on it seems great.

Is there an RFC associated with this that would be a better place to put that thought? If I implement it, are we game for rebasing this on said general purpose lowering intrinsic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category llvm:transforms SPIR-V SPIR-V language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.