Skip to content

[AutoBump] Merge with 08195f31 (Jan 23) (18) #556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 283 commits into from
Jun 26, 2025

Conversation

jorickert
Copy link

No description provided.

HanKuanChen and others added 30 commits January 23, 2025 09:47
This cannot happen. Also simplify the LaneBitmask check from !none
to any.
This was allocating tiny helper classes for every instruction
visited. We can just dispatch over the cases in the visitor
function instead.
…#123587)

This commit completed four tasks:
- Add `-mrelax/-mno-relax` options support for LoongArch in clang
driver.
- Print error for `-gsplit-dwarf` with LoongArch linker relaxation
(`-mrelax`).
- Pass `-X` to linker to discard a plethora of `.L` symbols due to
linker relaxation.
- Forward `--no-relax` option to linker.
)

This is looking at operand 0 of a REG_SEQUENCE, which can never
have a subregister index.
…ing (llvm#123474)

`handleIntrinsicByApplyingToShadow` (introduced in
llvm#114490) requires that the
intrinsic supports integer-ish operands; this is not the case for all
intrinsics. This patch generalizes the function to bitcast the shadow
arguments to be the same type as the original intrinsic, thus
guaranteeing that the intrinsic exists. Additionally, it casts the
computed shadow to be an appropriate shadow type.

This function assumes that the intrinsic will handle arbitrary
bit-patterns (for example, if the intrinsic accepts floats for var1, we
assume that it works normally even if inputs are NaNs etc.).
…lvm#123911)

We already bail if the user is tied in checkUsers, which is true for all
passthrus. Remove the check in getOperandLog2EEW so that it only worries
about computing the OperandInfo, and leaves the passthru correctness to
checkUsers.
…124046)

Reverts llvm#123853

The introduction of `reflect-error.ll` surfaced a bug with the use of
`report_fatal_error` in `SPIRVInstructionSelector` that was propagated
into the pr. This has caused a build-bot breakage, and the work to solve
the underlying issue is tracked here:
llvm#124045. We can re-apply this
commit when the underlying issue is resolved.
…lvm#123701)

Now a `emitc.switch` with argument of `emitc.expression` wouldn't emit
its argument to cpp. This patch fix it.
…ng it LLVM dialect (llvm#123840)

With these changes, CUF atomic operations are handled as cudadevice
intrinsics and are converted straight to the LLVM dialect with the
`llvm.atomicrw` operation.

I am only submitting changes for `atomicadd` to gather feedback. If we
are to proceed with these changes I will add support for all other
applicable atomic operations following this pattern.
Increases minimum CAS size from 16 bit to 32 bit, for better SASS
codegen.

When atomics are emulated using atom.cas.b16, the SASS generated
includes 2 (nested) emulation loops. When emulated using an atom.cas.b32
loop, the SASS too has a single emulation loop. Using 32 bit CAS thus
results in better codegen.
When looking at the slowest lit tests, I'm seeing these four tests take
two to eight minutes. Test coverage on Linux should be sufficient for
the functionality on top of it not really being useful on Windows at
all.

This was observed when hacking on the new premerge in a windows VM.
…vm#116771)

Two options for clang
  -mno-scq:                Disable sc.q instruction.
  -mscq:                   Enable sc.q instruction.
The default is -mno-scq.
…lvm#123881)

This extension adds eight 48 bit load store instructions.

The current spec can be found at:
https://github.com/quic/riscv-unified-db/releases/latest

This patch adds assembler only support.

---------

Co-authored-by: Harsh Chandel <[email protected]>
https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
proposes to partition static data sections.

This patch introduces a codegen pass. This patch produces jump table
hotness in the in-memory states (machine jump table info and entries).
Target-lowering and asm-printer consume the states and produce `.hot`
section suffix. The follow up PR
llvm#122215 implements such
changes.

---------

Co-authored-by: Ellis Hoag <[email protected]>
…#118656)

This patch is an extension to llvm#115128.

After profiling LLVM test-suite, I see a lot of loop nest of depth more
than `MaxLoopNestDepth` which is 10. Early exit for them would save
compile-time as it would avoid computing DependenceInfo and CacheCost.

Please see 'bound-max-depth' branch on compile-time-tracker.
Fixes llvm#113191

Issue: [flang][OpenMP] Runtime segfault when an allocatable variable is
used with copyin

Rootcause: The value of the threadprivate variable is not being copied
from the primary thread to the other threads within a parallel region.
As a result it tries to access a null pointer inside a parallel region
which causes segfault.

Fix: When allocatables used with copyin clause need to ensure that, on
entry to any parallel region each thread’s copy of a variable will
acquire the allocation status of the primary thread, before copying the
value of a threadprivate variable of the primary thread to the
threadprivate variable of each other member of the team.
When `try_table`'s catch clause's destination has a return type, as in
the case of catch with a concrete tag, catch_ref, and catch_all_ref. For
example:
```wasm
block exnref
  try_table (catch_all_ref 0)
    ...
  end_try_table
end_block
... use exnref ...
```

This code is not valid because the block's body type is not exnref. So
we add an unreachable after the 'end_try_table' to make the code valid
here:
```wasm
block exnref
  try_table (catch_all_ref 0)
    ...
  end_try_table
  unreachable                    ;; Newly added
end_block
```
Because 'unreachable' is a terminator we also need to split the BB.

---

We need to handle the same thing for unwind mismatch handling. In the
code below, we create a "trampoline BB" that will be the destination for
the nested `try_table`~`end_try_table` added to fix a unwind mismatch:
```wasm
try_table (catch ... )
  block exnref
    ...
    try_table (catch_all_ref N)
      some code
    end_try_table
    ...
  end_block                      ;; Trampoline BB
  throw_ref
end_try_table
```
While the `block` added for the trampoline BB has the return type
`exnref`, its body, which contains the nested `try_table` and other
code, wouldn't have the `exnref` return type. Most times it didn't
become a problem because the block's body ended with something like `br`
or `return`, but that may not always be the case, especially when there
is a loop. So we add an `unreachable` to make the code valid here too:
```wasm
try_table (catch ... )
  block exnref
    ...
    try_table (catch_all_ref N)
      some code
    end_try_table
    ...
    unreachable                  ;; Newly added
  end_block                      ;; Trampoline BB
  throw_ref
end_try_table
```
In this case we just append the `unreachable` at the end of the layout
predecessor BB. (This was tricky to do in the first (non-mismatch) case
because there `end_try_table` and `end_block` were added in the
beginning of an EH pad in `placeTryTableMarker` and moving
`end_try_table` and the new `unreachable` to the previous BB caused
other problems.)

---

This adds many `unreaachable`s to the output, but this adds
`unreachable` to only a few places to see if this is working. The
FileCheck lines in `exception.ll` and `cfg-stackify-eh.ll` are already
heavily redacted to only leave important control-flow instructions, so I
don't think it's worth adding `unreachable`s everywhere.
Resubmit, previously PR has compilation issues.
davemgreen and others added 28 commits January 24, 2025 09:51
Most of this is mis-compiling with +fullfp16 and should be disabled for GISel.
Assign register banks to virtual registers. Does not use generic
RegBankSelect. After register bank selection all register operand of
G_ instructions have LLT and register banks exclusively. If they had
register class, reassign appropriate register bank.

Assign register banks using machine uniformity analysis:
Sgpr - uniform values and some lane masks
Vgpr - divergent, non S1, values
Vcc  - divergent S1 values(lane masks)

AMDGPURegBankSelect does not consider available instructions and, in
some cases, G_ instructions with some register bank assignment can't be
inst-selected. This is solved in RegBankLegalize.

Exceptions when uniformity analysis does not work:
S32/S64 lane masks:
- need to end up with sgpr register class after instruction selection
- In most cases Uniformity analysis declares them as uniform
  (forced by tablegen) resulting in sgpr S32/S64 reg bank
- When Uniformity analysis declares them as divergent (some phis),
  use intrinsic lane mask analyzer to still assign sgpr register bank
temporal divergence copy:
- COPY to vgpr with implicit use of $exec inside of the cycle
- this copy is declared as uniform by uniformity analysis
- make sure that assigned bank is vgpr
Note: uniformity analysis does not consider that registers with vgpr def
are divergent (you can have uniform value in vgpr).
- TODO: implicit use of $exec could be implemented as indicator
  that instruction is divergent
…m#123900)

... for the dynamic blocks created for operator new calls. This way we
get the type of memory allocated right. As a side-effect, the
diagnostics now point to the std::allocator calls, which is an
improvement.
)

As part of the "RemoveDIs" project, BasicBlock::iterator now carries a
debug-info bit that's needed when getFirstNonPHI and similar feed into
instruction insertion positions. Call-sites where that's necessary were
updated a year ago; but to ensure some type safety however, we'd like to
have all calls to moveBefore use iterators.

This patch adds a (guaranteed dereferenceable) iterator-taking
moveBefore, and changes a bunch of call-sites where it's obviously safe
to change to use it by just calling getIterator() on an instruction
pointer. A follow-up patch will contain less-obviously-safe changes.

We'll eventually deprecate and remove the instruction-pointer
insertBefore, but not before adding concise documentation of what
considerations are needed (very few).
When generating `arm_neon.h`, NeonEmitter outputs code that
violates strict aliasing rules (C23 6.5 Expressions #7,
C++23 7.2.1 Value category [basic.lval] #11), for example:

    bfloat16_t __reint = __p0;
    uint32_t __reint1 = (uint32_t)(*(uint16_t *) &__reint) << 16;
    __ret = *(float32_t *) &__reint1;

This patch fixed the offending code by replacing it with
a call to `__builtin_bit_cast`.
This patch adds SM and PTX versions for SM
101, 120 and their arch-accelerated variants.

All these are supported in cuda-12.8.
sm120/120a requires ptx8.7 and the rest require ptx8.6.

Signed-off-by: Durgadoss R <[email protected]>
Lower G_ instructions that can't be inst-selected with register bank
assignment from AMDGPURegBankSelect based on uniformity analysis.
- Lower instruction to perform it on assigned register bank
- Put uniform value in vgpr because SALU instruction is not available
- Execute divergent instruction in SALU - "waterfall loop"

Given LLTs on all operands after legalizer, some register bank
assignments require lowering while other do not.
Note: cases where all register bank assignments would require lowering
are lowered in legalizer.

AMDGPURegBankLegalize goals:
- Define Rules: when and how to perform lowering
- Goal of defining Rules it to provide high level table-like brief
  overview of how to lower generic instructions based on available
  target features and uniformity info (uniform vs divergent).
- Fast search of Rules, depends on how complicated Rule.Predicate is
- For some opcodes there would be too many Rules that are essentially
  all the same just for different combinations of types and banks.
  Write custom function that handles all cases.
- Rules are made from enum IDs that correspond to each operand.
  Names of IDs are meant to give brief description what lowering does
  for each operand or the whole instruction.
- AMDGPURegBankLegalizeHelper implements lowering algorithms

Since this is the first patch that actually enables -new-reg-bank-select
here is the summary of regression tests that were added earlier:
- if instruction is uniform always select SALU instruction if available
- eliminate back to back vgpr to sgpr to vgpr copies of uniform values
- fast rules: small differences for standard and vector instruction
- enabling Rule based on target feature - salu_float
- how to specify lowering algorithm - vgpr S64 AND to S32
- on G_TRUNC in reg, it is up to user to deal with truncated bits
  G_TRUNC in reg is treated as no-op.
- dealing with truncated high bits - ABS S16 to S32
- sgpr S1 phi lowering
- new opcodes for vcc-to-scc and scc-to-vcc copies
- lowering for vgprS1-to-vcc copy (formally this is vgpr-to-vcc G_TRUNC)
- S1 zext and sext lowering to select
- uniform and divergent S1 AND(OR and XOR) lowering - inst-selected into
  SALU instruction
- divergent phi with uniform inputs
- divergent instruction with temporal divergent use, source instruction
  is defined as uniform(AMDGPURegBankSelect) - missing temporal
  divergence lowering
- uniform phi, because of undef incoming, is assigned to vgpr. Will be
  fixed in AMDGPURegBankSelect via another fix in machine uniformity
  analysis.
…vm#117939)

Canonicalize gathers/scatters with contiguous (i.e. [0, 1, 2, ...])
offsets into vector masked load/store ops.
…#123958)

`TimerGroup` don't need to use as field of `ClangTidyProfiling`.
We can construct it local during destructing.
…23454)

skip header file before register AST Matchers
it can avoid to matcher lots of ast node when lint header file
Add IDs for bit width that cover multiple LLTs: B32 B64 etc.
"Predicate" wrapper class for bool predicate functions used to
write pretty rules. Predicates can be combined using &&, || and !.
Lowering for splitting and widening loads.
Write rules for loads to not change existing mir tests from old
regbankselect.
…l coroutine clones (llvm#118628)

Summary:
CoroCloner, by calling into CloneFunctionInto, does a lot of repeated
work priming DIFinder and building a list of common module-level debug
info metadata. For programs compiled with full debug info this can get
very expensive.

This diff builds the data once and shares it between all clones.

Anecdata for a sample cpp source file compiled with full debug info:

|                 | Baseline | IdentityMD set | Prebuilt CommonDI (cur.) |
|-----------------|----------|----------------|--------------------------|
| CoroSplitPass   | 306ms    | 221ms          | 68ms                     |
| CoroCloner      | 101ms    | 72ms           | 0.5ms                    |
| CollectCommonDI | -        | -              | 63ms                     |
| Speed up        | 1x       | 1.4x           | 4.5x                     |

Note that CollectCommonDebugInfo happens once *per coroutine* rather than per clone.

Test Plan:
ninja check-llvm-unit
ninja check-llvm

Compiled a sample internal source file, checked time trace output for scope timings.
…2866)

Change existing code for G_PHI to match what LLVM-IR version is doing
via PHINode::hasConstantOrUndefValue. This is not safe for regular PHI
since it may appear with an undef operand and getVRegDef can fail.
Most notably this improves number of values that can be allocated
to sgpr in AMDGPURegBankSelect.
Common case here are phis that appear in structurize-cfg lowering
for cycles with multiple exits:
Undef incoming value is coming from block that reached cycle exit
condition, if other incoming is uniform keep the phi uniform despite
the fact it is joining values from pair of blocks that are entered
via divergent condition branch.
This is the behavior expected by DWARF. It also requires some fixups to
algorithms which were storing the addresses of some objects (Blocks and
Variables) relative to the beginning of the function.

There are plenty of things that still don't work in this setups, but
this change is sufficient for the expression evaluator to correctly
recognize the entry point of a function in this case.
…llvm#123745)

Add the following workflows:

- `fullbuild` on aarch64 ubuntu
- `overlay` on windows 2025
- `overlay` on aarch64 ubuntu

`ccache` variant is used on `aarch64` due to
hendrikmuhs/ccache-action#279
…ot (llvm#121463)

In function handleMFLOSlot, we may get a variable LastInstInFunction
with a value of true from function getNextMachineInstr and IInSlot may
be null which would trigger an assert.
So we need to skip this case.

Fix llvm#118223.
With the removal of mlir-vulkan-runner (as part of llvm#73457) in
e7e3c45, mlir-cpu-runner is now the
only runner for all CPU and GPU targets, and the "cpu" name has been
misleading for some time already. This commit renames it to mlir-runner.
[AutoBump] Merge with fixes of eb206e9 (Jan 24) (21)
[AutoBump] Merge with b4e81fd (Jan 24) (20)
[AutoBump] Merge with fixes of 8388040 (Jan 23) (19)
@jorickert jorickert merged commit 91fab1b into bump_to_7e622b61 Jun 26, 2025
10 of 11 checks passed
@jorickert jorickert deleted the bump_to_08195f31 branch June 26, 2025 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.