Skip to content

Conversation

XYenChi
Copy link
Owner

@XYenChi XYenChi commented Aug 14, 2023

No description provided.

@XYenChi XYenChi merged commit 703bd3b into master Aug 14, 2023
@XYenChi XYenChi deleted the rvv0.7.1 branch October 7, 2023 04:07
XYenChi pushed a commit that referenced this pull request Nov 1, 2023
This patch is my proposed solution to PR rtl-optimization/91865.
Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND
to a single ZERO_EXTEND, but as shown in this PR it is possible for
combine's make_compound_operation to unintentionally generate a
non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be
matched by the backend.

For the new test case:

const int table[2] = {1, 2};
int foo (char i) { return table[i]; }

compiling with -O2 -mlarge on msp430 we currently see:

Trying 2 -> 7:
    2: r25:HI=zero_extend(R12:QI)
      REG_DEAD R12:QI
    7: r28:PSI=sign_extend(r25:HI)#0
      REG_DEAD r25:HI
Failed to match this instruction:
(set (reg:PSI 28 [ iD.1772 ])
    (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]))))

which results in the following code:

foo:	AND     #0xff, R12
        RLAM.A #4, R12 { RRAM.A #4, R12
        RLAM.A  #1, R12
        MOVX.W  table(R12), R12
        RETA

With this patch, we now see:

Trying 2 -> 7:
    2: r25:HI=zero_extend(R12:QI)
      REG_DEAD R12:QI
    7: r28:PSI=sign_extend(r25:HI)#0
      REG_DEAD r25:HI
Successfully matched this instruction:
(set (reg:PSI 28 [ iD.1772 ])
    (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ])))
allowing combination of insns 2 and 7
original costs 4 + 8 = 12
replacement cost 8

foo:	MOV.B   R12, R12
        RLAM.A  #1, R12
        MOVX.W  table(R12), R12
        RETA

2023-10-26  Roger Sayle  <[email protected]>
	    Richard Biener  <[email protected]>

gcc/ChangeLog
	PR rtl-optimization/91865
	* combine.cc (make_compound_operation): Avoid creating a
	ZERO_EXTEND of a ZERO_EXTEND.

gcc/testsuite/ChangeLog
	PR rtl-optimization/91865
	* gcc.target/msp430/pr91865.c: New test case.
XYenChi pushed a commit that referenced this pull request Nov 7, 2023
This patch is my proposed solution to PR rtl-optimization/91865.
Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND
to a single ZERO_EXTEND, but as shown in this PR it is possible for
combine's make_compound_operation to unintentionally generate a
non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be
matched by the backend.

For the new test case:

const int table[2] = {1, 2};
int foo (char i) { return table[i]; }

compiling with -O2 -mlarge on msp430 we currently see:

Trying 2 -> 7:
    2: r25:HI=zero_extend(R12:QI)
      REG_DEAD R12:QI
    7: r28:PSI=sign_extend(r25:HI)#0
      REG_DEAD r25:HI
Failed to match this instruction:
(set (reg:PSI 28 [ iD.1772 ])
    (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]))))

which results in the following code:

foo:	AND     #0xff, R12
        RLAM.A #4, R12 { RRAM.A #4, R12
        RLAM.A  #1, R12
        MOVX.W  table(R12), R12
        RETA

With this patch, we now see:

Trying 2 -> 7:
    2: r25:HI=zero_extend(R12:QI)
      REG_DEAD R12:QI
    7: r28:PSI=sign_extend(r25:HI)#0
      REG_DEAD r25:HI
Successfully matched this instruction:
(set (reg:PSI 28 [ iD.1772 ])
    (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ])))
allowing combination of insns 2 and 7
original costs 4 + 8 = 12
replacement cost 8

foo:	MOV.B   R12, R12
        RLAM.A  #1, R12
        MOVX.W  table(R12), R12
        RETA

2023-10-26  Roger Sayle  <[email protected]>
	    Richard Biener  <[email protected]>

gcc/ChangeLog
	PR rtl-optimization/91865
	* combine.cc (make_compound_operation): Avoid creating a
	ZERO_EXTEND of a ZERO_EXTEND.

gcc/testsuite/ChangeLog
	PR rtl-optimization/91865
	* gcc.target/msp430/pr91865.c: New test case.
XYenChi pushed a commit that referenced this pull request Nov 10, 2023
Hi all,

We noticed that calls to the vadcq and vsbcq intrinsics, both of
which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in
the FPSCR, would produce the following code:

```
< r2 is the *carry input >
vmrs	r3, FPSCR_nzcvqc
bic	r3, r3, #536870912
orr	r3, r3, r2, lsl gcc-mirror#29
vmsr	FPSCR_nzcvqc, r3
```

when the MVE ACLE instead gives a different instruction sequence of:
```
< Rt is the *carry input >
VMRS Rs,FPSCR_nzcvqc
BFI Rs,Rt,gcc-mirror#29,#1
VMSR FPSCR_nzcvqc,Rs
```

the bic + orr pair is slower and it's also wrong, because, if the
*carry input is greater than 1, then we risk overwriting the top two
bits of the FPSCR register (the N and Z flags).

This turned out to be a problem in the header file and the solution was
to simply add a `& 1x0u` to the `*carry` input: then the compiler knows
that we only care about the lowest bit and can optimise to a BFI.

Ok for trunk?

Thanks,
Stam Markianos-Wright

gcc/ChangeLog:

	* config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic.
	(__arm_vadcq_u32): Likewise.
	(__arm_vadcq_m_s32): Likewise.
	(__arm_vadcq_m_u32): Likewise.
	(__arm_vsbcq_s32): Likewise.
	(__arm_vsbcq_u32): Likewise.
	(__arm_vsbcq_m_s32): Likewise.
	(__arm_vsbcq_m_u32): Likewise.
	* config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile.

gcc/testsuite/ChangeLog:
	* gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New.
XYenChi pushed a commit that referenced this pull request Mar 6, 2024
This patch fixes the second half of 64679.  Here we issue a wrong
"redefinition of 'int x'" for the following:

  struct Bar {
    Bar(int, int, int);
  };

  int x = 1;
  Bar bar(int(x), int(x), int{x}); // #1

cp_parser_parameter_declaration_list does pushdecl every time it sees
a named parameter, so the second "int(x)" causes the error.  That's
premature, since this turns out to be a constructor call after the
third argument!

If the first parameter is parenthesized, we can't push until we've
established we're looking at a function declaration.  Therefore this
could be fixed by some kind of lookahead.  I thought about introducing a
lightweight variant of cp_parser_parameter_declaration_list that would
not have any side effects and would return as soon as it figures out
whether it's looking at a declaration or expression.  Since that would
require fairly nontrivial changes, I wanted something simpler.

Something like delaying the pushdecl until we've reached the ')'
following the parameter-declaration-clause.  But we must push the
parameters before processing a default argument, as in:

  Bar bar(int(a), int(b), int c = sizeof(a));  // valid

Moreover, this code should still be accepted

  Bar f(int(i), decltype(i) j = 42);

so this patch stashes parameters into a vector when parsing tentatively
only when pushdecl-ing a parameter would result in a clash and an error
about redefinition/redeclaration.  The stashed parameters are pushed at
the end of a parameter-declaration-clause if it's followed by a ')', so
that we still diagnose redefining a parameter.

	PR c++/64679

gcc/cp/ChangeLog:

	* parser.cc (cp_parser_parameter_declaration_clause): Maintain
	a vector of parameters that haven't been pushed yet.  Push them at the
	end of a valid parameter-declaration-clause.
	(cp_parser_parameter_declaration_list): Take a new auto_vec parameter.
	Do not pushdecl while parsing tentatively when pushdecl-ing a parameter
	would result in a hard error.
	(cp_parser_cache_defarg): Adjust the call to
	cp_parser_parameter_declaration_list.

gcc/testsuite/ChangeLog:

	* g++.dg/parse/ambig11.C: New test.
	* g++.dg/parse/ambig12.C: New test.
	* g++.dg/parse/ambig13.C: New test.
	* g++.dg/parse/ambig14.C: New test.
XYenChi pushed a commit that referenced this pull request Mar 6, 2024
This recognises the patterns of the form:
  while (n & 1) { n >>= 1 }

Unfortunately there are currently two issues relating to this patch.

Firstly, simplify_using_initial_conditions does not recognise that
	(n != 0) and ((n & 1) == 0) implies that ((n >> 1) != 0).

This preconditions arise following the loop copy-header pass, and the
assumptions returned by number_of_iterations_exit_assumptions then
prevent final value replacement from using the niter result.

I'm not sure what is the best way to fix this - one approach could be to
modify simplify_using_initial_conditions to handle this sort of case,
but it seems that it basically wants the information that ranger could
give anway, so would something like that be a better option?

The second issue arises in the vectoriser, which is able to determine
that the niter->assumptions are always true.
When building with -march=armv8.4-a+sve -S -O3, we get this codegen:

foo (unsigned int b) {
    int c = 0;

    if (b == 0)
      return PREC;

    while (!(b & (1 << (PREC - 1)))) {
        b <<= 1;
        c++;
    }

    return c;
}

foo:
.LFB0:
        .cfi_startproc
        cmp     w0, 0
        cbz     w0, .L6
        blt     .L7
        lsl     w1, w0, 1
        clz     w2, w1
        cmp     w2, 14
        bls     .L8
        mov     x0, 0
        cntw    x3
        add     w1, w2, 1
        index   z1.s, #0, #1
        whilelo p0.s, wzr, w1
.L4:
        add     x0, x0, x3
        mov     p1.b, p0.b
        mov     z0.d, z1.d
        whilelo p0.s, w0, w1
        incw    z1.s
        b.any   .L4
        add     z0.s, z0.s, #1
        lastb   w0, p1, z0.s
        ret
        .p2align 2,,3
.L8:
        mov     w0, 0
        b       .L3
        .p2align 2,,3
.L13:
        lsl     w1, w1, 1
.L3:
        add     w0, w0, 1
        tbz     w1, gcc-mirror#31, .L13
        ret
        .p2align 2,,3
.L6:
        mov     w0, 32
        ret
        .p2align 2,,3
.L7:
        mov     w0, 0
        ret
        .cfi_endproc

In essence, the vectoriser uses the niter information to determine
exactly how many iterations of the loop it needs to run. It then uses
SVE whilelo instructions to run this number of iterations. The original
loop counter is also vectorised, despite only being used in the final
iteration, and then the final value of this counter is used as the
return value (which is the same as the number of iterations it computed
in the first place).

This vectorisation is obviously bad, and I think it exposes a latent
bug in the vectoriser, rather than being an issue caused by this
specific patch.

gcc/ChangeLog:

	* tree-ssa-loop-niter.cc (number_of_iterations_cltz): New.
	(number_of_iterations_bitcount): Add call to the above.
	(number_of_iterations_exit_assumptions): Add EQ_EXPR case for
	c[lt]z idiom recognition.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/cltz-max.c: New test.
	* gcc.dg/tree-ssa/clz-char.c: New test.
	* gcc.dg/tree-ssa/clz-int.c: New test.
	* gcc.dg/tree-ssa/clz-long-long.c: New test.
	* gcc.dg/tree-ssa/clz-long.c: New test.
	* gcc.dg/tree-ssa/ctz-char.c: New test.
	* gcc.dg/tree-ssa/ctz-int.c: New test.
	* gcc.dg/tree-ssa/ctz-long-long.c: New test.
	* gcc.dg/tree-ssa/ctz-long.c: New test.
XYenChi pushed a commit that referenced this pull request Mar 8, 2024
Hi All,

This patch adds initial support for early break vectorization in GCC. In other
words it implements support for vectorization of loops with multiple exits.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.

Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR/Bitwise AND.  This is however only checked
then the comparison would produce multiple statements.

This also fully decouples the vectorizer's notion of exit from the existing loop
infrastructure's exit.  Before this patch the vectorizer always picked the
natural loop latch connected exit as the main exit.

After this patch the vectorizer is free to choose any exit it deems appropriate
as the main exit.  This means that even if the main exit is not countable (i.e.
the termination condition could not be determined) we might still be able to
vectorize should one of the other exits be countable.

In such situations the loop is reflowed which enabled vectorization of many
other loop forms.

Concretely the kind of loops supported are of the forms:

 for (int i = 0; i < N; i++)
 {
   <statements1>
   if (<condition>)
     {
       ...
       <action>;
     }
   <statements2>
 }

where <action> can be:
 - break
 - return
 - goto

Any number of statements can be used before the <action> occurs.

Since this is an initial version for GCC 14 it has the following limitations and
features:

- Only fixed sized iterations and buffers are supported.  That is to say any
  vectors loaded or stored must be to statically allocated arrays with known
  sizes. N must also be known.  This limitation is because our primary target
  for this optimization is SVE.  For VLA SVE we can't easily do cross page
  iteraion checks. The result is likely to also not be beneficial. For that
  reason we punt support for variable buffers till we have First-Faulting
  support in GCC 15.
- any stores in <statements1> should not be to the same objects as in
  <condition>.  Loads are fine as long as they don't have the possibility to
  alias.  More concretely, we block RAW dependencies when the intermediate value
  can't be separated fromt the store, or the store itself can't be moved.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- No support for epilogue vectorization.  The only epilogue supported is the
  scalar final one.  Peeling code supports it but the code motion code cannot
  find instructions to make the move in the epilog.
- Early breaks are only supported for inner loop vectorization.

With the help of IPA and LTO this still gets hit quite often.  During bootstrap
it hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.

This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.

For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations.  For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.

That is to say, the first vector loop executes so long as the early exit isn't
needed.  Once the exit is taken, the scalar code will perform at most VF extra
iterations.  The exact number depending on peeling and iteration start and which
exit was taken (natural or early).   For this scalar loop, all early exits are
treated the same.

When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break.  If this is not possible we decline to vectorize.  The
analysis and code motion also takes into account that it doesn't introduce a RAW
dependency after the move of the stores.

This means that we check at the start of iterations whether we are going to exit
or not.  During the analyis phase we check whether we are allowed to do this
moving of statements.  Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.

With this the vector flow no longer necessarily needs to match that of the
scalar code.  In addition most of the infrastructure is in place to support
general control flow safely, however we are punting this to GCC 15.

Codegen:

for e.g.

unsigned vect_a[N];
unsigned vect_b[N];

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     break;
   vect_a[i] = x;

 }
 return ret;
}

We generate for Adv. SIMD:

test4:
        adrp    x2, .LC0
        adrp    x3, .LANCHOR0
        dup     v2.4s, w0
        add     x3, x3, :lo12:.LANCHOR0
        movi    v4.4s, 0x4
        add     x4, x3, 3216
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x1, 0
        mov     w2, 0
        .p2align 3,,7
.L3:
        ldr     q0, [x3, x1]
        add     v3.4s, v1.4s, v2.4s
        add     v1.4s, v1.4s, v4.4s
        cmhi    v0.4s, v0.4s, v2.4s
        umaxp   v0.4s, v0.4s, v0.4s
        fmov    x5, d0
        cbnz    x5, .L6
        add     w2, w2, 1
        str     q3, [x1, x4]
        str     q2, [x3, x1]
        add     x1, x1, 16
        cmp     w2, 200
        bne     .L3
        mov     w7, 3
.L2:
        lsl     w2, w2, 2
        add     x5, x3, 3216
        add     w6, w2, w0
        sxtw    x4, w2
        ldr     w1, [x3, x4, lsl 2]
        str     w6, [x5, x4, lsl 2]
        cmp     w0, w1
        bcc     .L4
        add     w1, w2, 1
        str     w0, [x3, x4, lsl 2]
        add     w6, w1, w0
        sxtw    x1, w1
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        add     w4, w2, 2
        str     w0, [x3, x1, lsl 2]
        sxtw    x1, w4
        add     w6, w1, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
        add     w2, w2, 3
        cmp     w7, 3
        beq     .L4
        sxtw    x1, w2
        add     w2, w2, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w2, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
.L4:
        mov     w0, 0
        ret
        .p2align 2,,3
.L6:
        mov     w7, 4
        b       .L2

and for SVE:

test4:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        add     x5, x2, 3216
        mov     x3, 0
        mov     w1, 0
        cntw    x4
        mov     z1.s, w0
        index   z0.s, #0, #1
        ptrue   p1.b, all
        ptrue   p0.s, all
        .p2align 3,,7
.L3:
        ld1w    z2.s, p1/z, [x2, x3, lsl 2]
        add     z3.s, z0.s, z1.s
        cmplo   p2.s, p0/z, z1.s, z2.s
        b.any   .L2
        st1w    z3.s, p1, [x5, x3, lsl 2]
        add     w1, w1, 1
        st1w    z1.s, p1, [x2, x3, lsl 2]
        add     x3, x3, x4
        incw    z0.s
        cmp     w3, 803
        bls     .L3
.L5:
        mov     w0, 0
        ret
        .p2align 2,,3
.L2:
        cntw    x5
        mul     w1, w1, w5
        cbz     w5, .L5
        sxtw    x1, w1
        sub     w5, w5, #1
        add     x5, x5, x1
        add     x6, x2, 3216
        b       .L6
        .p2align 2,,3
.L14:
        str     w0, [x2, x1, lsl 2]
        cmp     x1, x5
        beq     .L5
        mov     x1, x4
.L6:
        ldr     w3, [x2, x1, lsl 2]
        add     w4, w0, w1
        str     w4, [x6, x1, lsl 2]
        add     x4, x1, 1
        cmp     w0, w3
        bcs     .L14
        mov     w0, 0
        ret

On the workloads this work is based on we see between 2-3x performance uplift
using this patch.

Follow up plan:
 - Boolean vectorization has several shortcomings.  I've filed PR110223 with the
   bigger ones that cause vectorization to fail with this patch.
 - SLP support.  This is planned for GCC 15 as for majority of the cases build
   SLP itself fails.  This means I'll need to spend time in making this more
   robust first.  Additionally it requires:
     * Adding support for vectorizing CFG (gconds)
     * Support for CFG to differ between vector and scalar loops.
   Both of which would be disruptive to the tree and I suspect I'll be handling
   fallouts from this patch for a while.  So I plan to work on the surrounding
   building blocks first for the remainder of the year.

Additionally it also contains reduced cases from issues found running over
various codebases.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Also regtested with:
 -march=armv8.3-a+sve
 -march=armv8.3-a+nosve
 -march=armv9-a
 -mcpu=neoverse-v1
 -mcpu=neoverse-n2

Bootstrapped Regtested x86_64-pc-linux-gnu and no issues.
Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues.

gcc/ChangeLog:

	* tree-if-conv.cc (idx_within_array_bound): Expose.
	* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
	(vect_analyze_data_ref_dependences): Use it.
	* tree-vect-loop-manip.cc (vect_iv_increment_position): New.
	(vect_set_loop_controls_directly,
	vect_set_loop_condition_partial_vectors,
	vect_set_loop_condition_partial_vectors_avx512,
	vect_set_loop_condition_normal): Support multiple exits.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for
	multiple exits.
	(slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB
	count and instead look at loop shape.
	(vect_update_ivs_after_vectorizer): Drop asserts.
	(vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations.
	(vect_do_peeling): Support multiple exits.
	(vect_loop_versioning): Likewise.
	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise
	early_breaks.
	(vect_analyze_loop_form): Support loop flows with more than single BB
	loop body.
	(vect_create_loop_vinfo): Support niters analysis for multiple exits.
	(vect_analyze_loop): Likewise.
	(vect_get_vect_def): New.
	(vect_create_epilog_for_reduction): Support early exit reductions.
	(vectorizable_live_operation_1): New.
	(find_connected_edge): New.
	(vectorizable_live_operation): Support early exit live operations.
	(move_early_exit_stmts): New.
	(vect_transform_loop): Use it.
	* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond.
	(vect_recog_bitfield_ref_pattern): Support gconds and bools.
	(vect_recog_gcond_pattern): New.
	(possible_vector_mask_operation_p): Support gcond masks.
	(vect_determine_mask_precision): Likewise.
	(vect_mark_pattern_stmts): Set gcond def type.
	(can_vectorize_live_stmts): Force early break inductions to be live.
	* tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for
	early breaks.
	(vect_mark_stmts_to_be_vectorized): Process gcond usage.
	(perm_mask_for_reverse): Expose.
	(vectorizable_comparison_1): New.
	(vectorizable_early_exit): New.
	(vect_analyze_stmt): Support early break and gcond.
	(vect_transform_stmt): Likewise.
	(vect_is_simple_use): Likewise.
	(vect_get_vector_types_for_stmt): Likewise.
	* tree-vectorizer.cc (pass_vectorize::execute): Update exits for value
	numbering.
	* tree-vectorizer.h (enum vect_def_type): Add vect_condition_def.
	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES,
	LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB,
	LOOP_VINFO_EARLY_BRK_VUSES): New.
	(is_loop_header_bb_p): Drop assert.
	(class loop): Add early_breaks, early_break_stores, early_break_dest_bb,
	early_break_vuses.
	(vect_iv_increment_position, perm_mask_for_reverse,
	ref_within_array_bound): New.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
XYenChi pushed a commit that referenced this pull request Mar 8, 2024
This patch adjusts the costs so that we treat REG and SUBREG expressions the
same for costing.

This was motivated by bt_skip_func and bt_find_func in xz and results in nearly
a 5% improvement in the dynamic instruction count for input #2 and smaller, but
definitely visible improvements pretty much across the board.  Exceptions would
be perlbench input #1 and exchange2 which showed very small regressions.

In the bt_find_func and bt_skip_func cases we have  something like this:

> (insn 10 7 11 2 (set (reg/v:DI 136 [ x ])
>         (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip}
>      (nil))
> (insn 11 10 12 2 (set (reg:DI 142 [ _1 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3}
>      (nil))

[ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3}
>      (nil))

Note the two uses of (reg 136). The best way to handle that in combine might be
a 3->2 split.  But there's a much better approach if we look at fwprop...

(set (reg:DI 142 [ _1 ])
    (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))
        (reg/v:DI 139 [ b ])))
change not profitable (cost 4 -> cost 8)

So that should be the same cost as a regular DImode addition when the ZBA
extension is enabled.  But it ends up costing more because the clause to cost
this variant isn't prepared to handle a SUBREG.  That results in the RTL above
having too high a cost and fwprop gives up.

One approach would be to replace the REG_P  with REG_P || SUBREG_P in the
costing code.  I ultimately decided against that and instead check if the
operand in question passes register_operand.

By far the most important case to handle is the DImode PLUS.  But for the sake
of consistency, I changed the other instances in riscv_rtx_costs as well.  For
those other cases we're talking about improvements in the .000001% range.

While we are into stage4, this just hits cost modeling which we've generally
agreed is still appropriate (though we were mostly talking about vector).  So
I'm going to extend that general agreement ever so slightly and include scalar
cost modeling :-)

gcc/
	* config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG
	similarly.

gcc/testsuite/

	* gcc.target/riscv/reg_subreg_costs.c: New test.

	Co-authored-by: Jivan Hakobyan <[email protected]>
XYenChi pushed a commit that referenced this pull request Mar 25, 2024
This patch adjusts the costs so that we treat REG and SUBREG expressions the
same for costing.

This was motivated by bt_skip_func and bt_find_func in xz and results in nearly
a 5% improvement in the dynamic instruction count for input #2 and smaller, but
definitely visible improvements pretty much across the board.  Exceptions would
be perlbench input #1 and exchange2 which showed very small regressions.

In the bt_find_func and bt_skip_func cases we have  something like this:

> (insn 10 7 11 2 (set (reg/v:DI 136 [ x ])
>         (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))) "zz.c":6:21 387 {*zero_extendsidi2_bitmanip}
>      (nil))
> (insn 11 10 12 2 (set (reg:DI 142 [ _1 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 139 [ b ]))) "zz.c":7:23 5 {adddi3}
>      (nil))

[ ... ]> (insn 13 12 14 2 (set (reg:DI 143 [ _2 ])
>         (plus:DI (reg/v:DI 136 [ x ])
>             (reg/v:DI 141 [ c ]))) "zz.c":8:23 5 {adddi3}
>      (nil))

Note the two uses of (reg 136). The best way to handle that in combine might be
a 3->2 split.  But there's a much better approach if we look at fwprop...

(set (reg:DI 142 [ _1 ])
    (plus:DI (zero_extend:DI (subreg/s/u:SI (reg/v:DI 137 [ a ]) 0))
        (reg/v:DI 139 [ b ])))
change not profitable (cost 4 -> cost 8)

So that should be the same cost as a regular DImode addition when the ZBA
extension is enabled.  But it ends up costing more because the clause to cost
this variant isn't prepared to handle a SUBREG.  That results in the RTL above
having too high a cost and fwprop gives up.

One approach would be to replace the REG_P  with REG_P || SUBREG_P in the
costing code.  I ultimately decided against that and instead check if the
operand in question passes register_operand.

By far the most important case to handle is the DImode PLUS.  But for the sake
of consistency, I changed the other instances in riscv_rtx_costs as well.  For
those other cases we're talking about improvements in the .000001% range.

While we are into stage4, this just hits cost modeling which we've generally
agreed is still appropriate (though we were mostly talking about vector).  So
I'm going to extend that general agreement ever so slightly and include scalar
cost modeling :-)

gcc/
	* config/riscv/riscv.cc (riscv_rtx_costs): Handle SUBREG and REG
	similarly.

gcc/testsuite/

	* gcc.target/riscv/reg_subreg_costs.c: New test.

	Co-authored-by: Jivan Hakobyan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant