Skip to content

Commit 1dd7a09

Browse files
[llvm] Proofread BigEndianNEON.rst (llvm#156141)
1 parent d2fda70 commit 1dd7a09

File tree

1 file changed

+20
-20
lines changed

1 file changed

+20
-20
lines changed

llvm/docs/BigEndianNEON.rst

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
==============================================
2-
Using ARM NEON instructions in big endian mode
2+
Using ARM NEON instructions in big-endian mode
33
==============================================
44

55
.. contents::
@@ -8,16 +8,16 @@ Using ARM NEON instructions in big endian mode
88
Introduction
99
============
1010

11-
Generating code for big endian ARM processors is for the most part straightforward. NEON loads and stores however have some interesting properties that make code generation decisions less obvious in big endian mode.
11+
Generating code for big-endian ARM processors is straightforward for the most part. NEON loads and stores, however, have some interesting properties that make code generation decisions less obvious in big-endian mode.
1212

1313
The aim of this document is to explain the problem with NEON loads and stores, and the solution that has been implemented in LLVM.
1414

15-
In this document the term "vector" refers to what the ARM ABI calls a "short vector", which is a sequence of items that can fit in a NEON register. This sequence can be 64 or 128 bits in length, and can constitute 8, 16, 32 or 64 bit items. This document refers to A64 instructions throughout, but is almost applicable to the A32/ARMv7 instruction sets also. The ABI format for passing vectors in A32 is slightly different to A64. Apart from that, the same concepts apply.
15+
In this document, the term "vector" refers to what the ARM ABI calls a "short vector", which is a sequence of items that can fit in a NEON register. This sequence can be 64 or 128 bits in length, and can constitute 8, 16, 32 or 64 bit items. This document refers to A64 instructions throughout, but is almost applicable to the A32/ARMv7 instruction sets also. The ABI format for passing vectors in A32 is slightly different to A64. Apart from that, the same concepts apply.
1616

1717
Example: C-level intrinsics -> assembly
1818
---------------------------------------
1919

20-
It may be helpful first to illustrate how C-level ARM NEON intrinsics are lowered to instructions.
20+
It may be helpful to first illustrate how C-level ARM NEON intrinsics are lowered to instructions.
2121

2222
This trivial C function takes a vector of four ints and sets the zero'th lane to the value "42"::
2323

@@ -26,7 +26,7 @@ This trivial C function takes a vector of four ints and sets the zero'th lane to
2626
return vsetq_lane_s32(42, p, 0);
2727
}
2828

29-
arm_neon.h intrinsics generate "generic" IR where possible (that is, normal IR instructions not ``llvm.arm.neon.*`` intrinsic calls). The above generates::
29+
``arm_neon.h`` intrinsics generate "generic" IR where possible (that is, normal IR instructions, not ``llvm.arm.neon.*`` intrinsic calls). The above generates::
3030

3131
define <4 x i32> @f(<4 x i32> %p) {
3232
%vset_lane = insertelement <4 x i32> %p, i32 42, i32 0
@@ -45,7 +45,7 @@ Problem
4545

4646
The main problem is how vectors are represented in memory and in registers.
4747

48-
First, a recap. The "endianness" of an item affects its representation in memory only. In a register, a number is just a sequence of bits - 64 bits in the case of AArch64 general purpose registers. Memory, however, is a sequence of addressable units of 8 bits in size. Any number greater than 8 bits must therefore be split up into 8-bit chunks, and endianness describes the order in which these chunks are laid out in memory.
48+
First, a recap. The "endianness" of an item affects its representation in memory only. In a register, a number is just a sequence of bits - 64 bits in the case of AArch64 general-purpose registers. Memory, however, is a sequence of addressable units of 8 bits in size. Any number greater than 8 bits must therefore be split up into 8-bit chunks, and endianness describes the order in which these chunks are laid out in memory.
4949

5050
A "little endian" layout has the least significant byte first (lowest in memory address). A "big endian" layout has the *most* significant byte first. This means that when loading an item from big endian memory, the lowest 8-bits in memory must go in the most significant 8-bits, and so forth.
5151

@@ -58,30 +58,30 @@ A "little endian" layout has the least significant byte first (lowest in memory
5858
Big endian vector load using ``LDR``.
5959

6060

61-
A vector is a consecutive sequence of items that are operated on simultaneously. To load a 64-bit vector, 64 bits need to be read from memory. In little endian mode, we can do this by just performing a 64-bit load - ``LDR q0, [foo]``. However if we try this in big endian mode, because of the byte swapping the lane indices end up being swapped! The zero'th item as laid out in memory becomes the n'th lane in the vector.
61+
A vector is a consecutive sequence of items that are operated on simultaneously. To load a 64-bit vector, 64 bits need to be read from memory. In little-endian mode, we can do this by just performing a 64-bit load - ``LDR q0, [foo]``. However, if we try this in big-endian mode, because of the byte swapping the lane indices end up being swapped! The zero'th item as laid out in memory becomes the n'th lane in the vector.
6262

6363
.. figure:: ARM-BE-ld1.png
6464
:align: right
6565

6666
Big endian vector load using ``LD1``. Note that the lanes retain the correct ordering.
6767

6868

69-
Because of this, the instruction ``LD1`` performs a vector load but performs byte swapping not on the entire 64 bits, but on the individual items within the vector. This means that the register content is the same as it would have been on a little endian system.
69+
Because of this, the ``LD1`` instruction performs a vector load but performs byte swapping not on the entire 64 bits, but on the individual items within the vector. This means that the register content is the same as it would have been on a little-endian system.
7070

71-
It may seem that ``LD1`` should suffice to perform vector loads on a big endian machine. However there are pros and cons to the two approaches that make it less than simple which register format to pick.
71+
It may seem that ``LD1`` should suffice to perform vector loads on a big-endian machine. However, there are pros and cons to the two approaches that make it less than simple which register format to pick.
7272

7373
There are two options:
7474

7575
1. The content of a vector register is the same *as if* it had been loaded with an ``LDR`` instruction.
7676
2. The content of a vector register is the same *as if* it had been loaded with an ``LD1`` instruction.
7777

78-
Because ``LD1 == LDR + REV`` and similarly ``LDR == LD1 + REV`` (on a big endian system), we can simulate either type of load with the other type of load plus a ``REV`` instruction. So we're not deciding which instructions to use, but which format to use (which will then influence which instruction is best to use).
78+
Because ``LD1 == LDR + REV`` and similarly ``LDR == LD1 + REV`` (on a big-endian system), we can simulate either type of load with the other type of load plus a ``REV`` instruction. So we're not deciding which instructions to use, but which format to use (which will then influence which instruction is best to use).
7979

8080
.. The 'clearer' container is required to make the following section header come after the floated
8181
images above.
8282
.. container:: clearer
8383

84-
Note that throughout this section we only mention loads. Stores have exactly the same problems as their associated loads, so have been skipped for brevity.
84+
Note that throughout this section, we only mention loads. Stores have exactly the same problems as their associated loads, so have been skipped for brevity.
8585

8686

8787
Considerations
@@ -90,7 +90,7 @@ Considerations
9090
LLVM IR Lane ordering
9191
---------------------
9292

93-
LLVM IR has first class vector types. In LLVM IR, the zero'th element of a vector resides at the lowest memory address. The optimizer relies on this property in certain areas, for example when concatenating vectors together. The intention is for arrays and vectors to have identical memory layouts - ``[4 x i8]`` and ``<4 x i8>`` should be represented the same in memory. Without this property there would be many special cases that the optimizer would have to cleverly handle.
93+
LLVM IR has first class vector types. In LLVM IR, the zero'th element of a vector resides at the lowest memory address. The optimizer relies on this property in certain areas, for example, when concatenating vectors together. The intention is for arrays and vectors to have identical memory layouts - ``[4 x i8]`` and ``<4 x i8>`` should be represented the same in memory. Without this property, there would be many special cases that the optimizer would have to cleverly handle.
9494

9595
Use of ``LDR`` would break this lane ordering property. This doesn't preclude the use of ``LDR``, but we would have to do one of two things:
9696

@@ -102,11 +102,11 @@ AAPCS
102102

103103
The ARM procedure call standard (AAPCS) defines the ABI for passing vectors between functions in registers. It states:
104104

105-
When a short vector is transferred between registers and memory it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single ``STR`` of the entire register; a short vector is loaded from memory using the corresponding ``LDR`` instruction. On a little-endian system this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector.
105+
When a short vector is transferred between registers and memory, it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single ``STR`` of the entire register; a short vector is loaded from memory using the corresponding ``LDR`` instruction. On a little-endian system, this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector.
106106

107107
-- Procedure Call Standard for the ARM 64-bit Architecture (AArch64), 4.1.2 Short Vectors
108108

109-
The use of ``LDR`` and ``STR`` as the ABI defines has at least one advantage over ``LD1`` and ``ST1``. ``LDR`` and ``STR`` are oblivious to the size of the individual lanes of a vector. ``LD1`` and ``ST1`` are not - the lane size is encoded within them. This is important across an ABI boundary, because it would become necessary to know the lane width the callee expects. Consider the following code:
109+
The use of ``LDR`` and ``STR`` as the ABI defines has at least one advantage over ``LD1`` and ``ST1``. ``LDR`` and ``STR`` are oblivious to the size of the individual lanes of a vector. ``LD1`` and ``ST1`` are not - the lane size is encoded within them. This is important across an ABI boundary because it would become necessary to know the lane width the callee expects. Consider the following code:
110110

111111
.. code-block:: c
112112
@@ -132,7 +132,7 @@ Alignment
132132

133133
In strict alignment mode, ``LDR qX`` requires its address to be 128-bit aligned, whereas ``LD1`` only requires it to be as aligned as the lane size. If we canonicalised on using ``LDR``, we'd still need to use ``LD1`` in some places to avoid alignment faults (the result of the ``LD1`` would then need to be reversed with ``REV``).
134134

135-
Most operating systems however do not run with alignment faults enabled, so this is often not an issue.
135+
Most operating systems, however, do not run with alignment faults enabled, so this is often not an issue.
136136

137137
Summary
138138
-------
@@ -156,7 +156,7 @@ Implementation
156156

157157
There are 3 parts to the implementation:
158158

159-
1. Predicate ``LDR`` and ``STR`` instructions so that they are never allowed to be selected to generate vector loads and stores. The exception is one-lane vectors [1]_ - these by definition cannot have lane ordering problems so are fine to use ``LDR``/``STR``.
159+
1. Predicate ``LDR`` and ``STR`` instructions so that they are never allowed to be selected to generate vector loads and stores. The exception is one-lane vectors [1]_; by definition, these cannot have lane ordering problems so are fine to use ``LDR``/``STR``.
160160

161161
2. Create code generation patterns for bitconverts that create ``REV`` instructions.
162162

@@ -168,9 +168,9 @@ Bitconverts
168168
.. image:: ARM-BE-bitcastfail.png
169169
:align: right
170170

171-
The main problem with the ``LD1`` solution is dealing with bitconverts (or bitcasts, or reinterpret casts). These are pseudo instructions that only change the compiler's interpretation of data, not the underlying data itself. A requirement is that if data is loaded and then saved again (called a "round trip"), the memory contents should be the same after the store as before the load. If a vector is loaded and is then bitconverted to a different vector type before storing, the round trip will currently be broken.
171+
The main problem with the ``LD1`` solution is dealing with bitconverts (or bitcasts, or reinterpret casts). These are pseudo instructions that only change the compiler's interpretation of data, not the underlying data itself. A requirement is that if data is loaded and then saved again (called a "round trip"), the memory contents should be the same after the store as before the load. If a vector is loaded and then bitconverted to a different vector type before being stored, the round trip will currently be broken.
172172

173-
Take for example this code sequence::
173+
Take this code sequence, for example::
174174

175175
%0 = load <4 x i32> %x
176176
%1 = bitcast <4 x i32> %0 to <2 x i64>
@@ -185,7 +185,7 @@ This would produce a code sequence such as that in the figure on the right. The
185185
.. image:: ARM-BE-bitcastsuccess.png
186186
:align: right
187187

188-
Conceptually this is simple - we can insert a ``REV`` undoing the ``LD1`` of type ``X`` (converting the in-register representation to the same as if it had been loaded by ``LDR``) and then insert another ``REV`` to change the representation to be as if it had been loaded by an ``LD1`` of type ``Y``.
188+
Conceptually, this is simple - we can insert a ``REV`` undoing the ``LD1`` of type ``X`` (converting the in-register representation to the same as if it had been loaded by ``LDR``) and then insert another ``REV`` to change the representation to be as if it had been loaded by an ``LD1`` of type ``Y``.
189189

190190
For the previous example, this would be::
191191

@@ -201,4 +201,4 @@ For the previous example, this would be::
201201

202202
It turns out that these ``REV`` pairs can, in almost all cases, be squashed together into a single ``REV``. For the example above, a ``REV128 4s`` + ``REV128 2d`` is actually a ``REV64 4s``, as shown in the figure on the right.
203203

204-
.. [1] One lane vectors may seem useless as a concept but they serve to distinguish between values held in general purpose registers and values held in NEON/VFP registers. For example, an ``i64`` would live in an ``x`` register, but ``<1 x i64>`` would live in a ``d`` register.
204+
.. [1] One-lane vectors may seem useless as a concept, but they serve to distinguish between values held in general-purpose registers and values held in NEON/VFP registers. For example, an ``i64`` would live in an ``x`` register, but ``<1 x i64>`` would live in a ``d`` register.

0 commit comments

Comments
 (0)