Skip to content

Conversation

@lrzlin
Copy link

@lrzlin lrzlin commented Apr 25, 2025

This patch adds LoongArch BCJ filter support, it has been tested on real Loongson hardware & QEMU.

It's mostly refer the arm64's filter, converts BL/PCALAU12I only, other BCJ instructions was also tested but they do not have positive effet on redundancy. The PCALAU12I's range is also limited to +/-512MB to reduce false positives.

However, when testing BCJ filter on binaries with debug info, the compress ratio significant worse than use LZMA2 only, It's majority due to many false positives of BL instructions, so I've add a if-statement in BL part that only if the previous instructions of BL is in a range of instructions we selected, we'll convert it's immediate to absolute address, or we'll just ignores it. These instructions were derived from analyzing LoongArch architecture's libraries and executables.

This optimize could largely improves compress ratio when dealing with this kinds of libraries or executables. But this optimize relies on global varibales, I don't sure whether it's reliable in multithread environment. Currently, an one-hour test on Loongson 3C6000(16C/32T) shows no data corruption. It's still a very naive attempt, so it would be much appreciated if you could provide some feedback.

Also, I used 0x0C as this filter's ID, but it's just a temporary path to test the filter.

Related: loongson-community/discussions#78

@Larhzu
Copy link
Member

Larhzu commented Apr 25, 2025

Here are some quick early comments.

You can use the simple pointer to safely store data between calls. See the x86 filter for an example.

If you only care about the previous four bytes, set unfiltered_max to 8 instead of 4 when calling lzma_simple_coder_init. Then look forward four bytes further, that is, instead of

uint32_t instr = read32le(buffer + i);

do this:

uint32_t instr = read32le(buffer + i + 4);

Then you have the previous instruction in buffer + i. Maybe you won't need to save any state then. At least this avoids the out-of-bounds access that currently can happen here when i == 0:

uint32_t prev_instr = read32le(buffer + i - 4);

False positives can be a problem with other filters too, but your description sounds like that there is a common pattern in debug sections that causes too much trouble. A higher level smart filter, that applies these simple filters only to code sections, was planned over 15 years ago, but it still hasn't been implemented. (Compare to 7-Zip which has done fancy things over two decades in the .7z format. It slightly helps that 7-Zip does archiving too, so it can seek in the input files and even sort them.) So, if false positives are a big problem now, then the filter perhaps has to handle them to some extent at least like you have done now.

The current list of previous instructions to match is quite long in sense that the if-statement has many conditions. This is good in the beginning, but hopefully it can be simplified to a shorter "good enough" condition. The development versions of the RISC-V filter had a few too-specific conditions too that were simplified for the final version.

I recommended avoiding official Filter ID ranges until the filter is final. That way we avoid accidents where there can be files that have official ID but cannot be decoded by stable tools. Use random 40-bit personal ID followed by 16-bit filter number, for example, 0x47BEC794C6xxxx where xxxx can be whatever you like. :-)

I'm not able to promise quick progress with this. Getting it merged this year and included in xz 5.10.0 early next year could be nice. I don't promise anything, things can happen faster or slower.

I suppose you tried tweaking LZMA2 options too. Since the instructions are 32 bits, I guess pb=2,lp=2,lc=2 should be the best like with ARM64. If you haven't tested those yet, it's worth doing.

Thanks!

@lrzlin
Copy link
Author

lrzlin commented Apr 27, 2025

Thanks for your quick reply!

You can use the simple pointer to safely store data between calls. See the x86 filter for an example.

I've tried set unfiltered_max to 8 and look forward four bytes further, unfortunately, it not works well, caused regression and even data corrupt, so I choosed the simple pointer, which solved the out-of-bounds access & multithread problem perfectly, I also deleted prev_pos because we don't need it anymore.

The current list of previous instructions to match is quite long in sense that the if-statement has many conditions. This is good in the beginning, but hopefully it can be simplified to a shorter "good enough" condition. The development versions of the RISC-V filter had a few too-specific conditions too that were simplified for the final version.

I tried to get rid of some conditions, but it seems that the previous instruction list shows best balance between code sections & debug sections, it even improves compress ratio for some stripped .a files, for example: libQt6QmlLS.a

Compress without BCJ filter

libQt6QmlLS.a (1/1)
  100 %        86.9 MiB / 125.1 MiB = 0.694    13 MiB/s       0:09

Compress with origin BCJ filter

libQt6QmlLS.a (1/1)
  100 %        87.5 MiB / 125.1 MiB = 0.699    12 MiB/s       0:10

Compress with improved BCJ filter

libQt6QmlLS.a (1/1)
  100 %        87.0 MiB / 125.1 MiB = 0.696    13 MiB/s       0:09

But I do agree the condition now is a bit longer, which slows down compress speed for bigger file (1.0G~), I'll discuss this with loongarch developers about the debug section, hope we'll find a simpler list in the future.

I recommended avoiding official Filter ID ranges until the filter is final. That way we avoid accidents where there can be files that have official ID but cannot be decoded by stable tools. Use random 40-bit personal ID followed by 16-bit filter number, for example, 0x47BEC794C6xxxx where xxxx can be whatever you like. :-)

I've temporary changed the ID to 0x47BEC794C61203.

I'm not able to promise quick progress with this. Getting it merged this year and included in xz 5.10.0 early next year could be nice. I don't promise anything, things can happen faster or slower.

Thanks for your help, I'm not familiar with xz's release cycle & version policy, I think we could decide the version number and Filter ID as you find it's good to merge.

I suppose you tried tweaking LZMA2 options too. Since the instructions are 32 bits, I guess pb=2,lp=2,lc=2 should be the best like with ARM64. If you haven't tested those yet, it's worth doing.

Yes, when dealing with most executables & libraries, pb=2,lp=2,lc=2 do performs well, as for the bigger libs with debug sections (such as the libQt6QmlLS.a I mentioned before, with has 1.6G in size without strip), the default options shows better compress ratio.

@Larhzu
Copy link
Member

Larhzu commented Apr 28, 2025

I've tried set unfiltered_max to 8 and look forward four bytes further, unfortunately, it not works well, caused regression and even data corrupt, so I choosed the simple pointer, which solved the out-of-bounds access & multithread problem perfectly, I also deleted prev_pos because we don't need it anymore.

What you have now looks correct. :-) However, I still suspect it could be nicer without last_ins state, but this is a minor thing that can be done among the last steps of the filter design. I mention it now already because the topic is on my mind:

Did you also update the initial size &= ~(size_t)3; to

if (size < 8)
    return 0;

size -= 8;

and the for-loop condition to i <= size? See riscv.c, you just want to keep i += 4.

I tried to get rid of some conditions, but it seems that the previous instruction list shows best balance between code sections & debug sections

What you have now is easy to read and thus very good at this point of development. :-) If checking for multiple conditions is required for good results in the final version, then it is so even if it didn't feel great. Before one is sure, it's good to experiment if a simpler expression can be good enough (maybe it cannot).

For comparison, development versions of RISC-V filter had instruction sync to 16/32-bit instruction boundaries and S-type instructions were handled correctly. The final version doesn't have the sync and S-type instructions are partially misconverted as if they were I-type instructions. This made the code slightly smaller with negligible compression difference with real-world files.

Looking at the current conditions, the 5 and 6 are contiguous ranges, thus an optimized (but less readable) version can combine those two conditions: Edited: This was a basic thinking error.

-&& (prev_instr & 0xFC000000) != 0x24000000
-&& (prev_instr & 0xFC000000) != 0x28000000
+&& (prev_instr & 0xF8000000) != 0x24000000

As expected, the list of hex prefixes to accept is slightly awkward:

00150-00157
028-02F
03
18
19
24
25
26
27
28
29
2A
2B
40
41
5

The condition accepts about 31 of possible 256 8-bit prefixes. That's about 3 bits that have to match in addition to (instr >> 26) == 0x15) which requires 6 bits to match. Replacing the prev_instr condition with a simple one like prev_instr <= 0x5F000000 would be weaker, providing only about 1.4 bits instead of 3. It's an example that is worth testing, but it wouldn't surprise me if it isn't good enough.

If debug sections are a special trouble, I wonder what are the common bit pattern(s) that cause false matches there.

it even improves compress ratio for some stripped .a files, for example: libQt6QmlLS.a

That's nice. As expected, unfiltered is still better, but a tarball with mixed files can contain static libs too. (Hopefully some day xz will be smart enough to at least enabled/disable filter per-file basis when compressing archives, filtering only ELF files with e_type ET_EXEC and ET_DYN.)

I'm not familiar with xz's release cycle & version policy, I think we could decide the version number and Filter ID as you find it's good to merge.

Release cycle is what it happens to be. Sorry if I sounded pessimistic in my previous post, I just didn't want to set too high expectations because it's hard to predict the future. When you are happy with a version of the filter, there are more steps that have to happen before the filter is included in a stable XZ Utils release.

First, I need to study LoongArch64 instructions to verify the proposed implementation. I also should understand if all instructions worth filtering are being handled. For example, there seem to be instructions similar to PCALAU12I, but I have no idea how much they are used in real-world code.

Test files are needed to compare different filter variants. If debug symbols are a special concern, 2-3 executables or shared libraries (not static libs) are needed. If you can provide them, that would be great. Debian's LoongArch64 port seems to have some packages available, which hopefully are enough for testing non-debug files.

Once both you and I feel OK about the design, I want to discuss it with 7-Zip's developer Igor Pavlov. Test files are needed for this too.

I hope this explains a bit why this will take time.

Thanks!


Edited: Fix 0x5F to 0x5F000000.

Edited: Marked the bitmask thinking error.

@lrzlin lrzlin force-pushed the loong-bcj branch 2 times, most recently from 8e58801 to 93b4b92 Compare April 30, 2025 20:58
@lrzlin
Copy link
Author

lrzlin commented Apr 30, 2025

What you have now looks correct. :-) However, I still suspect it could be nicer without last_ins state, but this is a minor thing that can be done among the last steps of the filter design. I mention it now already because the topic is on my mind:

Did you also update the initial size &= ~(size_t)3; to

if (size < 8)
    return 0;

size -= 8;

and the for-loop condition to i <= size? See riscv.c, you just want to keep i += 4.

Thanks for your advice! After update the size &= ~(size_t)3;, everything runs fine, now we could just delete last_ins state.

The condition accepts about 31 of possible 256 8-bit prefixes. That's about 3 bits that have to match in addition to (instr >> 26) == 0x15) which requires 6 bits to match. Replacing the prev_instr condition with a simple one like prev_instr <= 0x5F000000 would be weaker, providing only about 1.4 bits instead of 3. It's an example that is worth testing, but it wouldn't surprise me if it isn't good enough.

I've tried different conditions, the prev_instr <= 0x5C00000 (which is the last instruction of LoongArch base instruction set) seems working well (e.g. for libQt6QmlLS.a, from 1.6G to 212.5MB), However, for better compress ratio, I decided to add the lower bound 0x00150000 (OR's opcode) and alter the lower bound to 0x2A800000 (LD.WU's opcode), then add B, BL, BEQZ, BNEZ seperately. In that case, we could compress libQt6QmlLS.a to 208.7MB, as a comparation, the condition before is 207MB, and the original one make it to 216MB, and 206MB without BCJ filter.

If debug sections are a special trouble, I wonder what are the common bit pattern(s) that cause false matches there.

The develop system I used is AOSC, it didn't strip the libraries like other Linux release, so at beginning I thought that's the reason why their binaries are so large. For example, 1.6GB for libQt6QmlLS.a while other distros like LoongArchArchLinux have 99MB. However after I stripped them the AOSC's libQt6QmlLS.a is still 125.1MB but the ArchLinux one's dropped to only 5.1MB, I'll discuss this difference with AOSC developers to find out the truth.

edit: After talking with AOSC developers, the size difference should be blame to LTO(Linker Time Optimization), AOSC enable it by default for better performance, however it'll generate .gnu.lto section in ELF files, which gnu strip could not deal with, we could use llvm-strip to remove these LTO sections. Though we could remove these sections, the filter still performs worse than the lzma2 itself, but better than the original filter(without conditions), which should be the common issue for all filters when dealing with .a files? I saw similar results on x86 filters. So maybe debug sections isn't the one who have special trouble, but the false-positive for all non-code section. Just because the file I used for testing is bigger than other architectures which makes it seems to be worse.

Compress without BCJ filter

xz -v libQt6QmlLS.a 
libQt6QmlLS.a (1/1)
  100 %     350.7 KiB / 7,216.4 KiB = 0.049                   0:01 

Compress with origin BCJ filter

xz --loongarch --lzma2 -v libQt6QmlLS.a 
lib.a (1/1)
  100 %     410.7 KiB / 7,216.4 KiB = 0.057                   0:01 

Compress with conditional BCJ filter

xz --loongarch --lzma2 -v libQt6QmlLS.a 
libQt6QmlLS.a (1/1)
  100 %     381.7 KiB / 7,216.4 KiB = 0.053                   0:01

Release cycle is what it happens to be. Sorry if I sounded pessimistic in my previous post, I just didn't want to set too high expectations because it's hard to predict the future. When you are happy with a version of the filter, there are more steps that have to happen before the filter is included in a stable XZ Utils release.

First, I need to study LoongArch64 instructions to verify the proposed implementation. I also should understand if all instructions worth filtering are being handled. For example, there seem to be instructions similar to PCALAU12I, but I have no idea how much they are used in real-world code.

Yes, there are PCADDI & PCADDU12I which similar to PCALAU12I and I've tried add them to the filter, but they were relatively less used in excutables and libraries, so they really didn't help with the increasement of redundancy. Also, LoongArch compiler's mostly use BL and PCALAU12I for function calls, so at least for current compiler's, only filter two them should be good enough.

Test files are needed to compare different filter variants. If debug symbols are a special concern, 2-3 executables or shared libraries (not static libs) are needed. If you can provide them, that would be great. Debian's LoongArch64 port seems to have some packages available, which hopefully are enough for testing non-debug files.

Yes, we could use Debian's packages for testing non-debug files. As for debug files, AOSC should be okay for testing it, they just keep all the debug info in their packages, and they also used deb files so we can easily used them for testing. Here is the package download site: https://packages.aosc.io/

Once both you and I feel OK about the design, I want to discuss it with 7-Zip's developer Igor Pavlov. Test files are needed for this too.

I hope this explains a bit why this will take time.

Thanks for your kindly explain, thus I'll keep on working on it until we found it OK for merge.

@Larhzu
Copy link
Member

Larhzu commented May 10, 2025

I still haven't tested the filter, but on the surface it looks fairly nice already. :-) However, there is more work to do still.

Instructions to filter

Yes, there are PCADDI & PCADDU12I which similar to PCALAU12I and I've tried add them to the filter, but they were relatively less used in excutables and libraries, so they really didn't help with the increasement of redundancy. Also, LoongArch compiler's mostly use BL and PCALAU12I for function calls, so at least for current compiler's, only filter two them should be good enough.

It would be good to understand why the instructionsPCADDI, PCADDU12I, and PCADDU18I exist. Then one can estimate how much they can matter in the filter. Also, compiler options might affect how much they are used (for example, on RISC-V, -fPIC makes AUIPC common). If one only looks at existing binaries in one distro, one might miss something that matters in some other situation.

binutils' elfnn-loongarch.c is one source of information.

PCADDI

Linker relaxation of PCALAU12I + ADDI.D can result in PCADDI. See loongarch_relax_pcala_addi in the binutils source file.

If I understand correctly, PCADDI can be used to get the address of nearby (±2 MiB) functions and 4-byte-aligned global variables. Maybe it's somewhat common if there are many functions that refer to the same global variables, and the executable or shared library is small enough.

ARM64's ADR has ±1 MiB range, so it's somewhat similar to PCADDI. The ARM64 filter doesn't modify ADR instructions. So maybe PCADDI isn't worth filtering. False positives would be a problem too like they are with BL.

PCADDU12I

An FAQ says that PCADDU12I is used in PLT stubs. The binutils source file confirms this. If this was the only use of PCADDU12I then perhaps filtering it isn't important.

I read that 32-bit LA32R might lack PCALAU12I and thus such binaries use PCADDU12I. LA32R and also LA32S lack PCADDU18I, but that instruction shouldn't be useful on a 32-bit system anyway.

Full filtering of PCADDU12I might be like with RISC-V's AUIPC because the instruction can be paired with a few other instructions. Does the filter matter on LA32R? A compromise could be to filter PCADDU12I + JIRL which, based on RISC-V, could be the most common pairing. (Other pairs to PCADDU12I could be ADDI.W/D and loads and stores.) Or there could be separate LA32R filter later if needed, and ignore PCADDU12I for now.

PCADDU18I

The code models section in the psABI shows that PCADDU18I is used with the medium code model. I got an impression that LLVM recently switched to medium by default because big applications fail to link with the 128 MiB limit of the normal/small code model.

When compiling for the medium code model, object files (and thus static libs) should have many PCADDU18I + JIRL pairs. However, linker relaxation turns them into BL or B if the range is small enough. See loongarch_relax_call36 in the linked binutils source file. Thus, the PCADDU18I might not appear in executables and shared libs unless they are big enough (over 128 MiB).

If PCADDU18I is filtered, it highly likely should be filtered as PCADDU18I + JIRL pair. The JIRL contains the low bits of the address. There doesn't seem to be any other instruction that would neatly pair with PCADDU18I to calculate an address or load/store data.

Filtering PCADDU18I + JIRL shouldn't cause many false matches. 6+5 opcode bits must match. Also rj of JIRL must equal rd of PCADDU18I, which provides extra five bits to the match requirement. Due to how the pseudo-instrucions call36 and tail36 are defined, rd of JIRL should normally be the same register too, and matching this provides another five bits. Then one would have 6+5+5+5 bits to match already. One could even restrict the range to something smaller than ±128 GiB. For example, ±8 GiB range would require 6+5+5+5+4=25 bits to match, but that might be stricter than required in practice.

Technically it's possible to use PCALAU12I + JIRL and even PCADDU12I + JIRL for long jumps too. The binutils file does mention pcalau12i + jirl in some context but maybe it's a legacy thing, I didn't investigate much.

Other thoughts about PCAxxx

Like with RISC-V's AUIPC, LoongArch's PCADDU12I and PCADDU18I cannot be perfectly filtered without the paired instruction: the same absolute address can be referenced with two consecutive relative addresses in PCADDU1xI because the paired instruction uses a sign-extended immediate. Matching the paired instruction reduces false matches considerably too, but it only works (easily) if the instructions are adjacent.

It's good to look at output from more than one toolchain. With RISC-V, the executables from the Go compiler had some small differences in filtering results compared to GCC and Clang/LLVM, so binaries of programs written in Go might be worth checking out.

Curiously, Go's link tool uses trampolines (it's called veneer in ARM64). This suggests that one won't find any PCADDU18I instructions from Go binaries even if they exceeded 128 MiB.

Summary

From my side, this is still a lot of guessing and hoping instead of actually knowing something.

  • Without looking at disassembly of binaries, I don't have a confident guess about PCADDI, but maybe it doesn't need to be filtered as you had observed already.

  • I suspect that PCADDU18I + JIRL combination is worth filtering, but one needs big (>128 MiB) executables or shared libs to find such instruction pairs.

  • PCADDU12I + some_instr hopefully is rare and thus not worth filtering outside LA32R. If it was filtered, the paired instruction should be filtered too like with PCADDU18I, but now multiple instructions are possible pairs like with RISC-V's AUIPC.

Trying to filter also B in addition to BL is worth testing, but based on the experience from other instruction sets, the results likely are mixed or negative. B is used for tail calls (which should be worth filtering) and short in-function jumps (maybe not worth filtering).


Unaligned access

From what I read, it sounds like that LoongArch processors don't necessarily support fast unaligned memory access. This is true on RISC-V and ARM64 too although I have (possibly incorrectly) an impression that most ARM64 processors have fast unaligned access. While the LoongArch instructions are four-byte aligned, the filter in liblzma has to work with unaligned buffers too.

read32le is safe with unaligned input, but if unaligned isn't fast, compiler may generate byte-by-byte access. Detecting BL and PCALAU12I requires checking only one byte. It might be faster to compare only one byte, and load the rest of the bytes only when the opcode matches.

Since you have access to real LoongArch hardware, it could be useful to check if there's a speed difference. One should test the filter alone, for example, have a gigabyte-sized buffer and repeatedly calling the filter/unfilter function directly on the buffer (without compression), and comparing the speed: aligned vs. unaligned buffer, and read32le vs. byte-by-byte reads. One shouldn't worry about this too much; there's no need to spend a lot of time on this.


Branch conditions

In my previous post I messed up the bitmask math, sorry. I try again with the new code:

if ((prev_instr < 0x00150000 || prev_instr > 0x2A800000)
        && (prev_instr & 0xF8000000) != 0x40000000
        && (prev_instr & 0xF8000000) != 0x50000000)
    continue;

The last two conditions can be combined to (prev_instr & 0xE8000000) != 0x40000000. One branch is avoided if the first two are combined using subtraction, but only testing tells if it makes any kind of difference (the code only runs when the BL opcode matches so it might not really matter).

if ((uint32_t)(prev_instr - 0x00150000) > 0x2A800000 - 0x00150000
        && (prev_instr & 0xE8000000) != 0x40000000)
    continue;

At this point this is a very minor silly thing, but I mentioned it anyway because it was on my mind. :-)


Static libs

In that case, we could compress libQt6QmlLS.a to 208.7MB, as a comparation, the condition before is 207MB, and the original one make it to 216MB, and 206MB without BCJ filter.

edit: After talking with AOSC developers, the size difference should be blame to LTO(Linker Time Optimization), AOSC enable it by default for better performance, however it'll generate .gnu.lto section in ELF files, which gnu strip could not deal with, we could use llvm-strip to remove these LTO sections. Though we could remove these sections, the filter still performs worse than the lzma2 itself, but better than the original filter(without conditions), which should be the common issue for all filters when dealing with .a files?

Static libraries (.a files) have nothing to filter. They consist of object files (.o) where the relative addresses are filled with zeros. If a package has a few tiny static libraries or objects (.o files) and also big executables and shared libraries, then the small .a and .o files might not matter much. But otherwise one should avoid filtering .a, .o, and .ko (Linux kernel modules).

The best solution is to make xz smarter and detect file types in the input stream. This has been planned for a very long time. I don't know when it will be implemented, but some version could be worth trying for the next release cycle (even a simple but kind of bad one, that is, working somewhat OK if input archive has the files sorted by type so all executables and shared libs are together).

It's also possible to make the filter relatively harmless on .a and .o files by implementing zero-skipping (the relative addresses are zeros in .a and .o and .ko). This was debated in length with the ARM64 filter design. In the end, it wasn't included. It's not much code, but if smarter filtering is implemented later, then the zero-skip feature has slightly negative value. So while zero-skip is technicaly simple, it kind of feels complicated at the same time. :-|

Because the stricter condition for BL helped significantly with .a files, the stricter condition likely is a good addition even if your testing was done with .a files. Matching BL alone requires only six bits to match, which isn't much. As long as the stricter condition doesn't hurt with executables and shared libraries, it is worth keeping. :-) (It even makes me wonder if the ARM64 filter could have had something like that. The false matches with BL are a problem there too.)


Sorry about the overlong post. The important thing is to figure out if there are more instructions or instruction pairs that should be filtered.

@thesamesam
Copy link
Member

thesamesam commented May 11, 2025

edit: After talking with AOSC developers, the size difference should be blame to LTO(Linker Time Optimization), AOSC enable it by default for better performance, however it'll generate .gnu.lto section in ELF files, which gnu strip could not deal with, we could use llvm-strip to remove these LTO sections. Though we could remove these sections, the filter still performs worse than the lzma2 itself, but better than the original filter(without conditions), which should be the common issue for all filters when dealing with .a files?

When you build with -flto and build a static library with non-ancient GCC, the resultant foo.a is a static archive containing only LTO IR/bytecode. If you link against foo.a, it will be using LTO. As @Larhzu says, it's not object code at all. But even when it is object code, filters can't do anything here.

Nonetheless.. if you build with -flto -ffat-lto-objects, GCC will populate foo.a with both LTO IR sections and regular object code, and then strip can remove the LTO part.

GNU strip should support this more easily with upcoming Binutils 2.45, thanks to H.J. Lu (https://sourceware.org/PR21479). But you can just do strip -R .gnu.lto_* -R .gnu.debuglto_* -N __gnu_lto_v1 today which works fine (you can drop the -N ... if you're only supporting non-ancient GCC).

Clang these days supports -ffat-lto-objects too (but you have to use llvm-bitcode-strip then). The alternative would be, for Clang and I think GCC too, is to recompile those objects using the driver to regular ELF objects.

TL;DR: don't distribute LTO'd static archives, make sure you build w/ -ffat-lto-objects and strip them (using the recipe above). You usually don't want LTO'd static archives because they require any consumers to have the same toolchain version.

@lrzlin lrzlin force-pushed the loong-bcj branch 2 times, most recently from d417d77 to ec4376d Compare May 21, 2025 17:13
@lrzlin
Copy link
Author

lrzlin commented May 21, 2025

Thanks for your detailed post and @thesamesam kindly pointed the LTO related issue, I've told the AOSC developers to strip the LTO section of their static libraries, which reduce the binary size in a huge way.

PC relative & B filtering

I've tried to add other PC relative instructions and B to the filter, here are the results:

  1. PCADDU12I is only used for PLT entries in loongarch64 binaries, so filtering it is somewhat decline the redundency, though LA32R is lack of PCALAU12I, the only use of this subset is for educational purpose, so I think we could safely ignore it by now.
  2. B does showed a mixed results, on some executables it did help, but when dealing with all binaries in /bin, it even reduced the compression ratio, so I think it's not worth filtering like other architectures.
  3. PCADDU18I + JIRL shows a great improment on executables and dynamic libraries, when zero-skipping it even didn't hurt .o and .a files' compress ratio. However adding zero-skipping to other instructions filtered like BL or PCALAU12I caused data corruption, also it did help with the ratio too much, so I just add this to the PCADDU18I only.
  4. PCADDI, to my surprise, maybe worth filtering, while it did increase the compressed size of .a/.o files, it shows a constent improve on binaries, also it's filtering logic is pretty simple, I'll show the data in the chart below, but compared with PCADDU18I, the false-positives is still an issue, which could not be resolved by zero-skipping due to data corruption.

About the implement of PCADDU18I + JIRL, due to the address could be 38-bits long, I decide to put them seperately into two instuctions, instead of using standalone address like RISC-V. Also, I did tried the big-endian address implement, but it didn't show any difference in compress ratio, so I just used the seperate way which could support full 38-bit address length.

Here are the testing results, skip stands for zero-skipping, SLib is static libraries with debuginfo, but without LTO data.

Filtered Instructions Bin SLib DLib Striped SLib Overall
None 818.1MiB/0.217 553.9MiB/0.118 918.1MiB/0.229 276.0MiB/0.255 2566.1MiB/0.189
Plain BL + PCALAU12I 787.1MiB/0.209 585.3MiB/0.124 866.9MiB/0.216 284.6MiB/0.263 2523.9MiB/0.186
BL/PCALAU12I 783.7MiB/0.208 567.9MiB/0.121 865.9MiB/0.216 280.8MiB/0.260 2498.3MiB/0.184
BL/PCALAU12I/ADDU18I 757.1MiB/0.201 573.3MiB/0.122 849.0MiB/0.211 286.2MiB/0.265 2465.6MiB/0.182
BL/PCALAU12I/ADDU18I (skip) 757.0MiB/0.201 567.8MiB/0.121 849.0MiB/0.211 280.8MiB/0.260 2454.6MiB/0.181
BL/PCALAU12I/ADDU18I/ADDI (skip) 752.9MiB/0.200 573.8MiB/0.122 842.6MiB/0.210 282.5MiB/0.261 2451.8MiB/0.181

Base on lastest gcc/llvm toolchain and renewed AOSC Linux, removed all LTO sections in static libraries.

Branch Conditions & Static Libs

if ((uint32_t)(prev_instr - 0x00150000) > 0x2A800000 - 0x00150000
        && (prev_instr & 0xE8000000) != 0x40000000)
    continue;

The last two conditions can be combined to (prev_instr & 0xE8000000) != 0x40000000. One branch is avoided if the first two are combined using subtraction, but only testing tells if it makes any kind of difference (the code only runs when the BL opcode matches so it might not really matter).

I tried new branch conditions, it did speed up the filter a bit, also through testing results, even after the LTO sections was removed, the conditions still improves redundency, so it's still worth keeping.

Unaligned access

Strangely enough, on LoongArch hardware (a Loongson 3A6000, 4c/8t, runs on 2.5 GHz), I couldn't find a steady relation with the filter's complexity and time consumption, when dealing with large chunks of data (like the bin in the chart which is 3.7 GiB~ large), the most complex filter (conditionalBL+PCALAU12I+PCADDI+PCADDU18I/JIRL`) even runs faster than without BCJ filter (4:35 v.s. 4:53), I'm pretty confused with it, the LoongArch processor didn't support dynamic boost, is it related to process scheduling or this shows unaligned access isn't a problem we need to worry about?

Zero-skipping and conditional BL filtering

It's also possible to make the filter relatively harmless on .a and .o files by implementing zero-skipping (the relative addresses are zeros in .a and .o and .ko). This was debated in length with the ARM64 filter design. In the end, it wasn't included. It's not much code, but if smarter filtering is implemented later, then the zero-skip feature has slightly negative value. So while zero-skip is technicaly simple, it kind of feels complicated at the same time. :-|

After implementing PCADDU18I + JIRL and zero-skipping, we further improves the compress ratio of this filter, due to zero-skipping for this instruction pair did not show other side effects, so I'm curious about the negative value about zero-skipping, does it makes the smart filtering harder to write? I think if we're going to do so, the conditional BL and zero-skipping of PCADDU18I + JIRL could be useless in some way and should be removed, but they did make a better compress ratio by far.

Sorry for the late reply, it spend my many time to figure out the best way to implement PCADDU18I + JIRL pair filtering, hopefully this makes progress to our final results.

@lrzlin
Copy link
Author

lrzlin commented Jul 11, 2025

@Larhzu Gentle ping. Are there any progess we could do about it?

@Larhzu
Copy link
Member

Larhzu commented Oct 3, 2025

Sorry for the delay. I was on an unannounced long break. Before the break I had drafted half a response, but I need to re-read things to remember the details and finish the reply. Hopefully I get it done next week.

@lrzlin
Copy link
Author

lrzlin commented Oct 9, 2025

Sorry for the delay. I was on an unannounced long break. Before the break I had drafted half a response, but I need to re-read things to remember the details and finish the reply. Hopefully I get it done next week.

Welcome back! I'll waiting for your kindly response and keep improving this.

@Larhzu
Copy link
Member

Larhzu commented Oct 21, 2025

I'm very sorry, I haven't been able to spend much time on xz since my previous message. :-(

Instructions

If LA32R can be ignored, then ignoring PCADDU12I sounds OK. It would be the most complex thing to filter (like AUIPC in RISC-V), so it's a significant simplification. :-) I just hope that LA32R won't some day be used in some situation where filtering would make sense (microcontrollers that run code directly from ROM shouldn't need the filter).

Your comments about B match what was seen on ARM64 and RISC-V, so let's ignore B.

PCADDI requires seven bits to match. That's one bit more than the six bits BL, and false positives with BL were a problem without the extra condition that compares the previous instruction. I wonder if a similar trick is possible with PCADDI, if false positives are a problem.

In the RISC-V filter, reordering the address bits of AUIPC + instr2 and using big endian helped a tiny amount (maybe 0.2 %). It was consistent though and didn't really affect code size of a decoder-only implementation, but on the other hand it makes the code more complex to understand. I might try the trick for PCADDU18I + JIRL among the last things before the filter is final; it's possible that keeping the split encoding is the way to go.

The internal APIs of existing BCJ filter implementations only support 32-bit program counter (pc). This means that the filter will provide suboptimal results if the relative address isn't within ±2 GiB. It will still work and be helpful, it just could be a slightly better in that case.

Part of me wonders if it made sense to limit the range in the filter so that the highest six bits of the immediate in PCADDU18I would need to be either all zeros or all ones. Then the range would be ±2 GiB. I didn't think now if the resulting code would be more complex or not. The ARM64 filter does something like this with ADRP.

If the range was limited, checking for JIRL's rd == $ra/$zero might not be needed. That is, the number of false positives would be low enough without that check.

Binaries that need more than 2 GiB of range are likely to remain uncommon outside special cases. Extending pc to 64 bits is easy enough, but existing filters running on 32-bit processors would then have a few useless extra instructions. It's only a silly very tiny detail, but I'm not sure right now what is best in practice.

Measuring speed

When comparing filter speed, you should compare the pure filter without any compression. When combined with LZMA/LZMA2, the speed difference of the filter variants isn't so significant; it could matter with some other compressor though.

It is expected that BCJ+LZMA2 is faster to compress and decompress than LZMA2 alone, because the compression is better. So your results make sense.

Unaligned access

I'm not sure if you understood what unaligned memory access or misaligned memory access means. If a program wants to load a 32-bit value from memory, on a 32/64-bit processor it's aligned access if the value is stored in an address that is a multiple of four bytes. Otherwise it's unaligned. If the processor doesn't support fast unaligned access, compiler will generate byte-by-byte access when it doesn't know if the load or store is aligned.

Try the following on Godbolt with -O2 optimization:

#include <stdint.h>
#include <string.h>

uint32_t
read32ne(const void *p)
{
    uint32_t v;
    memcpy(&v, p, sizeof(v));
    return v;
}

With GCC 15.1.0 it produces just ldptr.w, meaning that unaligned access is assumed to be fast:

read32ne:
        ldptr.w $r4,$r4,0
        jr      $r1

If you change the options to -O2 -mstrict-align then byte-by-byte access is used:

read32ne:
        ld.bu   $r15,$r4,0
        ld.bu   $r14,$r4,1
        ld.bu   $r13,$r4,2
        ld.bu   $r12,$r4,3
        addi.d  $r3,$r3,-16
        st.b    $r15,$r3,12
        st.b    $r14,$r3,13
        st.b    $r13,$r3,14
        st.b    $r12,$r3,15
        ldptr.w $r4,$r3,12
        addi.d  $r3,$r3,16
        jr      $r1

If the filter code is built for a processor that doesn't support fast unaligned access, it's waste of time to do the byte-by-byte access when only one of the bytes matter in the outermost conditions in the loop. For example, if ((inst >> 26) == 0x15) cares only about one byte.

By the way, the above byte-by-byte assembly doesn't seem great because it reconstructs the aligned value on stack, requiring more memory operations. There is a GCC bug report about it from 2023. The same thing happens on a few other archs.

To test the speed, one could load a big file (a hundred megabytes or more) to RAM and filter/defilter it multiple times in a loop, and measure the time. If the input buffer isn't aligned, then it might be slightly slower. That is, if you have

_Alignas(4)
static uint8_t buf[256U << 20];

and load the file to start from buf + 1, then filtering might be slower when you call loongarch_code(..., buf + 1, size).

This was quite a bit of text to explain, but it's only an implementation detail; the filtering algorithm doesn't change at all. This is about which is faster: using read32le or only reading the individual bytes when needed. The RISC-V filter does the latter because there are enough real-world RISC-V processors that don't support fast unaligned access. It makes the C code uglier but results in better output from compilers for such processors.

So this can be changed after the filter design has been finished. :-) In practice it likely won't be changed because no one will care enough to notice. If the filter is slow, then people will likely just assume that it's what it is.

From #186 I understood that most LoongArch processors support fast unaligned access and thus GCC defaults to -mno-strict-align and most distros do to. AOSC was mentioned as a distro that is different and uses -mstrict-align to support the less common LoongArch processors too.

Zero-skipping

While zero-skipping isn't a lot of code, it needs a few more characters than you have now. The if (addr == 0) in the current PCADDU18I + JIRL code will corrupt data, you just haven't managed to trigger it.

If encoder skips all zeros, it means that it will never produce 0 + pc. However, it will convert 0 - pc to zero. Because the decoder skips zeros, it won't decode it back to 0 - pc. This results in data corruption.

In addition, the decoder will decode 0 + pc to zero. Because the encoder never modified zeros, this too will result in data corruption.

One could make the encoder skip 0 - pc in addition to zeros, and similarly the decoder could also skip 0 + pc, but this would merely create new problematic values, and thus it's not the solution. (If you want to figure out the solution yourself, pause reading now.)

Correct solution: Both encoder and decoder will skip zeros without modifying them. In the encoder, 0 - pc is the problematic input. It cannot be converted to zero. Because the encoder will never convert zero to 0 + pc, it can convert 0 - pc to 0 + pc without creating any new problematic values.

The decoder can then reverse it perfectly. It will skip zeros and convert 0 + pc to 0 - pc instead of zero.

One way to think about it is that the encoder and decoder need to skip both relative zero and absolute zero.

Downsides of zero-skipping:

  1. If a popular function or data happens to be at absolute address of zero, instructions that refer to that address won't benefit from the filter.

    • If zero is the beginning of the ELF file, then this shouldn't matter, I suppose.

    • If one is filtering only the .text section, then the references to the first function won't be filtered if the start offset of the filter is zero. This wouldn't be ideal. It could be worked around with non-zero start offset, although with bad luck the new zero will then refer to some popular global variable. ;-)

  2. It adds a few more instructions to code size (not that many though).

  3. One or two more branches might affect speed (but likely not really).

  4. It's unnecessary if filter can be applied more smartly. That is, teaching xz to apply BCJ filters only to files that are worth filtering (not filtering static libs). This idea has existed for years but hasn't been implemented.

I recommend you skip testing .a, .o, and .ko files completely unless you want to test the effects of zero-skipping. Those types of files have nothing to filter, so a filter can only make compression worse with those files.

Next steps

I would like to hear if you have thoughts about false positives with PCADDI and if they can be reduced the same way as was done with BL.

After that, I need to test the filter (including a few variants) quite a lot myself. You have done great work, which means I need to do much less, but I still need to check a few things myself to feel confident that the new filter is good. Past experience is that this takes time (one reason is that the differences between some variants can be somewhat inconsistent).

There are some other FOSS topics (mostly but not only xz) in my queue which are even older than this PR and which aren't quick to finish. I feel I should prioritise those to some extent, and thus delay some of this filter work quite a bit (2-4 months), I'm sorry. If you have any ideas (like about false positives with PCADDI), I try to comment those somewhat quickly still. :-)

The filter code might not be very long, but the discussions around ARM64 and RISC-V filter designs were pretty long still. So don't be discouraged, this discussion isn't terribly long yet.

The new filter ID is 0x0C.
@lrzlin
Copy link
Author

lrzlin commented Oct 24, 2025

Thanks for your detailed reply! I've followed your instructions and test the fixed filter, here are the results.

PCADDI and zero-skipping

After further testing, the same instruction filter-and-skip strategy we used on BL also works on PCADDI, the bloat of static libraries became smaller (from 573.8MiB to 570.4 MiB), plus the binary compress ratio was raised a little bit, so it did helped to shrink the overall file size, therefore I've added it to the filter.

The zero-skipping in PCADDU18I + JIRL is now fixed and they works perfectly, much thanks for pointing me the correct way to do it. Considering zero-skipping (even the previous instruction skipping) is mainly designed to improve compress ratio on .o and .a files, so if we could implement the smart filtering, or we assume the user will only use the BCJ filter on executable files and dynamic libraries, they could be useless. As a result, if zero-skipping really being a obstruction for future improvements, I'll say we'll better delete them.

Unaligned access and speed

I've test the filter seperatly without using any compress algorithm, on my platform (Loongson 3A6000 which supports unaligned memory access), when dealing with large files (166 MiB~) there are not any noticeable speed difference between loongarch_code(..., buf + 1, size) and loongarch_code(..., buf, size). Moreover, as comments in #186 says, all loongarch desktop-level processors support fast unaligned memory access, so it should not be a problem.

PCADDU18I + JIRL

I've tried limit the upper 6 immediate bits of PCADDU18I to all zero or ones, but the test results didn't changed a lot, I believe you are right, more than 2GiB of jump is pretty rare in existing binaries, but adding this limitation still needs one more if statement, so keeping JIRL's rd == $ra/$zero is better in my opinion.

After these modifications and fixing, I believe this filter is ready for your test, again, much thanks to your kind and detailed reply, if you are busy dealing with other FOSS topics, it would be fine, I'll keep waiting and do as much as I could.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants