-
-
Notifications
You must be signed in to change notification settings - Fork 185
liblzma: Add LoongArch BCJ filter #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Here are some quick early comments. You can use the If you only care about the previous four bytes, set do this: Then you have the previous instruction in False positives can be a problem with other filters too, but your description sounds like that there is a common pattern in debug sections that causes too much trouble. A higher level smart filter, that applies these simple filters only to code sections, was planned over 15 years ago, but it still hasn't been implemented. (Compare to 7-Zip which has done fancy things over two decades in the .7z format. It slightly helps that 7-Zip does archiving too, so it can seek in the input files and even sort them.) So, if false positives are a big problem now, then the filter perhaps has to handle them to some extent at least like you have done now. The current list of previous instructions to match is quite long in sense that the if-statement has many conditions. This is good in the beginning, but hopefully it can be simplified to a shorter "good enough" condition. The development versions of the RISC-V filter had a few too-specific conditions too that were simplified for the final version. I recommended avoiding official Filter ID ranges until the filter is final. That way we avoid accidents where there can be files that have official ID but cannot be decoded by stable tools. Use random 40-bit personal ID followed by 16-bit filter number, for example, 0x47BEC794C6xxxx where xxxx can be whatever you like. :-) I'm not able to promise quick progress with this. Getting it merged this year and included in xz 5.10.0 early next year could be nice. I don't promise anything, things can happen faster or slower. I suppose you tried tweaking LZMA2 options too. Since the instructions are 32 bits, I guess Thanks! |
|
Thanks for your quick reply!
I've tried set
I tried to get rid of some conditions, but it seems that the previous instruction list shows best balance between code sections & debug sections, it even improves compress ratio for some stripped .a files, for example: Compress without BCJ filter Compress with origin BCJ filter Compress with improved BCJ filter But I do agree the condition now is a bit longer, which slows down compress speed for bigger file (1.0G~), I'll discuss this with loongarch developers about the debug section, hope we'll find a simpler list in the future.
I've temporary changed the ID to
Thanks for your help, I'm not familiar with xz's release cycle & version policy, I think we could decide the version number and Filter ID as you find it's good to merge.
Yes, when dealing with most executables & libraries, |
What you have now looks correct. :-) However, I still suspect it could be nicer without Did you also update the initial and the for-loop condition to
What you have now is easy to read and thus very good at this point of development. :-) If checking for multiple conditions is required for good results in the final version, then it is so even if it didn't feel great. Before one is sure, it's good to experiment if a simpler expression can be good enough (maybe it cannot). For comparison, development versions of RISC-V filter had instruction sync to 16/32-bit instruction boundaries and S-type instructions were handled correctly. The final version doesn't have the sync and S-type instructions are partially misconverted as if they were I-type instructions. This made the code slightly smaller with negligible compression difference with real-world files.
As expected, the list of hex prefixes to accept is slightly awkward: The condition accepts about 31 of possible 256 8-bit prefixes. That's about 3 bits that have to match in addition to If debug sections are a special trouble, I wonder what are the common bit pattern(s) that cause false matches there.
That's nice. As expected, unfiltered is still better, but a tarball with mixed files can contain static libs too. (Hopefully some day xz will be smart enough to at least enabled/disable filter per-file basis when compressing archives, filtering only ELF files with
Release cycle is what it happens to be. Sorry if I sounded pessimistic in my previous post, I just didn't want to set too high expectations because it's hard to predict the future. When you are happy with a version of the filter, there are more steps that have to happen before the filter is included in a stable XZ Utils release. First, I need to study LoongArch64 instructions to verify the proposed implementation. I also should understand if all instructions worth filtering are being handled. For example, there seem to be instructions similar to Test files are needed to compare different filter variants. If debug symbols are a special concern, 2-3 executables or shared libraries (not static libs) are needed. If you can provide them, that would be great. Debian's LoongArch64 port seems to have some packages available, which hopefully are enough for testing non-debug files. Once both you and I feel OK about the design, I want to discuss it with 7-Zip's developer Igor Pavlov. Test files are needed for this too. I hope this explains a bit why this will take time. Thanks! Edited: Fix Edited: Marked the bitmask thinking error. |
8e58801 to
93b4b92
Compare
Thanks for your advice! After update the
I've tried different conditions, the
The develop system I used is AOSC, it didn't strip the libraries like other Linux release, so at beginning I thought that's the reason why their binaries are so large. For example, 1.6GB for libQt6QmlLS.a while other distros like LoongArchArchLinux have 99MB. However after I stripped them the AOSC's libQt6QmlLS.a is still 125.1MB but the ArchLinux one's dropped to only 5.1MB, I'll discuss this difference with AOSC developers to find out the truth. edit: After talking with AOSC developers, the size difference should be blame to LTO(Linker Time Optimization), AOSC enable it by default for better performance, however it'll generate Compress without BCJ filter Compress with origin BCJ filter Compress with conditional BCJ filter
Yes, there are
Yes, we could use Debian's packages for testing non-debug files. As for debug files, AOSC should be okay for testing it, they just keep all the debug info in their packages, and they also used deb files so we can easily used them for testing. Here is the package download site: https://packages.aosc.io/
Thanks for your kindly explain, thus I'll keep on working on it until we found it OK for merge. |
|
I still haven't tested the filter, but on the surface it looks fairly nice already. :-) However, there is more work to do still. Instructions to filter
It would be good to understand why the instructions binutils' elfnn-loongarch.c is one source of information. PCADDILinker relaxation of If I understand correctly, ARM64's PCADDU12IAn FAQ says that I read that 32-bit LA32R might lack Full filtering of PCADDU18IThe code models section in the psABI shows that When compiling for the medium code model, object files (and thus static libs) should have many If Filtering Technically it's possible to use Other thoughts about PCAxxxLike with RISC-V's It's good to look at output from more than one toolchain. With RISC-V, the executables from the Go compiler had some small differences in filtering results compared to GCC and Clang/LLVM, so binaries of programs written in Go might be worth checking out. Curiously, Go's link tool uses trampolines (it's called veneer in ARM64). This suggests that one won't find any SummaryFrom my side, this is still a lot of guessing and hoping instead of actually knowing something.
Trying to filter also Unaligned accessFrom what I read, it sounds like that LoongArch processors don't necessarily support fast unaligned memory access. This is true on RISC-V and ARM64 too although I have (possibly incorrectly) an impression that most ARM64 processors have fast unaligned access. While the LoongArch instructions are four-byte aligned, the filter in liblzma has to work with unaligned buffers too.
Since you have access to real LoongArch hardware, it could be useful to check if there's a speed difference. One should test the filter alone, for example, have a gigabyte-sized buffer and repeatedly calling the filter/unfilter function directly on the buffer (without compression), and comparing the speed: aligned vs. unaligned buffer, and Branch conditionsIn my previous post I messed up the bitmask math, sorry. I try again with the new code: The last two conditions can be combined to At this point this is a very minor silly thing, but I mentioned it anyway because it was on my mind. :-) Static libs
Static libraries (.a files) have nothing to filter. They consist of object files (.o) where the relative addresses are filled with zeros. If a package has a few tiny static libraries or objects (.o files) and also big executables and shared libraries, then the small .a and .o files might not matter much. But otherwise one should avoid filtering .a, .o, and .ko (Linux kernel modules). The best solution is to make xz smarter and detect file types in the input stream. This has been planned for a very long time. I don't know when it will be implemented, but some version could be worth trying for the next release cycle (even a simple but kind of bad one, that is, working somewhat OK if input archive has the files sorted by type so all executables and shared libs are together). It's also possible to make the filter relatively harmless on .a and .o files by implementing zero-skipping (the relative addresses are zeros in .a and .o and .ko). This was debated in length with the ARM64 filter design. In the end, it wasn't included. It's not much code, but if smarter filtering is implemented later, then the zero-skip feature has slightly negative value. So while zero-skip is technicaly simple, it kind of feels complicated at the same time. :-| Because the stricter condition for Sorry about the overlong post. The important thing is to figure out if there are more instructions or instruction pairs that should be filtered. |
When you build with Nonetheless.. if you build with GNU strip should support this more easily with upcoming Binutils 2.45, thanks to H.J. Lu (https://sourceware.org/PR21479). But you can just do Clang these days supports TL;DR: don't distribute LTO'd static archives, make sure you build w/ |
d417d77 to
ec4376d
Compare
|
Thanks for your detailed post and @thesamesam kindly pointed the LTO related issue, I've told the AOSC developers to strip the LTO section of their static libraries, which reduce the binary size in a huge way. PC relative & B filteringI've tried to add other PC relative instructions and B to the filter, here are the results:
About the implement of Here are the testing results, skip stands for zero-skipping, SLib is static libraries with debuginfo, but without LTO data.
Base on lastest gcc/llvm toolchain and renewed AOSC Linux, removed all LTO sections in static libraries. Branch Conditions & Static Libs
I tried new branch conditions, it did speed up the filter a bit, also through testing results, even after the LTO sections was removed, the conditions still improves redundency, so it's still worth keeping. Unaligned accessStrangely enough, on LoongArch hardware (a Loongson 3A6000, 4c/8t, runs on 2.5 GHz), I couldn't find a steady relation with the filter's complexity and time consumption, when dealing with large chunks of data (like the bin in the chart which is 3.7 GiB~ large), the most complex filter ( Zero-skipping and conditional BL filtering
After implementing Sorry for the late reply, it spend my many time to figure out the best way to implement |
|
@Larhzu Gentle ping. Are there any progess we could do about it? |
|
Sorry for the delay. I was on an unannounced long break. Before the break I had drafted half a response, but I need to re-read things to remember the details and finish the reply. Hopefully I get it done next week. |
Welcome back! I'll waiting for your kindly response and keep improving this. |
|
I'm very sorry, I haven't been able to spend much time on xz since my previous message. :-( InstructionsIf LA32R can be ignored, then ignoring Your comments about
In the RISC-V filter, reordering the address bits of The internal APIs of existing BCJ filter implementations only support 32-bit program counter ( Part of me wonders if it made sense to limit the range in the filter so that the highest six bits of the immediate in If the range was limited, checking for Binaries that need more than 2 GiB of range are likely to remain uncommon outside special cases. Extending Measuring speedWhen comparing filter speed, you should compare the pure filter without any compression. When combined with LZMA/LZMA2, the speed difference of the filter variants isn't so significant; it could matter with some other compressor though. It is expected that BCJ+LZMA2 is faster to compress and decompress than LZMA2 alone, because the compression is better. So your results make sense. Unaligned accessI'm not sure if you understood what unaligned memory access or misaligned memory access means. If a program wants to load a 32-bit value from memory, on a 32/64-bit processor it's aligned access if the value is stored in an address that is a multiple of four bytes. Otherwise it's unaligned. If the processor doesn't support fast unaligned access, compiler will generate byte-by-byte access when it doesn't know if the load or store is aligned. Try the following on Godbolt with #include <stdint.h>
#include <string.h>
uint32_t
read32ne(const void *p)
{
uint32_t v;
memcpy(&v, p, sizeof(v));
return v;
}With GCC 15.1.0 it produces just If you change the options to If the filter code is built for a processor that doesn't support fast unaligned access, it's waste of time to do the byte-by-byte access when only one of the bytes matter in the outermost conditions in the loop. For example, By the way, the above byte-by-byte assembly doesn't seem great because it reconstructs the aligned value on stack, requiring more memory operations. There is a GCC bug report about it from 2023. The same thing happens on a few other archs. To test the speed, one could load a big file (a hundred megabytes or more) to RAM and filter/defilter it multiple times in a loop, and measure the time. If the input buffer isn't aligned, then it might be slightly slower. That is, if you have _Alignas(4)
static uint8_t buf[256U << 20];and load the file to start from This was quite a bit of text to explain, but it's only an implementation detail; the filtering algorithm doesn't change at all. This is about which is faster: using So this can be changed after the filter design has been finished. :-) In practice it likely won't be changed because no one will care enough to notice. If the filter is slow, then people will likely just assume that it's what it is. From #186 I understood that most LoongArch processors support fast unaligned access and thus GCC defaults to Zero-skippingWhile zero-skipping isn't a lot of code, it needs a few more characters than you have now. The If encoder skips all zeros, it means that it will never produce In addition, the decoder will decode One could make the encoder skip Correct solution: Both encoder and decoder will skip zeros without modifying them. In the encoder, The decoder can then reverse it perfectly. It will skip zeros and convert One way to think about it is that the encoder and decoder need to skip both relative zero and absolute zero. Downsides of zero-skipping:
I recommend you skip testing .a, .o, and .ko files completely unless you want to test the effects of zero-skipping. Those types of files have nothing to filter, so a filter can only make compression worse with those files. Next stepsI would like to hear if you have thoughts about false positives with After that, I need to test the filter (including a few variants) quite a lot myself. You have done great work, which means I need to do much less, but I still need to check a few things myself to feel confident that the new filter is good. Past experience is that this takes time (one reason is that the differences between some variants can be somewhat inconsistent). There are some other FOSS topics (mostly but not only xz) in my queue which are even older than this PR and which aren't quick to finish. I feel I should prioritise those to some extent, and thus delay some of this filter work quite a bit (2-4 months), I'm sorry. If you have any ideas (like about false positives with The filter code might not be very long, but the discussions around ARM64 and RISC-V filter designs were pretty long still. So don't be discouraged, this discussion isn't terribly long yet. |
The new filter ID is 0x0C.
|
Thanks for your detailed reply! I've followed your instructions and test the fixed filter, here are the results. PCADDI and zero-skippingAfter further testing, the same instruction filter-and-skip strategy we used on The zero-skipping in PCADDU18I + JIRL is now fixed and they works perfectly, much thanks for pointing me the correct way to do it. Considering zero-skipping (even the previous instruction skipping) is mainly designed to improve compress ratio on .o and .a files, so if we could implement the smart filtering, or we assume the user will only use the BCJ filter on executable files and dynamic libraries, they could be useless. As a result, if zero-skipping really being a obstruction for future improvements, I'll say we'll better delete them. Unaligned access and speedI've test the filter seperatly without using any compress algorithm, on my platform (Loongson 3A6000 which supports unaligned memory access), when dealing with large files (166 MiB~) there are not any noticeable speed difference between PCADDU18I + JIRLI've tried limit the upper 6 immediate bits of After these modifications and fixing, I believe this filter is ready for your test, again, much thanks to your kind and detailed reply, if you are busy dealing with other FOSS topics, it would be fine, I'll keep waiting and do as much as I could. |
This patch adds LoongArch BCJ filter support, it has been tested on real Loongson hardware & QEMU.
It's mostly refer the arm64's filter, converts BL/PCALAU12I only, other BCJ instructions was also tested but they do not have positive effet on redundancy. The PCALAU12I's range is also limited to +/-512MB to reduce false positives.
However, when testing BCJ filter on binaries with debug info, the compress ratio significant worse than use LZMA2 only, It's majority due to many false positives of BL instructions, so I've add a if-statement in BL part that only if the previous instructions of BL is in a range of instructions we selected, we'll convert it's immediate to absolute address, or we'll just ignores it. These instructions were derived from analyzing LoongArch architecture's libraries and executables.
This optimize could largely improves compress ratio when dealing with this kinds of libraries or executables. But this optimize relies on global varibales, I don't sure whether it's reliable in multithread environment. Currently, an one-hour test on Loongson 3C6000(16C/32T) shows no data corruption. It's still a very naive attempt, so it would be much appreciated if you could provide some feedback.
Also, I used 0x0C as this filter's ID, but it's just a temporary path to test the filter.
Related: loongson-community/discussions#78