Description
Hi, I tried mimalloc in ClickHouse and faced some (I hope so) interesting issues.
The slowdown against default jemalloc was big -- the query processing is approximately two times slower than usual.
I looked around and have some questions with examples of why something is happening.
First of all, everything is done under Linux x86-64.
The example
#include <memory>
int main() {
std::unique_ptr<int[]> a(new int[1ull << 30]);
return 0;
}
With standard allocator, I have only 1 mmap and 1 munmap which is pretty cool and expected because allocation is huge.
strace -fe mmap,munmap ./test
...
mmap(NULL, 4294971392, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5c63616000
munmap(0x7f5c63616000, 4294971392) = 0
With mimalloc, I have 6 mmaps and munmaps with rather big ones that are doubled.
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f24999000
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f24996000
munmap(0x7f8f24e71000, 193620) = 0
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f24596000
munmap(0x7f8f24596000, 4194304) = 0
mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f24196000
munmap(0x7f8f24196000, 2531328) = 0
munmap(0x7f8f24800000, 1662976) = 0
mmap(NULL, 4294967504, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e243ff000
munmap(0x7f8e243ff000, 4294967504) = 0
mmap(NULL, 4299161808, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e23fff000
munmap(0x7f8e23fff000, 4096) = 0
munmap(0x7f8f24001000, 4186320) = 0
munmap(0x7f8e24000000, 4294967504) = 0
Then I read some code and saw the https://github.com/microsoft/mimalloc/blob/master/src/segment.c#L290 and https://github.com/microsoft/mimalloc/blob/master/src/os.c#L284 that wants 4mb alignment and it is highly unlikely for Linux to have such address after mmap syscall. So we do munmap once, then use the slow method to mmap once and munmap two times more because we don't want to store mmap useless regions https://github.com/microsoft/mimalloc/blob/master/src/os.c#L245. I am not sure this behavior is somehow optimal (at least with Linux we should not wait for 4mb alignment after mmap).
And is there any deallocated regions reuse? Because I tried such code below and saw a lot of mmap and munmap calls (for each construction and destruction many times).
#include <memory>
#include <thread>
#include <vector>
void Foo() {
for (size_t i = 0; i < 10000; ++i) {
std::unique_ptr<int[]> a(new int[1ull << 18]);
}
}
int main() {
std::vector<std::thread> thrs;
for (size_t i = 0; i < 10; ++i) {
thrs.emplace_back(Foo);
}
for (auto&& thr : thrs) {
thr.join();
}
return 0;
}
With jemalloc and default allocator such code works almost immediately
time ./test
./test 0.02s user 0.00s system 205% cpu 0.011 total
With mimalloc it is extremely slow
time LD_PRELOAD=mimalloc/build/libmimalloc.so ./test
LD_PRELOAD=mimalloc/build/libmimalloc.so ./test 0.42s user 4.62s system 164% cpu 3.055 total
And such usage of the allocator is common, for example, in server applications, when you accept the query, process it and then again accept the query, memory reuse can help to avoid syscall penalty.
So the question is -- what are the best practices of using mimalloc? :)