Atomics for eBPF #399

kernel-patches-bot · 2020-11-23T17:43:38Z

Pull request for series with
subject: Atomics for eBPF
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=389615

The case for JITing atomics is about to get more complicated. Let's factor out some common code to make the review and result more readable. NB the atomics code doesn't yet use the new helper - a subsequent patch will add its use as a side-effect of other changes. Signed-off-by: Brendan Jackman <[email protected]>

The JIT case for encoding atomic ops is about to get more complicated. In order to make the review & resulting code easier, let's factor out some shared helpers. Signed-off-by: Brendan Jackman <[email protected]>

A subsequent patch will add additional atomic operations. These new operations will use the same opcode field as the existing XADD, with the immediate discriminating different operations. In preparation, rename the instruction mode BPF_ATOMIC and start calling the zero immediate BPF_ADD. This is possible (doesn't break existing valid BPF progs) because the immediate field is currently reserved MBZ and BPF_ADD is zero. All uses are removed from the tree but the BPF_XADD definition is kept around to avoid breaking builds for people including kernel headers. Signed-off-by: Brendan Jackman <[email protected]>

I can't find a reason why this code is in resolve_pseudo_ldimm64; since I'll be modifying it in a subsequent commit, tidy it up. Signed-off-by: Brendan Jackman <[email protected]>

This value can be set in bpf_insn.imm, for BPF_ATOMIC instructions, in order to have the previous value of the atomically-modified memory location loaded into the src register after an atomic op is carried out. Suggested-by: Yonghong Song <[email protected]> Signed-off-by: Brendan Jackman <[email protected]>

These are the operations that implement atomic exchange and compare-exchange. They are peculiarly named because of the presence of the separate FETCH field that tells you whether the instruction writes the value back to the src register. Neither operation is supported without BPF_FETCH: - BPF_CMPSET without BPF_FETCH (i.e. an atomic compare-and-set without knowing whether the write was successfully) isn't implemented by the kernel, x86, or ARM. It would be a burden on the JIT and it's hard to imagine a use for this operation, so it's not supported. - BPF_SET without BPF_FETCH would be bpf_set, which has pretty limited use: all it really lets you do is atomically set 64-bit values on 32-bit CPUs. It doesn't imply any barriers. There are two significant design decisions made for the CMPSET instruction: - To solve the issue that this operation fundamentally has 3 operands, but we only have two register fields. Therefore the operand we compare against (the kernel's API calls it 'old') is hard-coded to be R0. x86 has similar design (and A64 doesn't have this problem). A potential alternative might be to encode the other operand's register number in the immediate field. - The kernel's atomic_cmpxchg returns the old value, while the C11 userspace APIs return a boolean indicating the comparison result. Which should BPF do? A64 returns the old value. x86 returns the old value in the hard-coded register (and also sets a flag). That means return-old-value is easier to JIT. Signed-off-by: Brendan Jackman <[email protected]>

This relies on the work done by Yonghong Song in https://reviews.llvm.org/D72184 Signed-off-by: Brendan Jackman <[email protected]>

kernel-patches-bot · 2020-11-23T17:43:39Z

Master branch: 91b2db2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=389615
version: 1

kernel-patches-bot · 2020-11-23T19:46:27Z

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=389615 expired. Closing PR.

Subprograms are calling map_poke_track but on program release there is no hook to call map_poke_untrack. But on prog release the aux memory is freed even though we still have a reference to it in the element list of the map aux data. So when we run map_poke_run() we end up accessing free'd memory. This triggers with KASAN in prog_array_map_poke_run() shown here. [ 402.824686] ================================================================== [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 The elements concerned are walked like this, for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; So the access to size_poke_tab is the 4B read, verified by checking offsets in the KASAN dump, [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) With pahol output, struct bpf_prog_aux { ... /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ ... To fix track the map references by using a per subprogram used_maps array and used_map_cnt values to hold references into the maps so when the subprogram is released we can then untrack from the correct map using the correct aux field. Here we a slightly less than optimal because we insert all poke entries into the used_map array, even duplicates. In theory we could get by with only unique entries. This would require an extra collect the maps stage though and seems unnecessary when this is simply an extra 8B per duplicate. It also makes the logic simpler and the untrack hook is happy to ignore entries previously removed. Reported-by: Jussi Maki <[email protected]> Signed-off-by: John Fastabend <[email protected]>

Subprograms are calling map_poke_track but on program release there is no hook to call map_poke_untrack. But on prog release the aux memory is freed even though we still have a reference to it in the element list of the map aux data. So when we run map_poke_run() we end up accessing free'd memory. This triggers with KASAN in prog_array_map_poke_run() shown here. [ 402.824686] ================================================================== [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 The elements concerned are walked like this, for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; So the access to size_poke_tab is the 4B read, verified by checking offsets in the KASAN dump, [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) With pahol output, struct bpf_prog_aux { ... /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ ... In general subprograms do not manage their own data structures. For example btf func_info, linfo, etc are just pointers to the prog structure. This allows reference counting and cleanup to be done on the main prog. The aux->poke_tab struct however did not follow this logic. The initial fix for above use after free further embedded subprogram tracking of poke data tracking into the subprogram with proper reference counting. However, Daniel and Alexei questioned why we were treating these objects specially. I agree its unnecessary. The fix here removes the per subprogram poke data structure alloc and map tracking and instead simply points the aux->poke_tab pointer at the main programs poke_tab. This way map tracking is done on the origin program and we do not need to manage them per subprogram. A couple small complication arise here. First on bpf_prog_free_deferred(), where we unwind the prog reference counting and kfree objects, we need to ensure that we don't try to double free the poke_tab when free'ing the subprog structures. This is easily solved by NULL'ing the poke_tab pointer. The second detail is to ensure that per subprog jit logic only does fixups on poke_tab[] entries it owns. To do this we add a pointer in the poke structure to point at the subprog value so JITs can easily check while walking the poke_tab structure if the current entry belongs to the current program. This change is necessary per JIT. See x86/net/bpf_jit_compo.c func bpf_tail_call_direct_fixup() for the details. Only x86 is currently using the poke_tab struct so we only need to fixup the x86 JIT. Fixes: a748c69 ("bpf: propagate poke descriptors to subprograms") Signed-off-by: John Fastabend <[email protected]>

Subprograms are calling map_poke_track but on program release there is no hook to call map_poke_untrack. But on prog release the aux memory is freed even though we still have a reference to it in the element list of the map aux data. So when we run map_poke_run() we end up accessing free'd memory. This triggers with KASAN in prog_array_map_poke_run() shown here. [ 402.824686] ================================================================== [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 The elements concerned are walked like this, for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; So the access to size_poke_tab is the 4B read, verified by checking offsets in the KASAN dump, [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) With pahol output, struct bpf_prog_aux { ... /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ ... In general subprograms do not manage their own data structures. For example btf func_info, linfo, etc are just pointers to the prog structure. This allows reference counting and cleanup to be done on the main prog. The aux->poke_tab struct however did not follow this logic. The initial fix for above use after free further embedded subprogram tracking of poke data tracking into the subprogram with proper reference counting. However, Daniel and Alexei questioned why we were treating these objects specially. I agree its unnecessary. The fix here removes the per subprogram poke data structure alloc and map tracking and instead simply points the aux->poke_tab pointer at the main programs poke_tab. This way map tracking is done on the origin program and we do not need to manage them per subprogram. A couple small complication arise here. First on bpf_prog_free_deferred(), where we unwind the prog reference counting and kfree objects, we need to ensure that we don't try to double free the poke_tab when free'ing the subprog structures. This is easily solved by NULL'ing the poke_tab pointer. The second detail is to ensure that per subprog jit logic only does fixups on poke_tab[] entries it owns. To do this we add a pointer in the poke structure to point at the subprog value so JITs can easily check while walking the poke_tab structure if the current entry belongs to the current program. This change is necessary per JIT. See x86/net/bpf_jit_compo.c func bpf_tail_call_direct_fixup() for the details. Only x86 is currently using the poke_tab struct so we only need to fixup the x86 JIT. On the error path we omit cleaning up the poke->aux field because these are only ever referenced from the JIT side, but on error we will never make it to the JIT so its fine to leave them dangling. Removing these pointers would complicate the error path for no reason. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Subprograms are calling map_poke_track(), but on program release there is no hook to call map_poke_untrack(). However, on program release, the aux memory (and poke descriptor table) is freed even though we still have a reference to it in the element list of the map aux data. When we run map_poke_run(), we then end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run(): [...] [ 402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e [ 402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337 [ 402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G I 5.12.0+ #399 [ 402.824715] Call Trace: [ 402.824719] dump_stack+0x93/0xc2 [ 402.824727] print_address_description.constprop.0+0x1a/0x140 [ 402.824736] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824740] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824744] kasan_report.cold+0x7c/0xd8 [ 402.824752] ? prog_array_map_poke_run+0xc2/0x34e [ 402.824757] prog_array_map_poke_run+0xc2/0x34e [ 402.824765] bpf_fd_array_map_update_elem+0x124/0x1a0 [...] The elements concerned are walked as follows: for (i = 0; i < elem->aux->size_poke_tab; i++) { poke = &elem->aux->poke_tab[i]; [...] The access to size_poke_tab is a 4 byte read, verified by checking offsets in the KASAN dump: [ 402.825004] The buggy address belongs to the object at ffff8881905a7800 which belongs to the cache kmalloc-1k of size 1024 [ 402.825008] The buggy address is located 320 bytes inside of 1024-byte region [ffff8881905a7800, ffff8881905a7c00) The pahole output of bpf_prog_aux: struct bpf_prog_aux { [...] /* --- cacheline 5 boundary (320 bytes) --- */ u32 size_poke_tab; /* 320 4 */ [...] In general, subprograms do not necessarily manage their own data structures. For example, BTF func_info and linfo are just pointers to the main program structure. This allows reference counting and cleanup to be done on the latter which simplifies their management a bit. The aux->poke_tab struct, however, did not follow this logic. The initial proposed fix for this use-after-free bug further embedded poke data tracking into the subprogram with proper reference counting. However, Daniel and Alexei questioned why we were treating these objects special; I agree, its unnecessary. The fix here removes the per subprogram poke table allocation and map tracking and instead simply points the aux->poke_tab pointer at the main programs poke table. This way, map tracking is simplified to the main program and we do not need to manage them per subprogram. This also means, bpf_prog_free_deferred(), which unwinds the program reference counting and kfrees objects, needs to ensure that we don't try to double free the poke_tab when free'ing the subprog structures. This is easily solved by NULL'ing the poke_tab pointer. The second detail is to ensure that per subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do this, we add a pointer in the poke structure to point at the subprogram value so JITs can easily check while walking the poke_tab structure if the current entry belongs to the current program. The aux pointer is stable and therefore suitable for such comparison. On the jit_subprogs() error path, we omit cleaning up the poke->aux field because these are only ever referenced from the JIT side, but on error we will never make it to the JIT, so its fine to leave them dangling. Removing these pointers would complicate the error path for no reason. However, we do need to untrack all poke descriptors from the main program as otherwise they could race with the freeing of JIT memory from the subprograms. Lastly, a748c69 ("bpf: propagate poke descriptors to subprograms") had an off-by-one on the subprogram instruction index range check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'. However, subprog_end is the next subprogram's start instruction. Fixes: a748c69 ("bpf: propagate poke descriptors to subprograms") Signed-off-by: John Fastabend <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper prologue and epilogue for this case. With this patch, all of the struct_ops related testcases (except struct_ops_multi_pages) passed on LoongArch. The testcase struct_ops_multi_pages failed is because the actual image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES. Before: $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages ... WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds... After: $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages ... #15 bad_struct_ops:OK ... #399 struct_ops_autocreate:OK ... #400 struct_ops_kptr_return:OK ... #401 struct_ops_maybe_null:OK ... #402 struct_ops_module:OK ... #404 struct_ops_no_cfi:OK ... #405 struct_ops_private_stack:SKIP ... #406 struct_ops_refcounted:OK Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED Signed-off-by: Tiezhu Yang <[email protected]>

kernel-patches-bot and others added 8 commits November 23, 2020 09:43

adding ci files

c51bbbc

bpf: x86: Factor out emission of REX byte

9e6a4a1

The JIT case for encoding atomic ops is about to get more complicated. In order to make the review & resulting code easier, let's factor out some shared helpers. Signed-off-by: Brendan Jackman <[email protected]>

bpf: Move BPF_STX reserved field check into BPF_STX verifier code

df1db7a

I can't find a reason why this code is in resolve_pseudo_ldimm64; since I'll be modifying it in a subsequent commit, tidy it up. Signed-off-by: Brendan Jackman <[email protected]>

bpf: Add tests for new BPF atomic operations

8b39f8a

This relies on the work done by Yonghong Song in https://reviews.llvm.org/D72184 Signed-off-by: Brendan Jackman <[email protected]>

kernel-patches-bot added bpf-next new V1 labels Nov 23, 2020

kernel-patches-bot added changes-requested and removed new labels Nov 23, 2020

kernel-patches-bot closed this Nov 23, 2020

kernel-patches-bot deleted the series/389615=>bpf-next branch November 26, 2020 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Atomics for eBPF #399

Atomics for eBPF #399

Uh oh!

kernel-patches-bot commented Nov 23, 2020

Uh oh!

kernel-patches-bot commented Nov 23, 2020

Uh oh!

kernel-patches-bot commented Nov 23, 2020

Uh oh!

Uh oh!

Atomics for eBPF #399

Atomics for eBPF #399

Uh oh!

Conversation

kernel-patches-bot commented Nov 23, 2020

Uh oh!

kernel-patches-bot commented Nov 23, 2020

Uh oh!

kernel-patches-bot commented Nov 23, 2020

Uh oh!

Uh oh!