Skip to content

Atomics for eBPF #399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: Atomics for eBPF
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=389615

kernel-patches-bot and others added 8 commits November 23, 2020 09:43
The case for JITing atomics is about to get more complicated. Let's
factor out some common code to make the review and result more
readable.

NB the atomics code doesn't yet use the new helper - a subsequent
patch will add its use as a side-effect of other changes.

Signed-off-by: Brendan Jackman <[email protected]>
The JIT case for encoding atomic ops is about to get more
complicated. In order to make the review & resulting code easier,
let's factor out some shared helpers.

Signed-off-by: Brendan Jackman <[email protected]>
A subsequent patch will add additional atomic operations. These new
operations will use the same opcode field as the existing XADD, with
the immediate discriminating different operations.

In preparation, rename the instruction mode BPF_ATOMIC and start
calling the zero immediate BPF_ADD.

This is possible (doesn't break existing valid BPF progs) because the
immediate field is currently reserved MBZ and BPF_ADD is zero.

All uses are removed from the tree but the BPF_XADD definition is
kept around to avoid breaking builds for people including kernel
headers.

Signed-off-by: Brendan Jackman <[email protected]>
I can't find a reason why this code is in resolve_pseudo_ldimm64;
since I'll be modifying it in a subsequent commit, tidy it up.

Signed-off-by: Brendan Jackman <[email protected]>
This value can be set in bpf_insn.imm, for BPF_ATOMIC instructions,
in order to have the previous value of the atomically-modified memory
location loaded into the src register after an atomic op is carried
out.

Suggested-by: Yonghong Song <[email protected]>
Signed-off-by: Brendan Jackman <[email protected]>
These are the operations that implement atomic exchange and
compare-exchange.

They are peculiarly named because of the presence of the separate
FETCH field that tells you whether the instruction writes the value
back to the src register. Neither operation is supported without
BPF_FETCH:

- BPF_CMPSET without BPF_FETCH (i.e. an atomic compare-and-set
  without knowing whether the write was successfully) isn't implemented
  by the kernel, x86, or ARM. It would be a burden on the JIT and it's
  hard to imagine a use for this operation, so it's not supported.

- BPF_SET without BPF_FETCH would be bpf_set, which has pretty
  limited use: all it really lets you do is atomically set 64-bit
  values on 32-bit CPUs. It doesn't imply any barriers.

There are two significant design decisions made for the CMPSET
instruction:

 - To solve the issue that this operation fundamentally has 3
   operands, but we only have two register fields. Therefore the
   operand we compare against (the kernel's API calls it 'old') is
   hard-coded to be R0. x86 has similar design (and A64 doesn't
   have this problem).

   A potential alternative might be to encode the other operand's
   register number in the immediate field.

 - The kernel's atomic_cmpxchg returns the old value, while the C11
   userspace APIs return a boolean indicating the comparison
   result. Which should BPF do? A64 returns the old value. x86 returns
   the old value in the hard-coded register (and also sets a
   flag). That means return-old-value is easier to JIT.

Signed-off-by: Brendan Jackman <[email protected]>
This relies on the work done by Yonghong Song in
https://reviews.llvm.org/D72184

Signed-off-by: Brendan Jackman <[email protected]>
@kernel-patches-bot
Copy link
Author

Master branch: 91b2db2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=389615
version: 1

@kernel-patches-bot
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=389615 expired. Closing PR.

@kernel-patches-bot kernel-patches-bot deleted the series/389615=>bpf-next branch November 26, 2020 03:50
kernel-patches-bot pushed a commit that referenced this pull request Jun 16, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

[  402.824686] ==================================================================
[  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

[  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
[  402.824715] Call Trace:
[  402.824719]  dump_stack+0x93/0xc2
[  402.824727]  print_address_description.constprop.0+0x1a/0x140
[  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824744]  kasan_report.cold+0x7c/0xd8
[  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824757]  prog_array_map_poke_run+0xc2/0x34e
[  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

                for (i = 0; i < elem->aux->size_poke_tab; i++) {
                        poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

[  402.825004] The buggy address belongs to the object at ffff8881905a7800
                which belongs to the cache kmalloc-1k of size 1024
[  402.825008] The buggy address is located 320 bytes inside of
                1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

 struct bpf_prog_aux {
 ...
        /* --- cacheline 5 boundary (320 bytes) --- */
        u32                        size_poke_tab;        /*   320     4 */
 ...

To fix track the map references by using a per subprogram used_maps array
and used_map_cnt values to hold references into the maps so when the
subprogram is released we can then untrack from the correct map using
the correct aux field.

Here we a slightly less than optimal because we insert all poke entries
into the used_map array, even duplicates. In theory we could get by
with only unique entries. This would require an extra collect the maps
stage though and seems unnecessary when this is simply an extra 8B
per duplicate. It also makes the logic simpler and the untrack hook
is happy to ignore entries previously removed.

Reported-by: Jussi Maki <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 18, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

[  402.824686] ==================================================================
[  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

[  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
[  402.824715] Call Trace:
[  402.824719]  dump_stack+0x93/0xc2
[  402.824727]  print_address_description.constprop.0+0x1a/0x140
[  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824744]  kasan_report.cold+0x7c/0xd8
[  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824757]  prog_array_map_poke_run+0xc2/0x34e
[  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

                for (i = 0; i < elem->aux->size_poke_tab; i++) {
                        poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

[  402.825004] The buggy address belongs to the object at ffff8881905a7800
                which belongs to the cache kmalloc-1k of size 1024
[  402.825008] The buggy address is located 320 bytes inside of
                1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

 struct bpf_prog_aux {
 ...
        /* --- cacheline 5 boundary (320 bytes) --- */
        u32                        size_poke_tab;        /*   320     4 */
 ...

To fix track the map references by using a per subprogram used_maps array
and used_map_cnt values to hold references into the maps so when the
subprogram is released we can then untrack from the correct map using
the correct aux field.

Here we a slightly less than optimal because we insert all poke entries
into the used_map array, even duplicates. In theory we could get by
with only unique entries. This would require an extra collect the maps
stage though and seems unnecessary when this is simply an extra 8B
per duplicate. It also makes the logic simpler and the untrack hook
is happy to ignore entries previously removed.

Reported-by: Jussi Maki <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 21, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

[  402.824686] ==================================================================
[  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

[  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
[  402.824715] Call Trace:
[  402.824719]  dump_stack+0x93/0xc2
[  402.824727]  print_address_description.constprop.0+0x1a/0x140
[  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824744]  kasan_report.cold+0x7c/0xd8
[  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824757]  prog_array_map_poke_run+0xc2/0x34e
[  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

                for (i = 0; i < elem->aux->size_poke_tab; i++) {
                        poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

[  402.825004] The buggy address belongs to the object at ffff8881905a7800
                which belongs to the cache kmalloc-1k of size 1024
[  402.825008] The buggy address is located 320 bytes inside of
                1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

 struct bpf_prog_aux {
 ...
        /* --- cacheline 5 boundary (320 bytes) --- */
        u32                        size_poke_tab;        /*   320     4 */
 ...

To fix track the map references by using a per subprogram used_maps array
and used_map_cnt values to hold references into the maps so when the
subprogram is released we can then untrack from the correct map using
the correct aux field.

Here we a slightly less than optimal because we insert all poke entries
into the used_map array, even duplicates. In theory we could get by
with only unique entries. This would require an extra collect the maps
stage though and seems unnecessary when this is simply an extra 8B
per duplicate. It also makes the logic simpler and the untrack hook
is happy to ignore entries previously removed.

Reported-by: Jussi Maki <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 22, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

[  402.824686] ==================================================================
[  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

[  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
[  402.824715] Call Trace:
[  402.824719]  dump_stack+0x93/0xc2
[  402.824727]  print_address_description.constprop.0+0x1a/0x140
[  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824744]  kasan_report.cold+0x7c/0xd8
[  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824757]  prog_array_map_poke_run+0xc2/0x34e
[  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

                for (i = 0; i < elem->aux->size_poke_tab; i++) {
                        poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

[  402.825004] The buggy address belongs to the object at ffff8881905a7800
                which belongs to the cache kmalloc-1k of size 1024
[  402.825008] The buggy address is located 320 bytes inside of
                1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

 struct bpf_prog_aux {
 ...
        /* --- cacheline 5 boundary (320 bytes) --- */
        u32                        size_poke_tab;        /*   320     4 */
 ...

To fix track the map references by using a per subprogram used_maps array
and used_map_cnt values to hold references into the maps so when the
subprogram is released we can then untrack from the correct map using
the correct aux field.

Here we a slightly less than optimal because we insert all poke entries
into the used_map array, even duplicates. In theory we could get by
with only unique entries. This would require an extra collect the maps
stage though and seems unnecessary when this is simply an extra 8B
per duplicate. It also makes the logic simpler and the untrack hook
is happy to ignore entries previously removed.

Reported-by: Jussi Maki <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 22, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

[  402.824686] ==================================================================
[  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
[  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

[  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
[  402.824715] Call Trace:
[  402.824719]  dump_stack+0x93/0xc2
[  402.824727]  print_address_description.constprop.0+0x1a/0x140
[  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824744]  kasan_report.cold+0x7c/0xd8
[  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
[  402.824757]  prog_array_map_poke_run+0xc2/0x34e
[  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

                for (i = 0; i < elem->aux->size_poke_tab; i++) {
                        poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

[  402.825004] The buggy address belongs to the object at ffff8881905a7800
                which belongs to the cache kmalloc-1k of size 1024
[  402.825008] The buggy address is located 320 bytes inside of
                1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

 struct bpf_prog_aux {
 ...
        /* --- cacheline 5 boundary (320 bytes) --- */
        u32                        size_poke_tab;        /*   320     4 */
 ...

To fix track the map references by using a per subprogram used_maps array
and used_map_cnt values to hold references into the maps so when the
subprogram is released we can then untrack from the correct map using
the correct aux field.

Here we a slightly less than optimal because we insert all poke entries
into the used_map array, even duplicates. In theory we could get by
with only unique entries. This would require an extra collect the maps
stage though and seems unnecessary when this is simply an extra 8B
per duplicate. It also makes the logic simpler and the untrack hook
is happy to ignore entries previously removed.

Reported-by: Jussi Maki <[email protected]>
Signed-off-by: John Fastabend <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jun 30, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

    [  402.824686] ==================================================================
    [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
    [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

    [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
    [  402.824715] Call Trace:
    [  402.824719]  dump_stack+0x93/0xc2
    [  402.824727]  print_address_description.constprop.0+0x1a/0x140
    [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
    [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
    [  402.824744]  kasan_report.cold+0x7c/0xd8
    [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
    [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
    [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

    for (i = 0; i < elem->aux->size_poke_tab; i++) {
           poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

    [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                    which belongs to the cache kmalloc-1k of size 1024
    [  402.825008] The buggy address is located 320 bytes inside of
                    1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

     struct bpf_prog_aux {
     ...
            /* --- cacheline 5 boundary (320 bytes) --- */
            u32                        size_poke_tab;        /*   320     4 */
     ...

In general subprograms do not manage their own data structures. For example
btf func_info, linfo, etc are just pointers to the prog structure. This
allows reference counting and cleanup to be done on the main prog.

The aux->poke_tab struct however did not follow this logic. The initial fix
for above use after free further embedded subprogram tracking of poke data
tracking into the subprogram with proper reference counting. However,
Daniel and Alexei questioned why we were treating these objects specially.
I agree its unnecessary.

The fix here removes the per subprogram poke data structure alloc and map
tracking and instead simply points the aux->poke_tab pointer at the main
programs poke_tab. This way map tracking is done on the origin program
and we do not need to manage them per subprogram. A couple small
complication arise here. First on bpf_prog_free_deferred(), where we unwind
the prog reference counting and kfree objects, we need to ensure that we
don't try to double free the poke_tab when free'ing the subprog structures.
This is easily solved by NULL'ing the poke_tab pointer.

The second detail is to ensure that per subprog jit logic only does fixups
on poke_tab[] entries it owns. To do this we add a pointer in the poke
structure to point at the subprog value so JITs can easily check while
walking the poke_tab structure if the current entry belongs to the current
program. This change is necessary per JIT. See x86/net/bpf_jit_compo.c func
bpf_tail_call_direct_fixup() for the details. Only x86 is currently using
the poke_tab struct so we only need to fixup the x86 JIT.

Fixes: a748c69 ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <[email protected]>
kernel-patches-bot pushed a commit that referenced this pull request Jul 8, 2021
Subprograms are calling map_poke_track but on program release there is no
hook to call map_poke_untrack. But on prog release the aux memory is freed
even though we still have a reference to it in the element list of the
map aux data.

So when we run map_poke_run() we end up accessing free'd memory. This
triggers with KASAN in prog_array_map_poke_run() shown here.

    [  402.824686] ==================================================================
    [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
    [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337

    [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
    [  402.824715] Call Trace:
    [  402.824719]  dump_stack+0x93/0xc2
    [  402.824727]  print_address_description.constprop.0+0x1a/0x140
    [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
    [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
    [  402.824744]  kasan_report.cold+0x7c/0xd8
    [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
    [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
    [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0

The elements concerned are walked like this,

    for (i = 0; i < elem->aux->size_poke_tab; i++) {
           poke = &elem->aux->poke_tab[i];

So the access to size_poke_tab is the 4B read, verified by checking offsets
in the KASAN dump,

    [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                    which belongs to the cache kmalloc-1k of size 1024
    [  402.825008] The buggy address is located 320 bytes inside of
                    1024-byte region [ffff8881905a7800, ffff8881905a7c00)

With pahol output,

     struct bpf_prog_aux {
     ...
            /* --- cacheline 5 boundary (320 bytes) --- */
            u32                        size_poke_tab;        /*   320     4 */
     ...

In general subprograms do not manage their own data structures. For example
btf func_info, linfo, etc are just pointers to the prog structure. This
allows reference counting and cleanup to be done on the main prog.

The aux->poke_tab struct however did not follow this logic. The initial fix
for above use after free further embedded subprogram tracking of poke data
tracking into the subprogram with proper reference counting. However,
Daniel and Alexei questioned why we were treating these objects specially.
I agree its unnecessary.

The fix here removes the per subprogram poke data structure alloc and map
tracking and instead simply points the aux->poke_tab pointer at the main
programs poke_tab. This way map tracking is done on the origin program
and we do not need to manage them per subprogram. A couple small
complication arise here. First on bpf_prog_free_deferred(), where we unwind
the prog reference counting and kfree objects, we need to ensure that we
don't try to double free the poke_tab when free'ing the subprog structures.
This is easily solved by NULL'ing the poke_tab pointer.

The second detail is to ensure that per subprog jit logic only does fixups
on poke_tab[] entries it owns. To do this we add a pointer in the poke
structure to point at the subprog value so JITs can easily check while
walking the poke_tab structure if the current entry belongs to the current
program. This change is necessary per JIT. See x86/net/bpf_jit_compo.c func
bpf_tail_call_direct_fixup() for the details. Only x86 is currently using
the poke_tab struct so we only need to fixup the x86 JIT.

On the error path we omit cleaning up the poke->aux field because these
are only ever referenced from the JIT side, but on error we will never
make it to the JIT so its fine to leave them dangling. Removing these
pointers would complicate the error path for no reason.

Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
kernel-patches-bot pushed a commit that referenced this pull request Jul 9, 2021
Subprograms are calling map_poke_track(), but on program release there is no
hook to call map_poke_untrack(). However, on program release, the aux memory
(and poke descriptor table) is freed even though we still have a reference to
it in the element list of the map aux data. When we run map_poke_run(), we then
end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():

  [...]
  [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
  [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
  [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
  [  402.824715] Call Trace:
  [  402.824719]  dump_stack+0x93/0xc2
  [  402.824727]  print_address_description.constprop.0+0x1a/0x140
  [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824744]  kasan_report.cold+0x7c/0xd8
  [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
  [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
  [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
  [...]

The elements concerned are walked as follows:

    for (i = 0; i < elem->aux->size_poke_tab; i++) {
           poke = &elem->aux->poke_tab[i];
    [...]

The access to size_poke_tab is a 4 byte read, verified by checking offsets
in the KASAN dump:

  [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                 which belongs to the cache kmalloc-1k of size 1024
  [  402.825008] The buggy address is located 320 bytes inside of
                 1024-byte region [ffff8881905a7800, ffff8881905a7c00)

The pahole output of bpf_prog_aux:

  struct bpf_prog_aux {
    [...]
    /* --- cacheline 5 boundary (320 bytes) --- */
    u32                        size_poke_tab;        /*   320     4 */
    [...]

In general, subprograms do not necessarily manage their own data structures.
For example, BTF func_info and linfo are just pointers to the main program
structure. This allows reference counting and cleanup to be done on the latter
which simplifies their management a bit. The aux->poke_tab struct, however,
did not follow this logic. The initial proposed fix for this use-after-free
bug further embedded poke data tracking into the subprogram with proper
reference counting. However, Daniel and Alexei questioned why we were treating
these objects special; I agree, its unnecessary. The fix here removes the per
subprogram poke table allocation and map tracking and instead simply points
the aux->poke_tab pointer at the main programs poke table. This way, map
tracking is simplified to the main program and we do not need to manage them
per subprogram.

This also means, bpf_prog_free_deferred(), which unwinds the program reference
counting and kfrees objects, needs to ensure that we don't try to double free
the poke_tab when free'ing the subprog structures. This is easily solved by
NULL'ing the poke_tab pointer. The second detail is to ensure that per
subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
this, we add a pointer in the poke structure to point at the subprogram value
so JITs can easily check while walking the poke_tab structure if the current
entry belongs to the current program. The aux pointer is stable and therefore
suitable for such comparison. On the jit_subprogs() error path, we omit
cleaning up the poke->aux field because these are only ever referenced from
the JIT side, but on error we will never make it to the JIT, so its fine to
leave them dangling. Removing these pointers would complicate the error path
for no reason. However, we do need to untrack all poke descriptors from the
main program as otherwise they could race with the freeing of JIT memory from
the subprograms. Lastly, a748c69 ("bpf: propagate poke descriptors to
subprograms") had an off-by-one on the subprogram instruction index range
check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
However, subprog_end is the next subprogram's start instruction.

Fixes: a748c69 ("bpf: propagate poke descriptors to subprograms")
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 24, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 26, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 26, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 30, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 30, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 31, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants