dma: Automatically enable page touching on Caspian #5

dwmw2 · 2024-07-23T22:42:15Z

On systems with memory overcommit, although no more pages will be removed without the consent of virtio-balloon, not all pages are guaranteed to be present at boot time. They need to be faulted in when mapped for DMA.

Enable this by default for the relevant EC2 instance types to avoid a lot of time wasted in building separate images with the command line option.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

… avoid early OOMs This is the improved workaround to avoid early OOMs within cgroup v1 by throttling the memory reclaim given dirty/writeback pages under the GFP_NOFS allocations. Increment sleeping time exponentialy until a limit after half the number of maximum retries when writeback+dirty pages goes beyond a certain threshold before next retry occurs. This solution can not only help to prevent early OOMs on some extreme workload but also avoid unnecessary throttling on general cases. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207273 Suggested-by: Michal Hocko <[email protected]> Signed-off-by: Shaoying Xu <[email protected]>

(/sys/devices/system/memory/probe) can accept starting physical address of an entire memory block to be hot added into the kernel. This is in addition to the existing ACPI based interface. This just enables it with the required config CONFIG_ARCH_MEMORY_PROBE. Signed-off-by: Anshuman Khandual <[email protected]>

Issue: Offlining non-boot memory on arm64 via /sys/devices/system/memory/<mem_id>/state doesnt eliminate the struct page memory associated with the offlined memory. As memory is offlined, total and free memory reduce but the memory associated with struct page isnt given back and is reported as 'used' memory instead. This is because offlining via the sysfs 'state' probe doesnt remove the memmap associated with the memory to be offlined. Fix: Expose a sysfs probe that also removes memmap associated with the memory block after offlining it. Probe exposed accepts the physical address of a memory block to be removed. Signed-off-by: Rohit Wali <[email protected]>

Since commit e1c158e ("mm/memory_hotplug: remove nid parameter from remove_memory() and friends"), offline_and_remove_memory() no longer takes a node id arguments. Adapt. Signed-off-by: Frank van der Linden <[email protected]>

If it is possible to use MHP_MEMMAP_ON_MEMORY from the probe interface, which should normally be the case, do so. Signed-off-by: Frank van der Linden <[email protected]>

Add an interface to report offlined pages as free to the hypervisor. Define a new entry point for page reporting drivers, report_offline. If a driver sets it, it will be called after a range of memory has been offlined. This is done separately, and not with a memory notifier, since with memmap_on_memory, there are pages that are only freed outside of offline_pages, where the notifiers are called. Since this will be called asynchronously (e.g. not from the page reporting work queues), protect it with the page reporting mutex so that a driver can't be unloaded while calling the entry point. Signed-off-by: Frank van der Linden <[email protected]>

When reporting offlined pages through free page reporting, and memmap_on_memory is active, we don't want to touch the page structures anymore, since that will lead to a reference to the range we just offlined, as the page structures themselves reside in the range. So, we can't use sg_phys to set the dma address. Instead, if sg_page is set to NULL, assume that sg_dma_address is set already, and use it. Signed-off-by: Frank van der Linden <[email protected]>

A hack to report offlined memory ranges through virtio-balloon. Do this by registering a memory notifier callback for offlining, and then calling the normal free page reporting entry point to report the range that was just offlined. This is only active if the virtio_balloon.report_offline module parameter is set. Signed-off-by: Frank van der Linden <[email protected]>

Allows enabling page touching via a kernel command line parameter. When enabled, devices which don't have an IOMMU assigned to them will be assigned the page touching DMA map ops which ensures that any memory mapped for DMA by that devices will be accessed by the CPU to make it resident. Signed-off-by: James Gowans <[email protected]> Cc-Team: kaos-brimstone <[email protected]> Cc-Team: ec2-memo <[email protected]>

To force a page into residence, a read operation is performed on behalf of devices without an IOMMU. This functionality is required to facilitate memory overcommitted hosts. Commit 25d4ce2 ("Introduce page touching DMA ops binding") initially introduced this logic by invoking a '__raw_readl' function. This function can however read past the bounds of memory mapped for DMA. Instead, it is replaced with '__raw_readb'. This limits the length of memory read to a byte, and prevents reading past the range of mapped memory. Fixes: 25d4ce2 ("Introduce page touching DMA ops binding") Signed-off-by: Tighe Barris <[email protected]> Cc-Team: kaos-brimstone <[email protected]> Cc-Team: ec2-memo <[email protected]>

There's currently an issue with Xen and KASLR causing hibernation to break (and possibly kexec/kdump too). Until we have got to the bottom of this and fixed the root cause, let's disable KASLR at runtime when running on Xen instances so we can enable it for Nitro. This also adds a boot message to match ARM and help detect whether this test worked as expected. Signed-off-by: Benjamin Herrenschmidt <[email protected]>

Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

This provides a central place to maintain out-of-tree drivers. Renamed from VENDOR_AMAZON because the name was no longer appropriate. Signed-off-by: Munehisa Kamata <[email protected]> Reviewed-by: Cristian Gafton <[email protected]> Reviewed-by: Guru Anbalagane <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Reviewed-by: Anchal Agarwal <[email protected]> Signed-off-by: Vallish Vaidyeshwara <[email protected]>

EFA: driver version 2.1.0 (4929d24) ENA: driver version 2.8.0 (a03137e) igb_uio: add (adc137d)

Source: https://github.com/amzn/amzn-drivers/ Change Log: ## r2.8.1 release notes **New Features** * Add extended metrics mechanism support * Add conntrack customer metric to ethtool **Bug Fixes** * Fix compilation issues on SLES 15 SP4 * Fix compilation errors in RHEL 8.7, 9.0 * Configure TX rings mem policy in reset flow **Minor Changes** * Add napi_build_skb support * Add napi_consume_skb * Align ena_alloc_map_page signature * Move from strlcpy with unused retval to strscpy * Add status check for strscpy calls * Backport napi_alloc_skb usage

Source: https://github.com/amzn/amzn-drivers/ Change Log: ## r2.1.1 release notes * Fix dmabuf backport for some kernels

Squash the following 2 patches into 1 as they accomplish the same goal - setting which algorithims are availble for fips use in 6.1. not-for-upstream: testmgr config changes to enable FIPS boot The Federal Information Processing Standard (FIPS) Publication 140-2, is a computer security standard, developed by a U.S. Government and industry working group to validate the quality of cryptographic modules. Enabling FIPS mode involves the following steps: a. prelinking needs to be disabled. PRELINKING=no in /etc/sysconfig/prelink b. Install dracut-fips package # yum install dracut-fips. Installing dracut-fipes enables module signing by default and also enables scripts that do FIPS integrity verification, regardless of whether FIPS mode is on. If FIPS mode is on, and verification failure is detected, then syste will panic. c. Recreate initramfs # dracut -v -f d. Modify kernel command line to include the following option fips=1. For gaub2 based system add fips=1 to the end of the CMDLINE in /etc/default/grub and then run the following command # grub2-mkconfig -o /boot/grub2/grub.cfg e. Reboot the system. In FIPS mode, some self tests are run by dracut-fips package which is otherwise not the case for kernel not running in FIPS mode. The changes in the tests mentioned in this CR is only relevant for kernel running in FIPS mode. In this changeset, we enable/disable cryptographic algorithms in FIPS mode to make sure that we enable the tests that are supportedand disable the tests that are not supported in our kernel. Among the tests that are not supported are the SHA3 family of tests and their hmac versions. Also gcm(aesni) is disabled as the support is currently missing in the kernel. Also we should remember that, this change is not an effort to make the kernel FIPS compliant. FIPS compliance needs to be done by certified authority. This change is about adding support for FIPS mode. Running official FIPS compliance may necessiate support for additional cryptographic algorithms or remove fips_enabed flag in the tests for few algorithms as the need may arise. FIPS mode for a test is disabled by removing fips_enabled = 1 from the test description in testmgr.c. Adding support is more involved. The test needs to be implemented and pointed to in the structure used to describe the test. In FIPS mode, only the tests that are tagged with fips_enabled=1 are run and rest of the tests are ignored. So if you are not sure about an algorithm which needs to be enabled in FIPS mode, it needs to be disabled in testmgr.c. NU: because FIPS enablement is distro specific. Signed-off-by: Alakesh Haloi <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Reviewed-by: Anchal Agarwal <[email protected]> Reviewed-by: Cristian Gafton <[email protected]> Reviewed-by: Frederick Lefebvre <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Vallish Vaidyeshwara <[email protected]> enable rfc4106(gcm(aes)) for fips This alogrithim works with no additional changes required and has been requested by a customer, so enable it

To differentiate between Xen suspend, PM suspend and PM hibernation, keep track of the on-going suspend mode by mainly using a new PM notifier. Since Xen suspend doesn't have corresponding PM event, its main logic is modfied to acquire pm_mutex and set the current mode. Note that we may see deadlock if PM suspend/hibernation is interrupted by Xen suspend. PM suspend/hibernation depends on xenwatch thread to process xenbus state transactions, but the thread will sleep to wait pm_mutex which is already held by PM suspend/hibernation context in the scenario. Though, acquirng pm_mutex is still right thing to do, and we would need to modify Xen shutdown code to avoid the issue. This will be fixed by a separate patch. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Sebastian Biemueller <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> [6.1: Handle sleep flags for unlock_system_sleep()] Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Introduce simple functions which help to know the on-going suspend mode so that other Xen-related code can behave differently according to the current suspend mode. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Alakesh Haloi <[email protected]> Reviewed-by: Sebastian Biemueller <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Since commit b3e96c0 ("xen: use freeze/restore/thaw PM events for suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and PMSG_RESTORE events for Xen suspend. However, they're actually assigned to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume() respectively, and only suspend and resume callbacks are supported at driver level. To support PM suspend and PM hibernation, modify the bus level PM callbacks to invoke not only device driver's suspend/resume but also freeze/thaw/restore. Note that we'll use freeze/restore callbacks even for PM suspend whereas suspend/resume callbacks are normally used in the case, becausae the existing xenbus device drivers already have suspend/resume callbacks specifically designed for Xen suspend. So we can allow the device drivers to keep the existing callbacks wihtout modification. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Introduce a small function which re-uses shared page's PA allocated during guest initialization time in reserve_shared_info() and not allocate new page during resume flow. It also does the mapping of shared_info_page by calling xen_hvm_init_shared_info() to use the function. Backport Notes: We don't need this commit 8d5ce0dad4ab2a4c8c8a3c36f6fb8c46b695b053 ("x86/xen: decouple shared_info mapping from xen_hvm_init_shared_info()") here since xen_hvm_init_shared_info changed in 4.14 kernel just to do the mapping and allocation of shared page is done in a separate function. We don't need to decouple this kernel API anymore Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Sebastian Biemueller <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Add Xen PVHVM specific system core callbacks for PM suspend and hibernation support. The callbacks suspend and resume Xen primitives, like shared_info, pvclock and grant table. Note that Xen suspend can handle them in a different manner, but system core callbacks are called from the context. So if the callbacks are called from Xen suspend context, return immediately. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Add freeze and restore callbacks for PM suspend and hibernation support. The freeze handler stops a block-layer queue and disconnect the frontend from the backend while freeing ring_info and associated resources. The restore handler re-allocates ring_info and re-connect to the backedend, so the rest of the kernel can continue to use the block device transparently.Also, the handlers are used for both PM suspend and hibernation so that we can keep the existing suspend/resume callbacks for Xen suspend without modification. If a backend doesn't have commit 12ea729 ("xen/blkback: unmap all persistent grants when frontend gets disconnected"), the frontend may see massive amount of grant table warning when freeing resources. [ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff) [ 36.855089] xen:grant_table: WARNING: g.e. 0x112 still in use! In this case, persistent grants would need to be disabled. Ensure no reqs/rsps in rings before disconnecting. When disconnecting the frontend from the backend in blkfront_freeze(), there still may be unconsumed requests or responses in the rings, especially when the backend is backed by network-based device. If the frontend gets disconnected with such reqs/rsps remaining there, it can cause grant warnings and/or losing reqs/rsps by freeing pages afterward. This can lead resumed kernel into unrecoverable state like unexpected freeing of grant page and/or hung task due to the lost reqs or rsps. Therefore we have to ensure that there is no unconsumed requests or responses before disconnecting. Actually, the frontend just needs to wait for some amount of time so that the backend can process the requests, put responses and notify the frontend back. Timeout used here is based on some heuristic. If we somehow hit the timeout, it would mean something serious happens in the backend, the frontend will just return an error to PM core and PM suspend/hibernation will be aborted. This may be something should be fixed by the backend side, but a frontend side fix is probably still worth doing to work with broader backends. Backport Note: Unlike 4.9 kernel, blk-mq is default for 4.14 kernel and request-based mode cod eis not included in this frontend driver. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Add freeze and restore callbacks for PM suspend and hibernation support. The freeze handler simply disconnects the frotnend from the backend and frees resources associated with queues after disabling the net_device from the system. The restore handler just changes the frontend state and let the xenbus handler to re-allocate the resources and re-connect to the backend. This can be performed transparently to the rest of the system. The handlers are used for both PM suspend and hibernation so that we can keep the existing suspend/resume callbacks for Xen suspend without modification. Freezing netfront devices is normally expected to finish within a few hundred milliseconds, but it can rarely take more than 5 seconds and hit the hard coded timeout, it would depend on backend state which may be congested and/or have complex configuration. While it's rare case, longer default timeout seems a bit more reasonable here to avoid hitting the timeout. Also, make it configurable via module parameter so that we can cover broader setups than what we know currently. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Currently, steal time accounting code in scheduler expects steal clock callback to provide monotonically increasing value. If the accounting code receives a smaller value than previous one, it uses a negative value to calculate steal time and results in incorrectly updated idle and steal time accounting. This breaks userspace tools which read /proc/stat. top - 08:05:35 up 2:12, 3 users, load average: 0.00, 0.07, 0.23 Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,30100.0%id, 0.0%wa, 0.0%hi, 0.0%si,-1253874204672.0%st This can actually happen when a Xen PVHVM guest gets restored from hibernation, because such a restored guest is just a fresh domain from Xen perspective and the time information in runstate info starts over from scratch. This patch introduces xen_save_steal_clock() which saves current values in runstate info into per-cpu variables. Its couterpart, xen_restore_steal_clock(), sets offset if it found the current values in runstate info are smaller than previous ones. xen_steal_clock() is also modified to use the offset to ensure that scheduler only sees monotonically increasing number. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Save steal clock values of all present CPUs in the system core ops suspend callbacks. Also, restore a boot CPU's steal clock in the system core resume callback. For non-boot CPUs, restore after they're brought up, because runstate info for non-boot CPUs are not active until then. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Add a simple helper function to "shutdown" active PIRQs, which actually closes event channels but keeps related IRQ structures intact. PM suspend/hibernation code will rely on this. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

Close event channels allocated for devices which are backed by PIRQ and still active when suspending the system core. Normally, the devices are emulated legacy devices, e.g. PS/2 keyboard, floppy controller and etc. Without this, in PM hibernation, information about the event channel remains in hibernation image, but there is no guarantee that the same event channel numbers are assigned to the devices when restoring the system. This may cause conflict like the following and prevent some devices from being restored correctly. [ 102.330821] ------------[ cut here ]------------ [ 102.333264] WARNING: CPU: 0 PID: 2324 at drivers/xen/events/events_base.c:878 bind_evtchn_to_irq+0x88/0xf0 ... [ 102.348057] Call Trace: [ 102.348057] [<ffffffff813001df>] dump_stack+0x63/0x84 [ 102.348057] [<ffffffff81071811>] __warn+0xd1/0xf0 [ 102.348057] [<ffffffff810718fd>] warn_slowpath_null+0x1d/0x20 [ 102.348057] [<ffffffff8139a1f8>] bind_evtchn_to_irq+0x88/0xf0 [ 102.348057] [<ffffffffa00cd420>] ? blkif_copy_from_grant+0xb0/0xb0 [xen_blkfront] [ 102.348057] [<ffffffff8139a307>] bind_evtchn_to_irqhandler+0x27/0x80 [ 102.348057] [<ffffffffa00cc785>] talk_to_blkback+0x425/0xcd0 [xen_blkfront] [ 102.348057] [<ffffffff811e0c8a>] ? __kmalloc+0x1ea/0x200 [ 102.348057] [<ffffffffa00ce84d>] blkfront_restore+0x2d/0x60 [xen_blkfront] [ 102.348057] [<ffffffff813a0078>] xenbus_dev_restore+0x58/0x100 [ 102.348057] [<ffffffff813a1ff0>] ? xenbus_frontend_delayed_resume+0x20/0x20 [ 102.348057] [<ffffffff813a200e>] xenbus_dev_cond_restore+0x1e/0x30 [ 102.348057] [<ffffffff813f797e>] dpm_run_callback+0x4e/0x130 [ 102.348057] [<ffffffff813f7f17>] device_resume+0xe7/0x210 [ 102.348057] [<ffffffff813f7810>] ? pm_dev_dbg+0x80/0x80 [ 102.348057] [<ffffffff813f9374>] dpm_resume+0x114/0x2f0 [ 102.348057] [<ffffffff810c00cf>] hibernation_snapshot+0x15f/0x380 [ 102.348057] [<ffffffff810c0ac3>] hibernate+0x183/0x290 [ 102.348057] [<ffffffff810be1af>] state_store+0xcf/0xe0 [ 102.348057] [<ffffffff813020bf>] kobj_attr_store+0xf/0x20 [ 102.348057] [<ffffffff8127c88a>] sysfs_kf_write+0x3a/0x50 [ 102.348057] [<ffffffff8127c3bb>] kernfs_fop_write+0x10b/0x190 [ 102.348057] [<ffffffff81200008>] __vfs_write+0x28/0x120 [ 102.348057] [<ffffffff81200c19>] ? rw_verify_area+0x49/0xb0 [ 102.348057] [<ffffffff81200e62>] vfs_write+0xb2/0x1b0 [ 102.348057] [<ffffffff81202196>] SyS_write+0x46/0xa0 [ 102.348057] [<ffffffff81520cf7>] entry_SYSCALL_64_fastpath+0x1a/0xa9 [ 102.423005] ---[ end trace b8d6718e22e2b107 ]--- [ 102.425031] genirq: Flags mismatch irq 6. 00000000 (blkif) vs. 00000000 (floppy) Note that we don't explicitly re-allocate event channels for such devices in the resume callback. Re-allocation will occur when PM core re-enable IRQs for the devices at later point. Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation offset on a running kernel to enable hibernating to a swap file. However, it doesn't actually update the swsusp_resume_block variable. As a result, the hibernation fails at the last step (after all the data is written out) in the validation of the swap signature in mark_swapfiles(). Before this patch, the command line processing was the only place where swsusp_resume_block was set. Signed-off-by: Aleksei Besogonov <[email protected]> Signed-off-by: Munehisa Kamata <[email protected]> Signed-off-by: Anchal Agarwal <[email protected]> Reviewed-by: Munehisa Kamata <[email protected]> Reviewed-by: Eduardo Valentin <[email protected]> Signed-off-by: Samuel Mendoza-Jonas <[email protected]>

commit fb1a313 upstream. Function mlx5e_rep_neigh_update() wasn't updated to accommodate rtnl lock removal from TC filter update path and properly handle concurrent encap entry insertion/deletion which can lead to following use-after-free: [23827.464923] ================================================================== [23827.469446] BUG: KASAN: use-after-free in mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.470971] Read of size 4 at addr ffff8881d132228c by task kworker/u20:6/21635 [23827.472251] [23827.472615] CPU: 9 PID: 21635 Comm: kworker/u20:6 Not tainted 5.13.0-rc3+ #5 [23827.473788] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [23827.475639] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core] [23827.476731] Call Trace: [23827.477260] dump_stack+0xbb/0x107 [23827.477906] print_address_description.constprop.0+0x18/0x140 [23827.478896] ? mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.479879] ? mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.480905] kasan_report.cold+0x7c/0xd8 [23827.481701] ? mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.482744] kasan_check_range+0x145/0x1a0 [23827.493112] mlx5e_encap_take+0x72/0x140 [mlx5_core] [23827.494054] ? mlx5e_tc_tun_encap_info_equal_generic+0x140/0x140 [mlx5_core] [23827.495296] mlx5e_rep_neigh_update+0x41e/0x5e0 [mlx5_core] [23827.496338] ? mlx5e_rep_neigh_entry_release+0xb80/0xb80 [mlx5_core] [23827.497486] ? read_word_at_a_time+0xe/0x20 [23827.498250] ? strscpy+0xa0/0x2a0 [23827.498889] process_one_work+0x8ac/0x14e0 [23827.499638] ? lockdep_hardirqs_on_prepare+0x400/0x400 [23827.500537] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [23827.501359] ? rwlock_bug.part.0+0x90/0x90 [23827.502116] worker_thread+0x53b/0x1220 [23827.502831] ? process_one_work+0x14e0/0x14e0 [23827.503627] kthread+0x328/0x3f0 [23827.504254] ? _raw_spin_unlock_irq+0x24/0x40 [23827.505065] ? __kthread_bind_mask+0x90/0x90 [23827.505912] ret_from_fork+0x1f/0x30 [23827.506621] [23827.506987] Allocated by task 28248: [23827.507694] kasan_save_stack+0x1b/0x40 [23827.508476] __kasan_kmalloc+0x7c/0x90 [23827.509197] mlx5e_attach_encap+0xde1/0x1d40 [mlx5_core] [23827.510194] mlx5e_tc_add_fdb_flow+0x397/0xc40 [mlx5_core] [23827.511218] __mlx5e_add_fdb_flow+0x519/0xb30 [mlx5_core] [23827.512234] mlx5e_configure_flower+0x191c/0x4870 [mlx5_core] [23827.513298] tc_setup_cb_add+0x1d5/0x420 [23827.514023] fl_hw_replace_filter+0x382/0x6a0 [cls_flower] [23827.514975] fl_change+0x2ceb/0x4a51 [cls_flower] [23827.515821] tc_new_tfilter+0x89a/0x2070 [23827.516548] rtnetlink_rcv_msg+0x644/0x8c0 [23827.517300] netlink_rcv_skb+0x11d/0x340 [23827.518021] netlink_unicast+0x42b/0x700 [23827.518742] netlink_sendmsg+0x743/0xc20 [23827.519467] sock_sendmsg+0xb2/0xe0 [23827.520131] ____sys_sendmsg+0x590/0x770 [23827.520851] ___sys_sendmsg+0xd8/0x160 [23827.521552] __sys_sendmsg+0xb7/0x140 [23827.522238] do_syscall_64+0x3a/0x70 [23827.522907] entry_SYSCALL_64_after_hwframe+0x44/0xae [23827.523797] [23827.524163] Freed by task 25948: [23827.524780] kasan_save_stack+0x1b/0x40 [23827.525488] kasan_set_track+0x1c/0x30 [23827.526187] kasan_set_free_info+0x20/0x30 [23827.526968] __kasan_slab_free+0xed/0x130 [23827.527709] slab_free_freelist_hook+0xcf/0x1d0 [23827.528528] kmem_cache_free_bulk+0x33a/0x6e0 [23827.529317] kfree_rcu_work+0x55f/0xb70 [23827.530024] process_one_work+0x8ac/0x14e0 [23827.530770] worker_thread+0x53b/0x1220 [23827.531480] kthread+0x328/0x3f0 [23827.532114] ret_from_fork+0x1f/0x30 [23827.532785] [23827.533147] Last potentially related work creation: [23827.534007] kasan_save_stack+0x1b/0x40 [23827.534710] kasan_record_aux_stack+0xab/0xc0 [23827.535492] kvfree_call_rcu+0x31/0x7b0 [23827.536206] mlx5e_tc_del_fdb_flow+0x577/0xef0 [mlx5_core] [23827.537305] mlx5e_flow_put+0x49/0x80 [mlx5_core] [23827.538290] mlx5e_delete_flower+0x6d1/0xe60 [mlx5_core] [23827.539300] tc_setup_cb_destroy+0x18e/0x2f0 [23827.540144] fl_hw_destroy_filter+0x1d2/0x310 [cls_flower] [23827.541148] __fl_delete+0x4dc/0x660 [cls_flower] [23827.541985] fl_delete+0x97/0x160 [cls_flower] [23827.542782] tc_del_tfilter+0x7ab/0x13d0 [23827.543503] rtnetlink_rcv_msg+0x644/0x8c0 [23827.544257] netlink_rcv_skb+0x11d/0x340 [23827.544981] netlink_unicast+0x42b/0x700 [23827.545700] netlink_sendmsg+0x743/0xc20 [23827.546424] sock_sendmsg+0xb2/0xe0 [23827.547084] ____sys_sendmsg+0x590/0x770 [23827.547850] ___sys_sendmsg+0xd8/0x160 [23827.548606] __sys_sendmsg+0xb7/0x140 [23827.549303] do_syscall_64+0x3a/0x70 [23827.549969] entry_SYSCALL_64_after_hwframe+0x44/0xae [23827.550853] [23827.551217] The buggy address belongs to the object at ffff8881d1322200 [23827.551217] which belongs to the cache kmalloc-256 of size 256 [23827.553341] The buggy address is located 140 bytes inside of [23827.553341] 256-byte region [ffff8881d1322200, ffff8881d1322300) [23827.555747] The buggy address belongs to the page: [23827.556847] page:00000000898762aa refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1d1320 [23827.558651] head:00000000898762aa order:2 compound_mapcount:0 compound_pincount:0 [23827.559961] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff) [23827.561243] raw: 002ffff800010200 dead000000000100 dead000000000122 ffff888100042b40 [23827.562653] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 [23827.564112] page dumped because: kasan: bad access detected [23827.565439] [23827.565932] Memory state around the buggy address: [23827.566917] ffff8881d1322180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [23827.568485] ffff8881d1322200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [23827.569818] >ffff8881d1322280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [23827.571143] ^ [23827.571879] ffff8881d1322300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [23827.573283] ffff8881d1322380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [23827.574654] ================================================================== Most of the necessary logic is already correctly implemented by mlx5e_get_next_valid_encap() helper that is used in neigh stats update handler. Make the handler generic by renaming it to mlx5e_get_next_matching_encap() and use callback to test whether flow is matching instead of hardcoded check for 'valid' flag value. Implement mlx5e_get_next_valid_encap() by calling mlx5e_get_next_matching_encap() with callback that tests encap MLX5_ENCAP_ENTRY_VALID flag. Implement new mlx5e_get_next_init_encap() helper by calling mlx5e_get_next_matching_encap() with callback that tests encap completion result to be non-error and use it in mlx5e_rep_neigh_update() to safely iterate over nhe->encap_list. Remove encap completion logic from mlx5e_rep_update_flows() since the encap entries passed to this function are already guaranteed to be properly initialized by similar code in mlx5e_get_next_init_encap(). Fixes: 2a1f176 ("net/mlx5e: Refactor neigh update for concurrent execution") Signed-off-by: Vlad Buslov <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: Pratyush Yadav <[email protected]>

Evaluate _DSM Function #5, the "PCI Boot Configuration" function. If the result is 0, the OS should preserve any resource assignments made by the firmware. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Benjamin Herrenschmidt <[email protected]> [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas <[email protected]> (cherry picked from commit a78cf96)

commit 8410f70 upstream. Our test report a UAF: [ 2073.019181] ================================================================== [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168 [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584 [ 2073.019192] [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5 [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2073.019200] Call trace: [ 2073.019203] dump_backtrace+0x0/0x310 [ 2073.019206] show_stack+0x28/0x38 [ 2073.019210] dump_stack+0xec/0x15c [ 2073.019216] print_address_description+0x68/0x2d0 [ 2073.019220] kasan_report+0x238/0x2f0 [ 2073.019224] __asan_store8+0x88/0xb0 [ 2073.019229] __bfq_put_async_bfqq+0xa0/0x168 [ 2073.019233] bfq_put_async_queues+0xbc/0x208 [ 2073.019236] bfq_pd_offline+0x178/0x238 [ 2073.019240] blkcg_deactivate_policy+0x1f0/0x420 [ 2073.019244] bfq_exit_queue+0x128/0x178 [ 2073.019249] blk_mq_exit_sched+0x12c/0x160 [ 2073.019252] elevator_exit+0xc8/0xd0 [ 2073.019256] blk_exit_queue+0x50/0x88 [ 2073.019259] blk_cleanup_queue+0x228/0x3d8 [ 2073.019267] null_del_dev+0xfc/0x1e0 [null_blk] [ 2073.019274] null_exit+0x90/0x114 [null_blk] [ 2073.019278] __arm64_sys_delete_module+0x358/0x5a0 [ 2073.019282] el0_svc_common+0xc8/0x320 [ 2073.019287] el0_svc_handler+0xf8/0x160 [ 2073.019290] el0_svc+0x10/0x218 [ 2073.019291] [ 2073.019294] Allocated by task 14163: [ 2073.019301] kasan_kmalloc+0xe0/0x190 [ 2073.019305] kmem_cache_alloc_node_trace+0x1cc/0x418 [ 2073.019308] bfq_pd_alloc+0x54/0x118 [ 2073.019313] blkcg_activate_policy+0x250/0x460 [ 2073.019317] bfq_create_group_hierarchy+0x38/0x110 [ 2073.019321] bfq_init_queue+0x6d0/0x948 [ 2073.019325] blk_mq_init_sched+0x1d8/0x390 [ 2073.019330] elevator_switch_mq+0x88/0x170 [ 2073.019334] elevator_switch+0x140/0x270 [ 2073.019338] elv_iosched_store+0x1a4/0x2a0 [ 2073.019342] queue_attr_store+0x90/0xe0 [ 2073.019348] sysfs_kf_write+0xa8/0xe8 [ 2073.019351] kernfs_fop_write+0x1f8/0x378 [ 2073.019359] __vfs_write+0xe0/0x360 [ 2073.019363] vfs_write+0xf0/0x270 [ 2073.019367] ksys_write+0xdc/0x1b8 [ 2073.019371] __arm64_sys_write+0x50/0x60 [ 2073.019375] el0_svc_common+0xc8/0x320 [ 2073.019380] el0_svc_handler+0xf8/0x160 [ 2073.019383] el0_svc+0x10/0x218 [ 2073.019385] [ 2073.019387] Freed by task 72584: [ 2073.019391] __kasan_slab_free+0x120/0x228 [ 2073.019394] kasan_slab_free+0x10/0x18 [ 2073.019397] kfree+0x94/0x368 [ 2073.019400] bfqg_put+0x64/0xb0 [ 2073.019404] bfqg_and_blkg_put+0x90/0xb0 [ 2073.019408] bfq_put_queue+0x220/0x228 [ 2073.019413] __bfq_put_async_bfqq+0x98/0x168 [ 2073.019416] bfq_put_async_queues+0xbc/0x208 [ 2073.019420] bfq_pd_offline+0x178/0x238 [ 2073.019424] blkcg_deactivate_policy+0x1f0/0x420 [ 2073.019429] bfq_exit_queue+0x128/0x178 [ 2073.019433] blk_mq_exit_sched+0x12c/0x160 [ 2073.019437] elevator_exit+0xc8/0xd0 [ 2073.019440] blk_exit_queue+0x50/0x88 [ 2073.019443] blk_cleanup_queue+0x228/0x3d8 [ 2073.019451] null_del_dev+0xfc/0x1e0 [null_blk] [ 2073.019459] null_exit+0x90/0x114 [null_blk] [ 2073.019462] __arm64_sys_delete_module+0x358/0x5a0 [ 2073.019467] el0_svc_common+0xc8/0x320 [ 2073.019471] el0_svc_handler+0xf8/0x160 [ 2073.019474] el0_svc+0x10/0x218 [ 2073.019475] [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00 which belongs to the cache kmalloc-1024 of size 1024 [ 2073.019484] The buggy address is located 552 bytes inside of 1024-byte region [ffff8000ccf63f00, ffff8000ccf64300) [ 2073.019486] The buggy address belongs to the page: [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0 [ 2073.020123] flags: 0x7ffff0000008100(slab|head) [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00 [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000 [ 2073.020411] page dumped because: kasan: bad access detected [ 2073.020412] [ 2073.020414] Memory state around the buggy address: [ 2073.020420] ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020424] ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020430] ^ [ 2073.020434] ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020438] ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2073.020439] ================================================================== The same problem exist in mainline as well. This is because oom_bfqq is moved to a non-root group, thus root_group is freed earlier. Thus fix the problem by don't move oom_bfqq. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Paolo Valente <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Andrew Paniakin <[email protected]>

…terface [ Upstream commit a90b2a1 ] collect_md property on xfrm interfaces can only be set on device creation, thus xfrmi_changelink() should fail when called on such interfaces. The check to enforce this was done only in the case where the xi was returned from xfrmi_locate() which doesn't look for the collect_md interface, and thus the validation was never reached. Calling changelink would thus errornously place the special interface xi in the xfrmi_net->xfrmi hash, but since it also exists in the xfrmi_net->collect_md_xfrmi pointer it would lead to a double free when the net namespace was taken down [1]. Change the check to use the xi from netdev_priv which is available earlier in the function to prevent changes in xfrm collect_md interfaces. [1] resulting oops: [ 8.516540] kernel BUG at net/core/dev.c:12029! [ 8.516552] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 8.516559] CPU: 0 UID: 0 PID: 12 Comm: kworker/u80:0 Not tainted 6.15.0-virtme #5 PREEMPT(voluntary) [ 8.516565] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 8.516569] Workqueue: netns cleanup_net [ 8.516579] RIP: 0010:unregister_netdevice_many_notify+0x101/0xab0 [ 8.516590] Code: 90 0f 0b 90 48 8b b0 78 01 00 00 48 8b 90 80 01 00 00 48 89 56 08 48 89 32 4c 89 80 78 01 00 00 48 89 b8 80 01 00 00 eb ac 90 <0f> 0b 48 8b 45 00 4c 8d a0 88 fe ff ff 48 39 c5 74 5c 41 80 bc 24 [ 8.516593] RSP: 0018:ffffa93b8006bd30 EFLAGS: 00010206 [ 8.516598] RAX: ffff98fe4226e000 RBX: ffffa93b8006bd58 RCX: ffffa93b8006bc60 [ 8.516601] RDX: 0000000000000004 RSI: 0000000000000000 RDI: dead000000000122 [ 8.516603] RBP: ffffa93b8006bdd8 R08: dead000000000100 R09: ffff98fe4133c100 [ 8.516605] R10: 0000000000000000 R11: 00000000000003d2 R12: ffffa93b8006be00 [ 8.516608] R13: ffffffff96c1a510 R14: ffffffff96c1a510 R15: ffffa93b8006be00 [ 8.516615] FS: 0000000000000000(0000) GS:ffff98fee73b7000(0000) knlGS:0000000000000000 [ 8.516619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.516622] CR2: 00007fcd2abd0700 CR3: 000000003aa40000 CR4: 0000000000752ef0 [ 8.516625] PKRU: 55555554 [ 8.516627] Call Trace: [ 8.516632] <TASK> [ 8.516635] ? rtnl_is_locked+0x15/0x20 [ 8.516641] ? unregister_netdevice_queue+0x29/0xf0 [ 8.516650] ops_undo_list+0x1f2/0x220 [ 8.516659] cleanup_net+0x1ad/0x2e0 [ 8.516664] process_one_work+0x160/0x380 [ 8.516673] worker_thread+0x2aa/0x3c0 [ 8.516679] ? __pfx_worker_thread+0x10/0x10 [ 8.516686] kthread+0xfb/0x200 [ 8.516690] ? __pfx_kthread+0x10/0x10 [ 8.516693] ? __pfx_kthread+0x10/0x10 [ 8.516697] ret_from_fork+0x82/0xf0 [ 8.516705] ? __pfx_kthread+0x10/0x10 [ 8.516709] ret_from_fork_asm+0x1a/0x30 [ 8.516718] </TASK> Fixes: abc340b ("xfrm: interface: support collect metadata mode") Reported-by: Lonial Con <[email protected]> Signed-off-by: Eyal Birger <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

pert script tests fails with segmentation fault as below: 92: perf script tests: --- start --- test child forked, pid 103769 DB test [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ] /usr/libexec/perf-core/tests/shell/script.sh: line 35: 103780 Segmentation fault (core dumped) perf script -i "${perfdatafile}" -s "${db_test}" --- Cleaning up --- ---- end(-1) ---- 92: perf script tests : FAILED! Backtrace pointed to : #0 0x0000000010247dd0 in maps.machine () #1 0x00000000101d178c in db_export.sample () #2 0x00000000103412c8 in python_process_event () #3 0x000000001004eb28 in process_sample_event () #4 0x000000001024fcd0 in machines.deliver_event () #5 0x000000001025005c in perf_session.deliver_event () gregkh#6 0x00000000102568b0 in __ordered_events__flush.part.0 () gregkh#7 0x0000000010251618 in perf_session.process_events () gregkh#8 0x0000000010053620 in cmd_script () gregkh#9 0x00000000100b5a28 in run_builtin () gregkh#10 0x00000000100b5f94 in handle_internal_command () gregkh#11 0x0000000010011114 in main () Further investigation reveals that this occurs in the `perf script tests`, because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`. With `perf_db_export_mode` enabled, if a sample originates from a hypervisor, perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL when `maps__machine(al->maps)` is called from `db_export__sample`. As al->maps can be NULL in case of Hypervisor samples , use thread->maps because even for Hypervisor sample, machine should exist. If we don't have machine for some reason, return -1 to avoid segmentation fault. Reported-by: Disha Goel <[email protected]> Signed-off-by: Aditya Bodkhe <[email protected]> Reviewed-by: Adrian Hunter <[email protected]> Tested-by: Disha Goel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Suggested-by: Adrian Hunter <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>

Without the change `perf `hangs up on charaster devices. On my system it's enough to run system-wide sampler for a few seconds to get the hangup: $ perf record -a -g --call-graph=dwarf $ perf report # hung `strace` shows that hangup happens on reading on a character device `/dev/dri/renderD128` $ strace -y -f -p 2780484 strace: Process 2780484 attached pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached It's call trace descends into `elfutils`: $ gdb -p 2780484 (gdb) bt #0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0) at ../sysdeps/unix/sysv/linux/pread64.c:25 #1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1 #2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0) at util/dso.h:537 gregkh#6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114 gregkh#7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242 gregkh#8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 gregkh#9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 gregkh#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 gregkh#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 gregkh#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0, thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127, best_effort=best_effort@entry=false) at util/thread.h:152 gregkh#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0, sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939 gregkh#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2920 gregkh#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440, sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true) at util/machine.c:2970 gregkh#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440, sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198 gregkh#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0, evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127 gregkh#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127, arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255 gregkh#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540, evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334 gregkh#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0, file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367 gregkh#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245 #22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324 #23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224, file_path=0x105ffbf0 "perf.data") at util/session.c:1419 #24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0, --Type <RET> for more, q to quit, c to continue without paging-- quit prog=prog@entry=0x7fff9df81220) at util/session.c:2132 #25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220) at util/session.c:2181 #26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226 #27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390 #28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076 #29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827 #30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:351 #31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404 #32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448 #33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556 The hangup happens because nothing in` perf` or `elfutils` checks if a mapped file is easily readable. The change conservatively skips all non-regular files. Signed-off-by: Sergei Trofimovich <[email protected]> Acked-by: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>

Symbolize stack traces by creating a live machine. Add this functionality to dump_stack and switch dump_stack users to use it. Switch TUI to use it. Add stack traces to the child test function which can be useful to diagnose blocked code. Example output: ``` $ perf test -vv PERF_RECORD_ ... 7: PERF_RECORD_* events & perf_sample fields: 7: PERF_RECORD_* events & perf_sample fields : Running (1 active) ^C Signal (2) while running tests. Terminating tests with the same signal Internal test harness failure. Completing any started tests: : 7: PERF_RECORD_* events & perf_sample fields: ---- unexpected signal (2) ---- #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0 #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64 #3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72 #4 0x7fc12fef1393 in __nanosleep nanosleep.c:26 #5 0x7fc12ff02d68 in __sleep sleep.c:55 gregkh#6 0x55788c63196b in test__PERF_RECORD perf-record.c:0 gregkh#7 0x55788c620fb0 in run_test_child builtin-test.c:0 gregkh#8 0x55788c5bd18d in start_command run-command.c:127 gregkh#9 0x55788c621ef3 in __cmd_test builtin-test.c:0 gregkh#10 0x55788c6225bf in cmd_test ??:0 gregkh#11 0x55788c5afbd0 in run_builtin perf.c:0 gregkh#12 0x55788c5afeeb in handle_internal_command perf.c:0 gregkh#13 0x55788c52b383 in main ??:0 gregkh#14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74 gregkh#15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 gregkh#16 0x55788c52b9d1 in _start ??:0 ---- unexpected signal (2) ---- #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0 #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45 #3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26 #4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36 #5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0 gregkh#6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 gregkh#7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64 gregkh#8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72 gregkh#9 0x7fc12fef1393 in __nanosleep nanosleep.c:26 gregkh#10 0x7fc12ff02d68 in __sleep sleep.c:55 gregkh#11 0x55788c63196b in test__PERF_RECORD perf-record.c:0 gregkh#12 0x55788c620fb0 in run_test_child builtin-test.c:0 gregkh#13 0x55788c5bd18d in start_command run-command.c:127 gregkh#14 0x55788c621ef3 in __cmd_test builtin-test.c:0 gregkh#15 0x55788c6225bf in cmd_test ??:0 gregkh#16 0x55788c5afbd0 in run_builtin perf.c:0 gregkh#17 0x55788c5afeeb in handle_internal_command perf.c:0 gregkh#18 0x55788c52b383 in main ??:0 gregkh#19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74 gregkh#20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 gregkh#21 0x55788c52b9d1 in _start ??:0 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Signed-off-by: Ian Rogers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>

Calling perf top with branch filters enabled on Intel CPU's with branch counters logging (A.K.A LBR event logging [1]) support results in a segfault. $ perf top -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter ... Thread 27 "perf" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffafff76c0 (LWP 949003)] perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653 653 *width = env->cpu_pmu_caps ? env->br_cntr_width : (gdb) bt #0 perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653 #1 0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345 #2 0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389 #3 0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422 #4 0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850 #5 0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737 gregkh#6 0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359 gregkh#7 0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845 gregkh#8 0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211 gregkh#9 0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245 gregkh#10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324 gregkh#11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342 gregkh#12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120 gregkh#13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448 gregkh#14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 The cause is that perf_env__find_br_cntr_info tries to access a null pointer pmu_caps in the perf_env struct. A similar issue exists for homogeneous core systems which use the cpu_pmu_caps structure. Fix this by populating cpu_pmu_caps and pmu_caps structures with values from sysfs when calling perf top with branch stack sampling enabled. [1], LBR event logging introduced here: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Ian Rogers <[email protected]> Signed-off-by: Thomas Falcon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>

profile allocation is wrongly setting the number of entries on the rules vector before any ruleset is assigned. If profile allocation fails between ruleset allocation and assigning the first ruleset, free_ruleset() will be called with a null pointer resulting in an oops. [ 107.350226] kernel BUG at mm/slub.c:545! [ 107.350912] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 107.351447] CPU: 1 UID: 0 PID: 27 Comm: ksoftirqd/1 Not tainted 6.14.6-hwe-rlee287-dev+ #5 [ 107.353279] Hardware name:[ 107.350218] -QE-----------[ cutMU here ]--------- Ub--- [ 107.3502untu26] kernel BUG a 24t mm/slub.c:545.!04 P [ 107.350912]C ( Oops: invalid oi4pcode: 0000 [#1]40 PREEMPT SMP NOPFXTI + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 107.356054] RIP: 0010:__slab_free+0x152/0x340 [ 107.356444] Code: 00 4c 89 ff e8 0f ac df 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff ff ff <0f> 0b 41 f7 44 24 08 87 04 00 00 75 b2 eb a8 41 f7 44 24 08 87 04 [ 107.357856] RSP: 0018:ffffad4a800fbbb0 EFLAGS: 00010246 [ 107.358937] RAX: ffff97ebc2a88e70 RBX: ffffd759400aa200 RCX: 0000000000800074 [ 107.359976] RDX: ffff97ebc2a88e60 RSI: ffffd759400aa200 RDI: ffffad4a800fbc20 [ 107.360600] RBP: ffffad4a800fbc50 R08: 0000000000000001 R09: ffffffff86f02cf2 [ 107.361254] R10: 0000000000000000 R11: 0000000000000000 R12: ffff97ecc0049400 [ 107.361934] R13: ffff97ebc2a88e60 R14: ffff97ecc0049400 R15: 0000000000000000 [ 107.362597] FS: 0000000000000000(0000) GS:ffff97ecfb200000(0000) knlGS:0000000000000000 [ 107.363332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 107.363784] CR2: 000061c9545ac000 CR3: 0000000047aa6000 CR4: 0000000000750ef0 [ 107.364331] PKRU: 55555554 [ 107.364545] Call Trace: [ 107.364761] <TASK> [ 107.364931] ? local_clock+0x15/0x30 [ 107.365219] ? srso_alias_return_thunk+0x5/0xfbef5 [ 107.365593] ? kfree_sensitive+0x32/0x70 [ 107.365900] kfree+0x29d/0x3a0 [ 107.366144] ? srso_alias_return_thunk+0x5/0xfbef5 [ 107.366510] ? local_clock_noinstr+0xe/0xd0 [ 107.366841] ? srso_alias_return_thunk+0x5/0xfbef5 [ 107.367209] kfree_sensitive+0x32/0x70 [ 107.367502] aa_free_profile.part.0+0xa2/0x400 [ 107.367850] ? rcu_do_batch+0x1e6/0x5e0 [ 107.368148] aa_free_profile+0x23/0x60 [ 107.368438] label_free_switch+0x4c/0x80 [ 107.368751] label_free_rcu+0x1c/0x50 [ 107.369038] rcu_do_batch+0x1e8/0x5e0 [ 107.369324] ? rcu_do_batch+0x157/0x5e0 [ 107.369626] rcu_core+0x1b0/0x2f0 [ 107.369888] rcu_core_si+0xe/0x20 [ 107.370156] handle_softirqs+0x9b/0x3d0 [ 107.370460] ? smpboot_thread_fn+0x26/0x210 [ 107.370790] run_ksoftirqd+0x3a/0x70 [ 107.371070] smpboot_thread_fn+0xf9/0x210 [ 107.371383] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 107.371746] kthread+0x10d/0x280 [ 107.372010] ? __pfx_kthread+0x10/0x10 [ 107.372310] ret_from_fork+0x44/0x70 [ 107.372655] ? __pfx_kthread+0x10/0x10 [ 107.372974] ret_from_fork_asm+0x1a/0x30 [ 107.373316] </TASK> [ 107.373505] Modules linked in: af_packet_diag mptcp_diag tcp_diag udp_diag raw_diag inet_diag snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore qrtr binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd ccp kvm irqbypass polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd i2c_piix4 i2c_smbus input_leds joydev sch_fq_codel msr parport_pc ppdev lp parport efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci dmi_sysfs qemu_fw_cfg ip_tables x_tables autofs4 hid_generic usbhid hid psmouse serio_raw floppy bochs pata_acpi [ 107.379086] ---[ end trace 0000000000000000 ]--- Don't set the count until a ruleset is actually allocated and guard against free_ruleset() being called with a null pointer. Reported-by: Ryan Lee <[email protected]> Fixes: 217af7e ("apparmor: refactor profile rules and attachments") Signed-off-by: John Johansen <[email protected]>