Skip to content

ncsi deadlock detection #54

@cyrilbur-ibm

Description

@cyrilbur-ibm

ncsi deadlock detection

=================================
[ INFO: inconsistent lock state ]
4.3.6 #1 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
ip/934 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&ndp->ndp_package_lock)->rlock){+.?...}, at: [<c03b0c9c>] ncsi_stop_dev+0x14/0x50
{IN-SOFTIRQ-W} state was registered at:
  [<c03bbdf8>] _raw_spin_lock+0x28/0x38
  [<c03b08b0>] ncsi_add_package+0x64/0xf0
  [<c03af978>] ncsi_rsp_handler_sp+0x80/0xe0
  [<c03afb4c>] ncsi_rcv_rsp+0xd4/0x104
  [<c0308204>] __netif_receive_skb_core+0x6c4/0x808
  [<c0309d8c>] netif_receive_skb_internal+0xb4/0x138
  [<c030a664>] napi_gro_receive+0x48/0x9c
  [<c026d784>] ftgmac100_poll+0x360/0x59c
  [<c030ad48>] net_rx_action+0xe8/0x2a0
  [<c001a524>] __do_softirq+0x108/0x26c
  [<c001a728>] do_softirq+0x48/0x70
  [<c001a818>] __local_bh_enable_ip+0xc8/0x104
  [<c030d120>] __dev_queue_xmit+0x654/0x6c4
  [<c03ae648>] ncsi_xmit_cmd+0x1d4/0x208
  [<c03b01e0>] ncsi_dev_start+0xd0/0x3c0
  [<c03b0c44>] ncsi_dev_work+0x1b8/0x1fc
  [<c002c7a4>] process_one_work+0x228/0x3cc
  [<c002d5e0>] worker_thread+0x2a4/0x3d8
  [<c0031ba0>] kthread+0xc4/0xd8
  [<c000a3ac>] ret_from_fork+0x14/0x28
irq event stamp: 2009
hardirqs last  enabled at (2009): [<c001a834>] __local_bh_enable_ip+0xe4/0x104
hardirqs last disabled at (2007): [<c001a7b4>] __local_bh_enable_ip+0x64/0x104
softirqs last  enabled at (2008): [<c0326a44>] dev_deactivate_many+0x270/0x2ac
softirqs last disabled at (2006): [<c0326a28>] dev_deactivate_many+0x254/0x2ac

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&ndp->ndp_package_lock)->rlock);
  <Interrupt>
    lock(&(&ndp->ndp_package_lock)->rlock);

 *** DEADLOCK ***

1 lock held by ip/934:
 #0:  (rtnl_mutex){+.+.+.}, at: [<c036d0c8>] devinet_ioctl+0x15c/0x6c8

stack backtrace:
CPU: 0 PID: 934 Comm: ip Not tainted 4.3.6 #1
Hardware name: ASpeed SoC
[<c000fa2c>] (unwind_backtrace) from [<c000d5fc>] (show_stack+0x10/0x14)
[<c000d5fc>] (show_stack) from [<c0072a88>] (print_usage_bug.part.11+0x220/0x288)
[<c0072a88>] (print_usage_bug.part.11) from [<c0040394>] (mark_lock+0x400/0x678)
[<c0040394>] (mark_lock) from [<c004293c>] (__lock_acquire+0xa0c/0x1a9c)
[<c004293c>] (__lock_acquire) from [<c0043dc4>] (lock_acquire+0x9c/0xbc)
[<c0043dc4>] (lock_acquire) from [<c03bbdf8>] (_raw_spin_lock+0x28/0x38)
[<c03bbdf8>] (_raw_spin_lock) from [<c03b0c9c>] (ncsi_stop_dev+0x14/0x50)
[<c03b0c9c>] (ncsi_stop_dev) from [<c026cfa0>] (ftgmac100_stop+0x1c/0x28)
[<c026cfa0>] (ftgmac100_stop) from [<c0306cb0>] (__dev_close_many+0xa0/0xc8)
[<c0306cb0>] (__dev_close_many) from [<c0306dc8>] (__dev_close+0x20/0x34)
[<c0306dc8>] (__dev_close) from [<c030da50>] (__dev_change_flags+0x8c/0x138)
[<c030da50>] (__dev_change_flags) from [<c030db14>] (dev_change_flags+0x18/0x48)
[<c030db14>] (dev_change_flags) from [<c036d298>] (devinet_ioctl+0x32c/0x6c8)
[<c036d298>] (devinet_ioctl) from [<c02f42b0>] (sock_ioctl+0x26c/0x2d0)
[<c02f42b0>] (sock_ioctl) from [<c00b5804>] (do_vfs_ioctl+0x588/0x67c)
[<c00b5804>] (do_vfs_ioctl) from [<c00b592c>] (SyS_ioctl+0x34/0x5c)
[<c00b592c>] (SyS_ioctl) from [<c000a320>] (ret_fast_syscall+0x0/0x1c)

Kernel is at 908a999 with the following applied so that networking works with CONFIG_LOCKDEP

diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index b8503e1..cd496b7 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -359,12 +359,6 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
                eh->h_source[i] = 0xff;
        }

-       /* Send NCSI packet */
-       skb_get(nr->nr_cmd);
-       ret = dev_queue_xmit(nr->nr_cmd);
-       if (ret)
-               goto out;
-
        /* Start the timer for the request that might not have
         * corresponding response. I'm not sure 1 second delay
         * here is enough. Anyway, NCSI is internal network, so
@@ -373,6 +367,12 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
        nr->nr_timer_enabled = true;
        mod_timer(&nr->nr_timer, jiffies + 1 * HZ);

+       /* Send NCSI packet */
+       skb_get(nr->nr_cmd);
+       ret = dev_queue_xmit(nr->nr_cmd);
+       if (ret)
+               goto out;
+
        return 0;
 out:
        ncsi_free_req(nr, false, false);

Fix might be as simple as:

    spin_lock(&ndp->ndp_package_lock); -> spin_lock_irqsave(...);
    list_for_each_entry_safe(np, tmp, &ndp->ndp_packages, np_node)
        ncsi_release_package(np);
    spin_unlock(&ndp->ndp_package_lock); -> spin_lock_irqrestore(...);

but I'll check with Gavin.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions