Comment 63 for bug 1403152

Revision history for this message
James Dempsey (jamespd) wrote :

We also backported [1] to 4.2 (linux-lts-wily) and deployed it to our production OpenStack cloud. We just installed it yesterday and our MTBF is between two and twenty days, so we won't know if this has made any difference for a while now.

Some details about our configuration / failure mode:

Three OpenStack "Layer 3" hosts (running 3.19.0-30-generic #34~14.04.1-Ubuntu) providing virtual routers/VPNs/Metadata via network namespaces.

Our most recent failures occurred on hosts B and C (within 30 minutes of each other, after having been fine for weeks) while removing routers from A and re-creating them on B and C.

Our stack traces are a slightly different from the ones posted above...

Dec 14 15:37:05 hostname kernel: [961050.119727] INFO: task ip:9865 blocked for more than 120 seconds.
Dec 14 15:37:05 hostname kernel: [961050.126707] Tainted: G C 3.19.0-30-generic #34~14.04.1-Ubuntu
Dec 14 15:37:05 hostname kernel: [961050.135073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 15:37:05 hostname kernel: [961050.144094] ip D ffff88097e3e3de8 0 9865 9864 0x00000000
Dec 14 15:37:05 hostname kernel: [961050.144098] ffff88097e3e3de8 ffff880e982693a0 0000000000013e80 ffff88097e3e3fd8
Dec 14 15:37:05 hostname kernel: [961050.144100] 0000000000013e80 ffff88101a8993a0 ffff880e982693a0 0000000000000000
Dec 14 15:37:05 hostname kernel: [961050.144102] ffffffff81cdb2a0 ffffffff81cdb2a4 ffff880e982693a0 00000000ffffffff
Dec 14 15:37:05 hostname kernel: [961050.144104] Call Trace:
Dec 14 15:37:05 hostname kernel: [961050.144109] [<ffffffff817b2fa9>] schedule_preempt_disabled+0x29/0x70
Dec 14 15:37:05 hostname kernel: [961050.144111] [<ffffffff817b4c95>] __mutex_lock_slowpath+0x95/0x100
Dec 14 15:37:05 hostname kernel: [961050.144115] [<ffffffff811cfd66>] ? __kmalloc+0x226/0x280
Dec 14 15:37:05 hostname kernel: [961050.144117] [<ffffffff816a14a1>] ? net_alloc_generic+0x21/0x30
Dec 14 15:37:05 hostname kernel: [961050.144120] [<ffffffff817b4d23>] mutex_lock+0x23/0x37
Dec 14 15:37:05 hostname kernel: [961050.144122] [<ffffffff816a1c75>] copy_net_ns+0x75/0x150
Dec 14 15:37:05 hostname kernel: [961050.144125] [<ffffffff810943ad>] create_new_namespaces+0xfd/0x180
Dec 14 15:37:05 hostname kernel: [961050.144127] [<ffffffff810945ba>] unshare_nsproxy_namespaces+0x5a/0xc0
Dec 14 15:37:05 hostname kernel: [961050.144130] [<ffffffff8107439b>] SyS_unshare+0x15b/0x2e0
Dec 14 15:37:05 hostname kernel: [961050.144133] [<ffffffff817b6e4d>] system_call_fastpath+0x16/0x1b
Dec 14 15:37:05 hostname kernel: [961050.144135] INFO: task ip:9896 blocked for more than 120 seconds.
Dec 14 15:37:05 hostname kernel: [961050.151109] Tainted: G C 3.19.0-30-generic #34~14.04.1-Ubuntu
Dec 14 15:37:05 hostname kernel: [961050.159558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 15:37:05 hostname kernel: [961050.168551] ip D ffff8804591cfde8 0 9896 9895 0x00000000
Dec 14 15:37:05 hostname kernel: [961050.168556] ffff8804591cfde8 ffff880814031d70 0000000000013e80 ffff8804591cffd8
Dec 14 15:37:05 hostname kernel: [961050.168558] 0000000000013e80 ffffffff81c1d4e0 ffff880814031d70 0000000000000000
Dec 14 15:37:05 hostname kernel: [961050.168560] ffffffff81cdb2a0 ffffffff81cdb2a4 ffff880814031d70 00000000ffffffff
Dec 14 15:37:05 hostname kernel: [961050.168562] Call Trace:
Dec 14 15:37:05 hostname kernel: [961050.168568] [<ffffffff817b2fa9>] schedule_preempt_disabled+0x29/0x70
Dec 14 15:37:05 hostname kernel: [961050.168571] [<ffffffff817b4c95>] __mutex_lock_slowpath+0x95/0x100
Dec 14 15:37:05 hostname kernel: [961050.168573] [<ffffffff817b4d23>] mutex_lock+0x23/0x37
Dec 14 15:37:05 hostname kernel: [961050.168577] [<ffffffff816a1c75>] copy_net_ns+0x75/0x150
Dec 14 15:37:05 hostname kernel: [961050.168581] [<ffffffff810943ad>] create_new_namespaces+0xfd/0x180
Dec 14 15:37:05 hostname kernel: [961050.168584] [<ffffffff810945ba>] unshare_nsproxy_namespaces+0x5a/0xc0
Dec 14 15:37:05 hostname kernel: [961050.168587] [<ffffffff8107439b>] SyS_unshare+0x15b/0x2e0
Dec 14 15:37:05 hostname kernel: [961050.168589] [<ffffffff817b6e4d>] system_call_fastpath+0x16/0x1b

[1] http://www.spinics.net/lists/netdev/msg351337.html

Cheers,
James Dempsey