control node crashes while adding a default gateway to quantum router.

Bug #1260334 reported by Tim McSweeney
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cisco Openstack
New
Undecided
Unassigned

Bug Description

I installed the g.3 release using one control node and three compute nodes on UCS B200M3 blades.

I'm using two quantum routers. The control node crashed after setting a gateway on the second router.
quantum router-gateway-set router-pcrf 10.87.252.130

Setting the gateway on the first router had no issues. Both routers are using the same gateway. The crash happens every time I try to set the gateway on the second router. I rebooted the control node several times but it always crashed one minute after the login prompt appears. I have to reinstall OpenStack on the blade to recover from the crash.

root@m2c1q5-ctrl:~# dpkg -l | grep openvswitch
ii openvswitch-common 1.4.0-1ubuntu1.5 Open vSwitch common components
ii openvswitch-datapath-dkms 1.4.0-1ubuntu1.5 Open vSwitch datapath module source - DKMS version
ii openvswitch-switch 1.4.0-1ubuntu1.5 Open vSwitch switch implementations
ii quantum-plugin-openvswitch 7:2013.1.4.a11.gdb5bb6a-9-cisco1 Quantum is a virtual network service for Openstack - Open vSwitch plugin
ii quantum-plugin-openvswitch-agent 7:2013.1.4.a11.gdb5bb6a-9-cisco1 Quantum is a virtual network service for Openstack - Open vSwitch plugin agent
root@m2c1q5-ctrl:~#

root@m2c1q5-ctrl:~# uname -a
Linux m2c1q5-ctrl 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
root@m2c1q5-ctrl:~#

I installed two other g.3 test beds about four weeks and did not see this issue.

root@m2c1q5-ctrl:~# tail -f /var/log/syslog
Dec 9 17:09:23 m2c1q5-ctrl puppet-agent[1379]: (/Stage[main]/Nova::Cert/Nova::Generic_service[cert]/Service[nova-cert]) Triggered 'refresh' from 1 events
Dec 9 17:09:23 m2c1q5-ctrl puppet-agent[1379]: Finished catalog run in 139.84 seconds
Dec 9 17:17:01 m2c1q5-ctrl CRON[19722]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 9 17:17:40 m2c1q5-ctrl ovs-vsctl: 00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl -- --may-exist add-port br-int qr-5aacfb0a-d1 -- set Interface qr-5aacfb0a-d1 type=internal -- set Interface qr-5aacfb0a-d1 external-ids:iface-id=5aacfb0a-d15a-470b-beb2-be1b24564769 -- set Interface qr-5aacfb0a-d1 external-ids:iface-status=active -- set Interface qr-5aacfb0a-d1 external-ids:attached-mac=fa:16:3e:2c:9c:00
Dec 9 17:17:42 m2c1q5-ctrl kernel: <6[31.741 p_als C 0020 efle oeTa
Dec 9 17:17:42 m2c1q5-ctrl ovs-vsctl: 00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=2 set Port qr-5aacfb0a-d1 tag=1
Dec 9 17:17:43 m2c1q5-ctrl kernel: [ 3520.116842] general protection fault: 0000 [#1] SMP
Dec 9 17:17:43 m2c1q5-ctrl kernel: [ 3520.117055] CPU 3
Dec 9 17:17:43 m2c1q5-ctrl kernel: [ 3520.117132] Modules linked in: ip6table_filter ip6_tables ipt_REDIRECT veth openvswitch(O) xt_tcpudp iptable_filter iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables vesafb 8021q joydev usbhid hid wmi aps_dcspea_oeip
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.145 <> 50182]Pd 88,cm:osvwth ane:G O3205-eei 8-bnuCssIcUS-20M/CBB0-3<> 50193]RP 29220 C:00000004
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.193 D:00000000 S:00000004 D:c1400002
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.213 B:ff81e85c 0:ff81dbd0 0:ff81d336
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.245 1:00000000 182900)G:ff88c00(00 nG:00000000<> 50114]C: C2 000007e0C3 0007db00C4 00000460<> 50115]D0 00000000D1 00000000raif ff872b00 akff81ec40)<> 500145] 00000002ff81d339 ff872b00000votsn+xe05 oevwth
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.839 [ffffa176> oeeueatos0f/x9 oevwth
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.976 [ffff815b> _mlo+x5/x9
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.9ct_cin+x8010[pnsic][32.286 [ffeeue028020[pnsic]<> 50263] <ffff1250]?napre09/x0<> 50233] <ffff1> 50288] <ffff1700]nt<> 50341] <ffff176b]?mmcru_paepg_tt02/x1
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[.171 [ffff818f> nokpg+xa04
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.152 [ffff81c6> _ofut049050<> 5321] <ffff13e6]?vrf_oe+x60d
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.289 [ffff8] <ffff1499]?hnl_mfut2 [ffff8657> opg_al+x7/x4
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.466 [ffff81eb> ppl+xb030<> 5038ff1389]sssnmg01/x0<> 50389] <ff1[32.763 I [ffffa1f4> v_n_ed0580c
Dec 9 17:17:43 m2c1q5-ctrl kernel: 4[32.985 -[edtaea634fa6b --<0>[ 3520.405226] Kernel panic - not syncing: Fatal exception in interrupt
Dec 9 17:17:43 m2c1q5-ctrl kernel: [ 3520.410584] Pid: 28983, comm: ovs-vswitchd Tainted: G D O 3.2.0-57-generic #87-Ubuntu
Dec 9 17:17:43 m2c1q5-ctrl kernel: [ 3520.610456] [<ffffffff811beabb>] ? ep_poll+0xdb/0x380
Dec 9 17:17:43 m2c1q5-ctrl kernel: [ 3520.615579] [<ffffffff815327a9>] __sys_sendmsg+0x49/0x90

root@m2c1q5-ctrl:~# cat /proc/kmsg
<4>[ 2986.500041] init: nova-novncproxy main process (1951) killed by TERM signal
48]ii:nv-bettr anpoes(205)triae ihsau
<4>[ 2988.212269] init: nova-consoleauth main process (2167) terminated with status 1
<4>[ 2989.115443] init: nova-scheduler main process (2390) terminated with status 1
<4>[ 2989.512955] init: nova-conductor main process (2499) terminated with status 1
<4>[ 2990.372186] init: keystone main process (2583) killed by TERM signal
<6>[ 3517.123716] device qr-5aacfb0a-d1 entered promiscuous mode
58998]i6tbe:()20-06NtitrCr emgr bea t dccr acp_ower_meter mac_hid lp parport enic megaraid_sas<> 50186]
4[32.155 i:293 om v-sicdTitd ..-7gnrc#7Uut ico System n CBB0-3US-20M
4[32.103 I:0010:[<ffffffffa010f748>] [<ffffffffa010f748>] ovs_tnl_send+0x588/0xca0 [openvswitch]
<4>[ 3520.119413] RSP: 0018:ffff8817e28b5818 EFLAGS: 00010286
<4>[ 3520.119620] RAX: ffff8817deb4d624 RBX: ffff88f5720RX 00000000<> 50190]RX 00000000RI 00000000RI ef5d0051<> 50108]RP ff872b88R8 ff87e460R9 ff87fb60<> 50106]R0 00000008R1: 0000000000000000 R12: ffff8817df3b3660
<4>[ 3520.120746] R13: ffff8f5272800 R14: ffff8817deba3400 R15: 0000000000000000
<4>[ 3520.121028] FS: 00007f52aa2ec700(00 Sff811f60000)klS00000000
4[32.238 S 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 3520.121573] R:0000043c R:0001d760 R:0000000e
4[32.284 R:00000000 R:00000000 DR2: 0000000000000000
<4>[ 3520.122134] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 3520.122416] Process ovs-vswitchd (pid: 28983, thednoff81e840,ts ff871350
0[32.122764] Stack:
<4>[ 3520.128787] 0000000000000001 000000000000001c 0000000000000004 ffff882fd5511000
<4>[ 3520.141506] 0000000000000000 ffff882fd55110c8 0000000000000001 0000000000000000
<4>[ 352.541 00000000 ff87fb60ff81e804 0000000000296
<0>[ 3520.167480] Call Trace:
<4>[ 3520.173792] [<ffffffff8153b090>] ? skb_clone+0x40/0xb0
<4>[ 3520.180124] [<ffffffffa01108fe>] ovs_pr_ed01/x0[pnsic]<> 50162] <ffff002d]d_xct_cin+xd070[pnsic]<> 50127] <ffff162b]?_kalc01b010<> 5019135] [<ffffffffa010b869>] ? ovs_flow_actions_alloc+0x49/0x90 [openvswitch]
<4>[ 3520.211495] [<ffffffffa0107968>] ovs_exeueatos06/x1 oevwth
<4> 50239] <ffffffa0108018>] ovs_packet_cmd_xct+x0/x4 oevwth
4[32.374 [ffff83b6> l_as+x00e
4[32.412 [ffff8571ee5>] genl_rcv_msg+0x1d5/0x250
<4>[ 3520.249585] [<ffffffff81571d10>] ? genl_rcv+0x40/0x40
<4>[ 3520.255954] [<ffffffff815717a9>] netlink_rcv_skb+0xa9/0xd0
<4>[ 3520.262170] [<ffffffff81571cf5>] genl_rcv+0x25/0x40
<4[32.629 [ffff8517> elink_unicast+0x2b0/0x300
<4>[ 3520.274386] [<ffffffff8153cbe7>] ? memcpy_fromiovec+0x67/0xb0
<4>[ 3520.280318] [<ffffffff8157139e>] netlink_sendmsg+0x2de/0x390
<4>[ 3520.286396] [<ffffffff8152ec1e>] sock_sendmsg+0x10e/0x130
<4>[ 3520.292737] [<ffffffff8118d530>] ? __pollwait+0xf0/0xf0
<4>[ 3520.298756] [<ffffffff8111b70e>] ? filemap_fault+0xee/0x3e0
4[32.071 [ffff8107> e_gopudt_aesa+xb010<> 3520307] <ffff11ea]?ulc_ae02/x0<> 50367] <ffff1319]?_d_al+x3/x5
4[320.228 [ffff85ce> eiyivc05/x0<> 50377] <ffff15307f6>] ___sys_sendmsg+0x396/0x3b0
<4>[ 3520.333322 [ffff810b> adem_al+0x269/0x370
<4>[ 3520.33853] <ffff1618]?d_aefut028050<> 50335] <ffff1bab]?e_ol0d/x8
4[32.4794] [<ffffffff815327a9>] __sys_sendmsg+0x49/0x90
<4>[ 3520.353918] [<ffffff8520> y_eds+x902
4[32.587 [ffffff81669b82>] system_call_fastpath+0x16/0x1b
<0>[ 3520.363966] Code: 55 b0 8d 74 0e 14 0f b6 4d cb 89 b3 c0 00 00 00 8b b8 c4 00 00 00 0f b7 75 a0 48 03 b8 d8 00 00 00 88 50 01 88 48 08 66 89 70 06 <f6> 47 06 40 74 0a f6 40 7c 01 0f 84 d0 fd ff ff 31 d2 4c 89 ee
<> 50391]RP <ffff0078]ostlsn+x8/xa0 [openvswitch]
<4>[ 3520.385256] RSP <ffff8817e28b5818><> 50380]-- n rc 4af35f18]-
<4>[ 3520.420344] Call Trace:
<4>[ 3520.425213] [<ffffffff81648a17>] panic+0x91/0x1a4
<4>[ 3520.430181] [<ffffffff8166271a>] oops_end+0xea/0xf0
<4>[ 3520.435150] [<ffffffff810178b8>] die+0x58/0x90
<4>[ 3520.440085] [<ffffffff81662262>] do_general_protection+0x162/0x170
<4>[ 3520.445113] [<ffffffff81661c85>] general_protection+0x25/0x30
<4>[ 3520.450148] [<ffffffffa010f748>] ? ovs_tnl_send+0x588/0xca0 [openvswitch]
<4>[ 3520.455282] [<ffffffffa010f6eb>] ? ovs_tnl_send+0x52b/0xca0 [openvswitch]
<4>[ 3520.460375] [<ffffffff8153b090>] ? skb_clone+0x40/0xb0
<4>[ 3520.465453] [<ffffffffa01108fe>] ovs_vport_send+0x1e/0x50 [openvswitch]
<4>[ 3520.470577] [<ffffffffa010726d>] do_execute_actions+0xfd/0x790 [openvswitch]
<4>[ 3520.475751] [<ffffffff811652bb>] ? __kmalloc+0x15b/0x190
<4>[ 3520.480922] [<ffffffffa010b869>] ? ovs_flow_actions_alloc+0x49/0x90 [openvswitch]
<4>[ 3520.491354] [<ffffffffa0107968>] ovs_execute_actions+0x68/0x110 [openvswitch]
<4>[ 3520.502259] [<ffffffffa0108018>] ovs_packet_cmd_execute+0x208/0x240 [openvswitch]
<4>[ 3520.513779] [<ffffffff8132b560>] ? nla_parse+0x90/0xe0
<4>[ 3520.519756] [<ffffffff81571ee5>] genl_rcv_msg+0x1d5/0x250
<4>[ 3520.525737] [<ffffffff81571d10>] ? genl_rcv+0x40/0x40
<4>[ 3520.531571] [<ffffffff815717a9>] netlink_rcv_skb+0xa9/0xd0
<4>[ 3520.537269] [<ffffffff81571cf5>] genl_rcv+0x25/0x40
<4>[ 3520.542855] [<ffffffff81571070>] netlink_unicast+0x2b0/0x300
<4>[ 3520.548312] [<ffffffff8153cbe7>] ? memcpy_fromiovec+0x67/0xb0
<4>[ 3520.553686] [<ffffffff8157139e>] netlink_sendmsg+0x2de/0x390
<4>[ 3520.559061] [<ffffffff8152ec1e>] sock_sendmsg+0x10e/0x130
<4>[ 3520.564371] [<ffffffff8118d530>] ? __pollwait+0xf0/0xf0
<4>[ 3520.569528] [<ffffffff8111b70e>] ? filemap_fault+0xee/0x3e0
<4>[ 3520.574600] [<ffffffff8117067b>] ? mem_cgroup_update_page_stat+0x2b/0x110
<4>[ 3520.579781] [<ffffffff81118efa>] ? unlock_page+0x2a/0x40
<4>[ 3520.584906] [<ffffffff8113c169>] ? __do_fault+0x439/0x550
<4>[ 3520.590036] [<ffffffff8153cee6>] ? verify_iovec+0x56/0xd0
<4>[ 3520.595192] [<ffffffff815307f6>] ___sys_sendmsg+0x396/0x3b0
<4>[ 3520.600295] [<ffffffff811409b9>] ? handle_mm_fault+0x269/0x370
<4>[ 3520.605407] [<ffffffff81665178>] ? do_page_fault+0x278/0x540
<4>[ 3520.620681] [<ffffffff81532809>] sys_sendmsg+0x19/0x20
<4>[ 3520.625684] [<ffffffff81669b82>] system_call_fastpath+0x16/0x1b

Revision history for this message
Sean Merrow (smerrow) wrote :

I've also hit this crash. The only difference is how I got there. I just did a brand new g.3 installation using COI on the same hardware that it was running fine on, also using g.3. But now after I created my networks, I started adding internal interfaces to my router. The first interface went fine. When I added the second interface, the controller crashed and would not recover without a power cycle. However, after the power cycle, I get to the login prompt, but it immediately crashes again, just as Tim described.

Yesterday, I had a colleague reach out to me with the same issue. Adding an internal interface causes the controller to crash.

Revision history for this message
Bryn Pounds (bpounds) wrote :

I may have run into the same thing. Running G.3 installer, and the control node keeps crashing.

Initial install - was following the Grizzly G.3 multinode install instructions. Created the public network and private net10. Created the router and added interfaces and the node crashed, and immediately crashed on a reboot.

Cleaned the node and reinstalled. This time, the system allowed the creation of the networks and routers. (Had some "issue" seeing the compute nodes, so rebooted them to make nova-compute available) When spinning up the first instance, the control server crashed again, and immediately crashed after every reboot.

Wasn't able to capture the entire log (yet - I'm remote from the system) but here is the screen capture of the final console screen if helpful...

Revision history for this message
Bryn Pounds (bpounds) wrote :
  • logs Edit (9.0 MiB, application/x-tar)
Revision history for this message
Mark T. Voelker (mvoelker) wrote :

So, just to keep everyone in the loop, a few things we're seeing emerge from reports coming here and on mailing lists/Jabber/etc:

1.) Nobody has seen this on Grizzly releases (including g.3) until approximately this week (first report and duplication ~12/9).

2.) During the time between the g.3 release and the first reports of this coming in, numerous deployments have been done without seeing this.

3.) During the time between the g.3 release and the first reports of this coming in, the g.3 packages have not changed at all.

4.) The triggers aren't always the same. For example, Tim's original report indicated that he had to set up two qrouters and add a gateway to each before he saw the crash. In another report, setting up one qrouter and adding a private interface to it was enough to cause a kernel panic.

5.) The stack trace isn't always quite the same either, but always fingers OVS.

6.) Recovery is also not consistent. In some cases folks are seeing the kernel crash almost immediately on bootup (presumably quantum restoring state). In other incidents, a reboot or two seems to clear things up.

Our best guess at this point is that some upstream Ubuntu package update released in the past week or so may be what's causing the problem, but that's just a hypothesis for the time being until we can pin things down a little more.

Some things that you could try in the meantime if you're still stuck:

1.) Since this is a kernel crash, changing the kernel might help. If you're installing a fresh COI
cloud, you can just change this line:

https://github.com/CiscoSystems/grizzly-manifests/blob/g.3/manifests/site.pp.example#L934

Uncomment it and specify the kernel as "linux-image-3.8.0-34-generic" (the latest linux-image-generic-lts-saucy kernel) or "linux-image-3.5.0-44-generic" (the latest linux-image-generic-lts-raring kernel).

If you're trying to get a current install back on it's feet and can keep the box alive long enough to install packages, just do an "apt-get update linux-image-3.8.0-34-generic" and reboot.

If you're trying to get a current install back on it's feet and your machine keeps crashing on bootup, you have a couple of options. You can download linux-image-3.8.0-34-generic and linux-headers-3.8.0-34-generic to a USB stick, boot the server with a rescue disk or you may be able to boot up in single user mode (add "single" to the kernel parameters from a GRUB prompt), then mount the USB stick and install the packages with "dpkg -i insert_package_name_here". You may also be able to boot from singlemode or a rescue disk and disable Quantum from running at bootup (you'll want to disable the puppet agent too or it'll start Quantum up again the first time it does a catalog run), and then use apt-get to upgrade the kernel.

2.) Along similar lines, you may wish to experiment with slightly older openvswitch-switch packages.

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

I'll also note that we're seeing this on a variety of hardware (UCS C-series and B-series, Intel NICs and Cisco Palo NICs, etc).

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

One further shot-in-the-dark that's even easier to try:

https://github.com/CiscoSystems/grizzly-manifests/blob/g.3/manifests/site.pp.example#L942

We use elevator=deadline by default on the kernel command line to avoid a known issue with iSCSI-backed Cinder volumes. This should have no impact whatsoever on OVS...but at the end of the day it's the kernel and sometimes subsystems cross paths in strange ways. If you're currently using the deadline elevator, try removing "elevator=deadline" from the kernel commandline by editing it from GRUB. As I say: I'm very doubtful this will help, but it only takes seconds to try so it might make for an interesting data point.

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

Hi folks,

Another update: it's looking increasingly likely that this crash may be related to an upstream kernel update that Ubuntu published on or around December 2. Has anyone been able to reproduce the crash on any of these kernels?

* linux-image-3.2.0-56-generic (released around 2013-11-07)
* linux-image-3.2.0-55-generic (released around 2013-10-21)
* Any quantal/raring/saucy backport kernel (3.5.x/3.8.x/3.11.x)

If you're looking for a workaround, downgrading to 3.2.0-56 or 3.2.0-55 (or upgrading to a backport such as 3.11.0-14 or 3.8.0-34 or 3.5.0-44) is probably your best bet right now.

Revision history for this message
Sean Merrow (smerrow) wrote :

Thank you very much for the input and suggested work-around Mark.

I was able to successfully build out my cloud, then create a router with external gateway and several internal interfaces after changing the following line in my site.pp from:

#$load_kernel_pkg = 'linux-image-3.2.0-57-generic'

...to ...

$load_kernel_pkg = 'linux-image-3.2.0-55-generic'

So it does appear that this problem is a kernel issue that began in 3.2.0-56 or 3.2.0-57.

Thanks,
Sean

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

Ok, I've now done some smoke testing on several of the quantal/raring/saucy backport kernels as well as 3.2.0-56 and 3.2.0-55 and have been unable to reproduce the crash on any of them. Only 3.2.0-57 seems to trigger this crash (and it's very easy to reproduce). This seems very likely to be a kernel bug specifically with 3.2.0-57, so we'll have to take this upstream and have the Ubuntu folks look into it. In the meantime, using any of the other kernels mentioned in this thread should be a valid workaround.

Revision history for this message
Chris Ricker (chris-ricker) wrote :

For an additional data point, I cannot reproduce with the problematic kernel (3.2.0-57) when using havana and OVS 1.10. So possibly specific to the combination of OVS 1.4 and kernel 3.2.0-57

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

I encountered this today with kernel 3.2.0-58, openvswitch 1.4.0-1ubuntu1.5 and openstack 1:2013.1.4-0ubuntu1~cloud0. This was on a newly built compute node. The kernel panic happened as soon as quantum-plugin-openvswitch-agent had to do anything useful.

I have other working compute nodes with kernel 3.2.0-54, openvswitch 1.4.0-1ubuntu1.5 and openstack 1:2013.1.3-0ubuntu1~cloud0.

As an additional datapoint I'm using Quantum with GRE tunneling enabled and am using openvswitch-datapath-dkms.

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

Note that after replacing the kernel with 3.2.0-54, everything seems to be working just fine. This eliminates the cause being the openstack 1:2013.1.4-0ubuntu1~cloud0 packages.

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.