Fuel for OpenStack

OVS causes kernel oops SMP, unhandled paging request on one compute node. Fuel 7.0 HA neutron+gre deployment.

Bug #1505907 reported by Rich on 2015-10-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Ivan Suzdal	Fuel for OpenStack 8.0

Bug Description

This deployment is our second round of testing and deployed as part of our production network using hardware we already had on site.

Specific steps to the attempted deployment using Fuel 7.0:
Create a new cluster selecting neutron with gre tunneling.
Deploy three controllers, two ceph storage nodes, four compute nodes (see hardware of nodes for specific failure).
Options changed from default include:
Select Neutron L2 population and Neutron DVR. Select https and tls.
Verify networks & deploy cluster.

Expected result:
Deployment succeeds on all nodes.

Actual result:
Deployment fails due to timeout.
Deployment hangs at "(/Stage[main]/Main/L23network::L2::Bridge[br-floating]/L2_bridge[br-floating]) Starting to evaluate the resource" on the supermicro compute node (see below).

Workaround:
Remove the supermicro node and redeploy.

Impact:
Deployment cannot be completed with the desired hardware.

Description of environment:
Fuel 7, Kilo on Unbuntu 14.04, HA, Neutron+GRE

Fuel version:

{"build_id": "301", "build_number": "301", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "301", "build_number": "301", "api": "1.0", "fuel-library_sha": "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd", "nailgun_sha": "4162b0c15adb425b37608c787944d1983f543aa8", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "50e90af6e3d560e9085ff71d2950cfbcca91af67", "production": "docker", "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3", "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012", "fuel-ostf_sha": "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c", "release": "7.0", "fuelmain_sha": "a65d453215edb0284a2e4761be7a156bb5627677"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd", "nailgun_sha": "4162b0c15adb425b37608c787944d1983f543aa8", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "50e90af6e3d560e9085ff71d2950cfbcca91af67", "production": "docker", "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3", "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012", "fuel-ostf_sha": "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c", "release": "7.0", "fuelmain_sha": "a65d453215edb0284a2e4761be7a156bb5627677"}

All nodes with exception of one compute node are PowerEdge T410's w/ Intel E5620 proc, 32gb ddr3 1333mhz ecc ram, DP broadcom net extreme 3, DP Intel 82576, with varying hdd configurations. Deployment to all of these nodes is successful when the supermicro node is removed (see below).

Three of the above are compute, the fourth compute node is a SuperMicro H8DG6/H8DGi w/ AMD Opteron 6344, 64gb ddr3 1333mhz ecc ram, DP Intel 82576, DP Intel I350, 6x 146gb 15k sas in raid 10. (Hardware dumps included)

This supermicro compute node begins to have OVS processes hang starting with
(/Stage[main]/Main/L23network::L2::Bridge[br-fw-admin]/L2_bridge[br-fw-admin]) Starting to evaluate the resource
For each bridge the new OVS process hangs until the kernel oops at
(/Stage[main]/Main/L23network::L2::Bridge[br-floating]/L2_bridge[br-floating]) Starting to evaluate the resource
where the deployment hangs.

Here is the dmesg output:

[ 358.721109] gre: GRE over IPv4 demultiplexor driver
[ 358.721314] openvswitch: module verification failed: signature and/or required key missing - tainting kernel
[ 358.721969] openvswitch: Open vSwitch switching datapath 2.3.1, built Oct 13 2015 20:19:34
[ 362.877270] bonding: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
[ 363.028297] Bridge firewalling registered
[ 363.179052] 8021q: 802.1Q VLAN Support v1.8
[ 363.179075] 8021q: adding VLAN 0 to HW filter on device eth2
[ 364.374112] device eth2 entered promiscuous mode
[ 364.377476] br-fw-admin: port 1(eth2) entered forwarding state
[ 364.377512] br-fw-admin: port 1(eth2) entered forwarding state
[ 379.429473] br-fw-admin: port 1(eth2) entered forwarding state
[ 397.631393] IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
[ 397.631402] 8021q: adding VLAN 0 to HW filter on device eth3
[ 397.634577] device eth3.101 entered promiscuous mode
[ 397.637758] device eth3 entered promiscuous mode
[ 397.637955] IPv6: ADDRCONF(NETDEV_UP): eth3.101: link is not ready
[ 399.834616] igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 399.834902] IPv6: ADDRCONF(NETDEV_CHANGE): eth3: link becomes ready
[ 399.835237] IPv6: ADDRCONF(NETDEV_CHANGE): eth3.101: link becomes ready
[ 399.835312] br-mgmt: port 1(eth3.101) entered forwarding state
[ 399.835338] br-mgmt: port 1(eth3.101) entered forwarding state
[ 410.142175] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 410.142182] 8021q: adding VLAN 0 to HW filter on device eth1
[ 410.146786] device eth1.103 entered promiscuous mode
[ 410.148362] device eth1 entered promiscuous mode
[ 410.148541] IPv6: ADDRCONF(NETDEV_UP): eth1.103: link is not ready
[ 413.485542] igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 413.485723] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 413.486055] IPv6: ADDRCONF(NETDEV_CHANGE): eth1.103: link becomes ready
[ 413.486128] br-storage: port 1(eth1.103) entered forwarding state
[ 413.486145] br-storage: port 1(eth1.103) entered forwarding state
[ 414.893500] br-mgmt: port 1(eth3.101) entered forwarding state
[ 422.624921] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 422.624930] 8021q: adding VLAN 0 to HW filter on device eth0
[ 422.628780] device eth0.104 entered promiscuous mode
[ 422.630994] device eth0 entered promiscuous mode
[ 422.631183] IPv6: ADDRCONF(NETDEV_UP): eth0.104: link is not ready
[ 422.878019] device ovs-system entered promiscuous mode
[ 422.878166] BUG: unable to handle kernel paging request at 0000000000001e08
[ 422.880136] IP: [<ffffffff81158fae>] __alloc_pages_nodemask+0x8e/0xb80
[ 422.881775] PGD 7fb708067 PUD 7fa6d4067 PMD 0
[ 422.882814] Oops: 0000 [#1] SMP
[ 422.883771] Modules linked in: 8021q garp mrp bridge stp llc bonding openvswitch(OX) gre vxlan ip_tunnel iptable_filter ip_tables x_tables kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper joydev cryptd amd64_edac_mod shpchp edac_core edac_mce_amd k10temp fam15h_power i2c_piix4 serio_raw mac_hid nf_conntrack_proto_gre nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack nls_utf8 isofs xfs libcrc32c raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 hid_generic igb raid0 i2c_algo_bit pata_acpi usbhid multipath dca linear mpt2sas ahci usb_storage pata_atiixp psmouse raid_class ptp hid libahci scsi_transport_sas pps_core
[ 422.904913] CPU: 6 PID: 25833 Comm: ovs-vswitchd Tainted: G OX 3.13.0-65-generic #105-Ubuntu
[ 422.907350] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.5a 08/01/2015
[ 422.909653] task: ffff8807fa2ce000 ti: ffff8807f85a6000 task.ti: ffff8807f85a6000
[ 422.911712] RIP: 0010:[<ffffffff81158fae>] [<ffffffff81158fae>] __alloc_pages_nodemask+0x8e/0xb80
[ 422.914120] RSP: 0018:ffff8807f85a76d0 EFLAGS: 00010246
[ 422.915287] RAX: 0000000000001e00 RBX: 00000000002012d0 RCX: 0000000000000000
[ 422.916980] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: 00000000002012d0
[ 422.918874] RBP: ffff8807f85a77f8 R08: 0000000040000000 R09: ffffea001fdf1de0
[ 422.920900] R10: ffffffffa0351100 R11: ffff880807400f90 R12: 0000000000000080
[ 422.922763] R13: 00000000002012d0 R14: 0000000000000000 R15: 0000000000000000
[ 422.924675] FS: 00007f1de84cf980(0000) GS:ffff880807980000(0000) knlGS:0000000000000000
[ 422.926799] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 422.928325] CR2: 0000000000001e08 CR3: 00000007fa27b000 CR4: 00000000000407e0
[ 422.930370] Stack:
[ 422.930899] ffff8807f8427000 ffff8807f7cb8300 ffffffff81ce4260 ffff8807fa267800
[ 423.045957] ffff8807fa267a58 ffff8807f85a7708 ffffffff816c8f5a ffff8807f85a78c0
[ 423.161270] ffffffff81638235 0000000000000000 ffff8807fa2ce000 ffff8807fa2ce000
[ 423.276231] Call Trace:
[ 423.389802] [<ffffffff816c8f5a>] ? inet6_fill_link_af+0x1a/0x30
[ 423.502871] [<ffffffff81638235>] ? rtnl_fill_ifinfo+0x945/0xc50
[ 423.616877] [<ffffffff811a20f4>] ? deactivate_slab+0x3b4/0x440
[ 423.730458] [<ffffffff81197f83>] ? alloc_pages_current+0xa3/0x160
[ 423.844103] [<ffffffff811a080d>] new_slab+0x9d/0x320
[ 423.957402] [<ffffffff8172033b>] __slab_alloc+0x2a8/0x459
[ 424.071890] [<ffffffffa0351100>] ? ovs_flow_alloc+0x60/0x100 [openvswitch]
[ 424.187761] [<ffffffff8136dc7e>] ? memzero_explicit+0xe/0x10
[ 424.304359] [<ffffffff8147cc20>] ? extract_entropy+0xc0/0x140
[ 424.422442] [<ffffffff811a4e5c>] kmem_cache_alloc_node+0x8c/0x200
[ 424.540677] [<ffffffff811a357b>] ? kmem_cache_alloc+0x18b/0x1e0
[ 424.659841] [<ffffffffa0351100>] ovs_flow_alloc+0x60/0x100 [openvswitch]
[ 424.779445] [<ffffffffa034986c>] ovs_flow_cmd_new+0x5c/0x390 [openvswitch]
[ 424.898701] [<ffffffffa03497cd>] ? ovs_flow_cmd_del+0x19d/0x1e0 [openvswitch]
[ 425.015557] [<ffffffff81654fd3>] ? nlmsg_notify+0x93/0xb0
[ 425.131605] [<ffffffff8138d873>] ? __nla_reserve+0x43/0x50
[ 425.245315] [<ffffffff8131569d>] ? apparmor_capable+0x1d/0x130
[ 425.358257] [<ffffffff8138d6b6>] ? nla_parse+0xb6/0x120
[ 425.468903] [<ffffffff81656c2d>] genl_family_rcv_msg+0x18d/0x370
[ 425.579213] [<ffffffff81656e10>] ? genl_family_rcv_msg+0x370/0x370
[ 425.688153] [<ffffffff81656ea1>] genl_rcv_msg+0x91/0xd0
[ 425.796983] [<ffffffff81654f29>] netlink_rcv_skb+0xa9/0xc0
[ 425.910393] [<ffffffff81655428>] genl_rcv+0x28/0x40
[ 426.026483] [<ffffffff81654615>] netlink_unicast+0xd5/0x1b0
[ 426.134650] [<ffffffff816549fe>] netlink_sendmsg+0x30e/0x680
[ 426.238983] [<ffffffff816518c4>] ? netlink_rcv_wake+0x44/0x60
[ 426.342257] [<ffffffff81652932>] ? netlink_recvmsg+0x1a2/0x3a0
[ 426.445303] [<ffffffff8160e9db>] sock_sendmsg+0x8b/0xc0
[ 426.546880] [<ffffffff8160ede9>] ___sys_sendmsg+0x389/0x3a0
[ 426.646218] [<ffffffff81653443>] ? netlink_table_ungrab+0x33/0x40
[ 426.742477] [<ffffffff811a3796>] ? kmem_cache_alloc_trace+0x1c6/0x1f0
[ 426.836976] [<ffffffff8131618b>] ? apparmor_file_alloc_security+0x5b/0x180
[ 426.930216] [<ffffffff8160ef21>] ? SYSC_sendto+0x121/0x1c0
[ 427.019472] [<ffffffff811dbfe7>] ? __alloc_fd+0xa7/0x130
[ 427.106684] [<ffffffff8160fbd2>] __sys_sendmsg+0x42/0x80
[ 427.192348] [<ffffffff8160fc22>] SyS_sendmsg+0x12/0x20
[ 427.277234] [<ffffffff81734b5d>] system_call_fastpath+0x1a/0x1f
[ 427.361734] Code: c1 e8 13 41 83 e7 02 83 e0 01 41 09 c7 23 1d ba da bb 00 48 c7 45 b8 00 00 00 00 f6 c3 10 41 89 dd 0f 85 5e 02 00 00 48 8b 45 98 <48> 83 78 08 00 0f 84 a6 01 00 00 66 66 66 66 90 0f b6 4d a0 b8
[ 427.540019] RIP [<ffffffff81158fae>] __alloc_pages_nodemask+0x8e/0xb80
[ 427.629146] RSP <ffff8807f85a76d0>
[ 427.715253] CR2: 0000000000001e08
[ 427.798145] ---[ end trace d91033ced793f0ec ]---
[ 427.878825] igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 427.958666] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 428.035881] IPv6: ADDRCONF(NETDEV_CHANGE): eth0.104: link becomes ready
[ 428.111518] br-ex: port 1(eth0.104) entered forwarding state
[ 428.185472] br-ex: port 1(eth0.104) entered forwarding state
[ 428.528555] br-storage: port 1(eth1.103) entered forwarding state
[ 443.223230] br-ex: port 1(eth0.104) entered forwarding state

I have also attached the full dmesg, dmidecode, lspci, lshw, and everything in /var/log in the following format. supermicro-logs.tar.gz contains the above commands as <cmd>.log and everything in /var/log under the log directory. I realize now that I duplicated dmesg and this is probably an unorthodox method. I am new to bug reporting of this magnitude, so please forgive me.

Also, I have tested the memory and all disks on this server, just to be on the safe side. All passed.

Any feedback provided will be greatly appreciated and please let me know if I can gather any more information.

There seems to now be an issue with the remote connection where I was transferring the logs from. I will attach the logs in the morning.

See original description

Tags:

Rich (rmhayes462) on 2015-10-14

description:	updated
description:	updated
description:	updated

Revision history for this message

Alexander Kislitsky (akislitsky) wrote on 2015-10-14:

@Rich, could you provide diagnostic snapshot, please.

tags:	added: customer-found
Changed in fuel:
status:	New → Incomplete

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-14:

supermicro-logs.tar.gz Edit (24.2 MiB, application/x-tar)

The diagnostic snapshot is still generating. I'll attach it when it's complete. For now here is the dump of the logs and mentioned hardware output commands from the supermicro (problem) node.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-14:

@Alexander, here is the diagnostic snapshot. The problem node is named cti05-bs02. Please let me know if you need anything else or have any questions. Upload has failed several times. Here's a google drive link to the snapshot.
https://drive.google.com/file/d/0B0qNjZm_G-GRZ3BxMTJ2VnJsOXM/view?usp=sharing

Alexander Kislitsky (akislitsky) on 2015-10-16

Changed in fuel:
milestone:	none → 8.0
status:	Incomplete → Confirmed
importance:	Undecided → High

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-10-16:

Rich, it's possible it's a bug related to some IPv6 code in OVS. Can you retry on the same hardware, but use sysctl to set net.ipv6.conf.all.disable_ipv6 to 1?

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

Yes I will try this shortly. I'm assuming you mean to set this manually after Ubuntu is installed or is there a method for setting this in fuel pre deployment?

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

Disabling IPv6 resulted in the exact same problem.

Steps to attempt workaround:
Begin deployment on problem node.
Immediately after ubuntu is installed run the following commands:
echo net.ipv6.conf.all.disable_ipv6 = 1 >> /etc/sysctl.conf
sysctl -p /etc/sysctl.conf

Let the deployment continue.

Result:
No IPv6 addresses are assigned to interfaces, vlans, or bridges.
Deployment hangs in the same location with the same unhandled kernel paging request.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-10-16:

Well, this seems pretty interesting:

Oct 13 20:15:27 cti05 kernel: [ 0.000000] Early memory node ranges
Oct 13 20:15:27 cti05 kernel: [ 0.000000] node 1: [mem 0x00001000-0x00098fff]
Oct 13 20:15:27 cti05 kernel: [ 0.000000] node 1: [mem 0x00100000-0xd7e8ffff]
Oct 13 20:15:27 cti05 kernel: [ 0.000000] node 1: [mem 0x100000000-0x827ffffff]
Oct 13 20:15:27 cti05 kernel: [ 0.000000] node 3: [mem 0x828000000-0x1027ffffff]

From the kernel trace I can say that it looks like Kernel tries to allocate a memory on the NUMA node 2 and fails to check on a zonelist because it is not allocated properly. This is then the reason the `__alloc_pages_nodemask` fails in the following code:

if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

(gdb) p &((struct pglist_data*)0)->node_zonelists->_zonerefs
$2 = (struct zoneref (*)[257]) 0x1e08 <irq_stack_union+7688>

Dmitry Pyzhov (dpyzhov) on 2015-10-16

Changed in fuel:
assignee:	nobody → MOS Linux (mos-linux)

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

I suspected NUMA may be causing a problem with out diving deeper than what I've reported. I am fairly certain I tried to deploy with NUMA enabled and disabled. I am able to double check this now and will report back.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-10-16:

Rich, can you please provide us with an access to the machine?

Or, at least, are you able to install a default Ubuntu (via Fuel) and execute a sample program there?

Dmitry Teselkin (teselkin-d) on 2015-10-16

Changed in fuel:
assignee:	MOS Linux (mos-linux) → Ivan Suzdal (isuzdal)

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

#10

I can install just the OS and run the program. At this time I can't provide direct access. I will be in the fuel irc channel for a few more minutes before lunch, after that I'll be back in about an hour and a half.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-10-16:

#11

Please execute the following command at the faulting host. This should Oops the machine:

# numactl --membind=2 dd if=/dev/zero of=/dev/null count=1 bs=128M

If this passes, then try other numbers (--membind=1 then --membind=3 and finally --membind=4).

(Install `numactl` if it is missing)

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

#12

I am pushing out ubuntu now. The department is locking up for lunch. I will be back in roughly an hour to run the commands.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

#13

@Pavel
The deployment from the snapshot had NUMA disabled. I am fairly certain that a previous attempt had NUMA enabled, however, I will try another deployment shortly.

I executed the above commands which ran successfully (did not cause an Oops) for membind=1 and 3.
membind=2 produces Warning: node argument 2 is out of range.

I see this expected behavior for the NUMA configuration but assume OVS is not expecting this configuration.

numactl -show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 1 3
nodebind: 1 3
membind: 1 3

I will be in the fuel IRC channel while I attempt another deployment.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-16:

#14

As expected the deployment failed with the same Oops.
I did generate a new snapshot in case there were changes elsewhere now that NUMA is enabled; here is the link.
https://drive.google.com/file/d/0B0qNjZm_G-GRaTlHbGNzTXpoaHM/view?usp=sharing

This is the parent process for the ovs-vsctl processes that error:
/usr/bin/ruby /usr/bin/puppet apply /etc/puppet/modules/osnailyfacter/modular/netconfig/netconfig.pp --modulepath=/etc/puppet/modules --logdest syslog --trace --no-report --debug --evaltrace --logdest /var/log/puppet.log

Is it possible to add numactl --membind=1,3 to the beginning of this command by modifying the file which generates it during deployment or after a deployment has timed out and before redeploying? To target the change to this node only.
The child processes should then have the same binding. However, I do not know if this would alter the way ovs-vsctl is behaving.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2015-10-18:

#15

Possible duplicate of the #1503655 - same wild scheduler races on everything.

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2015-10-19:

#16

Thanks Andrey for finding it. It looks like there is a kernel update that fixes this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1503655

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-10-19:

#17

@Andrey,

It is not a duplicate.

This bug is a consequence of running the [1] on a sparse NUMA and the fact that kernel never checks if zonelist is there in the first place.

[1] https://github.com/openvswitch/ovs/commit/9ac56358dec1a5aa7f4275a42971f55fad1f7f35 (via Alexei Sheplyakov)

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-19:

#18

Kernel 3.19.0-31.35 results in no change. For some reason I haven't been able to locate 3.13.0-66.108 with apt. I may have to install from source but that will have to wait until later.

Steps to attempt workaround:
Have previous successful deployment excluding problem nodes.
Begin deployment of additional compute node.
Pause puppet process once openstack installation begins.
Upgrade to kernel 3.19.0-31.35
Verify kernel version after reboot.
Deployment should be in error state now for hung puppet process.
Deploy again.

Expect result:
Deployment is successful or at least proceeds further based on #1503655 reported fix in 3.19.0-31.35

Actual result:
Deployment has the exact same failure as found on the default kernel.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-19:

#19

@Pavel
Apologies for the extraneous information. I started testing before your comment and did not see it until after my previous post.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2015-10-19:

#20

Soo.. not all sockets are populated there? It is quite interesting why vswitchd is able to trigger this so early, given enabled numa autoplacement.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-19:

#21

All sockets are populated. However, as mentioned above, OVS is trying to reference node 2 but NUMA is using 1 and 3. I do not know why 1 and 3, but all cores are bound to these two nodes.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2015-10-19:

#22

Quite interesting, because x86 does not allow sparse NUMA at all :). May be some sockets are depopulated via bios settings, but it is obvious that one shouldn`t see all CPUs as well in such a configuration.

grep NUMA_BALANCING_DEFAULT /boot/config-$(uname -r) for sake of the numa scheduler,
numactl --hardware and
numastat

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-19:

#23

I will check these as soon as possible but it will most likely be later today. I am not on site at the moment.

I did have the hardware configuration checked and it was discovered that memory was not installed correctly (half channels populated). Not having worked directly with NUMA before I am not sure if this will alter the CPU bindings or not but I will report back asap.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2015-10-19:

#24

Yep, memory placement could shift NUMA population, albeit without NUMA#0 where most PCI devices are bounded I am genuinely wondering how node was able to start at all. Most likely the node 0 was actually fed with some memory DIMMs but firmware formed a broken SRAT afterwards... Anyway, the DIMM alignment according to HMM should fix a situation. And of course may be other people (not me) there would like to see SRAT layout from the 'broken' config.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-19:

#25

@Andrey

Thank you for the information.

The SRAT should be present in the dmesg logs from the second snapshot provided here (if anyone is interested).

Once completed I will provide the output from the NUMA commands again and new snapshot even if this allows OVS to work correctly for comparative purposes.

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-20:

#26

Deployment was successful with the proper memory placement.
We now have NUMA nodes 0-3 with balanced pinning.

However, we have an undetected dimm that I didn't notice before deployment so I have removed the node while that is being investigated. Once the node is deployed again I will upload a snapshot for anyone who may need the comparison.

Thank you all for your time and expertise helping us resolve this issue.

I am assuming that there is a potential for someone to have a legitimate hardware setup which results in a sparse NUMA configuration, in which case OVS would have the same problem seen here. So I will let someone who knows better decide whether to change the bug status.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2015-10-21:

#27

Could you please try http://www.gossamer-threads.com/lists/linux/kernel/2270789 ? It is supposed to fix your problem with certain nodes being disabled.

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-linux

Revision history for this message

Rich (rmhayes462) wrote on 2015-10-25:

#28

Sorry for the delay. Here is the snapshot from the successful deployment if anyone is interested in it.
https://drive.google.com/open?id=0B0qNjZm_G-GRcDZpakh5bFBJaDA

@Andrey
I am not able to test this patch currently. However, we will be redeploying our env in a week or so to accommodate some hardware and network changes. I may be able to test the patch at that time, depending on the time table.

Could you tell me, is this the best method for testing?
Deploy cluster with proper ram placement (current successful deployment).
Rebuild OVS 2.3.1 from source with code changes and replace deployed OVS.
Move ram so we get a non-optimal numa configuration.
See if we can boot, if so test OVS functionality.

Or is there a way to patch OVS pre-deployment and have it pushed out with the patch in place?

Thanks.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2015-10-25:

#29

I`d suggest you to rebuild a package with a proposed patch, downloading https://github.com/openvswitch/ovs/archive/branch-2.3.zip and issuing apt-get install build-essential, apt-get build-dep openvswitch, unzip, cd openvswitch-xx, apply a patch, run boot.sh and issue dpkg-buildpackage -b -rfakeroot -j. Shortly, if nothing is especially wrong with sysctl settings for tests you`ll get a complete package set in upper-level directory. These packages could be pushed to local repo according to deployment manual or installed by hand, then you could broke topology again - I did not suspected that the nodes would be marked offline instead of re-enumeration, so observed behavior would probably be fixed by mentioned patch. I am out of knowledge if the -src is provided by Fuel and therefore it could be retrieved via apt-get source.

Revision history for this message

Igor Marnat (imarnat) wrote on 2015-12-08:

#30

Rich,
any updates about this bug? Does it work for you or we can help more?
I'd suggest to close this bug unless we hear from Rich in couple of weeks.

Revision history for this message

Sam Stoelinga (sammiestoel) wrote on 2015-12-08:

#31

FYI I got hit by the same bug today on Fuel 8.0 nightly build. We only had a single numa node as shown by numactl. Only one of the nodes had this issue. We solved the issue by pulling out the server and re-ordering the memory into different slots.

Revision history for this message

Rich (rmhayes462) wrote on 2015-12-09:

#32

The bug is avoided with proper RAM configuration. Unfortunately I am not able to test the patch linked a couple posts back. However, in most cases the bug should be avoided with proper hardware configuration. I suppose there could be an issue where a channel is only partially populated causing a sparse numa configuration where this bug may be a problem. I have not tested for this. For us this issue is resolved with proper hardware configuration and it sounds like that is true for Sam as well.

Revision history for this message

Dmitry Teselkin (teselkin-d) wrote on 2015-12-14:

#33

This is not a bug as it seems to be caused by incorrect hardware configuration (RAM modules placement), see comments #26 and #31. Thanks to everyone who helped resolve that issue.