Fuel for OpenStack

[library] l3-agent does not create namespace after full cluster hard reset

Bug #1340989 reported by Vladimir Kuklin on 2014-07-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Vladimir Kuklin	Fuel for OpenStack 5.1

Bug Description

{"build_id": "2014-07-10_02-01-14", "ostf_sha": "f6f7cee46a85ca3e758f629c2df8b370e9de494a", "build_number": "308", "auth_required": false, "api": "1.0", "nailgun_sha": "745cf21a85a238a62d339e517258eb475fb0603e", "production": "docker", "fuelmain_sha": "11552f9b70f60ae4be4a4f6a68fa0291298c4e00", "astute_sha": "c0ffd4fa1b1ea16931f174a7f4efeac701ec23e6", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "c7c47fb692846dd1fb2e50661c6e95e0545b09ab"} + https://review.openstack.org/#/c/106061/ + https://review.openstack.org/#/c/106363/ + https://review.openstack.org/#/q/status:open+project:stackforge/fuel-library+branch:master+topic:bug1339080,n,z

make a hard cluster reset

rabbitmq and galera assemble

nova works just fine

but: l3-agent/dhcp namespace is not created. also there could be some glitches with ovs agent. if I restart broken agent everything works.

there may be several reasons:
1) oslo.messaging bug https://bugs.launchpad.net/nova/+bug/856764
2) wrong messages reschedule for l3-agent
3) neutron services must start after rabbitmq???
4) environment performance problems as load average just after the start of my 1 vcpu VMs is >20

Tags:

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-07-11:

fuel-snapshot-2014-07-11_23-07-04.tgz Edit (102.9 MiB, application/x-tar)

Aleksandr Didenko (adidenko) on 2014-07-17

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)

Dmitry Ilyin (idv1985) on 2014-07-17

summary:

- l3-agent does not create namespace after full cluster hard reset
+ [library] l3-agent does not create namespace after full cluster hard
+ reset

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-17:

Tried to reproduce it but in my case galera cluster got messes up after reset. And galera problems caused failures in neutron-server. So I guess there is no point to reproduce it without https://review.openstack.org/#/c/106516/ . I'm going to try with new galera ocf now.

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-18:

Checked on CentOS with https://review.openstack.org/#/c/106516/ - looks good. After hard reset of all nodes namespaces are in place.

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-21:

There is odd situation on Ubuntu envs after hard reset - load grows huge on all controllers due to software interrupts:

node-1 (controller):

top - 12:47:39 up 15 min, 2 users, load average: 47.71, 35.17, 19.44
Tasks: 168 total, 11 running, 153 sleeping, 0 stopped, 4 zombie
Cpu(s): 2.2%us, 1.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 95.5%si, 0.4%st

node-5 (controller):

top - 12:50:11 up 17 min, 1 user, load average: 13.91, 24.13, 18.71
Tasks: 158 total, 3 running, 155 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2%us, 1.8%sy, 0.0%ni, 4.4%id, 0.0%wa, 0.0%hi, 90.9%si, 0.7%st

So on my envs it's not even able to detect cluster state due to overload. Checking further.

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-22:

fuel-snapshot-2014-07-22_11-39-05.tgz Edit (10.8 MiB, application/x-tar)

Some additional research on newer ISO:
{
    "api": "1.0",
    "astute_sha": "fd9b8e3b6f59b2727b1b037054f10e0dd7bd37f1",
    "auth_required": false,
    "build_id": "2014-07-21_10-32-30",
    "build_number": "340",
    "feature_groups": [
        "mirantis"
    ],
    "fuellib_sha": "1ec799bc6c8b08b8c9c6243c426507cb7a46459b",
    "fuelmain_sha": "539a5bf7493a5d14690a34bb18c3ad1c75b4f37f",
    "nailgun_sha": "bdd0bdec2b45eea843d559b7648bd5dca4873c66",
    "ostf_sha": "9863db951a6e159f4fa6e6861c8331e1af069cf8",
    "production": "docker",
    "release": "5.1"
}

After hard reset of all nodes it took ~30 minutes to finish bringing cluster back to one peace (crm to stop reporting offline nodes, load to go back to normal, etc). During those 30 minutes there was high load due to "software interrupts" which was even casuing packer loss accordinf to ping. So it finally ended up with the following errors in crm:

vip__management_old (ocf::mirantis:ns_IPaddr2): Started (unmanaged) FAILED [ node-3 node-2 ]
vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3 (unmanaged) FAILED

netns-es were OK:

root@node-1:~# ip netns
qrouter-7fa8e3cf-5e67-48eb-852a-cb0a57006f79
haproxy

root@node-2:~# ip netns
qdhcp-16b365ed-b181-49c9-a8a2-0e69bae788ff
haproxy

After "crm resource cleanup" for those problem vip resources, env got back to operating state. It was able to successfully pass OSTF, except "RabbitMQ availability" test. After "crm resource restart master_p_rabbitmq-server" rabbit got back to normal state as well.

Attaching snapshot, additional info with pacemaker logs and rabbit statuses to follow.

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-22:

fuel-logs-2014-07-22_11-39-05.tgz Edit (105.4 KiB, application/x-tar)

Additional info with pacemaker logs and rabbit statuses

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-22:

Huge "software interrupts" load is generated by "ovs-vswitchd" process and extremely high network activity (GRE multicasts) on the following interfaces:

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-22:

Tcpdump output example on eth2:

14:13:18.323037 IP 10.109.2.4 > 10.109.2.7: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323038 IP 10.109.2.4 > 10.109.2.6: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323039 IP 10.109.2.4 > 10.109.2.3: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323039 IP 10.109.2.4 > 10.109.2.5: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323041 IP 10.109.2.4 > 10.109.2.6: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323042 IP 10.109.2.4 > 10.109.2.5: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28

Please note the frequency (timestamps).

As soon as this traffic disappears, load (including sowftare interrupts) goes back to normal, bandwidth goes down from 100+ Mbps to 70-300 Kbps on eth2 and br-mgmt.

Aleksandr Didenko (adidenko) on 2014-07-23

Changed in fuel:
status:	New → Confirmed

Aleksandr Didenko (adidenko) on 2014-07-23

Changed in fuel:
importance:	High → Critical

Aleksandr Didenko (adidenko) on 2014-07-23

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-07-24:

It looks like the problem with ICMP6 flood and thus software interrupts overload is isolated to my libvirt env. I was not able to reproduce it on hardware env or on other libvirt envs. So lowering priority back to "high".

Changed in fuel:
importance:	Critical → High

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-08-01:

#10

Tested on HW and on other than mine libvirt env - looks good, ip namespaces are always there, nova services and neutron agents are alive, etc.

Aleksandr Didenko (adidenko) on 2014-08-01

Changed in fuel:
assignee:	Aleksandr Didenko (adidenko) → Artem Panchenko (apanchenko-8)

Vladimir Kuklin (vkuklin) on 2014-08-05

tags:

added: ha

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-08-05:

#11

Couldn't reproduce this issue on bare metal:

api: '1.0'
astute_sha: b16efcec6b4af1fb8669055c053fbabe188afa67
auth_required: false
build_id: 2014-07-30_02-30-36
build_number: '373'
feature_groups:
- mirantis
fuellib_sha: 8729e696e0653920bf937329e45a9c23a8f20a1f
fuelmain_sha: 11ef72a20409ba34535ec9e6e093a2e1695161de
nailgun_sha: 8cf375f7687d7d0797e7f085a909df8087fc82a6
ostf_sha: 9c0454b2197756051fc9cee3cfd856cf2a4f0875
production: docker
release: '5.1'

Ubuntu + HA + NeutronGre

IP namespaces after deployment:

p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-8
p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-2

[root@fuel-lab-cz5551 tmp]# fuel nodes | awk '/ready.*controller/ {print $1}' | xargs -n1 -i ssh node-{} ip netns
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
qrouter-6bd120f5-6292-46f5-a31a-3f0ecafb284e
haproxy
Warning: Permanently added 'node-8' (RSA) to the list of known hosts.
qdhcp-974ffa2a-29da-420e-b485-9372ca134f25
haproxy
Warning: Permanently added 'node-11' (RSA) to the list of known hosts.
haproxy

and after cluster restart (all nodes were simultaneously reset via IPMI):

p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-2
p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-11

[root@fuel-lab-cz5551 tmp]# fuel nodes | awk '/ready.*controller/ {print $1}' | xargs -n1 -i ssh node-{} ip netns
Warning: Permanently added 'node-8' (RSA) to the list of known hosts.
haproxy
Warning: Permanently added 'node-11' (RSA) to the list of known hosts.
qrouter-6bd120f5-6292-46f5-a31a-3f0ecafb284e
haproxy
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
qdhcp-974ffa2a-29da-420e-b485-9372ca134f25
haproxy