[library] l3-agent does not create namespace after full cluster hard reset

Bug #1340989 reported by Vladimir Kuklin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Vladimir Kuklin

Bug Description

{"build_id": "2014-07-10_02-01-14", "ostf_sha": "f6f7cee46a85ca3e758f629c2df8b370e9de494a", "build_number": "308", "auth_required": false, "api": "1.0", "nailgun_sha": "745cf21a85a238a62d339e517258eb475fb0603e", "production": "docker", "fuelmain_sha": "11552f9b70f60ae4be4a4f6a68fa0291298c4e00", "astute_sha": "c0ffd4fa1b1ea16931f174a7f4efeac701ec23e6", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "c7c47fb692846dd1fb2e50661c6e95e0545b09ab"} + https://review.openstack.org/#/c/106061/ + https://review.openstack.org/#/c/106363/ + https://review.openstack.org/#/q/status:open+project:stackforge/fuel-library+branch:master+topic:bug1339080,n,z

make a hard cluster reset

rabbitmq and galera assemble

nova works just fine

but: l3-agent/dhcp namespace is not created. also there could be some glitches with ovs agent. if I restart broken agent everything works.

there may be several reasons:
1) oslo.messaging bug https://bugs.launchpad.net/nova/+bug/856764
2) wrong messages reschedule for l3-agent
3) neutron services must start after rabbitmq???
4) environment performance problems as load average just after the start of my 1 vcpu VMs is >20

Tags: ha
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)
Dmitry Ilyin (idv1985)
summary: - l3-agent does not create namespace after full cluster hard reset
+ [library] l3-agent does not create namespace after full cluster hard
+ reset
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tried to reproduce it but in my case galera cluster got messes up after reset. And galera problems caused failures in neutron-server. So I guess there is no point to reproduce it without https://review.openstack.org/#/c/106516/ . I'm going to try with new galera ocf now.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Checked on CentOS with https://review.openstack.org/#/c/106516/ - looks good. After hard reset of all nodes namespaces are in place.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

There is odd situation on Ubuntu envs after hard reset - load grows huge on all controllers due to software interrupts:

node-1 (controller):

top - 12:47:39 up 15 min, 2 users, load average: 47.71, 35.17, 19.44
Tasks: 168 total, 11 running, 153 sleeping, 0 stopped, 4 zombie
Cpu(s): 2.2%us, 1.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 95.5%si, 0.4%st

node-5 (controller):

top - 12:50:11 up 17 min, 1 user, load average: 13.91, 24.13, 18.71
Tasks: 158 total, 3 running, 155 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2%us, 1.8%sy, 0.0%ni, 4.4%id, 0.0%wa, 0.0%hi, 90.9%si, 0.7%st

So on my envs it's not even able to detect cluster state due to overload. Checking further.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Some additional research on newer ISO:
{
    "api": "1.0",
    "astute_sha": "fd9b8e3b6f59b2727b1b037054f10e0dd7bd37f1",
    "auth_required": false,
    "build_id": "2014-07-21_10-32-30",
    "build_number": "340",
    "feature_groups": [
        "mirantis"
    ],
    "fuellib_sha": "1ec799bc6c8b08b8c9c6243c426507cb7a46459b",
    "fuelmain_sha": "539a5bf7493a5d14690a34bb18c3ad1c75b4f37f",
    "nailgun_sha": "bdd0bdec2b45eea843d559b7648bd5dca4873c66",
    "ostf_sha": "9863db951a6e159f4fa6e6861c8331e1af069cf8",
    "production": "docker",
    "release": "5.1"
}

After hard reset of all nodes it took ~30 minutes to finish bringing cluster back to one peace (crm to stop reporting offline nodes, load to go back to normal, etc). During those 30 minutes there was high load due to "software interrupts" which was even casuing packer loss accordinf to ping. So it finally ended up with the following errors in crm:

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started (unmanaged) FAILED [ node-3 node-2 ]
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-3 (unmanaged) FAILED

netns-es were OK:

root@node-1:~# ip netns
qrouter-7fa8e3cf-5e67-48eb-852a-cb0a57006f79
haproxy

root@node-2:~# ip netns
qdhcp-16b365ed-b181-49c9-a8a2-0e69bae788ff
haproxy

After "crm resource cleanup" for those problem vip resources, env got back to operating state. It was able to successfully pass OSTF, except "RabbitMQ availability" test. After "crm resource restart master_p_rabbitmq-server" rabbit got back to normal state as well.

Attaching snapshot, additional info with pacemaker logs and rabbit statuses to follow.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Additional info with pacemaker logs and rabbit statuses

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Huge "software interrupts" load is generated by "ovs-vswitchd" process and extremely high network activity (GRE multicasts) on the following interfaces:

NET | br-mgmt ---- | pcki 126817 | pcko 309282 | | si 44 Mbps | so 108 Mbps | coll 0 | mlti 0 | erri 0 | | erro 0 | drpi 24073 | drpo 0
NET | eth2 ---- | pcki 127009 | pcko 285277 | | si 44 Mbps | so 100 Mbps | coll 0 | mlti 0 | erri 0 | | erro 0 | drpi 0 | drpo 0

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tcpdump output example on eth2:

14:13:18.323037 IP 10.109.2.4 > 10.109.2.7: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323038 IP 10.109.2.4 > 10.109.2.6: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323039 IP 10.109.2.4 > 10.109.2.3: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323039 IP 10.109.2.4 > 10.109.2.5: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323041 IP 10.109.2.4 > 10.109.2.6: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28
14:13:18.323042 IP 10.109.2.4 > 10.109.2.5: GREv0, key=0x0, length 98: IP6 fe80::ec12:64ff:fe76:3d82 > ff02::16: HBH ICMP6, multicast listener report v2, 1 group record(s), length 28

Please note the frequency (timestamps).

As soon as this traffic disappears, load (including sowftare interrupts) goes back to normal, bandwidth goes down from 100+ Mbps to 70-300 Kbps on eth2 and br-mgmt.

Changed in fuel:
status: New → Confirmed
Changed in fuel:
importance: High → Critical
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

It looks like the problem with ICMP6 flood and thus software interrupts overload is isolated to my libvirt env. I was not able to reproduce it on hardware env or on other libvirt envs. So lowering priority back to "high".

Changed in fuel:
importance: Critical → High
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tested on HW and on other than mine libvirt env - looks good, ip namespaces are always there, nova services and neutron agents are alive, etc.

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Artem Panchenko (apanchenko-8)
tags: added: ha
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Couldn't reproduce this issue on bare metal:

api: '1.0'
astute_sha: b16efcec6b4af1fb8669055c053fbabe188afa67
auth_required: false
build_id: 2014-07-30_02-30-36
build_number: '373'
feature_groups:
- mirantis
fuellib_sha: 8729e696e0653920bf937329e45a9c23a8f20a1f
fuelmain_sha: 11ef72a20409ba34535ec9e6e093a2e1695161de
nailgun_sha: 8cf375f7687d7d0797e7f085a909df8087fc82a6
ostf_sha: 9c0454b2197756051fc9cee3cfd856cf2a4f0875
production: docker
release: '5.1'

Ubuntu + HA + NeutronGre

IP namespaces after deployment:

 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-8
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-2

[root@fuel-lab-cz5551 tmp]# fuel nodes | awk '/ready.*controller/ {print $1}' | xargs -n1 -i ssh node-{} ip netns
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
qrouter-6bd120f5-6292-46f5-a31a-3f0ecafb284e
haproxy
Warning: Permanently added 'node-8' (RSA) to the list of known hosts.
qdhcp-974ffa2a-29da-420e-b485-9372ca134f25
haproxy
Warning: Permanently added 'node-11' (RSA) to the list of known hosts.
haproxy

and after cluster restart (all nodes were simultaneously reset via IPMI):

 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-2
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-11

[root@fuel-lab-cz5551 tmp]# fuel nodes | awk '/ready.*controller/ {print $1}' | xargs -n1 -i ssh node-{} ip netns
Warning: Permanently added 'node-8' (RSA) to the list of known hosts.
haproxy
Warning: Permanently added 'node-11' (RSA) to the list of known hosts.
qrouter-6bd120f5-6292-46f5-a31a-3f0ecafb284e
haproxy
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
qdhcp-974ffa2a-29da-420e-b485-9372ca134f25
haproxy

Changed in fuel:
status: In Progress → Incomplete
assignee: Artem Panchenko (apanchenko-8) → Vladimir Kuklin (vkuklin)
Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.