Fuel for OpenStack

haproxy issue: backends are in DOWN state after node restarts

Bug #1618891 reported by ElenaRossokhina on 2016-08-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Incomplete	High	ElenaRossokhina	Fuel for OpenStack 10.0
	Mitaka	Invalid	High	ElenaRossokhina	Fuel for OpenStack 9.2

Bug Description

Detailed bug description:
On the face of it, the issue seems duplicate of #1608561, because node reset leads to wrong status of backends on different nodes http://paste.openstack.org/show/565168/
Brief discussion with @isuzdal shows that root cause can differ
Found on CI https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/44/testReport/(root)/ha_rabbitmq_stability_check/ha_rabbitmq_stability_check/

Steps to reproduce:
1. Deploy environment with at least 3 controllers
(Or revert existing snapshot)
2. Wait for mysql cluster to become active
3. Run ostf tests before destructive actions
4. Get rabbit master node
5. Move master rabbit resource to slave with pcs
6. Delete pcs constraint for rabbit resource
7. Assert HA services ready
8. Get new rabbit master node
9. Destroy it
10. Assert HA services ready
11. Run sanity and smoke OSTF sets
12. Power on destroyed node
13. Assert HA services ready (fail)

Expected results:
ha suite passed
Actual result:
One of ha tests failed: Check state of haproxy backends on controllers (failure)

Tags:

Roman Prykhodchenko (romcheg) on 2016-09-01

tags:	added: area-library
Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
milestone:	none → 10.0

Maksim Malchuk (mmalchuk) on 2016-09-01

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Kyrylo Galanov (kgalanov) wrote on 2016-09-02:

Hi,

It looks like a haproxy bug. It marks After haproxy restart all nodes go UP as they supposed to be.

# haproxy-status.sh | grep DOWN | grep keysto
keystone-1 node-4 Status: DOWN/L4TOUT Sessions: 0 Rate: 0
keystone-2 node-4 Status: DOWN/L4TOUT Sessions: 0 Rate: 0

# cat /etc/haproxy/conf.d/020-keystone-1.cfg | grep node-4
server node-4 10.109.6.3:5000 check inter 10s fastinter 2s downinter 2s rise 30 fall 3

# curl http://10.109.6.3:5000
{"versions": {"values": [{"status": "stable", "updated": "2016-04-04T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v3+json"}], "id": "v3.6", "links": [{"href": "http://10.109.6.3:5000/v3/", "rel": "self"}]}, {"status": "stable", "updated": "2014-04-17T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v2.0+json"}], "id": "v2.0", "links": [{"href": "http://10.109.6.3:5000/v2.0/", "rel": "self"}, {"href": "http://docs.openstack.org/", "type": "text/html", "rel": "describedby"}]}]}}

# killall haproxy

### Wait until pcs restarts haproxy

# haproxy-status.sh | grep node-4 | grep keystone
keystone-1 node-4 Status: UP/L7OK Sessions: 0 Rate: 0
keystone-2 node-4 Status: UP/L7OK Sessions: 0 Rate: 0

# dpkg -l | grep haproxy
ii haproxy 1.6.3-1~u14.04+mos2 amd64 fast and reliable load balancing reverse proxy

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-09-08:

Linux team, could you take a look?

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → MOS Linux (mos-linux)
status:	Invalid → Confirmed

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-08:

As far as I can remember, it's not the haproxy bug. The root cause is in how is conntrackd working in net namespace.
If you could provide access to reverted failed env - it might help to clarify what exactly and where is going wrong. I had tried to reproduce locally but couldn't.

Changed in fuel:
assignee:	MOS Linux (mos-linux) → Fuel Sustaining (fuel-sustaining-team)

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-09-08:

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/52/testReport/junit/(root)/ha_rabbitmq_stability_check/ha_rabbitmq_stability_check/
the latest test failed with the same issue, you could try to revert the env and investigate deeper.

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Ivan Suzdal (isuzdal)

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-09-08:

Unfortunately, I haven't access to production slaves.

Changed in fuel:
assignee:	Ivan Suzdal (isuzdal) → ElenaRossokhina (esolomina)
status:	Confirmed → Incomplete

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-09-08:

env is reverted and ready for investigation

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-10-20:

the issue is reproduced again on ci: https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/101/testReport/(root)/ha_rabbitmq_stability_check/ha_rabbitmq_stability_check/

it is not mysql issue, mysql started ok after node reboot. also, logs seem strange: haproxy.log on restarted node ends early than on another nodes (but log shows that services recovered)

haproxy-status.sh - http://paste.openstack.org/show/586506/
full logs - https://drive.google.com/open?id=0B2ag_Bf-ShtTWm5ueWRFZUw3WWM

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-11-14:

new failure https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive_vlan/125/testReport/(root)/neutron_l3_migration_after_reset_vlan/neutron_l3_migration_after_reset_vlan/

Revision history for this message

Pavlo Shchelokovskyy (pshchelo) wrote on 2016-11-28:

seems like reproduced on 10.0 community build BVT

https://ci.fuel-infra.org/job/10.0-community.main.ubuntu.bvt_2/887/testReport/junit/(root)/ceph_rados_gw/ceph_rados_gw/

Revision history for this message

Ivan Suzdal (isuzdal) wrote on 2016-12-15:

#10

Passing bug to author, because we still doesn't know how is this bug can be reproduced.
Elena, please save the failed env and ping me or any of mos-linux team member if bug will occur again.

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-12-27:

#11

Moved to "Invalid" because I didn't find the env with issue.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.