haproxy issue: backends are in DOWN state after node restarts

Bug #1618891 reported by ElenaRossokhina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Incomplete
High
ElenaRossokhina
Mitaka
Invalid
High
ElenaRossokhina

Bug Description

Detailed bug description:
On the face of it, the issue seems duplicate of #1608561, because node reset leads to wrong status of backends on different nodes http://paste.openstack.org/show/565168/
Brief discussion with @isuzdal shows that root cause can differ
Found on CI https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/44/testReport/(root)/ha_rabbitmq_stability_check/ha_rabbitmq_stability_check/

Steps to reproduce:
1. Deploy environment with at least 3 controllers
               (Or revert existing snapshot)
2. Wait for mysql cluster to become active
3. Run ostf tests before destructive actions
4. Get rabbit master node
5. Move master rabbit resource to slave with pcs
6. Delete pcs constraint for rabbit resource
7. Assert HA services ready
8. Get new rabbit master node
9. Destroy it
10. Assert HA services ready
11. Run sanity and smoke OSTF sets
12. Power on destroyed node
13. Assert HA services ready (fail)

Expected results:
ha suite passed
Actual result:
One of ha tests failed: Check state of haproxy backends on controllers (failure)

Tags: area-library
tags: added: area-library
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
milestone: none → 10.0
Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

Hi,

It looks like a haproxy bug. It marks After haproxy restart all nodes go UP as they supposed to be.

# haproxy-status.sh | grep DOWN | grep keysto
keystone-1 node-4 Status: DOWN/L4TOUT Sessions: 0 Rate: 0
keystone-2 node-4 Status: DOWN/L4TOUT Sessions: 0 Rate: 0

# cat /etc/haproxy/conf.d/020-keystone-1.cfg | grep node-4
  server node-4 10.109.6.3:5000 check inter 10s fastinter 2s downinter 2s rise 30 fall 3

# curl http://10.109.6.3:5000
{"versions": {"values": [{"status": "stable", "updated": "2016-04-04T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v3+json"}], "id": "v3.6", "links": [{"href": "http://10.109.6.3:5000/v3/", "rel": "self"}]}, {"status": "stable", "updated": "2014-04-17T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v2.0+json"}], "id": "v2.0", "links": [{"href": "http://10.109.6.3:5000/v2.0/", "rel": "self"}, {"href": "http://docs.openstack.org/", "type": "text/html", "rel": "describedby"}]}]}}

# killall haproxy

### Wait until pcs restarts haproxy

# haproxy-status.sh | grep node-4 | grep keystone
keystone-1 node-4 Status: UP/L7OK Sessions: 0 Rate: 0
keystone-2 node-4 Status: UP/L7OK Sessions: 0 Rate: 0

# dpkg -l | grep haproxy
ii haproxy 1.6.3-1~u14.04+mos2 amd64 fast and reliable load balancing reverse proxy

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Linux team, could you take a look?

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → MOS Linux (mos-linux)
status: Invalid → Confirmed
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

As far as I can remember, it's not the haproxy bug. The root cause is in how is conntrackd working in net namespace.
If you could provide access to reverted failed env - it might help to clarify what exactly and where is going wrong. I had tried to reproduce locally but couldn't.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Ivan Suzdal (isuzdal)
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Unfortunately, I haven't access to production slaves.

Changed in fuel:
assignee: Ivan Suzdal (isuzdal) → ElenaRossokhina (esolomina)
status: Confirmed → Incomplete
Revision history for this message
ElenaRossokhina (esolomina) wrote :

env is reverted and ready for investigation

Revision history for this message
ElenaRossokhina (esolomina) wrote :

the issue is reproduced again on ci: https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/101/testReport/(root)/ha_rabbitmq_stability_check/ha_rabbitmq_stability_check/

it is not mysql issue, mysql started ok after node reboot. also, logs seem strange: haproxy.log on restarted node ends early than on another nodes (but log shows that services recovered)

haproxy-status.sh - http://paste.openstack.org/show/586506/
full logs - https://drive.google.com/open?id=0B2ag_Bf-ShtTWm5ueWRFZUw3WWM

Revision history for this message
ElenaRossokhina (esolomina) wrote :
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Passing bug to author, because we still doesn't know how is this bug can be reproduced.
Elena, please save the failed env and ping me or any of mos-linux team member if bug will occur again.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Moved to "Invalid" because I didn't find the env with issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.