haproxy containers stopped on two controllers on fresh deployment with OpenDaylight

Bug #1764514 reported by Sai Sindhur Malleni
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Tim Rozet

Bug Description

Description of problem:
On a fresh deployment of Queen with ODL Oxygen, we are seeing haproxy-bundle containers killed on two of the controllers (the controlelrs without the VIP)

Version-Release number of selected component (if applicable):
OSP 13

How reproducible:
100%

Steps to Reproduce:
1. Deploy OpenStack Queens with ODL
2. use pcs status to view status
3.

Actual results:
haproxy containers killed on controller-1 and controller-2

Expected results:
haproxy containers should be started on all controllers

Additional info:

[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Mon Apr 9 23:44:07 2018
Last change: Mon Apr 9 23:44:05 2018 by hacluster via crmd on overcloud-controller-2

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1
   rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2
 Docker container set: galera-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0
   galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1
   galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2
 Docker container set: redis-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0 (ocf::heartbeat:redis): Master overcloud-controller-0
   redis-bundle-1 (ocf::heartbeat:redis): Slave overcloud-controller-1
   redis-bundle-2 (ocf::heartbeat:redis): Slave overcloud-controller-2
 ip-192.168.24.54 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.21.0.100 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.16.0.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.16.0.14 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.18.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 ip-172.19.0.13 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 Docker container set: haproxy-bundle [docker-registry.engineering.redhat.com/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-0
   haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped
   haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped
 Docker container: openstack-cinder-volume [docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:pcmklatest]
openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-1

==============================================================================

In /var/log/messages
Apr 9 19:49:09 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886021]: ERROR: Newly created docker container exited after start
Apr 9 19:49:09 overcloud-controller-1 lrmd[20848]: notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:waiting on monitor_cmd to pass after start ]
Apr 9 19:49:09 overcloud-controller-1 lrmd[20848]: notice: haproxy-bundle-docker-2_start_0:886021:stderr [ ocf-exit-reason:Newly created docker container exited after start ]
Apr 9 19:49:09 overcloud-controller-1 crmd[20851]: notice: Result of start operation for haproxy-bundle-docker-2 on overcloud-controller-1: 1 (unknown error)
Apr 9 19:49:09 overcloud-controller-1 crmd[20851]: notice: overcloud-controller-1-haproxy-bundle-docker-2_start_0:159 [ ocf-exit-reason:waiting on monitor_cmd to pass after start\nocf-exit-reason:Newly created docker container exited after start\n ]
Apr 9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.004764059Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop?t=10 returned error: Container haproxy-bundle-docker-2 is already stopped"
Apr 9 19:49:10 overcloud-controller-1 dockerd-current: time="2018-04-09T23:49:10.005303162Z" level=error msg="Handler for POST /v1.26/containers/haproxy-bundle-docker-2/stop returned error: Container haproxy-bundle-docker-2 is already stopped"
Apr 9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2
Apr 9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: NOTICE: Cleaning up inactive container, haproxy-bundle-docker-2.
Apr 9 19:49:10 overcloud-controller-1 docker(haproxy-bundle-docker-2)[886633]: INFO: haproxy-bundle-docker-2
Apr 9 19:49:10 overcloud-controller-1 crmd[20851]: notice: Result of stop operation for haproxy-bundle-docker-2 on overcloud-controller-1: 0 (ok)

Revision history for this message
Sai Sindhur Malleni (smalleni) wrote :

Comment from Raoul,
On the machines we see that the problem related to the container is really specific, and you can see it by starting the container by hand:

[ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [172.16.0.15:8185]
[ALERT] 099/133036 (10) : Starting proxy opendaylight_ws: cannot bind socket [192.168.24.59:8185]

Which should mean that the ports that haproxy want to use are occupied by something, but in fact what we have on the controller is:

[root@overcloud-controller-1 heat-admin]# netstat -nlp|grep 8185
tcp 0 0 172.16.0.20:8185 0.0.0.0:* LISTEN 496289/java

So the local IP of the machine 172.16.0.20 correctly listens with the opendaylight service (driven by the container) and nothing else.
One particular thing is that controller-1 do not have any VIP on it, and the problem does not happen on controller-0, where the VIP lives.

Commenting the opendaylight_ws section in /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg on the machine makes haproxy start, but it remains to be understood why it cannot bind the port.

Revision history for this message
Sai Sindhur Malleni (smalleni) wrote :

Comment from Tim Rozet,
The reason it is not starting on non-VIP control nodes is because the binding for haproxy is not set to transparent. Therefore since the nodes do not have the VIP on their machines, haproxy will not start because it cannot bind to those IPs. Most other services use transparent mode which will allow haproxy to start even when it does not have the referenced bind address. Therefore the behavior is expected here. However, is the behavior correct?

On both Zaqar Websocket and ODL Websocket services we are not using transparent binding with a note from Juan indicating it is done intentionally:

  if $zaqar_ws {
    ::tripleo::haproxy::endpoint { 'zaqar_ws':
      public_virtual_ip => $public_virtual_ip,
      internal_ip => hiera('zaqar_ws_vip', $controller_virtual_ip),
      service_port => $ports[zaqar_ws_port],
      ip_addresses => hiera('zaqar_ws_node_ips', $controller_hosts_real),
      server_names => hiera('zaqar_ws_node_names', $controller_hosts_names_real),
      mode => 'http',
      haproxy_listen_bind_param => [], # We don't use a transparent proxy here

Revision history for this message
Sai Sindhur Malleni (smalleni) wrote :
Download full text (5.8 KiB)

Changed the HAProxy configuration for opendaylight_ws to include transparent on the controllers and restarted haproxy-bundle. Tried VM boot ping scenario, VMs go into ACTIVE as expected and are pingable. (Booting 50 VMs)

--------------------------------------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------------------+
| Response Times (sec) |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |

+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_router | 1.62 | 1.899 | 3.281 | 3.89 | 4.327 | 2.259 | 100.0% | 50 |
| neutron.create_network | 0.246 | 0.462 | 0.612 | 0.689 | 0.774 | 0.444 | 100.0% | 50 |
| neutron.create_subnet | 0.582 | 0.849 | 1.014 | 1.038 | 1.421 | 0.854 | 100.0% | 50 |
| neutron.add_interface_router | 2.029 | 2.42 | 2.859 | 2.966 | 3.152 | 2.453 | 100.0% | 50 |
| nova.boot_server | 38.003 | 77.645 | 90.082 | 91.992 | 92.988 | 75.522 | 100.0% | 50 |
| vm.attach_floating_ip | 3.779 | 5.075 | 5.671 | 5.786 | 6.557 | 5.051 | 100.0% | 50 |
| -> neutron.create_floating_ip | 1.374 | 1.711 | 2.074 | 2.132 | 2.156 | 1.752 | 100.0% | 50 |
| -> nova.associate_floating_ip | 2.029 | 3.24 | 3.978 | 4.175 | 4.655 | 3.298 | 100.0% | 50 |
| vm.wait_for_ping | 0.019 | 0.023 | 0.028 | 0.029 | 121.23 | 4.851 | 96.0% | 50 |
| total | 47.474 | 88.715 | 101.95 | 103.039 | 215.239 | 91.436 | 96.0% | 50 |
| -> duration | 46.474 | 87.715 | 100.95 | 102.039 | 214.239 | 90.436 | 96.0% | 50 |
| -> idle_duration | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 96.0% | 50 |
+--------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

haproxy-bundle container is started on all 3 controllers

[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
Last updated: Mon Apr 16 20:10:46 2018
Last change: Mon Apr 16 19:49:34 2018 by hacluster via crmd on overcloud-controller-1

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 ov...

Read more...

summary: - haproxy containers stopped on two controllers on fresh deployment
+ haproxy containers stopped on two controllers on fresh deployment with
+ OpenDaylight
Tim Rozet (trozet)
Changed in tripleo:
assignee: nobody → Tim Rozet (trozet)
importance: Undecided → High
status: New → In Progress
milestone: none → rocky-1
tags: added: queens-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/561712

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.openstack.org/561712
Reason: TO NOT RE-CHECK OR RE-APPROVE - CLEARING THE GATE NOW TO FIX A BLOCKER

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/561712
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=70bedeef99d5b2852307158330f78c0d9159532f
Submitter: Zuul
Branch: master

commit 70bedeef99d5b2852307158330f78c0d9159532f
Author: Tim Rozet <email address hidden>
Date: Mon Apr 16 16:50:04 2018 -0400

    Fixes binding type for OpenDaylight Websocket

    For OpenDaylight Websocket connections we were not using transparent
    binding type with HA Proxy. This means that HA Proxy was not able to
    start on nodes that did not have the VIP because it was unable to bind
    to that IP on more than one node. However, transparent binding works OK
    with OpenDaylight Websocket and should be fine to enable so that HA
    Proxy is able to start on every controller.

    Closes-Bug: 1764514

    Change-Id: I89e6115795ece6735e816ab71b5b552b17f7b943
    Signed-off-by: Tim Rozet <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/562581

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/562581
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e11f051b7109596b052ff594d09fe2a399bbe58e
Submitter: Zuul
Branch: stable/queens

commit e11f051b7109596b052ff594d09fe2a399bbe58e
Author: Tim Rozet <email address hidden>
Date: Mon Apr 16 16:50:04 2018 -0400

    Fixes binding type for OpenDaylight Websocket

    For OpenDaylight Websocket connections we were not using transparent
    binding type with HA Proxy. This means that HA Proxy was not able to
    start on nodes that did not have the VIP because it was unable to bind
    to that IP on more than one node. However, transparent binding works OK
    with OpenDaylight Websocket and should be fine to enable so that HA
    Proxy is able to start on every controller.

    Closes-Bug: 1764514

    Change-Id: I89e6115795ece6735e816ab71b5b552b17f7b943
    Signed-off-by: Tim Rozet <email address hidden>
    (cherry picked from commit 70bedeef99d5b2852307158330f78c0d9159532f)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.3.2

This issue was fixed in the openstack/puppet-tripleo 8.3.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.1.0

This issue was fixed in the openstack/puppet-tripleo 9.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.