[jammy/yoga] openstack-loadbalancer fails to start haproxy

Bug #1997103 reported by Bas de Bruijne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Loadbalancer Charm
New
Undecided
Unassigned

Bug Description

In testrun https://solutions.qa.canonical.com/v2/testruns/8b382b6f-73e8-455c-be26-d7b8d9b02143, openstack-loadbalancer blocks with:

```
openstack-loadbalancer/1 active idle 1/lxd/9 10.246.167.21 Unit is ready
  filebeat/38 active idle 10.246.167.21 Filebeat ready.
  hacluster-openstack-loadbalancer/2 active idle 10.246.167.21 Unit is ready and clustered
  landscape-client/38 maintenance idle 10.246.167.21 Need computer-title and juju-info to proceed
  logrotated/31 active idle 10.246.167.21 Unit is ready.
  nrpe/47 active idle 10.246.167.21 icmp,5666/tcp Ready
  prometheus-grok-exporter/39 active idle 10.246.167.21 9144/tcp Unit is ready
  public-policy-routing/18 active idle 10.246.167.21 Unit is ready
  telegraf/38 active idle 10.246.167.21 9103/tcp Monitoring openstack-loadbalancer/1 (source version/commit d208a64)
openstack-loadbalancer/2* blocked idle 2/lxd/10 10.246.165.129 Services not running that should be: haproxy
  filebeat/15 active idle 10.246.165.129 Filebeat ready.
  hacluster-openstack-loadbalancer/0* active idle 10.246.165.129 Unit is ready and clustered
  landscape-client/14 maintenance idle 10.246.165.129 Need computer-title and juju-info to proceed
  logrotated/9 active idle 10.246.165.129 Unit is ready.
  nrpe/21 active idle 10.246.165.129 icmp,5666/tcp Ready
  prometheus-grok-exporter/15 active idle 10.246.165.129 9144/tcp Unit is ready
  public-policy-routing/3 active idle 10.246.165.129 Unit is ready
  telegraf/14 active idle 10.246.165.129 9103/tcp Monitoring openstack-loadbalancer/2 (source version/commit d208a64)
```

In the logs we see a timeout trying to stop the haproxy:
```
var/log/pacemaker/pacemaker.log-Nov 17 17:50:41.900 juju-173067-2-lxd-10 pacemaker-based [39820] (cib_perform_op) info: + /cib/status/node_state[@id='1000']/lrm[@id='1000']/lrm_resources/lrm_resource[@id='res_openstack_loadbalancer_haproxy']/lrm_rsc_op[@id='res_openstack_loadbalancer_haproxy_last_0']: @operation_key=res_openstack_loadbalancer_haproxy_stop_0, @operation=stop, @crm-debug-origin=do_update_resource, @transition-key=2:9:0:fd391420-ca57-4da8-9021-1a51c847e3aa, @transition-magic=-1:193;2:9:0:fd391420-ca57-4da8-9021-1a51c847e3aa, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1668707441, @exec-time=0
var/log/pacemaker/pacemaker.log-Nov 17 17:50:41.900 juju-173067-2-lxd-10 pacemaker-based [39820] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=juju-173067-2-lxd-10/crmd/74, version=0.12.22)
var/log/pacemaker/pacemaker.log-Nov 17 17:50:46.900 juju-173067-2-lxd-10 pacemaker-based [39820] (cib_process_ping) info: Reporting our current digest to juju-173067-2-lxd-10: d4e7bf75ba1d472cf2644a30720ce797 for 0.12.22 (0x55f2216e0360 0)
var/log/pacemaker/pacemaker.log-Nov 17 17:50:57.888 juju-173067-2-lxd-10 pacemaker-controld [39825] (throttle_check_thresholds) notice: High CPU load detected: 20.889999
var/log/pacemaker/pacemaker.log-Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-execd [39822] (async_action_complete) warning: res_openstack_loadbalancer_haproxy_stop_0[43103] timed out after 20000ms
var/log/pacemaker/pacemaker.log-Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-execd [39822] (log_finished) info: res_openstack_loadbalancer_haproxy stop (call 21, PID 43103) could not be executed: Timed Out (Process did not exit within specified timeout) (execution time 20.001s)
var/log/pacemaker/pacemaker.log:Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-controld [39825] (log_executor_event) error: Result of stop operation for res_openstack_loadbalancer_haproxy on juju-173067-2-lxd-10: Timed Out after 20s (Process did not exit within specified timeout) | CIB update 75, graph action confirmed; call=21 key=res_openstack_loadbalancer_haproxy_stop_0
```

I can't locate more hints about what is happening.

Configs and crashdumps are here:
https://oil-jenkins.canonical.com/artifacts/8b382b6f-73e8-455c-be26-d7b8d9b02143/index.html

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This little snippet looks suspicious:

var/log/pacemaker/pacemaker.log-Nov 17 17:50:57.888 juju-173067-2-lxd-10 pacemaker-controld [39825] (throttle_check_thresholds) notice: High CPU load detected: 20.889999
var/log/pacemaker/pacemaker.log-Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-execd [39822] (async_action_complete) warning: res_openstack_loadbalancer_haproxy_stop_0[43103] timed out after 20000ms
var/log/pacemaker/pacemaker.log-Nov 17 17:51:01.880 juju-173067-2-lxd-10 pacemaker-execd [39822] (log_finished) info: res_openstack_loadbalancer_haproxy stop (call 21, PID 43103) could not be executed: Timed Out (Process did not exit within specified timeout) (execution time 20.001s)

i.e. the 2/lxd/10 had a load average of 20 and the pacemaker command timed out at 20 seconds.

The machine.lock for machine 2 at roughly the same time shows:

2022-11-17 17:50:45 unit-ceph-osd-2: ceph-osd/2 uniter (run update-status hook), waited 35s, held 4s
2022-11-17 17:50:56 unit-nrpe-0: nrpe/0 uniter (run start hook), waited 37s, held 12s
2022-11-17 17:51:05 unit-nrpe-4: nrpe/4 uniter (run start hook), waited 37s, held 9s
2022-11-17 17:51:05 unit-nova-compute-2: nova-compute/2 uniter (run relation-changed (35; unit: nova-compute/3) hook), waited 24s, held 0s
2022-11-17 17:51:07 unit-prometheus-libvirt-exporter-0: prometheus-libvirt-exporter/0 uniter (run update-status hook), waited 23s, held 2s
2022-11-17 17:51:18 unit-nrpe-0: nrpe/0 uniter (run relation-joined (401; unit: prometheus-grok-exporter/0) hook), waited 11s, held 10s

e.g. hooks are waiting +30 seconds to get a go at running, which means the previous hook is running at the time.

2/lxd/10's machine.lock at the time shows that loadbalancer unit held for 41 which is consistent with the timeout.

2022-11-17 17:50:02 unit-hacluster-openstack-loadbalancer-0: hacluster-openstack-loadbalancer/0 uniter (run relation-changed (267; unit: openstack-loadbalancer/
2) hook), waited 7s, held 41s

Machine 2's psaux had 750_ processes, and this is during deploy.

Machine 2 also has 11 LXD containers (potentially 11 simultaneously juju hooks).

Machine 2 also has:

    hardware: arch=amd64 cores=4 mem=65536M tags=foundation-nodes,rack1,regular,silo6
      availability-zone=zone3

I'm going to hazard a guess that this was a resource contention issue, and the pacemaker command did simply just time out due to not getting sufficient resources.

A couple of things probably should be done here:

1. extending the timeouts on the pacemaker commands (but that in itself causes problems)
2. Collecting load averages/CPU/IO stats for the machine so that loads during these kinds of errors can be analysed.

It's only a hunch that this is load related; there are some odd messages in the various logs, particularly about when a command is initiated vs when the timeout occurs, but I think we need to try to rule out timeouts due to resource constraints.

Changed in charm-openstack-loadbalancer:
status: New → Incomplete
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Resource contention could be the issue, but that doesn't explain why it happens on jammy and not on yoga (so far at least).

All the occurrences are listed here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1997103

I will look into capturing the suggested logs. I'm leaving the bug report on 'new' for now so as to not have it expire.

Changed in charm-openstack-loadbalancer:
status: Incomplete → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.