Designate HA may result in Resource not running res_designate_haproxy

Bug #1839021 reported by David Ames
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Charm Helpers
Triaged
Medium
Unassigned
OpenStack Designate Charm
Invalid
High
David Ames
OpenStack HA Cluster Charm
Confirmed
Undecided
Unassigned
OpenStack Keystone Charm
Invalid
Undecided
Unassigned
charm-interface-hacluster
Fix Released
High
Unassigned

Bug Description

Designate HA may result in Resource not running res_designate_haproxy. There seems to be a race, haproxy is in fact running but CRM is unaware of this.

Particularly running the full designate_ha openstack-mojo-spec with xenial <= queens.

Investigate timing, and compare for similarities with LP Bug##1837401.

Work around:
crm resource cleanup res_designate_haproxy

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Confirmed. I've seen this with keystone and swift as well. I'll dig into this further today.

Changed in charm-hacluster:
status: New → Confirmed
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Should add, that I've seen it during series upgrades. I thought it might be due to the reboot - but that fact that it's also seen on designate_ha mojo spec probably means it's not due to the reboot and a race.

The resource does seem to be 'broken' (in my series upgrade test) in crm and I got it working by manually the crm resource using crm commands.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So adding to my previous comment re: keystone and haproxy with the error:

keystone config:

public_endpoint = http://10.5.100.2:5000
admin_endpoint = http://10.5.100.2:35357

haproxy.cfg:

frontend tcp-in_public-port
    bind *:5000
    bind :::5000
    acl net_10.5.0.92 dst 10.5.0.92/255.255.0.0
    use_backend public-port_10.5.0.92 if net_10.5.0.92
    default_backend public-port_10.5.0.92

i.e. they are both configured on the same port, and keystone doesn't realise that it should be using haproxy. I'll dig into the code that makes the decision about whether haproxy is present or not.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :
Download full text (4.4 KiB)

So my previous was a red-herring (the public_endpoint and admin_endpoint don't actually affect the where keystone wsgi listens).

The config is correct:

apache2/sites-enabled/wsgi-openstack-api.conf:

# Configuration file maintained by Juju. Local changes may be overwritten.

Listen 35347
Listen 4990
<VirtualHost *:35347>
... etc

However, actual keystone units are listening on those ports:

# lsof | grep 5000
keystone- 2126 keystone 10u IPv4 23414 0t0 TCP *:5000 (LISTEN)
keystone- 3128 keystone 10u IPv4 23414 0t0 TCP *:5000 (LISTEN)
keystone- 3129 keystone 10u IPv4 23414 0t0 TCP *:5000 (LISTEN)

# lsof | grep 35357

keystone- 2126 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3126 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3126 keystone 15u IPv4 150493 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-25.project.serverstack:51114 (ESTABLISHED)
keystone- 3127 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3127 keystone 10u IPv4 147044 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-26.project.serverstack:33835 (ESTABLISHED)
keystone- 3127 keystone 13u IPv4 155157 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-24.project.serverstack:56627 (ESTABLISHED)
keystone- 3127 keystone 16u IPv4 155159 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-25.project.serverstack:52961 (ESTABLISHED)
keystone- 3128 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3129 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)

However, it's also listening on the other ports as well (4990 and 35347).

Thus, one hypothesis is that the former are apache processes that weren't shutdown and are left over after a restart.

However, it seems that the keystone.service is starting some services on 5000 and 35357:

root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*5000
keystone- 11047 keystone 10u IPv4 160995 0t0 TCP *:5000 (LISTEN)
keystone- 11062 keystone 10u IPv4 160995 0t0 TCP *:5000 (LISTEN)
keystone- 11063 keystone 10u IPv4 160995 0t0 TCP *:5000 (LISTEN)
root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*4990
root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*35357
keystone- 11047 keystone 9u IPv4 160994 0t0 TCP *:35357 (LISTEN)
keystone- 11060 keystone 9u IPv4 160994 0t0 TCP *:35357 (LISTEN)
keystone- 11061 keystone 9u IPv4 ...

Read more...

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So the mitka - keystone - hacluster pair bug is:

1. on the series-upgrade, the keystone service is either "still enabled" or enabled.
2. However, on mitaka, keystone is configured to be accessed via apache2
3. Thus the keystone.service shouldn't be enabled or running.

The outcome is:

1. the keystone.service runs and binds to 5000/35357
2. the "post-series-upgrade" hooks fails because it can't start haproxy.
3. it can't start haproxy because keystone.service has binded to 5000/35357
4. the hacluster charm reports that res_ks_haproxy isn't running and blocks (this part of the bug).

Solution:

Ensure keystone.service doesn't get enabled (or at least disable it) during the series upgrade.

Changed in charm-keystone:
status: New → Confirmed
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
David Ames (thedac) wrote :
Changed in charm-designate:
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 19.10
David Ames (thedac)
Changed in charm-interface-hacluster:
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Ames (thedac)
Changed in charm-designate:
status: Triaged → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-interface-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/674872

Changed in charm-interface-hacluster:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (master)

Reviewed: https://review.opendev.org/675127
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=21d212cb2739a3ff1b08d8d96916dfd638f06ffe
Submitter: Zuul
Branch: master

commit 21d212cb2739a3ff1b08d8d96916dfd638f06ffe
Author: Alex Kavanagh <email address hidden>
Date: Wed Aug 7 15:09:13 2019 +0100

    Ensure that keystone service is paused if needed on series upgrade

    During series upgrade, the keystone packages get re-installed as the
    underlying Linux has been upgraded and new package sets are updated and
    then pulled in. For trusty->xenial this means that keystone.service
    gets enabled which then breaks haproxy. On install, on xenial+, the
    keystone.service is disabled in the install hook. This just replicates
    this in the series-upgrade hook.

    Change-Id: Ic5ed6cf354d5545b9e554e205a048955a381e0f5
    Closed-Bug: #1839021

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

On subsequent trusty->xenial series upgrade the keystone-hacluster on one node may show a blocked state with "Resource: res_ks_haproxy not running"

However, if the associated keystone unit is not errored or blocked, it's likely the crm retries have been exceeded. By running "sudo crm resource refresh" the status can be cleared.

Changed in charm-keystone:
status: Confirmed → Invalid
assignee: Alex Kavanagh (ajkavanagh) → nobody
David Ames (thedac)
Changed in charm-helpers:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-interface-hacluster (master)

Reviewed: https://review.opendev.org/674872
Committed: https://git.openstack.org/cgit/openstack/charm-interface-hacluster/commit/?id=fe9d009520be5082f82ff99bce8d460a02bd7c93
Submitter: Zuul
Branch: master

commit fe9d009520be5082f82ff99bce8d460a02bd7c93
Author: David Ames <email address hidden>
Date: Tue Aug 6 09:45:24 2019 -0700

    Never give up on resources

    Configure pacemaker to never give up on resources and
    to recheck 5 seconds after a failure. This is achieved
    using migration-threshold and failure-timeout options *1.

    *1 https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html

    Change-Id: I4044810daa83f9bd7a59430b5c52c009149fac6e
    Partial-Bug: #1839021

Revision history for this message
David Ames (thedac) wrote :
Changed in charm-interface-hacluster:
status: In Progress → Fix Released
assignee: David Ames (thedac) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.