Bug #1839021 “Designate HA may result in Resource not running re...” : Bugs : Charm Helpers

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-08-06:

#1

Confirmed. I've seen this with keystone and swift as well. I'll dig into this further today.

Changed in charm-hacluster:
status:	New → Confirmed

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-08-06:

#2

Should add, that I've seen it during series upgrades. I thought it might be due to the reboot - but that fact that it's also seen on designate_ha mojo spec probably means it's not due to the reboot and a race.

The resource does seem to be 'broken' (in my series upgrade test) in crm and I got it working by manually the crm resource using crm commands.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-08-06:

#3

So adding to my previous comment re: keystone and haproxy with the error:

keystone config:

public_endpoint = http://10.5.100.2:5000
admin_endpoint = http://10.5.100.2:35357

haproxy.cfg:

frontend tcp-in_public-port
    bind *:5000
    bind :::5000
    acl net_10.5.0.92 dst 10.5.0.92/255.255.0.0
    use_backend public-port_10.5.0.92 if net_10.5.0.92
    default_backend public-port_10.5.0.92

i.e. they are both configured on the same port, and keystone doesn't realise that it should be using haproxy. I'll dig into the code that makes the decision about whether haproxy is present or not.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-08-06:

#4

Download full text (4.4 KiB)

So my previous was a red-herring (the public_endpoint and admin_endpoint don't actually affect the where keystone wsgi listens).

The config is correct:

apache2/sites-enabled/wsgi-openstack-api.conf:

# Configuration file maintained by Juju. Local changes may be overwritten.

Listen 35347
Listen 4990
<VirtualHost *:35347>
... etc

However, actual keystone units are listening on those ports:

# lsof | grep 5000
keystone- 2126 keystone 10u IPv4 23414 0t0 TCP *:5000 (LISTEN)
keystone- 3128 keystone 10u IPv4 23414 0t0 TCP *:5000 (LISTEN)
keystone- 3129 keystone 10u IPv4 23414 0t0 TCP *:5000 (LISTEN)

# lsof | grep 35357

keystone- 2126 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3126 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3126 keystone 15u IPv4 150493 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-25.project.serverstack:51114 (ESTABLISHED)
keystone- 3127 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3127 keystone 10u IPv4 147044 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-26.project.serverstack:33835 (ESTABLISHED)
keystone- 3127 keystone 13u IPv4 155157 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-24.project.serverstack:56627 (ESTABLISHED)
keystone- 3127 keystone 16u IPv4 155159 0t0 TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-25.project.serverstack:52961 (ESTABLISHED)
keystone- 3128 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)
keystone- 3129 keystone 9u IPv4 23413 0t0 TCP *:35357 (LISTEN)

However, it's also listening on the other ports as well (4990 and 35347).

Thus, one hypothesis is that the former are apache processes that weren't shutdown and are left over after a restart.

However, it seems that the keystone.service is starting some services on 5000 and 35357:

root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*5000
keystone- 11047 keystone 10u IPv4 160995 0t0 TCP *:5000 (LISTEN)
keystone- 11062 keystone 10u IPv4 160995 0t0 TCP *:5000 (LISTEN)
keystone- 11063 keystone 10u IPv4 160995 0t0 TCP *:5000 (LISTEN)
root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*4990
root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*35357
keystone- 11047 keystone 9u IPv4 160994 0t0 TCP *:35357 (LISTEN)
keystone- 11060 keystone 9u IPv4 160994 0t0 TCP *:35357 (LISTEN)
keystone- 11061 keystone 9u IPv4 ...

So my previous was a red-herring (the public_endpoint and admin_endpoint don't actually affect the where keystone wsgi listens).

The config is correct:

apache2/sites-enabled/wsgi-openstack-api.conf:

# Configuration file maintained by Juju. Local changes may be overwritten.

Listen 35347
Listen 4990
<VirtualHost *:35347>
... etc

However, actual keystone units are listening on those ports:

# lsof | grep 5000
keystone-  2126               keystone   10u     IPv4              23414       0t0        TCP *:5000 (LISTEN)
keystone-  3128               keystone   10u     IPv4              23414       0t0        TCP *:5000 (LISTEN)
keystone-  3129               keystone   10u     IPv4              23414       0t0        TCP *:5000 (LISTEN)

# lsof | grep 35357

keystone-  2126               keystone    9u     IPv4              23413       0t0        TCP *:35357 (LISTEN)
keystone-  3126               keystone    9u     IPv4              23413       0t0        TCP *:35357 (LISTEN)
keystone-  3126               keystone   15u     IPv4             150493       0t0        TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-25.project.serverstack:51114 (ESTABLISHED)
keystone-  3127               keystone    9u     IPv4              23413       0t0        TCP *:35357 (LISTEN)
keystone-  3127               keystone   10u     IPv4             147044       0t0        TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-26.project.serverstack:33835 (ESTABLISHED)
keystone-  3127               keystone   13u     IPv4             155157       0t0        TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-24.project.serverstack:56627 (ESTABLISHED)
keystone-  3127               keystone   16u     IPv4             155159       0t0        TCP juju-6afeea-mojo-16.project.serverstack:35357->juju-6afeea-mojo-25.project.serverstack:52961 (ESTABLISHED)
keystone-  3128               keystone    9u     IPv4              23413       0t0        TCP *:35357 (LISTEN)
keystone-  3129               keystone    9u     IPv4              23413       0t0        TCP *:35357 (LISTEN)

However, it's also listening on the other ports as well (4990 and 35347).

Thus, one hypothesis is that the former are apache processes that weren't shutdown and are left over after a restart.

However, it seems that the keystone.service is starting some services on 5000 and 35357:

root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*5000
keystone- 11047               keystone   10u     IPv4             160995       0t0        TCP *:5000 (LISTEN)
keystone- 11062               keystone   10u     IPv4             160995       0t0        TCP *:5000 (LISTEN)
keystone- 11063               keystone   10u     IPv4             160995       0t0        TCP *:5000 (LISTEN)
root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*4990
root@juju-6afeea-mojo-16:/etc# lsof | egrep keystone.*35357
keystone- 11047               keystone    9u     IPv4             160994       0t0        TCP *:35357 (LISTEN)
keystone- 11060               keystone    9u     IPv4             160994       0t0        TCP *:35357 (LISTEN)
keystone- 11061               keystone    9u     IPv4             160994       0t0        TCP *:35357 (LISTEN)
keystone- 11062               keystone    9u     IPv4             160994       0t0        TCP *:35357 (LISTEN)
keystone- 11063               keystone    9u     IPv4             160994       0t0        TCP *:35357 (LISTEN)

lsof | egrep keystone.*35347

processes:

keystone 11047  2.5  4.3 142600 65980 ?        Ss   13:42   0:06 /usr/bin/python /usr/bin/keystone-all --config-file=/etc/keystone/keystone.conf --log-file=/var/log/keystone/keystone.log
keystone 11060  0.0  3.9 142600 61172 ?        S    13:42   0:00  \_ /usr/bin/python /usr/bin/keystone-all --config-file=/etc/keystone/keystone.conf --log-file=/var/log/keystone/keystone.log
keystone 11061  0.0  3.9 142600 61172 ?        S    13:42   0:00  \_ /usr/bin/python /usr/bin/keystone-all --config-file=/etc/keystone/keystone.conf --log-file=/var/log/keystone/keystone.log
keystone 11062  0.0  3.9 142600 61176 ?        S    13:42   0:00  \_ /usr/bin/python /usr/bin/keystone-all --config-file=/etc/keystone/keystone.conf --log-file=/var/log/keystone/keystone.log
keystone 11063  0.0  3.9 142600 61176 ?        S    13:42   0:00  \_ /usr/bin/python /usr/bin/keystone-all --config-file=/etc/keystone/keystone.conf --log-file=/var/log/keystone/keystone.log

Which is very confusing.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-08-06:

#5

So the mitka - keystone - hacluster pair bug is:

1. on the series-upgrade, the keystone service is either "still enabled" or enabled.
2. However, on mitaka, keystone is configured to be accessed via apache2
3. Thus the keystone.service shouldn't be enabled or running.

The outcome is:

1. the keystone.service runs and binds to 5000/35357
2. the "post-series-upgrade" hooks fails because it can't start haproxy.
3. it can't start haproxy because keystone.service has binded to 5000/35357
4. the hacluster charm reports that res_ks_haproxy isn't running and blocks (this part of the bug).

Solution:

Ensure keystone.service doesn't get enabled (or at least disable it) during the series upgrade.

Changed in charm-keystone:
status:	New → Confirmed
assignee:	nobody → Alex Kavanagh (ajkavanagh)

Revision history for this message

David Ames (thedac) wrote on 2019-08-06:

#6

Suggest trying

https://github.com/openstack/charm-percona-cluster/commit/9844f2fd3e1246a53f1088055c4d5f9cea703278

Changed in charm-designate:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → David Ames (thedac)
milestone:	none → 19.10

David Ames (thedac) on 2019-08-06

Changed in charm-interface-hacluster:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → David Ames (thedac)
Changed in charm-designate:
status:	Triaged → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-06: Fix proposed to charm-interface-hacluster (master)

#7

Fix proposed to branch: master
Review: https://review.opendev.org/674872

Changed in charm-interface-hacluster:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-07: Fix proposed to charm-keystone (master)

#8

Fix proposed to branch: master
Review: https://review.opendev.org/675127

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-07: Fix merged to charm-keystone (master)

#9

Reviewed: https://review.opendev.org/675127
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=21d212cb2739a3ff1b08d8d96916dfd638f06ffe
Submitter: Zuul
Branch: master

commit 21d212cb2739a3ff1b08d8d96916dfd638f06ffe
Author: Alex Kavanagh <email address hidden>
Date: Wed Aug 7 15:09:13 2019 +0100

Ensure that keystone service is paused if needed on series upgrade

    During series upgrade, the keystone packages get re-installed as the
    underlying Linux has been upgraded and new package sets are updated and
    then pulled in. For trusty->xenial this means that keystone.service
    gets enabled which then breaks haproxy. On install, on xenial+, the
    keystone.service is disabled in the install hook. This just replicates
    this in the series-upgrade hook.

Change-Id: Ic5ed6cf354d5545b9e554e205a048955a381e0f5
Closed-Bug: #1839021

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-08-08:

#10

On subsequent trusty->xenial series upgrade the keystone-hacluster on one node may show a blocked state with "Resource: res_ks_haproxy not running"

However, if the associated keystone unit is not errored or blocked, it's likely the crm retries have been exceeded. By running "sudo crm resource refresh" the status can be cleared.

Alex Kavanagh (ajkavanagh) on 2019-08-09

Changed in charm-keystone:
status:	Confirmed → Invalid
assignee:	Alex Kavanagh (ajkavanagh) → nobody

David Ames (thedac) on 2019-10-01

Changed in charm-helpers:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-01: Fix merged to charm-interface-hacluster (master)

#11

Reviewed: https://review.opendev.org/674872
Committed: https://git.openstack.org/cgit/openstack/charm-interface-hacluster/commit/?id=fe9d009520be5082f82ff99bce8d460a02bd7c93
Submitter: Zuul
Branch: master

commit fe9d009520be5082f82ff99bce8d460a02bd7c93
Author: David Ames <email address hidden>
Date: Tue Aug 6 09:45:24 2019 -0700

Never give up on resources

    Configure pacemaker to never give up on resources and
    to recheck 5 seconds after a failure. This is achieved
    using migration-threshold and failure-timeout options *1.

*1 https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html

Change-Id: I4044810daa83f9bd7a59430b5c52c009149fac6e
Partial-Bug: #1839021

Revision history for this message

David Ames (thedac) wrote on 2021-07-16:

#12

https://review.opendev.org/c/openstack/charm-interface-hacluster/+/674872/ Landed. Tiding up.

Changed in charm-interface-hacluster:
status:	In Progress → Fix Released
assignee:	David Ames (thedac) → nobody

Charm Helpers

Designate HA may result in Resource not running res_designate_haproxy

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Charm Helpers	Triaged	Medium	Unassigned
OpenStack Designate Charm	Invalid	High	David Ames	OpenStack Designate Charm 19.10
OpenStack HA Cluster Charm	Confirmed	Undecided	Unassigned
OpenStack Keystone Charm	Invalid	Undecided	Unassigned
charm-interface-hacluster	Fix Released	High	Unassigned