tripleo

centos8 standalone-upgrade ussuri fails Error: cluster is not configured

Bug #1887159 reported by Marios Andreou on 2020-07-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Marios Andreou	tripleo victoria-3 "tripleo victoria"

Bug Description

centos8 standalone-upgrade ussuri fails at [1] during step4 of the upgrade tasks with trace like:

        2020-07-08 15:02:05 | TASK [Start pacemaker cluster] *************************************************
        2020-07-08 15:02:05 | Wednesday 08 July 2020 15:02:05 +0000 (0:00:00.208) 0:05:41.812 ********
        2020-07-08 15:02:08 | fatal: [standalone]: FAILED! => changed=false
        2020-07-08 15:02:08 | msg: |-
        2020-07-08 15:02:08 | Command execution failed.
        2020-07-08 15:02:08 | Command: `pcs cluster start`
        2020-07-08 15:02:08 | Error: Error: cluster is not currently configured on this node

It looks like pacemaker is indeed missing (?) at least I can't see any of the expected 'pacemaker'/'pcsd' logs at [2]. Note this job is 'new'/being re-added with [3] and tested in [4].

I *suspect* it may have to do with the "switch to HA by default" at [5] which is on stable/ussuri but not on stable/train.

[1] https://d72da4f3bf40b5c15d18-39524de8c5a1fb89d206195b6f692473.ssl.cf1.rackcdn.com/739457/6/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/6cb2420/logs/undercloud/home/zuul/standalone_upgrade.log
[2] https://d72da4f3bf40b5c15d18-39524de8c5a1fb89d206195b6f692473.ssl.cf1.rackcdn.com/739457/6/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/6cb2420/logs/undercloud/var/log/
[3] https://review.opendev.org/#/c/738844/
[4] https://review.opendev.org/#/c/739457/
[5] https://review.opendev.org/#/c/359060/

Tags:

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-10:

14:30 < bandini> marios: just to get the full context, what does that job do. Deploy a standalone and does a minor update? or an
upgrade from train to ussuri?
14:30 < marios> bandini: deploys train and upgrade to ussuri
14:30 < bandini> if the latter then I don't think we have a simple way to fix this, since we changed the default
14:32 < marios> bandini: it's not the 'full upgrade workflow' but yeah for the intents of what you're saying it indeed takes train
things and deploys them then fetches the ussuri things and tries to run the upgrade tasks
14:32 < marios> bandini: ie. including tht and all the things

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-23:

dig some digging here today - the problem is there is no pacemaker (on standalone) for train, but then for ussuri there is pacemaker by default with [1]. This means that during the upgrade the tasks at [2] are executed and we get the error in the description above.

Really we need an exception for the tasks at [2] - i.e. when coming from an environment without pacemaker. Once the upgrade tasks are completed, the deploy steps will be run so in theory pacemaker will eventualy be deployed fine. Looking at the rest of the upgrade tasks (steps 4 and 5) after this error I can't see something else directly invoking the cluster [3][4].

I am currently testing that with an explicit skip in my test review at [5] - not sure it will work but even if it does we need to work out what the condition will be for skipping it - i.e. coming from no pacemaker to pacemaker environment for train->ussuri upgrade.

[1] https://review.opendev.org/#/c/359060/77/overcloud-resource-registry-puppet.j2.yaml@165
[2] https://opendev.org/openstack/tripleo-heat-templates/src/commit/45c959a5ea9b71515fccf2d1e3763c2e166e6b3d/deployment/pacemaker/pacemaker-baremetal-puppet.yaml#L311
[3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f6b/739457/9/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/f6b75f7/logs/undercloud/home/zuul/standalone-ansible-uryyxixd/Standalone/upgrade_tasks_step4.yaml
[4] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f6b/739457/9/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/f6b75f7/logs/undercloud/home/zuul/standalone-ansible-uryyxixd/Standalone/upgrade_tasks_step5.yaml
[5] https://review.opendev.org/#/c/739457/11/deployment/pacemaker/pacemaker-baremetal-puppet.yaml

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-24:

somehow my change at [1] is being ignored I am trying to understand what/why... I suspect that because in this job we have the releases.sh [2] which is used to setup repos before deploy then before upgrade, so as a result the depends-on is being ignored/skipped.

Perhaps build-test-packages is done before deployment, but not before upgrade, so we need to rerun at that point? Still trying to understand.

[1] https://review.opendev.org/#/c/739457/11/deployment/pacemaker/pacemaker-baremetal-puppet.yaml
[2] https://4c9bea7f28088149dfcc-1634c0891e365f9cbd205121e56f3649.ssl.cf1.rackcdn.com/739457/11/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/36584f0/logs/quickstart_files/releases.sh

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-24:

per comment [3] forgot to add pointer to logs where the change is ignored - see https://4c9bea7f28088149dfcc-1634c0891e365f9cbd205121e56f3649.ssl.cf1.rackcdn.com/739457/11/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/36584f0/logs/undercloud/home/zuul/standalone_upgrade.log

* 2020-07-23 16:27:16 | 2020-07-23 16:27:16.618295 | bc764e10-0185-5407-9c61-000000000818 | TIMING | Update all packages | 0:07:31.616 | 8.31s
2020-07-23 16:27:16 | 2020-07-23 16:27:16.789910 | fbf4b653-1a51-44be-8ad8-4bf105c81da3 | INCLUDED | /home/zuul/standalone-ansible-3kfzjh81/Standalone/upgrade_tasks_step4.yaml | standalone
2020-07-23 16:27:16 | 2020-07-23 16:27:16.888732 | bc764e10-0185-5407-9c61-000000000a3c | TASK | Start pacemaker cluster
2020-07-23 16:27:16 | 2020-07-23 16:27:16.890256 | bc764e10-0185-5407-9c61-000000000027 | TIMING | include_tasks | 0:07:31.888 | 0.27s
2020-07-23 16:27:19 | 2020-07-23 16:27:19.428557 | bc764e10-0185-5407-9c61-000000000a3c | FATAL | Start pacemaker cluster | standalone | error={"changed": false, "msg": "Command execution failed.\nCommand: `pcs cluster start`\nError: Error: cluster is not currently configured on this node\n"}
2020-07-23 16:27:19 |

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-27:

as discussed today with weshay & DF trying to set the HA env for train deployment and then upgrade to ussuri let's see posted https://review.opendev.org/#/c/742418/3/zuul.d/standalone-jobs.yaml@137 & test running in https://review.opendev.org/#/c/739457

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-28:

quick update - inconclusive still waiting - the first run from comment #5 failed on the 'docker throttling issue' at [1]

Failure: id 9df66ab74dd3f6a7b95f4ed4f16538ac3ac28b02, status 401, reason Unauthorized text {"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":[{"Type":"repository","Class":"","Name":"tripleotraincentos8/centos-binary-cinder-volume","Action":"pull"}]}]}
* requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: http://mirror.bhs1.ovh.opendev.org:8082/v2/tripleotraincentos8/centos-binary-cinder-api/blobs/sha256:039ecb888e3713490da27b3bdcaf081b050ac7cb5023dec080a52c60e1265781

and then the second run today failed [2] because train has docker as cli at [3]. I updated [4] to include podman env file last and rechecked test at [5] let's see

[1] https://67dfaceef38c0863cda6-10caedded388001c6bbc38619ca4b324.ssl.cf1.rackcdn.com/739457/12/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/59243d8/logs/undercloud/home/zuul/standalone_deploy.log
[2] https://6afc0ec476a3d9ae35ae-597ff148d0ea9164d11e7cb764cf9b04.ssl.cf2.rackcdn.com/739457/12/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/c1f0b8a/logs/undercloud/home/zuul/standalone_deploy.log
[3] https://opendev.org/openstack/tripleo-heat-templates/src/commit/36fdfc53758cec8a09b92c7931f964d3fc053d64/environments/docker-ha.yaml#L27
[4] https://review.opendev.org/#/c/742418/4/zuul.d/standalone-jobs.yaml
[5] https://review.opendev.org/#/c/739457

Emilien Macchi (emilienm) on 2020-07-28

Changed in tripleo:
milestone:	victoria-1 → victoria-3

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-29:

so the test at [1] is still failing however I think we can consider closing out this particular bug. I think we may have a new bug now.

At [2] we can see the upgrade tasks playbook is completed successfully and then the deploy tasks are re-executed after upgrade - which is where the new fail is happening

* 2020-07-28 21:36:30 | 2020-07-28 21:36:30.062 258515 INFO tripleoclient.utils.utils [-] Ansible execution success. playbook: upgrade_steps_playbook.yaml[00m
2020-07-28 21:36:30 | 2020-07-28 21:36:30.062 258515 INFO tripleoclient.utils.utils [-] Running Ansible playbook: deploy_steps_playbook.yaml, Working directory: /home/zuul/standalone-ansible-sqhq3tvy, Playbook directory: /home/zuul/standalone-ansible-sqhq3tvy[00m

...

        * 2020-07-28 21:57:16 | [ERROR]: Container(s) which finished with wrong return code:
        2020-07-28 21:57:16 | ['haproxy_restart_bundle']
        2020-07-28 21:57:16 | 2020-07-28 21:57:16.141685 | bc764e10-2a1a-a56a-2191-000000002cbf | FATAL | Check containers status | standalone | error={"changed": false, "msg": "Failed container(s): ['haproxy_restart_bundle'], check logs in /var/log/containers/stdouts/"}

Looking at the haproxy stdouts [3] it actually can't find haproxy-bundle

        * 2020-07-28T21:57:09.678381400+00:00 stdout F Tue Jul 28 21:57:09 UTC 2020: Restarting haproxy-bundle globally
        2020-07-28T21:57:10.215468255+00:00 stderr F Error: Error performing operation: No such device or address
        2020-07-28T21:57:10.215468255+00:00 stderr F haproxy-bundle is not running anywhere and so cannot be restarted

I think we will file a new bug for this though... this should move to fix-released?

[1] https://review.opendev.org/#/c/739457
[2] https://76bd6a632dcf869ef49b-9d73aaaa1727b5f44155748d9566e05a.ssl.cf2.rackcdn.com/739457/13/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/a6a63ab/logs/undercloud/home/zuul/standalone_upgrade.log
[3]

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-07-29:

hit return too quickly on comment #7 above, adding missing link at [3]

[3] https://76bd6a632dcf869ef49b-9d73aaaa1727b5f44155748d9566e05a.ssl.cf2.rackcdn.com/739457/13/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/a6a63ab/logs/undercloud/var/log/containers/stdouts/haproxy_restart_bundle.log

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-08-03:

new issue per comment #7 at

bugs.launchpad.net/tripleo/+bug/1889395 centos8 standalone-upgrade ussuri job Failed container(s): ['haproxy_restart_bundle

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-08-03:

#10

https://bugs.launchpad.net/tripleo/+bug/1889395

Marios Andreou (marios-b) on 2020-08-05

Changed in tripleo:
status:	Triaged → Fix Released
status:	Fix Released → Triaged
status:	Triaged → In Progress

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-09-10:

#11

moving this bug to fix-released - fixed by https://review.opendev.org/#/c/742418/
Even though this bug is addressed, the job is *still* not green in the test [1] and we are now chasing yet another new bug for that [2]

[1] https://review.opendev.org/#/c/739457/
[2] https://bugs.launchpad.net/tripleo/+bug/1895138

Changed in tripleo:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.