Swact request is rejected due to a (phantom) unapplied config

Bug #1848355 reported by Tee Ngo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
A previous swact that occurs while a runtime manifest is being applied can lead to rejection of subsequent swact request.

Severity
--------
Critical

Steps to Reproduce
------------------
Perform a series of host-swact

Expected Behavior
------------------
All swacts should succeed

Actual Behavior
----------------
Third swact was rejected with the following reason:
controller-1 target Config 2bbdc151-53b9-427f-92fc-8c4e74df24bc not yet applied. Apply target Config via Lock/Unlock prior to Swact

and yet there was no alarm indicating controller-1 config being out of date.

sysinv.log:
===========
At 02:02:24, the config (8c4e74df24bc) was sent to active controller (controller-1), but the applied id had not be changed to previous config id (dc754340488c). At 02:02:27, controller-1 cleared the system config alarm against the previous config id (dc754340488c) again.

2019-10-16 02:02:24.390 560330 INFO sysinv.conductor.manager [-] Setting config target of host 'controller-1' to '2bbdc151-53b9-427f-92fc-8c4e74df24bc'.
2019-10-16 02:02:24.401 560330 WARNING sysinv.conductor.manager [-] controller-1: iconfig out of date: target 2bbdc151-53b9-427f-92fc-8c4e74df24bc, applied 495555d0-f3b6-473b-adab-dc754340488c
2019-10-16 02:02:24.401 560330 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-1 config applied: 495555d0-f3b6-473b-adab-dc754340488c vs. target: 2bbdc151-53b9-427f-92fc-8c4e74df24bc.
2019-10-16 02:02:24.458 560330 INFO sysinv.conductor.manager [-] _config_update_hosts config_uuid=2bbdc151-53b9-427f-92fc-8c4e74df24bc
2019-10-16 02:02:24.458 560330 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=2bbdc151-53b9-427f-92fc-8c4e74df24bc, classes: ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime', 'platform::sysinv::runtime']
2019-10-16 02:02:24.469 560330 INFO sysinv.puppet.puppet [-] Updating hiera for host: controller-1 with config_uuid: 2bbdc151-53b9-427f-92fc-8c4e74df24bc
2019-10-16 02:02:27.377 560330 INFO sysinv.helm.manifest_base [-] Delete manifest file /opt/platform/armada/19.10/platform-integ-apps/1.0-8/platform-integ-apps-manifest-del.yaml generated
2019-10-16 02:02:27.378 560330 INFO sysinv.conductor.manager [-] No override change after configuration action, skipping re-apply of platform-integ-apps
2019-10-16 02:02:27.380 560330 INFO sysinv.agent.rpcapi [-] config_apply_runtime_manifest: fanout_cast: sending config 2bbdc151-53b9-427f-92fc-8c4e74df24bc {'classes': ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime', 'platform::sysinv::runtime'], 'force': False, 'personalities': ['controller'], 'host_uuids': [u'b0b201ef-e92f-4b1d-bdcc-5cfbfe092f89']} to agent
2019-10-16 02:02:27.384 20451 INFO sysinv.agent.manager [req-2bf8b457-d051-46da-b6db-a38ce87bdf99 admin None] config_apply_runtime_manifest: 2bbdc151-53b9-427f-92fc-8c4e74df24bc {u'classes': [u'openstack::keystone::endpoint::runtime', u'platform::firewall::runtime', u'platform::sysinv::runtime'], u'force': False, u'personalities': [u'controller'], u'host_uuids': [u'b0b201ef-e92f-4b1d-bdcc-5cfbfe092f89']} controller
2019-10-16 02:02:27.384 20451 INFO sysinv.agent.manager [req-2bf8b457-d051-46da-b6db-a38ce87bdf99 admin None] controller-active
2019-10-16 02:02:27.384 20451 INFO sysinv.agent.manager [req-2bf8b457-d051-46da-b6db-a38ce87bdf99 admin None] _apply_runtime_manifest with hieradata_path = '/opt/platform/puppet/19.10/hieradata'
2019-10-16 02:02:27.514 560330 INFO sysinv.conductor.manager [-] drbd-overview: pgsql-40.0, platform-9.8, extension-0.96875, patch-vault-0, etcd-4.8, dockerdistribution-16.0
2019-10-16 02:02:27.515 560330 INFO sysinv.conductor.manager [-] lvdisplay: pgsql-40.0, platform-10.0, extension-1.0, patch-vault-0, etcd-5.0, dockerdistribution-16.0
2019-10-16 02:02:27.526 560330 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-1 target config 495555d0-f3b6-473b-adab-dc754340488c
2019-10-16 02:02:27.585 560330 INFO sysinv.conductor.manager [-] SYS_I Clear system config alarm: controller-0 target config 2ff321c2-de17-4d97-97a8-b7bca29a868d

The keystone::endpoint, platform::firewall and platform::sysinv runtime manifests were applied around 02:02:28, and the manifests did not complete as a swact occured around 2019-10-16T02:02:51 (from controller-1 to controller-0 ) that interfered with the ongoing manifest apply.

puppet.log
==========
2019-10-16T02:02:28.416 Debug: 2019-10-16 02:02:28 +0000 Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8
....
....
....
2019-10-16T02:03:15.187 Unable to establish connection to http://[fd00:204::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd00:204::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff01626d6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))'. Retrying for 169 more seconds
2019-10-16T02:03:18.958 Unable to establish connection to http://[fd00:204::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd00:204::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5fccf206d0>: Failed to establish a new connection: [Errno 111] Connection refused',))'. Retrying for 166 more seconds
2019-10-16T02:03:22.730 Unable to establish connection to http://[fd00:204::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd00:204::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fadbf16b6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))'. Retrying for 162 more seconds
2019-10-16T02:03:26.464 Unable to establish connection to http://[fd00:204::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd00:204::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f83da8fb6d0>: Failed to establish a new connection: [Errno 113] No route to host',))'. Retrying for 158 more seconds
2019-10-16T02:03:30.219 Unable to establish connection to http://[fd00:204::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd00:204::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f702628c6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))'. Retrying for 154 more seconds
2019-10-16T02:03:33.979 Unable to establish connection to http://[fd00:204::2]:5000/v3/auth/tokens: HTTPConnectionPool(host='fd00:204::2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6fa6e3e6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))'. Retrying for 151 more seconds
2019-10-16T02:04:02.555 Error: 2019-10-16 02:04:02 +0000 Cannot create /opt/platform/sysinv/19.10; parent directory /opt/platform/sysinv does not exist
2019-10-16T02:04:02.669 Error: 2019-10-16 02:04:02 +0000 /Stage[main]/Platform::Sysinv::Runtime/File[/opt/platform/sysinv/19.10]/ensure: change from absent to directory failed: Cannot create /opt/platform/sysinv/19.10; parent directory /opt/platform/sysinv does not exist

Reproducibility
---------------
Seen once.

System Configuration
--------------------
IPv6 standard system

Branch/Pull Time/Commit
-----------------------
Oct. 9th, 2019 build

Last Pass
---------
Swact normally works

Timestamp/Logs
--------------

Test Activity
-------------
System Test

Revision history for this message
Bin Qian (bqian20) wrote :

Looks like the runtime manifest was start at 02:02:28 and done at 02:04:28, and the manifest was not successful.
Note that there is a swact occurred from controller-1 -> controller-0 before the manifest completed, at 02:03:09, the swact failed the manifest. For this instance, there were 2 potential breaking points, keystone not being available for a short period of time (from disabled on c-1 to enabled on c-2), and as well the /opt/platform/sysinv dir not available.

A potential solution is to block swact until runtime manifest is completed.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - swact failing

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.3.0 stx.ha
Changed in starlingx:
status: New → Triaged
assignee: nobody → Bin Qian (bqian20)
Ghada Khalil (gkhalil)
tags: added: stx.config
removed: stx.ha
Bin Qian (bqian20)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/691713

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/691714

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/692439

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/691713
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=5d9805a431b2d222c282f7b0426f18dd596e8d5d
Submitter: Zuul
Branch: master

commit 5d9805a431b2d222c282f7b0426f18dd596e8d5d
Author: Bin Qian <email address hidden>
Date: Mon Oct 28 12:27:42 2019 -0400

    Ensure controller-0 has right invprovision state

    Correct the invprovision state of controller-0 being "provisioned" before
    the node is actually provisioned.

    Partial-Bug: 1848355
    Change-Id: I42bb8556e623c6fbe6d6651ebb44cc5559975f27
    Signed-off-by: Bin Qian <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/691714
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=666008a5836124665258d1b06b4017ba3ae560c7
Submitter: Zuul
Branch: master

commit 666008a5836124665258d1b06b4017ba3ae560c7
Author: Bin Qian <email address hidden>
Date: Fri Oct 18 18:44:38 2019 -0400

    Copy sysinv.config to drbd drive only once

    This change is to ensure copying sysinv.config to drbd drive only once
    when the controller-0 is unlocked successfully the 1st time.

    The file was previously copied periodically from the active controller
    which causes occasionally fail to apply the runtime config if a swact
    happens to start. The result is that the controller is config-out-date.

    This fix also ensure all nodes going through the same invprovision
    cycle: none (installing)
           -> provisioning (installed)
           -> provisioned (unlocked successfully 1st time)
    Previously the controller-0 of a standard load starts with provisioned,
    even before it is unlocked the 1st time.

    Depends-on: https://review.opendev.org/691713
    Closes-Bug: 1848355
    Closes-Bug: 1849837

    Change-Id: Ifdfb985c8cde86f9c9074dc0774cb0dd805aacd1
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/692439
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=bc35c90653f4f63edf39f5447f200d9234b3151f
Submitter: Zuul
Branch: master

commit bc35c90653f4f63edf39f5447f200d9234b3151f
Author: Bin Qian <email address hidden>
Date: Wed Oct 30 08:17:32 2019 -0400

    Remove manifest to copy sysinv.conf to drbd storage

    sysinv-conductor will copy the file to drbd when it is ready. Removing
    puppet code doing the job which is no long needed.

    Change-Id: Ia9a78ec01e6b87d31dbdb4016fd2a7826734f59d
    Signed-off-by: Bin Qian <email address hidden>
    Depends-On: https://review.opendev.org/#/c/691714
    Partial-Bug: 1848355

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.