Ansible: config out of date after unlocking AIO-SX controller

Bug #1828271 reported by Allain Legacy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Ovidiu Poncea

Bug Description

Brief Description
-----------------
Following an unlock of an AIO-SX controller that was configured with Ansible, the node remains in a "config-out-of-date" condition. I am not 100% sure, but based on the logs it looks like the alarm condition is due to storage related functionality. Not sure if it is OSD related or iSCSI related. See logs below. On a previous test the logs looked similar but there were OSD related actions happening immediately before the logs (i.e., the OSDs were added before the unlock but they were actually configured by the system as a post-unlock step).

Severity
--------
Critical, this prevents the following manifests from running as part of the conductor manifest.

    'openstack::keystone::endpoint::runtime',
    'platform::firewall::runtime'

Aside from those two manifests there is a pending Ansible update in progress (not yet pushed) that will be impacted by this issue because the sysinv.conf defaults will not be copied to the /opt shared mount and therefore no other nodes will be able to report their inventory as they will not be configured with the correct Rabbit URL. At a minimum, that will impact AIO-DX configurations, but it is not clear whether a similar condition can occur for Storage or Standard configurations.

Steps to Reproduce
------------------
Use Ansible to configure an AIO controller. Manually configure the node and then unlock it. Following the unlock/reboot observe that there is a config-out-of-date alarm.

Expected Behavior
------------------
After the unlock there should be no config-out-of-date alarm.

Actual Behavior
----------------
There is a config-out-of-date alarm raised.

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-DX and AIO-SX at a minimum.

Branch/Pull Time/Commit
-----------------------
Private load rebased on May 6th with some Ansible and networking fixes.

Last Pass
---------
Unknown

Timestamp/Logs
--------------
2019-05-08 17:25:10.516 89124 INFO sysinv.conductor.manager [req-927e8042-3020-4857-9bb7-078c187688eb None None] platform_interfaces host_id=1 info_list=[]
2019-05-08 17:25:10.673 10680 INFO sysinv.agent.manager [-] iscsi initiator name = iqn.1994-05.com.redhat:2f4c6ea42251
2019-05-08 17:25:10.703 89124 INFO sysinv.conductor.manager [req-927e8042-3020-4857-9bb7-078c187688eb None None] Updating platform data for host: 1966219d-e334-4b3e-af38-f85107d1ff55 with: {u'config_applied': u'58979769-3808-438c-b5c9-814299d224de', u'first_report': True, u'availability': u'available', u'iscsi_initiator_name': u'iqn.1994-05.com.redhat:2f4c6ea42251'}
2019-05-08 17:25:11.036 89124 WARNING sysinv.conductor.manager [req-927e8042-3020-4857-9bb7-078c187688eb None None] controller-0: iconfig out of date: target d8979769-3808-438c-b5c9-814299d224de, applied 58979769-3808-438c-b5c9-814299d224de
2019-05-08 17:25:11.036 89124 WARNING sysinv.conductor.manager [req-927e8042-3020-4857-9bb7-078c187688eb None None] SYS_I Raise system config alarm: host controller-0 config applied: 58979769-3808-438c-b5c9-814299d224de vs. target: d8979769-3808-438c-b5c9-814299d224de.
2019-05-08 17:25:11.087 10680 INFO sysinv.agent.manager [-] Sysinv Agent platform update by host: {'config_applied': '58979769-3808-438c-b5c9-814299d224de', 'first_report': True, 'availability': 'available', 'iscsi_initiator_name': 'iqn.1994-05.com.redhat:2f4c6ea42251'}
2019-05-08 17:25:13.032 89124 INFO sysinv.conductor.manager [-] _controller_config_active_apply about to resize the filesystem
2019-05-08 17:25:13.033 89124 WARNING sysinv.conductor.manager [-] resizing filesystems
2019-05-08 17:25:13.525 89124 INFO sysinv.conductor.manager [-] drbd-overview: pgsql-20.0, cgcs-9.9, extension-0.96875, patch-vault-0, etcd-4.8, dockerdistribution-16.0
2019-05-08 17:25:13.526 89124 INFO sysinv.conductor.manager [-] lvdisplay: pgsql-20.0, cgcs-10.0, extension-1.0, patch-vault-0, etcd-5.0, dockerdistribution-16.0

Test Activity
-------------
Developer Testing

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

It's a strange chain of events, we do reset the node to config-out-of-date when an OSD is configured, which is expected. Yet it should be cleared after unlock, but instead we get this:

after unlock
------------
| config_applied | 62228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |
| config_status | Config out-of-date |
| config_target | e2228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |

if we look closely to the two values, we see that they are identical except the first 'bit'. So we have config-out-of-fate at 1 bit difference :) This means that the config has applied correctly but something else is either not correctly set or not correctly updated in the DB.

Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
Allain Legacy (alegacy) wrote :

That is the reboot required bit:

# configuration UUID reboot required flag (bit)
CONFIG_REBOOT_REQUIRED = (1 << 127)

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; affects system install/commissioning for AIO-SX

tags: added: stx.2.0 stx.config
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Yeah... it should be cleared after reboot, it look like it's not. I need to take a look at the underlying mechanism.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

The issue is related to config_target and config_apply limitations not to storage stuff. A config uuid with the reboot flag is passed to puppet ONLY when host is unlocked (which makes sense as this is when we do the reboot). Runtime manifests don't pass the reboot flag to puppet (it is a runtime, reboot flag has to remain). So, at unlock we set it correctly then... since the last operation in Ansible is to run a set of runtime manifests we reset it one more time to a value w/o the reboot flag => the reboot flag is no longer set, that's why after unlock we get the one bit difference in:

after unlock
------------
| config_applied | 62228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |
| config_status | Config out-of-date |
| config_target | e2228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

A solution that would avoid situations like this is to move the reboot flag out of the config_uuid so that it won't get overwritten when runtime manifests are applied yet such a modification has a high impact as we extensively make use of runtime manifets throughout the code.

For now I'll focus on solving the issue at hand as solution proves to be confined to the latest code changes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/658391

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/658391
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=9720932899b69287871a419422880f04d618286f
Submitter: Zuul
Branch: master

commit 9720932899b69287871a419422880f04d618286f
Author: Ovidiu Poncea <email address hidden>
Date: Fri May 10 17:46:27 2019 +0300

    Fix missing reboot flag for config uuid on unlock

    Due to a limitation in config uuid functionality, on first unlock
    of controller-0, node remains in config-out-of-date as we loose
    the reboot flag.

    Example output after unlock:
    $ system host-show controller-0 | grep config
    | config_applied | 62228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |
    | config_status | Config out-of-date |
    | config_target | e2228cc1-e5da-4f2e-a3c3-c468e9a46fb5 |

    The reboot flag is:
    CONFIG_REBOOT_REQUIRED = (1 << 127)

    We set config_target through sysinv and config_applied
    through puppet once manifests have applied. If there the reboot
    flag in config_target is set but not in config_applied we are
    "Config-out-of-date".

    On host-unlock or runtime manifest apply we set config_uuid in
    hieradata to e.g.:
    platform::config::params::config_uuid: \
       62228cc1-e5da-4f2e-a3c3-c468e9a46fb5

    Then, after runtime manifest apply or after reboot, sysinv-agent
    takes this value and updates config_applied.

    A config uuid with the reboot flag is passed to puppet ONLY when
    host is unlocked (which makes sense as this is when we do the
    reboot). Runtime manifests don't pass the reboot flag to puppet
    (it is a runtime, reboot flag has to remain).
    So, in our case, at unlock it is correctly set but then sysinv
    does a runtime manifest apply and resets it to a value w/o
    the reboot flag. Therefore, the reboot flag is no longer set,
    that's why even after unlock we still have Config-out-of-date.

    To fix the issue we generate a new config_uuid with the reboot
    flag set and we properly send it to puppet as the last operation
    we attempt before reboot.

    Change-Id: I12865d45f4456de81d72689f799441531a444bea
    Closes-Bug: #1828271
    Closes-Bug: #1829004
    Closes-Bug: #1829260
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
summary: - config out of date after unlocking AIO-SX controller
+ Ansible: config out of date after unlocking AIO-SX controller
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.