Failure with evaluate_app_reapply can interrupt runtime manifests apply

Bug #1891933 reported by Kristine Bujold
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kristine Bujold

Bug Description

Brief Description
-----------------

Runtime manifest apply will invoke the evaluation of app re-apply for the supported installed apps. The approach of the evaluation is regenerate application overrides and compare with the old ones. "helm install" command is used to generate overrides for each chart during application overrides generation.

If helm command fails it can interrupt the runtime manifests apply resulting in stuck 250.001 alarm Configuration is out-of-date.

This would only be an issue when applications have overrides in this case the cert-manager application had an override.

sysinv 2020-07-30 00:01:48.553 94899 INFO sysinv.conductor.manager [-] Setting config target of host 'controller-0' to '5e71161d-9ec1-4d68-b2d7-4f04d936cd15'.
sysinv 2020-07-30 00:01:48.566 94899 WARNING sysinv.conductor.manager [-] controller-0: iconfig out of date: target 5e71161d-9ec1-4d68-b2d7-4f04d936cd15, applied 21d1dbbb-6260-4c92-b97d-29f816385b3c
sysinv 2020-07-30 00:01:48.566 94899 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: 21d1dbbb-6260-4c92-b97d-29f816385b3c vs. target: 5e71161d-9ec1-4d68-b2d7-4f04d936cd15.
sysinv 2020-07-30 00:01:48.578 94899 INFO sysinv.conductor.manager [-] _config_update_hosts config_uuid=5e71161d-9ec1-4d68-b2d7-4f04d936cd15
sysinv 2020-07-30 00:01:48.579 94899 INFO sysinv.conductor.manager [-] applying runtime manifest config_uuid=5e71161d-9ec1-4d68-b2d7-4f04d936cd15, classes: ['openstack::keystone::endpoint::runtime', 'platform::firewall::runtime']
sysinv 2020-07-30 00:01:48.588 94899 INFO sysinv.puppet.puppet [-] Updating hiera for host: controller-0 with config_uuid: 5e71161d-9ec1-4d68-b2d7-4f04d936cd15
sysinv 2020-07-30 00:01:50.616 94899 ERROR sysinv.puppet.kubernetes [-] Failed to get device id for pci device 0000:b6:00.0
sysinv 2020-07-30 00:01:50.879 94899 INFO sysinv.puppet.kubernetes [-] get_kubernetes_join_cmd join_cmd=kubeadm join [fd00:4888:0:1::1]:6443 --token 5jg99s.2ymqkn3ks7qpqfbt --discovery-token-ca-cert-hash sha256:5da52e9246e22e7ac542c376da684930fca5ca2b33f5199139b7f87c29b56572 --control-plane --certificate-key daba0f3d927234991b5d8b41bd22937e3d691804513ea58cd81bdb990070cf3a --apiserver-advertise-address fd00:4888:0:1::2 --cri-socket /var/run/containerd/containerd.sock
sysinv 2020-07-30 00:01:52.479 94899 INFO sysinv.conductor.manager [-] No override change after configuration action, skipping re-apply of platform-integ-apps
sysinv 2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task [-] Error during ConductorManager._conductor_audit: Command '['helm', 'install', '--dry-run', '--debug', '--values', '/var/run/sysinv_tmp/tmpz5ej3V', '--values', '/var/run/sysinv_tmp/tmpyz7iZo', '/var/run/sysinv_tmp/tmp9dMpVO']' returned non-zero exit status 1: CalledProcessError: Command '['helm', 'install', '--dry-run', '--debug', '--values', '/var/run/sysinv_tmp/tmpz5ej3V', '--values', '/var/run/sysinv_tmp/tmpyz7iZo', '/var/run/sysinv_tmp/tmp9dMpVO']' returned non-zero exit status 1
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task Traceback (most recent call last):
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/periodic_task.py", line 180, in run_periodic_tasks
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task task(self, context)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 5052, in _conductor_audit
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task self._controller_config_active_apply(context)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 4771, in _controller_config_active_apply
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task context, config_uuid, config_dict)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 8876, in _config_apply_runtime_manifest
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task self.evaluate_app_reapply(context, app_name)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 11277, in evaluate_app_reapply
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task armada_format=True, armada_chart_info=app.charts, combined=True)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 49, in _wrapper
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task return func(self, *args, **kwargs)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 328, in inner
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task return f(*args, **kwargs)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 680, in generate_helm_application_overrides
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task file_overrides=file_overrides)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/helm/helm.py", line 551, in merge_overrides
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task output = subprocess.check_output(cmd, env=env)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task raise CalledProcessError(retcode, cmd, output=output)
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task CalledProcessError: Command '['helm', 'install', '--dry-run', '--debug', '--values', '/var/run/sysinv_tmp/tmpz5ej3V', '--values', '/var/run/sysinv_tmp/tmpyz7iZo', '/var/run/sysinv_tmp/tmp9dMpVO']' returned non-zero exit status 1
2020-07-30 00:01:52.695 94899 ERROR sysinv.openstack.common.periodic_task

Severity
--------
Major

Steps to Reproduce
------------------

Not certain how exactly to reproduce this exact failure. It was seen in a DC lab. The helm install command had failed because tiller was not running yet on the system and the sysinv conductor was trying to apply runtime manifests.

tiller-deploy-5c8dd9fb56-88w6g_kube-system_tiller-c1e2563a9146b9f525c431904761dedeea3c7c19a12c78507d0e6f1251711761.log
2020-07-30T00:01:50.994005897Z stderr F [main] 2020/07/30 00:01:50 Starting Tiller v2.13.1 (tls=false)

sysinv.log
sysinv 2020-07-30 00:01:42.995 94899 ERROR sysinv.openstack.common.rpc.amqp [-] Exception during message handling: CalledProcessError: Command '['helm', 'install', '--dry-run', '--debug', '--values', '/var/run/sysinv_tmp/tmpNTjKcb', '--values', '/var/run/sysinv_tmp/tmpcwivSo', '/var/run/sysinv_tmp/tmpP05WVg']' returned non-zero exit status 1

Expected Behavior
------------------
No 250.001 alarm Configuration is out-of-date raised.
No exception logs in sysinv.log

Command '['helm', 'install', '--dry-run', '--debug', '--values', '/var/run/sysinv_tmp/tmpz5ej3V', '--values', '/var/run/sysinv_tmp/tmpyz7iZo', '/var/run/sysinv_tmp/tmp9dMpVO']' returned non-zero exit status 1

Actual Behavior
----------------
See above

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC system

Branch/Pull Time/Commit
-----------------------

Last Pass
---------

Timestamp/Logs
--------------

Test Activity
-------------
Testing

Workaround
----------

Changed in starlingx:
assignee: nobody → Kristine Bujold (kbujold)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/746609

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - issue is intermittent, but leads to the config not being applied properly.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.5.0 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/746609
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=f9b83fef12daf44d594887ec4bdf518371c77287
Submitter: Zuul
Branch: master

commit f9b83fef12daf44d594887ec4bdf518371c77287
Author: Kristine Bujold <email address hidden>
Date: Mon Aug 17 15:24:22 2020 -0400

    Prevent manifest apply interruptions by helm

    Improved the evaluate_app_reapply() method so that if a failure occurs
    with the helm command the application the runtime manifests apply are
    not interrupted. Also added a log for helm failure so the user has more
    information why this command failed.

    Executed the code path change with these steps
    - create user overrides for cert-manager (can be any value)
    system helm-override-update cert-manager cert-manager cert-manager
      --values <override file>

    - cause a runtime manifest apply with
    system modify --https_enabled <True/False>

    Closes-Bug: 1891933
    Change-Id: I376df922b893c84fd1a06a80084d2780796811f0
    Signed-off-by: Kristine Bujold <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.