system applications started applying after locking and modifying host

Bug #1879018 reported by Yang Liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------
platform-integ-apps and stx-openstack is reapplied after lock and modify standby controller on DX system

Severity
--------
Major

Steps to Reproduce
------------------
1. Install and configure AIO-DX system
2. Apply and configure stx-openstack
3. system host-lock controller-1 (Checked system applications after this, they are in applied states)
4. system host-cpu-modify -p platform -p0 4 -p1 2 controller-1, and check system applications

TC-name: N/A

Expected Behavior
------------------
4. system applications are either applied or uploaded (should not be applying when controller is locked)

Actual Behavior
----------------
4.a platform-integ-apps started applying and succeeded
4.b stx-openstack app started applying and failed after 30 minutes - it was waiting for openstack ingress pod to be scheduled on locked controller and became ready.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system
Lab-name: R430-1-2

Branch/Pull Time/Commit
-----------------------
stx master as of "2020-05-14_20-00-00"

Last Pass
---------
Not sure.

Timestamp/Logs
--------------

# 2020-05-16T00:54:38.000 controller-0 -sh: info HISTORY: PID=1271634 UID=42425 system host-lock controller-1

# 2020-05-16T00:56:53.000 controller-0 -sh: info HISTORY: PID=1271634 UID=42425 system host-cpu-modify -f platform -p0 4 -p1 2 controller-1

# armada log for stx-openstack app apply failure when host is locked
2020-05-16 01:32:22.217 10311 INFO armada.handlers.lock [-] Releasing lock
2020-05-16 01:32:22.226 10311 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openstack-ingress']
2020-05-16 01:32:22.226 10311 ERROR armada.cli Traceback (most recent call last):
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2020-05-16 01:32:22.226 10311 ERROR armada.cli self.invoke()
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2020-05-16 01:32:22.226 10311 ERROR armada.cli resp = self.handle(documents, tiller)
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2020-05-16 01:32:22.226 10311 ERROR armada.cli return future.result()
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2020-05-16 01:32:22.226 10311 ERROR armada.cli return self.__get_result()
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2020-05-16 01:32:22.226 10311 ERROR armada.cli raise self._exception
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2020-05-16 01:32:22.226 10311 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2020-05-16 01:32:22.226 10311 ERROR armada.cli return armada.sync()
2020-05-16 01:32:22.226 10311 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2020-05-16 01:32:22.226 10311 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2020-05-16 01:32:22.226 10311 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openstack-ingress']

Test Activity
-------------
Normal use

Revision history for this message
Yang Liu (yliu12) wrote :
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

The key log is in armada and indicates two of the ingress pods are not ready:

2020-05-16 01:32:21.357 10311 ERROR armada.handlers.wait [-] [chart=openstack-ingress]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-ingress)). These pods were not ready=['ingress-558499cd86-crbnt', 'ingress-error-pages-65d785749c-l67q2']^[[00m

The sequence of events is:
1. controller-1 is locked
2. controller-1 is modified via system host-cpu-modify
3. the helm managed apps are audited every 60 seconds to see if something has changed and they need to be re-applied. Both the platform-integ-apps and stx-openstack need to be re-applied.
4. platform-integ-apps is re-applied successfully
5. stx-openstack re-apply is triggered but after 30 minutes fails with a timeout due to 2 pods not ready.

This is most likely due to the replica settings being incorrect for the 2 pods. The replica setting needs to take into account how many controllers are in unlocked state.

Revision history for this message
Frank Miller (sensfan22) wrote :

The helm chart plugin in sysinv/sysinv/helm/ingress.py currently is:
common.HELM_NS_OPENSTACK: {
                'pod': {
                    'replicas': {
                        'ingress': self._num_controllers(),
                        'error_page': self._num_controllers()
                    },

But this should likely instead be:
common.HELM_NS_OPENSTACK: {
                'pod': {
                    'replicas': {
                        'ingress': self._num_provisioned_controllers(),
                        'error_page': self._num_provisioned_controllers()
                    },

Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
tags: added: stx.apps stx.distro.openstack
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Mihnea,

Could you help to provide the fix according to the solution provided by Frank?

Thanks!
Zhipeng

Changed in starlingx:
status: Triaged → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)

Fix proposed to branch: master
Review: https://review.opendev.org/742679

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per 2020-07-30 release meeting, this will be included in a 4.0 mtce release.

Ghada Khalil (gkhalil)
tags: added: not-yet-in-r-stx40
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/742679
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=754a1d33de7e16b454052190a2496f1a1d59c707
Submitter: Zuul
Branch: master

commit 754a1d33de7e16b454052190a2496f1a1d59c707
Author: Mihnea Saracin <email address hidden>
Date: Thu Jul 23 16:39:05 2020 +0300

    Fix apply of stx-openstack when host is locked

    Currently, all of the stx-openstack services have the
    replica count set to the number of the controllers.
    If one of the controllers is locked their replicas
    number will still be 2 which is incorrect.
    We solve this by changing the number of replicas
    to be equal to the number of the active controllers.
    The rabbitmq service cannot use this approach because
    it is unable to work properly if its replicas number
    is decreasaed from 2 to 1. So a kubernetes toleration
    is used here to allow the second rabbitmq pod to be
    deployed on the locked controller.

    Change-Id: Ie979c7b5f2755ad673bd180e38b68e0d53c5f9b2
    Closes-Bug: 1879018
    Signed-off-by: Mihnea Saracin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Frank Miller (sensfan22) wrote :

Re-opening this commit as the proposed solution led to mariadb recovery issue on an apply of stx-openstack on an AIO-DX system.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)

Fix proposed to branch: master
Review: https://review.opendev.org/751358

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (3.6 KiB)

Issue seems reproduced in
2020-10-26_20-00-07
on R430_3-4

[2020-10-27 05:38:11,839] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-10-27 05:38:13,107] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------+----------+-----------------------------------+----------------------------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+----------+-----------------------------------+----------------------------------------+----------+-----------+
| cert-manager | 20.06-5 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 20.06-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 20.06-28 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 20.06-11 | platform-integration-manifest | manifest.yaml | applied | completed |
+--------------------------+----------+-----------------------------------+----------------------------------------+----------+-----------+

[2020-10-27 05:38:40,859] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2020-10-27 05:41:00,413] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2020-10-27 05:41:01,642] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------+----------+-----------------------------------+----------------------------------------+----------+-------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+----------+-----------------------------------+----------------------------------------+----------+-------------------------------+
| cert-manager | 20.06-5 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 20.06-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 20.06-28 | oidc-auth-manifest | manifest.yaml ...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :

The issue was reproduced on
2020-11-18_20-00-07
R430_3-4

[2020-11-19 06:47:15,822] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'
[2020-11-19 06:47:17,626] 436 DEBUG MainThread ssh.expect :: Output:
Rejected: Can not unlock host controller-0 while an application is being applied, updated or recovered. Please try again later.

2020-11-19 07:02:17,745] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'
[2020-11-19 07:02:19,193] 436 DEBUG MainThread ssh.expect :: Output:
Rejected: Can not unlock host controller-0 while an application is being applied, updated or recovered. Please try again later.

log:
https://files.starlingx.kube.cengn.ca/launchpad/1879018

Revision history for this message
Peng Peng (ppeng) wrote :

The issue was reproduced on
2020-11-23_20-00-10
R430_3-4

[2020-11-24 06:46:38,372] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'
[2020-11-24 06:46:39,827] 436 DEBUG MainThread ssh.expect :: Output:
Rejected: Can not unlock host controller-0 while an application is being applied, updated or recovered. Please try again later.
controller-1:~$

log uploaded:
https://files.starlingx.kube.cengn.ca/launchpad/1879018

Revision history for this message
Austin Sun (sunausti) wrote :

Do we have any update on this one ?

Revision history for this message
Mihnea Saracin (msaracin) wrote :
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (f/centos8)
Download full text (6.7 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792235
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/1bf694661282a019bf79f253fc148baede65db64
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

    Fix cpu_shared/dedicated_set config location

    Change I61514389b616db754b0d2f35deb0101f90dbdd02 removed the deprecated
    property vcpu_pin_set in favor of the newer cpu_shared_set and
    cpu_dedicated_set, but those new configs are placed under the [compute]
    section of nova.conf instead of [DEFAULT]. This is causing VMs to be
    scheduled on platform reserved cores. This commit will fix it.

    Closes-Bug: #1928683

    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I541760619f4c79c66a2bf22715afdc873b8343ce

commit 58f4d9ffcaf47fe969267149135201aec01624a8
Author: Gustavo Santos <email address hidden>
Date: Mon Mar 8 14:56:55 2021 -0300

    Add k8s proxy-body-size to horizon overrides

    The current network.dashboard.ingress.annotations in horizon's
    values.yaml helm charts do not include the kubernetes property
    'proxy-body-size'. This makes the resulting nginx.conf file in ingress
    add the default rule 'max_body_size 1m' to the horizon servers,
    which limits all http requests' size inside horizon to 1MiB, making it
    impossible to upload images larger than that to glance using the
    horizon GUI, for example.

    This change adds said property to the horizon overrides, making
    horizon's servers in nginx.conf include a 'max_body_size' of 2500MiB,
    which makes uploading images up to that size possible again.

    Story: 2008692
    Task: 41996
    Change-Id: I91888ce238d5304c08eb1e97918989b8f93ee34f

commit b5c1f62088778287e4b50aeac1f17d166a7a177a
Author: Dan Voiculeasa <email address hidden>
Date: Wed Feb 3 16:00:47 2021 +0200

    Introduce metadata for app behavior control

    Keep existing behavior when evaluating app reapplies.

    Story: 2007960
    Task: 41755
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: Ie02743cdf056dda3feb66911c74f9dabe69d98dd

commit eab750b7ff03808002acf35deebdf762e687b332
Author: Martin, Chen <email address hidden>
Date: Sat May 30 09:11:05 2020 +0800

    Add override setting in openstack helm plugin for rook-ceph

    Deploy with rook-ceph, without "system storage-backend-add ceph"
    there is no object storage-ceph in database. As current openstack
    helm plugin fixed on object storage-ceph, in rook-ceph case
    use a fixed override setting

    Story: 2005527
    Task: 39914

    Depends-On: https://review.opendev.org/#/c/713084/

    Change-Id: Ied852d60e8b15d55865747e0b6f4b54f2392d6df
    Signed-off-by: Martin, Chen <email address hidden>

commit 852d8d61dbfc4f9f29afe8da10924731a58028ea
Author: Dan Voiculeasa <email address hidden>
Date: Mon Nov 16 12:41:55 2020 +0200

    Introduce lifecycle operator to openstack app

    A big chunk...

Read more...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.