host-swact failed by target Config not yet applied

Bug #1826418 reported by Peng Peng
50
This bug affects 4 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Qian

Bug Description

Brief Description
-----------------
After lock/unlock controller host and host-swact, the second host-swact failed by target Config not yet applied

Severity
--------
Major

Steps to Reproduce
------------------
1. Lock controller host
2. Unlock controller host
3. Swact active controller
4. Swact active controller

....
TC-name:
test_lock_unlock_host.py::test_lock_unlock_host[controller]
mtc/test_swact.py::test_swact_controllers
mtc/test_swact.py::test_swact_controller_platform

Expected Behavior
------------------
swact success

Actual Behavior
----------------
swact failed

Reproducibility
---------------
Reproducible
4/4 in the row

System Configuration
--------------------
Multi-node system

Lab-name:
WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 20190425T013000Z

Last Pass
---------
20190410T013000Z

Timestamp/Logs
--------------
[2019-04-25 08:40:58,332] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-1'

[2019-04-25 08:41:44,640] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-1'

[2019-04-25 08:54:51,262] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-04-25 09:14:29,763] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2019-04-25 09:14:31,229] 387 DEBUG MainThread ssh.expect :: Output:
controller-0 target Config 26662607-e867-4be1-8df3-e1f34efca0b7 not yet applied. Apply target Config via Lock/Unlock prior to Swact
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as release gating as swacts need to work.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Bin Qian (bqian20)
tags: added: stx.2.0 stx.retestneeded
Revision history for this message
Bin Qian (bqian20) wrote :

From a similar instance, the information below was found.
5af3bc55-6caf-4b1b-bced-4c33a5f87b7c was the config_uuid that system complain not being applied on controller-0

[wrsroot@controller-1 ~(keystone_admin)]$ system host-swact controller-1
controller-0 target Config 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c not yet applied. Apply target Config via Lock/Unlock prior to Swact

prior to the instance, active controller swact from controller-0 to controller-1, command issued at 2019-04-29T10:20:50
2019-04-29T10:20:50.000 controller-0 -sh: info HISTORY: PID=258916 UID=1875 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0

sysinv conductor detected the mismatch of target config_uuid and applied config_uuid, it tried to apply the config:
2019-04-29 10:22:52.316 110549 WARNING sysinv.conductor.manager [-] controller-0: iconfig out of date: target 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c, applied 6f0146c3-3e60-40be-bcf9-d72e9e9751ca
2019-04-29 10:22:52.316 110549 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: 6f0146c3-3e60-40be-bcf9-d72e9e9751ca vs. target: 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c.
2019-04-29 10:22:52.340 110549 WARNING sysinv.conductor.manager [-] controller-1: iconfig out of date: target 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c, applied 6f0146c3-3e60-40be-bcf9-d72e9e9751ca
2019-04-29 10:22:52.340 110549 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-1 config applied: 6f0146c3-3e60-40be-bcf9-d72e9e9751ca vs. target: 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c.
2019-04-29 10:22:52.435 110549 INFO sysinv.conductor.manager [-] _config_update_hosts config_uuid=5af3bc55-6caf-4b1b-bced-4c33a5f87b7c

The target config (uuid 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c) was applied to controller-1 successfully a few seconds after the config mismatched was detected.
2019-04-29 10:24:19.929 19382 INFO sysinv.agent.manager [req-fb0f7abe-917e-4285-a061-365669a34b89 admin None] Agent config applied 5af3bc55-6caf-4b1b-bced-4c33a5f87b7c
2019-04-29 10:24:19.964 110549 INFO sysinv.conductor.manager [req-fb0f7abe-917e-4285-a061-365669a34b89 admin None] SYS_I Clear system config alarm: controller-1

Not log was found to indicate attempt had been made to apply the same config to controller-0

An error msg was found but may not be related with the instance.
2019-04-29 10:22:54.370 110549 ERROR stevedore.extension [-] Could not load 'stx-openstack': No module named systemconfig.helm_plugins.stx_openstack

Revision history for this message
Frank Miller (sensfan22) wrote :

Bin can you check the sysinv-api.log to see what config was being changed that triggered the config out of date alarm. If the type of change is a runtime config change (John Kung says it looks like this should be the case) then the next step is to determine why the runtime change is not being done on controller-0.

Revision history for this message
Bin Qian (bqian20) wrote :

It is a runtime config change.

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

sysinv.log conroller-1
WARNING wsme.api [-] Client-side error: controller-1 target Config 6b1586ea-a940-4af6-a069-15471814b735 not yet applied. Apply target Config via Lock/Unlock prior to Swact

2019-04-28 19:48:20.500 130213 WARNING wsme.api [-] Client-side error: controller-1 target Config 6b1586ea-a940-4af6-a069-15471814b735 not yet applied. Apply target Config via Lock/Unlock prior to Swact

[2019-04-28 19:48:19,308] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'
[2019-04-28 19:48:20,558] 387 DEBUG MainThread ssh.expect :: Output:
controller-1 target Config 6b1586ea-a940-4af6-a069-15471814b735 not yet applied. Apply target Config via Lock/Unlock prior to Swact

Lab: WP_3_7
Load: 20190427T013000Z
FAIL 20190428 19:48:46 test_migrate_vm_various_guest[ubuntu_16-3-4096-dedicated-volume]
FAIL 20190428 19:54:24 test_migrate_vm_various_guest[centos_6-3-4096-dedicated-volume]
FAIL 20190428 20:00:49 test_migrate_vm_various_guest[centos_7-1-1024-dedicated-volume]
FAIL 20190428 20:06:48 test_migrate_vm_various_guest[centos_7-5-4096-None-image]
FAIL 20190428 20:11:48 test_migrate_vm_various_guest[opensuse_11-3-1024-dedicated-volume]
FAIL 20190428 20:18:31 test_migrate_vm_various_guest[rhel_6-3-1024-dedicated-image]
FAIL 20190428 20:24:27 test_migrate_vm_various_guest[rhel_6-4-4096-None-volume]
FAIL 20190428 20:31:26 test_migrate_vm_various_guest[rhel_7-1-1024-dedicated-volume]

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/656847

Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.ha
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/656847
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=3189e4209401dfb5d62e9bc321253a8e33a30273
Submitter: Zuul
Branch: master

commit 3189e4209401dfb5d62e9bc321253a8e33a30273
Author: Bin Qian <email address hidden>
Date: Thu May 2 08:16:43 2019 -0400

    Do not change config_target if not reconfiguring

    During config active controller, the config_target on the
    standby controller is mistakenly changed as well.
    Should only reset config target for active controller.
    Also enabled more logging to help future troubleshooting.

    Closes-Bug: 1826418

    Change-Id: I1db29ff0c07098844039b1c83cb437cbcae30aee
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced in build 20190503T013000Z

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The commit above was merged on May 6. Reproduction with an older build is expected. Retest needs to use a May 7 or later build.

Ghada Khalil (gkhalil)
tags: added: stx.config
removed: stx.ha
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Load: 20190508T013000Z

[2019-05-11 20:41:33,086] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2019-05-11 20:41:34,653] 387 DEBUG MainThread ssh.expect :: Output:
controller-0 target Config bcbc38e8-f9a0-4958-bdfc-3b42cc5bfa41 not yet applied. Apply target Config via Lock/Unlock prior to Swact

[2019-05-11 21:47:55,449] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2019-05-11 21:47:56,917] 387 DEBUG MainThread ssh.expect :: Output:
controller-0 target Config 9bfc1d03-59bb-4ea4-9769-ff2737bb20ef not yet applied. Apply target Config via Lock/Unlock prior to Swact
controller-1:~$

[2019-05-11 21:48:12,735] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'
[2019-05-11 21:48:14,135] 387 DEBUG MainThread ssh.expect :: Output:
controller-0 target Config 9bfc1d03-59bb-4ea4-9769-ff2737bb20ef not yet applied. Apply target Config via Lock/Unlock prior to Swact
controller-1:~$

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Bin Qian (bqian20) wrote :

There are more to fix, same style of bug in different areas.

Revision history for this message
mhg (marvinhg) wrote :

The issue was reproduced in test changing the password of wrsroot on a 2+2 system with load 20190508T013000Z.
The scenario might be a bit different:
 (note: there's an existing out-of-date alarm for controller-0)
- after changed of wrsroot on controller-0 (the active controller),
  - timestamp of the existing alarm on controller-0 was updated with clock time on the lab at that time
  - new "Configuration is out-of-date" alarms popped up for all other nodes
  - all other alarms were cleared automatically as expected (within about 5 minutes), except the one for controller-0

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening this bug as more code changes are required as per comments from Bin and Marvin.

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/660424

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/660424
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=c1301724aa5160b81c50c3ef5b78d70863ba24f9
Submitter: Zuul
Branch: master

commit c1301724aa5160b81c50c3ef5b78d70863ba24f9
Author: Bin Qian <email address hidden>
Date: Tue May 21 11:16:42 2019 -0400

    Do not change config_target if not reconfiguring (II)

        More changes to correct updating config_target not aligned with
    the hosts to be reconfigured.

    Change-Id: I636fba82bc76eb9f7cb30b24b39487d391f3d738
    Close-Bug: 1826418
    Signed-off-by: Bin Qian <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Confirmed → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not saw this issue in recent loads.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.