StarlingX

AIO-DX: Controller-1 reboot twice after initial host-unlock

Bug #1841670 reported by Angie Wang on 2019-08-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Tao Liu

Bug Description

Brief Description
-----------------
After initial unlocking controller-1, controller-1 reboot and manifests were applied successfully. However, a few seconds after worker manifest applied, another reboot occurred unexpectedly. After the second reboot completed, controller-1 was enabled and unlocked. The mtcClient log on controller-1 showed that the sysinv goenabled check didn't pass after the first reboot.

Severity
--------
Major

Steps to Reproduce
------------------
1. bootstrap and unlock controller-0
2. install controller-1
3. configure controller-1
4. host-unlock controller-1

Expected Behavior
------------------
controller-1 reboots once after unlock

Actual Behavior
----------------
controller-1 reboots twice after unlock

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-DX
Lab-name: IP_5_6, vbox

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-08-26 21:00:20 -0400"

Last Pass
---------
Don't know

Timestamp/Logs
--------------
2019-08-27T07:28:21.642 [112079.00194] controller-0 mtcAgent --- nodeBase.cpp ( 675) log_adminAction : Info : controller-1 Unlock Action

controller-1:
daemon.log
2019-08-27T07:30:48.293 controller-1 controller_config[12976]: info Applying puppet controller manifest...
...
2019-08-27T07:36:19.828 controller-1 worker_config[106371]: info Applying puppet worker manifest...
2019-08-27T07:36:19.829 controller-1 systemd[1]: info Started Service Management API Unit.
2019-08-27T07:36:19.859 controller-1 systemd[1]: info Starting Service Management Event Recorder Unit...
...
2019-08-27T07:36:35.388 controller-1 systemd[1]: info Stopped target RPC Port Mapper.
2019-08-27T07:36:35.409 controller-1 systemd[1]: info Stopping Name Service Cache Daemon...
2019-08-27T07:36:35.416 controller-1 systemd[1]: info Stopping Authorization Manager...
2019-08-27T07:36:35.421 controller-1 systemd[1]: info Stopped target rpc_pipefs.target.
...
Rebooting ...
...
2019-08-27T07:41:17.015 controller-1 systemd[1]: info Started Service Management Unit.
2019-08-27T07:41:17.023 controller-1 systemd[1]: info Started Service Management Shutdown Unit.
2019-08-27T07:41:17.038 controller-1 systemd[1]: info Starting Service Management API Unit...
2019-08-27T07:41:17.077 controller-1 sm-api[98463]: info Starting sm-api: OK
2019-08-27T07:41:17.078 controller-1 systemd[1]: info Started Service Management API Unit.
2019-08-27T07:41:17.084 controller-1 worker_config[98137]: info /etc/init.d/worker_config: Running puppet manifest apply
2019-08-27T07:41:17.098 controller-1 systemd[1]: info Starting Service Management Event Recorder Unit...
2019-08-27T07:41:17.103 controller-1 worker_config[98137]: info Applying puppet worker manifest...
...

mtcClient.log
2019-08-27T07:36:19.532 [106386.00028] controller-1 mtcClient --- mtcNodeComp.cpp (1690) run_goenabled_scripts : Info : GoEnabled Scripts : Main-Function Context
2019-08-27T07:36:19.532 [106386.00029] controller-1 mtcClient --- mtcNodeComp.cpp (1408) _launch_all_scripts : Info : Sorted Test File List: 7
2019-08-27T07:36:19.532 [106386.00030] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/config_goenabled_check.sh start
2019-08-27T07:36:19.532 [106386.00031] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/inventory_goenabled_check.sh start
2019-08-27T07:36:19.532 [106386.00032] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/mlx4_core_goenabled.sh start
2019-08-27T07:36:19.532 [106386.00033] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/patch_check_goenabled.sh start
2019-08-27T07:36:19.532 [106386.00034] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/sysinv_goenabled_check.sh start
2019-08-27T07:36:19.532 [106386.00035] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/virt-support-goenabled.sh start
2019-08-27T07:36:19.532 [106386.00036] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/worker-goenabled.sh start
2019-08-27T07:36:19.534 [106386.00037] controller-1 mtcClient --- mtcNodeComp.cpp (1179) daemon_service_run : Info : main goenable tests started
2019-08-27T07:36:19.536 [106386.00038] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/config_goenabled_check.sh (0.002 secs)
2019-08-27T07:36:19.536 [106386.00039] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/mlx4_core_goenabled.sh (0.002 secs)
2019-08-27T07:36:19.536 [106386.00040] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/patch_check_goenabled.sh (0.002 secs)
2019-08-27T07:36:19.537 [106386.00041] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/worker-goenabled.sh (0.002 secs)
2019-08-27T07:36:20.014 [106386.00042] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/virt-support-goenabled.sh (0.480 secs)
2019-08-27T07:36:20.483 [106386.00043] controller-1 mtcClient msg mtcCompMsg.cpp ( 196) mtc_service_command : Info : GoEnabled In-Progress
2019-08-27T07:36:20.483 [106386.00044] controller-1 mtcClient msg mtcCompMsg.cpp ( 410) mtc_service_command : Info : goEnabled main req reply send (Mgmnt)
2019-08-27T07:36:36.976 [106386.00045] controller-1 mtcClient sig daemon_signal.cpp ( 172) daemon_signal_hdlr : Info : Received SIGTERM
...

2019-08-27T07:41:16.799 [98218.00028] controller-1 mtcClient --- mtcNodeComp.cpp (1690) run_goenabled_scripts : Info : GoEnabled Scripts : Main-Function Context
2019-08-27T07:41:16.799 [98218.00029] controller-1 mtcClient --- mtcNodeComp.cpp (1408) _launch_all_scripts : Info : Sorted Test File List: 7
2019-08-27T07:41:16.799 [98218.00030] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/config_goenabled_check.sh start
2019-08-27T07:41:16.799 [98218.00031] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/inventory_goenabled_check.sh start
2019-08-27T07:41:16.799 [98218.00032] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/mlx4_core_goenabled.sh start
2019-08-27T07:41:16.799 [98218.00033] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/patch_check_goenabled.sh start
2019-08-27T07:41:16.799 [98218.00034] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/sysinv_goenabled_check.sh start
2019-08-27T07:41:16.799 [98218.00035] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/virt-support-goenabled.sh start
2019-08-27T07:41:16.799 [98218.00036] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/goenabled.d/worker-goenabled.sh start
2019-08-27T07:41:16.801 [98218.00037] controller-1 mtcClient --- mtcNodeComp.cpp (1179) daemon_service_run : Info : main goenable tests started
2019-08-27T07:41:16.802 [98218.00038] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/config_goenabled_check.sh (0.002 secs)
2019-08-27T07:41:16.803 [98218.00039] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/mlx4_core_goenabled.sh (0.002 secs)
2019-08-27T07:41:16.803 [98218.00040] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/patch_check_goenabled.sh (0.002 secs)
2019-08-27T07:41:16.803 [98218.00041] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/worker-goenabled.sh (0.002 secs)
2019-08-27T07:41:17.264 [98218.00042] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/virt-support-goenabled.sh (0.463 secs)
2019-08-27T07:41:47.750 [98218.00043] controller-1 mtcClient com nodeUtil.cpp (1202) get_node_health : Info : controller-1 is Healthy (0)
2019-08-27T07:42:02.846 [98218.00044] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/inventory_goenabled_check.sh (46.045 secs)
2019-08-27T07:42:02.846 [98218.00045] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/goenabled.d/sysinv_goenabled_check.sh (46.046 secs)
2019-08-27T07:42:02.897 [98218.00046] controller-1 mtcClient --- mtcNodeComp.cpp ( 888) _manage_goenabled_tests : Info : GoEnabled Testing Complete ; all tests passed

Test Activity
-------------
Other

Tags:

Revision history for this message

Angie Wang (angiewang) wrote on 2019-08-27:

IP_5_6 Edit (42.2 MiB, application/x-tar)

Revision history for this message

Angie Wang (angiewang) wrote on 2019-08-27:

IP_1_4 Edit (69.6 MiB, application/x-tar)

Also see this happened on compute nodes on a low-latency standard lab IP_1_4

Dariush Eslimi (deslimi) on 2019-08-28

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.3.0 stx.metal

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-08-28:

I looked at this with Angie and concluded the following at that time.

System inventory agent is not ready when it should be.
Seems like a system inventory prime should investigate this first.

I can prime this system inventory investigation if needed but does not seem to be a maintenance issue.

Bin Qian (bqian20) on 2019-09-04

Changed in starlingx:
assignee:	Eric MacDonald (rocksolidmtce) → Bin Qian (bqian20)

Revision history for this message

Bin Qian (bqian20) wrote on 2019-09-04:

The reboot was caused by a mismatch of kernel arguments configured in the initial boot was detected during the worker manifest.
2019-08-27T07:36:25.556 ^[[mNotice: 2019-08-27 07:36:24 +0000 Scope(Class[Platform::Compute::Grub::Audit]): Kernel Boot Argument Mismatch^[[0m
2019-08-27T07:36:35.389 ^[[0;36mDebug: 2019-08-27 07:36:35 +0000 Class[Platform::Compute::Grub::Update]: The container Stage[main] will propagate my refresh event^[[0m
2019-08-27T07:36:35.390 ^[[0;36mDebug: 2019-08-27 07:36:35 +0000 Exec[reboot-recovery](provider=posix): Executing 'reboot'^[[0m
2019-08-27T07:36:35.392 ^[[0;36mDebug: 2019-08-27 07:36:35 +0000 Executing: 'reboot'^[[0m

The initial boot kernal args:
BOOT_IMAGE=/vmlinuz-3.10.0-957.12.2.el7.3.tis.x86_64 root=UUID=920d8268-2ea6-4f60-acb7-627eba080fb6 ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=ttyS0,115200n8 iommu=pt usbcore.autosuspend=-1 hugepagesz=2M hugepages=0 default_hugepagesz=2M isolcpus=2,22,3,23 rcu_nocbs=2-19,22-39 kthread_cpus=0,20,1,21 irqaffinity=0,20,1,21 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 intel_iommu=on user_namespace.enable=1 nopti nospectre_v2

kernel args for final boot
BOOT_IMAGE=/vmlinuz-3.10.0-957.12.2.el7.3.tis.x86_64 root=UUID=920d8268-2ea6-4f60-acb7-627eba080fb6 ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=ttyS0,115200n8 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 intel_iommu=on user_namespace.enable=1 hugepagesz=1G hugepages=2 hugepagesz=2M hugepages=0 default_hugepagesz=2M irqaffinity=0-1,20-21 rcu_nocbs=2-19,22-39 isolcpus=2-3,22-23 kthread_cpus=0-1,20-21 nopti nospectre_v2

Apparently during the worker manifest, hugepagesz=1G hugepages=2 was configured. The changes for the hugepage required a reboot to apply.

As the cpus on the lab do support 1G huge pages the final config is correct.
The 1G hugepage arg updated only when vswitch type is configured, which happened in the controller manifest after initial unlock reboot. Therefor, a 2nd reboot is required to update kernel command line arg after initial unlock reboot.

This is not a bug, the behavior is by designed.

To avoid the 2nd reboot, the vswitch type needs to be configured after initial unlock before reboot starts.

The reboot was caused by a mismatch of kernel arguments configured in the initial boot was detected during the worker manifest. 
2019-08-27T07:36:25.556 ^[[mNotice: 2019-08-27 07:36:24 +0000 Scope(Class[Platform::Compute::Grub::Audit]): Kernel Boot Argument Mismatch^[[0m
2019-08-27T07:36:35.389 ^[[0;36mDebug: 2019-08-27 07:36:35 +0000 Class[Platform::Compute::Grub::Update]: The container Stage[main] will propagate my refresh event^[[0m
2019-08-27T07:36:35.390 ^[[0;36mDebug: 2019-08-27 07:36:35 +0000 Exec[reboot-recovery](provider=posix): Executing 'reboot'^[[0m
2019-08-27T07:36:35.392 ^[[0;36mDebug: 2019-08-27 07:36:35 +0000 Executing: 'reboot'^[[0m

kernel args for final boot 
BOOT_IMAGE=/vmlinuz-3.10.0-957.12.2.el7.3.tis.x86_64 root=UUID=920d8268-2ea6-4f60-acb7-627eba080fb6 ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=ttyS0,115200n8 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=panic,1 softlockup_panic=1 intel_iommu=on user_namespace.enable=1 hugepagesz=1G hugepages=2 hugepagesz=2M hugepages=0 default_hugepagesz=2M irqaffinity=0-1,20-21 rcu_nocbs=2-19,22-39 isolcpus=2-3,22-23 kthread_cpus=0-1,20-21 nopti nospectre_v2

Apparently during the worker manifest, hugepagesz=1G hugepages=2 was configured. The changes for the hugepage required a reboot to apply.

As the cpus on the lab do support 1G huge pages the final config is correct. 
The 1G hugepage arg updated only when vswitch type is configured, which happened in the controller manifest after initial unlock reboot. Therefor, a 2nd reboot is required to update kernel command line arg after initial unlock reboot.

This is not a bug, the behavior is by designed.

To avoid the 2nd reboot, the vswitch type needs to be configured after initial unlock before reboot starts.

Revision history for this message

Bin Qian (bqian20) wrote on 2019-09-04:

For low-latency compute nodes, similar reason caused the 2nd reboot, the isolcpus kernal arg changed during manifest being applied for the initial unlock.

Initial unlock reboot kernel args:
BOOT_IMAGE=/vmlinuz-3.10.0-957.12.2.rt56.929.el7.tis.4.x86_64 root=UUID=e066f9de-e7d5-4691-b340-859ae13a056f ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=ttyS0,115200n8 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 softlockup_panic=0 intel_iommu=on user_namespace.enable=1 skew_tick=1 nopti nospectre_v2 hugepagesz=1G hugepages=2 hugepagesz=2M hugepages=0 default_hugepagesz=2M irqaffinity=0,20 rcu_nocbs=1-19,21-39 nohz_full=1-19,21-39 isolcpus=1-2,21-22 kthread_cpus=0,20

final kernel args:
BOOT_IMAGE=/vmlinuz-3.10.0-957.12.2.rt56.929.el7.tis.4.x86_64 root=UUID=e066f9de-e7d5-4691-b340-859ae13a056f ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=ttyS0,115200n8 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 softlockup_panic=0 intel_iommu=on user_namespace.enable=1 skew_tick=1 hugepagesz=1G hugepages=2 hugepagesz=2M hugepages=0 default_hugepagesz=2M irqaffinity=0,20 rcu_nocbs=1-19,21-39 nohz_full=1-19,21-39 isolcpus=1-19,21-39 kthread_cpus=0,20 nopti nospectre_v2

Revision history for this message

Bin Qian (bqian20) wrote on 2019-09-06:

Low-latency compute nodes double reboot is caused by a different reason (but same theory) than the controller-1 of aio-dx. This LP needs to split to 2 to capture each of the issues.

Revision history for this message

Angie Wang (angiewang) wrote on 2019-09-06:

According to Bin, the low-latency worker nodes reboot is tracked under https://bugs.launchpad.net/starlingx/+bug/1843064

Bin Qian (bqian20) on 2019-09-06

Changed in starlingx:
assignee:	Bin Qian (bqian20) → nobody

Revision history for this message

Tao Liu (tliu88) wrote on 2019-09-06:

The grub update will be re-organized for the single hugepages size support and I can address this issue at the same time.

Changed in starlingx:
assignee:	nobody → Tao Liu (tliu88)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-09-16:

ALL_NODES_20190916.145515.tar Edit (56.9 MiB, application/x-tar)

Issue reproduced on WCP_90-91:
"2019-09-12_20-00-00"

Tao Liu (tliu88) on 2019-09-24

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-24: Fix proposed to config (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/684427

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-24: Related fix proposed to stx-puppet (master)

#11

Related fix proposed to branch: master
Review: https://review.opendev.org/684429

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-25: Fix merged to config (master)

#12

Reviewed: https://review.opendev.org/684427
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=3672ca9608b2467c1e0cbc990aea16725658cff5
Submitter: Zuul
Branch: master

commit 3672ca9608b2467c1e0cbc990aea16725658cff5
Author: Tao Liu <email address hidden>
Date: Tue Sep 24 16:44:49 2019 -0400

Fix AIO-DX Controller-1 double reboot after initial unlock

    After initial unlocking controller-1, controller-1 rebooted twice.
    The second reboot was triggered by the grub audit during the manifest
    apply, based on mismatch detection. The grub mismatch was due to the
    vswitch 1G memory has not been reflected in the kernel boot arguments
    after the initial host-unlock.

    This update renames the update_cpu_config method to update_grub_config
    and triggers the grub runtime update when the host memory is modified.
    This change is also required to ensure the kernel default huge page
    size is updated when the host memory is changed.

Closes-Bug: 1841670

Change-Id: I9deb0a443ed86d35b20d5f6779e44204a921dd0d
Signed-off-by: Tao Liu <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-25: Related fix merged to stx-puppet (master)

#13

Reviewed: https://review.opendev.org/684429
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=77e22266208dce8bc005f3e5858d742b6d3377e7
Submitter: Zuul
Branch: master

commit 77e22266208dce8bc005f3e5858d742b6d3377e7
Author: Tao Liu <email address hidden>
Date: Tue Sep 24 16:42:07 2019 -0400

Move the grub audit manifest to the post stage

    This update moves the grub audit to the worker manifest post stage.
    This would allow the worker manifest apply to complete prior to
    reboot the host, should there be a mismatch.

Related-Bug: 1841670

Change-Id: I7690769a38e088e91a009ed7aa59a937bab42987
Signed-off-by: Tao Liu <email address hidden>

Revision history for this message

Angie Wang (angiewang) wrote on 2019-10-02:

#14

Still see this issue on vbox AIO-DX.
According to Tao, this is due to the mismatch of irqaffinity setup in kickstart and sysinv puppet.
Kickstart sets it to irqaffinity=0,1 while sysinv puppet operator set it to irqaffinity=0-1 in the hiera data.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.