Worker node remains offline after host re-install - kernel error in fs/ext4

Bug #1912623 reported by Mihnea Saracin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------
"sw-manager upgrade-strategy apply" to upgrade compute nodes failed (upgrade from 20.06 to 20.12), one or more compute nodes stayed offline.

Severity
--------
Major

Steps to Reproduce
------------------
host-upgrade and unlock controller-1
swact to make controller-1 active
host-upgrade and unlock controller-0
sw-manager upgrade-strategy create --storage-apply-type serial --alarm-restrictions relaxed --worker-apply-type parallel --max-parallel-worker-hosts 10
sw-manager upgrade-strategy apply

Expected Behavior
------------------
upgrade-strategy applies successfully. compute-0 and compute-1 are upgraded and are in available status.

Actual Behavior
----------------
upgrade-strategy failed and prompted apply timed-out. One or both two of compute nodes stayed offline.

Reproducibility
---------------
Happened 3/3 times
1st: compute-0 offline, compute-1 upgraded successfully
2nd: both compute-0 and compute-1 offline
3rd: both compute-0 and compute-1 offline.

System Configuration
--------------------
Standard lab with 2 workers

Branch/Pull Time/Commit
-----------------------
stx master build on "2020-11-27"

Last Pass
---------
Tried upgrades from 20.04 to 20.06 many times before on the same lab, so the lab boot configuration should be OK.

Timestamp/Logs
--------------
$ sw-manager upgrade-strategy show

Strategy Upgrade Strategy:
  strategy-uuid: f9f64730-743c-4f43-baf3-2e06528e0b65
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: parallel
  max-parallel-worker-hosts: 10
  default-instance-action: migrate
  alarm-restrictions: relaxed
  current-phase: abort
  current-phase-completion: 100%
  state: abort-failed
  apply-result: timed-out
  apply-reason:
  abort-result: failed
  abort-reason: host unlock failed

[sysadmin@controller-1 ~(keystone_admin)]$ system host-list

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | locked | disabled | offline |
| 3 | compute-1 | worker | locked | disabled | offline |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-1 ~(keystone_admin)]$ system host-upgrade-list

+----+--------------+-------------+-----------------+----------------+
| id | hostname | personality | running_release | target_release |
+----+--------------+-------------+-----------------+----------------+
| 1 | controller-0 | controller | 20.12 | 20.12 |
| 2 | compute-0 | worker | 20.06 | 20.12 |
| 3 | compute-1 | worker | 20.06 | 20.12 |
| 4 | controller-1 | controller | 20.12 | 20.12 |
+----+--------------+-------------+-----------------+----------------+

## compute-0 console log with the error

acpid: exiting
collectd[35230]: alarm notifier reading: 100.00 % usage - /boot
collectd[35230]: alarm notifier File System /boot debounce 'okay -> failure' (100.00) (1:1) False
collectd[35230]: Exiting normally.
collectd[35230]: collectd: Stopping 5 read threads.
collectd[35230]: platform cpu usage plugin Usage: 4.9% (avg per cpu); cpus: 2, Platform: 6.8% (Base: 4.4, k8s-system: 2.4), k8s-addon: 0.0collectd[35230]: platform memory usage: Usage: 1.0%; Reserved: 8000.0 MiB, Platform: 81.7 MiB (Base: 81.7, k8s-system: 0.0), k8s-addon: 0.0
collectd[35230]: interface plugin http request exception ; [Errno 111] Connection refused
collectd[35230]: 4K memory usage: Anon: 0.2%, Anon: 92.2 MiB, cgroup-rss: 84.3 MiB, Avail: 60769.0 MiB, Total: 60861.3 MiB
collectd[35230]: 4K numa memory usage: node0, Anon: 0.31%, Anon: 92.2 MiB, Avail: 29523.6 MiB, Total: 29615.9 MiB
collectd[35230]: 4K numa memory usage: node1, Anon: 0.00%, Anon: 0.0 MiB, Avail: 31825.1 MiB, Total: 31825.1 MiB
collectd[35230]: fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(4), rc:(-1), error:(Broken pipe)
collectd[35230]: ptp plugin 'set_fault' exception ; 100.119:host=compute-0.ptp=no-lock:major ; Failed to execute set_fault.
collectd[3530]: ptp plugin compute-0 not locked to remote Grand Master ()
collectd[35230]: collectd: Stopping 5 write threads.
[30599.665976] ------------[ cut here ]------------
[30599.671120] kernel BUG at fs/ext4/super.c:4882!
[30599.676174] invalid opcode: 0000 [#1] PREEMPT SMP
...
(skipped)
...
[ 310.159201] systemd-udevd[87]: worker [89] terminated by signal 9 (Killed)
[ 310.166893] systemd-udevd[87]: worker [89] failed while handling '/devices/pci0000:00/0000:00:03.0/0000:05:00.0'
[ 310.178553] systemd-shutdown[1]: Sending SIGKILL to remaining processes...[ 310.188398] systemd-shutdown[1]: Hardware watchdog 'Software Watchdog', version 0
[ 310.196872] systemd-shutdown[1]: Unmounting file systems.
[ 310.203126] systemd-shutdown[225]: Remounting '/' read-only in with options 'size=56508k,nr_inodes=14127'.
[ 310.214005] systemd-shutdown[1]: All filesystems unmounted.
[ 310.220235] systemd-shutdown[1]: Deactivating swaps.
[ 310.225810] systemd-shutdown[1]: All swaps deactivated.
[ 310.231657] systemd-shutdown[1]: Detaching loop devices.
[ 310.237671] systemd-shutdown[1]: All loop devices detached.
[ 310.243895] systemd-shutdown[1]: Detaching DM devices.
[ 310.249682] systemd-shutdown[1]: All DM devices detached.
[ 310.256116] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 310.263586] systemd-shutdown[1]: Rebooting.[ 361.802388] random: crng init done

## compute-1 console log with the error

acpid: exiting
collectd[34347]: interface plugin http request exception ; [Errno 111] Connection refused
[30404.228183] ------------[ cut here ]------------
[30404.233336] kernel BUG at fs/ext4/super.c:4882!
[30404.238389] invalid opcode: 0000 [#1] PREEMPT SMP
[30404.243767] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_rpfilter ip6t_rpfilter xt_multiport xt_set iptable_raw iptable_mangle ip6table_raw ip_set_hash_ip ip_set_hash_net ip_set xt_nat xt_statistic ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nbd ip6table_mangle ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_comment xt_mark ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack cls_u32 sch_sfq sch_htb xfs libcrc32c binfmt_misc br_netfilter bridge overlay(T) nfsv3 nfs fscache virtio_net net_failover failover nfsd auth_rpcgss nfs_acl lockd grace 8021q garp mrp stp llc ip6table_filter ip6_tables iptable_filter sunrpc iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm dm_modirqbypass crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd joydev lpc_ich mei_me mei i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter tpm_crb(O) ip_tables ext4 mbcache jbd2 vfio xprtrdma(O) svcrdma(O) rpcrdma(O) nvmet_rdma(O) nvme_rdma(O) ib_srp(O) ib_isert(O) ib_iser(O) rdma_rxe(O) mlx5_ib(O) sd_mod crc_t10dif crct10dif_generic mlx4_ib(O) mlx4_en(O) mlx4_core(O) rdma_ucm(O) rdma_cm(O) iw_cm(O) ib_ucm(O) ib_uverbs(O) ib_cm(O) ib_core(O) ixgbevf(O) crct10dif_pclmul crct10dif_common crc32c_intel mlx5_core(O) igb ixgbe(O) ahci libahci i2c_algo_bit mlxfw(O) devlink mlx_compat(O) dca tpm_tis(O) tpm_tis_core(O) tpm(O) iavf(O) i40e(O) e1000e(O)
[30404.397744] CPU: 0 PID: 424497 Comm: systemd-shutdow Kdump: loaded Tainted: G O ------------ T 3.10.0-1127.el7.2.tis.x86_64 #1
[30404.411619] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0028.121720182203 12/17/2018
[30404.423459] task: ffff984fb4e517a0 ti: ffff985d0758000 task.ti: ffff9850d0758000
[30404.431809] RIP: 0010:[<ffffffffc108b2ed>] [<ffffffffc108b2ed>] ext4_mark_recovery_complete.isra.189+0x8d/0x90 [ext4]
[30404.443760] RSP: 0018:ffff9850d075bd78 EFLAGS: 00010286
[30404.449686] RAX: ffff9850b0385400 RBX: ffff9850d0607000 RCX: 0000000000000000
[30404.457645] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff9850d00e4000
[30404.465607] RBP: ffff9850d075bd88 R08: 0000000000000000 R09: ffff9850d075bdd4
[30404.473568] R10: ffff984fe51d8019 R11: 0000000000000000 R12: ffff9850d00e4000
[30404.481530] R13: ffff9850d00e4800 R14: ffff9850b0385400 R15: 0000000000000000
[30404.489490] FS: 00007fd96ec78840(0000) GS:ffff9850df400000(0000) knlGS:0000000000000000
[30404.498516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30404.504926] CR2: 00007fd96e13e600 CR3: 000000084f3c6000 CR4: 00000000001607f0
[30404.512886] Call Trace:
[30404.515620] [<ffffffffc108e2e3>] ext4_remount+0x3f3/0x730 [ext4]
[30404.522424] [<ffffffff8418a40f>] ? filemap_fdatawat_range+0x1f/0x30
...
(skipped)
...
[ 491.159683] INFO: task kworker/u2:6:173 blocked for more than 120 seconds.
[ 491.212906] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 491.273587] kworker/u2:6 D ffff88002fd6af40 0 173 2 0x00000000
[ 491.328444] Workqueue: events_unbound async_run_entry_fn
[ 491.368344] Call Trace:
[ 491.388483] [<ffffffff817f7d99>] schedule+0x29/0x70
[ 491.426986] [<ffffffff810b16d5>] async_synchronize_cookie_domain+0x85/0x150
[ 491.481542] [<ffffffff810aad20>] ? wake_up_atomic_t+0x30/0x30
[ 491.526768] [<ffffffff810b17f5>] async_synchronize_cookie+0x15/0x20
[ 491.576017] [<ffffffff8155cd16>] async_port_probe+0x36/0x60
[ 491.619859] [<ffffffff810b14bf>] async_run_entry_fn+0x3f/0x130[ 491.665785] [<ffffffff810a23a6>] process_one_work+0x176/0x4a0
[ 491.710971] [<ffffffff810a30f6>] worker_thread+0x126/0x3b0
[ 491.754095] [<ffffffff817f8053>] ? preempt_schedule+0x43/0x60
[ 491.799203] [<ffffffff810a2fd0>] ? manage_workers.isra.28+0x2a0/0x2a0
[ 491.849723] [<ffffffff810a9c41>] kthread+0xd1/0xe0
[ 491.887542] [<ffffffff810a9b70>] ? kthread_create_on_node+0x140/0x140
[ 491.938099] [<ffffffff817fbe5d>] ret_from_fork_nospec_begin+0x7/0x21
[ 491.987986] [<ffffffff810a9b70>] ? kthread_create_on_node+0x140/0x140

Test Activity
-------------
Normal use

Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Ghada Khalil (gkhalil)
tags: added: stx.5.0 stx.storage
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Ghada Khalil (gkhalil)
summary: - Upgrades orchestration failed due to compute offline after host re-
- install - kernel error in fs/ext4
+ Worker node offline after host re-install - kernel error in fs/ext4
summary: - Worker node offline after host re-install - kernel error in fs/ext4
+ Worker node remains offline after host re-install - kernel error in
+ fs/ext4
Revision history for this message
Mihnea Saracin (msaracin) wrote :
Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

    Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.