Worker node remains offline after host re-install - kernel error in fs/ext4
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Mihnea Saracin |
Bug Description
Brief Description
-----------------
"sw-manager upgrade-strategy apply" to upgrade compute nodes failed (upgrade from 20.06 to 20.12), one or more compute nodes stayed offline.
Severity
--------
Major
Steps to Reproduce
------------------
host-upgrade and unlock controller-1
swact to make controller-1 active
host-upgrade and unlock controller-0
sw-manager upgrade-strategy create --storage-
sw-manager upgrade-strategy apply
Expected Behavior
------------------
upgrade-strategy applies successfully. compute-0 and compute-1 are upgraded and are in available status.
Actual Behavior
----------------
upgrade-strategy failed and prompted apply timed-out. One or both two of compute nodes stayed offline.
Reproducibility
---------------
Happened 3/3 times
1st: compute-0 offline, compute-1 upgraded successfully
2nd: both compute-0 and compute-1 offline
3rd: both compute-0 and compute-1 offline.
System Configuration
-------
Standard lab with 2 workers
Branch/Pull Time/Commit
-------
stx master build on "2020-11-27"
Last Pass
---------
Tried upgrades from 20.04 to 20.06 many times before on the same lab, so the lab boot configuration should be OK.
Timestamp/Logs
--------------
$ sw-manager upgrade-strategy show
Strategy Upgrade Strategy:
strategy-uuid: f9f64730-
controller-
storage-
worker-
max-parallel-
default-
alarm-
current-phase: abort
current-
state: abort-failed
apply-result: timed-out
apply-reason:
abort-result: failed
abort-reason: host unlock failed
[sysadmin@
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | locked | disabled | offline |
| 3 | compute-1 | worker | locked | disabled | offline |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+-
[sysadmin@
+----+-
| id | hostname | personality | running_release | target_release |
+----+-
| 1 | controller-0 | controller | 20.12 | 20.12 |
| 2 | compute-0 | worker | 20.06 | 20.12 |
| 3 | compute-1 | worker | 20.06 | 20.12 |
| 4 | controller-1 | controller | 20.12 | 20.12 |
+----+-
## compute-0 console log with the error
acpid: exiting
collectd[35230]: alarm notifier reading: 100.00 % usage - /boot
collectd[35230]: alarm notifier File System /boot debounce 'okay -> failure' (100.00) (1:1) False
collectd[35230]: Exiting normally.
collectd[35230]: collectd: Stopping 5 read threads.
collectd[35230]: platform cpu usage plugin Usage: 4.9% (avg per cpu); cpus: 2, Platform: 6.8% (Base: 4.4, k8s-system: 2.4), k8s-addon: 0.0collectd[35230]: platform memory usage: Usage: 1.0%; Reserved: 8000.0 MiB, Platform: 81.7 MiB (Base: 81.7, k8s-system: 0.0), k8s-addon: 0.0
collectd[35230]: interface plugin http request exception ; [Errno 111] Connection refused
collectd[35230]: 4K memory usage: Anon: 0.2%, Anon: 92.2 MiB, cgroup-rss: 84.3 MiB, Avail: 60769.0 MiB, Total: 60861.3 MiB
collectd[35230]: 4K numa memory usage: node0, Anon: 0.31%, Anon: 92.2 MiB, Avail: 29523.6 MiB, Total: 29615.9 MiB
collectd[35230]: 4K numa memory usage: node1, Anon: 0.00%, Anon: 0.0 MiB, Avail: 31825.1 MiB, Total: 31825.1 MiB
collectd[35230]: fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(4), rc:(-1), error:(Broken pipe)
collectd[35230]: ptp plugin 'set_fault' exception ; 100.119:
collectd[3530]: ptp plugin compute-0 not locked to remote Grand Master ()
collectd[35230]: collectd: Stopping 5 write threads.
[30599.665976] ------------[ cut here ]------------
[30599.671120] kernel BUG at fs/ext4/
[30599.676174] invalid opcode: 0000 [#1] PREEMPT SMP
...
(skipped)
...
[ 310.159201] systemd-udevd[87]: worker [89] terminated by signal 9 (Killed)
[ 310.166893] systemd-udevd[87]: worker [89] failed while handling '/devices/
[ 310.178553] systemd-
[ 310.196872] systemd-
[ 310.203126] systemd-
[ 310.214005] systemd-
[ 310.220235] systemd-
[ 310.225810] systemd-
[ 310.231657] systemd-
[ 310.237671] systemd-
[ 310.243895] systemd-
[ 310.249682] systemd-
[ 310.256116] systemd-
[ 310.263586] systemd-
## compute-1 console log with the error
acpid: exiting
collectd[34347]: interface plugin http request exception ; [Errno 111] Connection refused
[30404.228183] ------------[ cut here ]------------
[30404.233336] kernel BUG at fs/ext4/
[30404.238389] invalid opcode: 0000 [#1] PREEMPT SMP
[30404.243767] Modules linked in: ip6t_REJECT nf_reject_ipv6 ipt_rpfilter ip6t_rpfilter xt_multiport xt_set iptable_raw iptable_mangle ip6table_raw ip_set_hash_ip ip_set_hash_net ip_set xt_nat xt_statistic ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nbd ip6table_mangle ip6t_MASQUERADE nf_nat_
[30404.397744] CPU: 0 PID: 424497 Comm: systemd-shutdow Kdump: loaded Tainted: G O ------------ T 3.10.0-
[30404.411619] Hardware name: Intel Corporation S2600WT2R/
[30404.423459] task: ffff984fb4e517a0 ti: ffff985d0758000 task.ti: ffff9850d0758000
[30404.431809] RIP: 0010:[<
[30404.443760] RSP: 0018:ffff9850d0
[30404.449686] RAX: ffff9850b0385400 RBX: ffff9850d0607000 RCX: 0000000000000000
[30404.457645] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff9850d00e4000
[30404.465607] RBP: ffff9850d075bd88 R08: 0000000000000000 R09: ffff9850d075bdd4
[30404.473568] R10: ffff984fe51d8019 R11: 0000000000000000 R12: ffff9850d00e4000
[30404.481530] R13: ffff9850d00e4800 R14: ffff9850b0385400 R15: 0000000000000000
[30404.489490] FS: 00007fd96ec7884
[30404.498516] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30404.504926] CR2: 00007fd96e13e600 CR3: 000000084f3c6000 CR4: 00000000001607f0
[30404.512886] Call Trace:
[30404.515620] [<ffffffffc108e
[30404.522424] [<ffffffff8418a
...
(skipped)
...
[ 491.159683] INFO: task kworker/u2:6:173 blocked for more than 120 seconds.
[ 491.212906] "echo 0 > /proc/sys/
[ 491.273587] kworker/u2:6 D ffff88002fd6af40 0 173 2 0x00000000
[ 491.328444] Workqueue: events_unbound async_run_entry_fn
[ 491.368344] Call Trace:
[ 491.388483] [<ffffffff817f7
[ 491.426986] [<ffffffff810b1
[ 491.481542] [<ffffffff810aa
[ 491.526768] [<ffffffff810b1
[ 491.576017] [<ffffffff8155c
[ 491.619859] [<ffffffff810b1
[ 491.710971] [<ffffffff810a3
[ 491.754095] [<ffffffff817f8
[ 491.799203] [<ffffffff810a2
[ 491.849723] [<ffffffff810a9
[ 491.887542] [<ffffffff810a9
[ 491.938099] [<ffffffff817fb
[ 491.987986] [<ffffffff810a9
Test Activity
-------------
Normal use
Changed in starlingx: | |
assignee: | nobody → Mihnea Saracin (msaracin) |
tags: | added: stx.5.0 stx.storage |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
summary: |
- Upgrades orchestration failed due to compute offline after host re- - install - kernel error in fs/ext4 + Worker node offline after host re-install - kernel error in fs/ext4 |
summary: |
- Worker node offline after host re-install - kernel error in fs/ext4 + Worker node remains offline after host re-install - kernel error in + fs/ext4 |
Fixed by: /review. opendev. org/c/starlingx /metal/ +/777751
https:/