swtpm is missing from nova-compute and killing host

Bug #1961531 reported by Boris Lukashev
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
New
Undecided
Unassigned

Bug Description

We just ran an in-release (xena->xena newer) upgrade and started to see terrible things after nova-compute containers were replaced. The new containers lack swtpm or the TSS user, and when nova has the swtpm set to enabled, this causes the container to crash. Installing swtpm from a PPA doesnt help as OpenStack looks for specific UID/GID which are not created by the packages. This kills nova-compute, which is bad enough by itself.
Something in the teardown process of nova-compute's current iteration from the quay.io repos is very very broken. We are seeing this stack trace on every node which got the new container, on 5.10 and 5.15:
```
Feb 20 06:48:45 redacted-hostname kernel: watchdog: BUG: soft lockup - CPU#12 stuck for 26s! [runc:[2:INIT]:52543]
...
Feb 20 06:48:45 redacted-hostname kernel: CPU: 12 PID: 52543 Comm: runc:[2:INIT] Tainted: G T 5.15.24-svn #1
Feb 20 06:48:45 redacted-hostname kernel: Hardware name: REDACTED HW
Feb 20 06:48:45 redacted-hostname kernel: RIP: 0010:[<ffffffff8115c071>] queued_spin_lock_slowpath+0x1a1/0x200
Feb 20 06:48:45 redacted-hostname kernel: Code: c1 e8 12 83 e0 03 41 ff c8 48 c1 e0 04 4d 63 c0 48 05 c0 e2 01 00 4a 03 04 c5 40 f2 10 83 48 89 30 8b 46 08 85 c0 75 09 f3 90 <8b> 46 08 85 c0 74 f7 4c 8b 06 4d 85 c0 74 95 41 0f 18 08 eb 8f bf
Feb 20 06:48:45 redacted-hostname kernel: RSP: 0018:ffffc9000ed8bd30 EFLAGS: 00000246
Feb 20 06:48:45 redacted-hostname kernel: RAX: 0000000000000000 RBX: a0187854ae07311e RCX: a0187854ad07744c
Feb 20 06:48:45 redacted-hostname kernel: RDX: ffffffff83488644 RSI: ffff88903e61e2c0 RDI: 0000000000340000
Feb 20 06:48:45 redacted-hostname kernel: RBP: ffff888284a52b80 R08: 000000000000000d R09: 0000000000340000
Feb 20 06:48:45 redacted-hostname kernel: R10: 0000000000000080 R11: 0000000000000020 R12: 5fe787ab2f4f3941
Feb 20 06:48:45 redacted-hostname kernel: R13: a0187854ae072528 R14: ffffc9000ed8bda8 R15: ffff888284a52b80
Feb 20 06:48:45 redacted-hostname kernel: RBX(RAP): mntput_no_expire+0x9f/0x4c0
Feb 20 06:48:45 redacted-hostname kernel: RCX(RAP): _raw_spin_lock+0x4d/0x70
Feb 20 06:48:45 redacted-hostname kernel: RBP: mnt_cache+0x0/0x148
Feb 20 06:48:45 redacted-hostname kernel: R13(RAP): do_umount+0x549/0xaa0
Feb 20 06:48:45 redacted-hostname kernel: R14: copy_process+0x59d/0x3400
Feb 20 06:48:45 redacted-hostname kernel: R15: mnt_cache+0x0/0x148
Feb 20 06:48:45 redacted-hostname kernel: FS: 00007f32bb27e740(0000) GS:ffff88903e600000(0000) knlGS:0000000000000000
Feb 20 06:48:45 redacted-hostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 06:48:45 redacted-hostname kernel: CR2: 00005623a0846960 CR3: 0000000006840000 CR4: 00000000001606f0 shadow CR4: 00000000001606f0
Feb 20 06:48:45 redacted-hostname kernel: Stack:
Feb 20 06:48:45 redacted-hostname kernel: ffffffff82484d0d a0187854ae07362f ffffffff8148085f 0000000000000000
Feb 20 06:48:45 redacted-hostname kernel: ffffffff81480fb0 ffff88843bf5c740 ffff88843bf5c740 ffffc9000ed8bd68
Feb 20 06:48:45 redacted-hostname kernel: ffffc9000ed8bd68 ffff8881808adc40 ffff888284a52b80 a0187854ae072528
Feb 20 06:48:45 redacted-hostname kernel: Call Trace:
Feb 20 06:48:45 redacted-hostname kernel: <TASK>
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff82484d0d>] _raw_spin_lock+0x4d/0x70
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff8148085f>] mntput_no_expire+0x9f/0x4c0
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff81480fb0>] ? namespace_unlock+0x1a0/0x220
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff81480f6e>] namespace_unlock+0x15e/0x220
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff81481c69>] do_umount+0x549/0xaa0
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff81485e47>] path_umount+0xb7/0x1a0
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff81485fd2>] __x64_sys_umount+0xa2/0xd0
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff8245551c>] do_syscall_64+0x6c/0xf0
Feb 20 06:48:45 redacted-hostname kernel: [<ffffffff81001293>] entry_SYSCALL_64_after_hwframe+0x75/0x138
Feb 20 06:48:45 redacted-hostname kernel: RIP: 0033:[<000055ac81351ffb>] 0x55ac81351ffb
Feb 20 06:48:45 redacted-hostname kernel: Code: fa ff eb bf e8 86 ad fa ff e9 61 ff ff ff cc e8 7b 7c fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
Feb 20 06:48:45 redacted-hostname kernel: RSP: 002b:000000c000146c28 EFLAGS: 00000212 ORIG_RAX: 00000000000000a6
Feb 20 06:48:45 redacted-hostname kernel: RAX: ffffffffffffffda RBX: 000000c000028000 RCX: 000055ac81351ffb
Feb 20 06:48:45 redacted-hostname kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c0000b4114
Feb 20 06:48:45 redacted-hostname kernel: RBP: 000000c000146c80 R08: 0000000000000001 R09: 0000000000000001
Feb 20 06:48:45 redacted-hostname kernel: R10: 0000000000000006 R11: 0000000000000212 R12: 0000000000000012
Feb 20 06:48:45 redacted-hostname kernel: R13: 0000000000000011 R14: 0000000000000200 R15: 0000000000000055
Feb 20 06:48:45 redacted-hostname kernel: </TASK>
Feb 20 06:48:53 redacted-hostname kernel: ------------[ cut here ]------------
Feb 20 06:48:53 redacted-hostname kernel: NETDEV WATCHDOG: ens1 (mlx4_core): transmit queue 22 timed out

```
which subsequently cases stalls on every fork or exec (not sure), delaying SSH auth and breaking the upgrade process itself because the nodes which got the new images are now not crashed but not running properly. Its also delaying Ceph IOs which is quite scary (thankfully we're @ under 1/3 impact), and as shown under that trace, messing up network queues.

Commenting out the swtpm from nova.conf prevents this crash from occurring, otherwise we see it within a few minutes of boot.

Requesting inclusion of swtpm into nova-compute containers with relevant user and group as demanded by Nova as a fix for the overarching issue, along with tests to ensure that the config option being enabled does not cause failure. We also need to handle the persistence of TPM data between container reloads and migrations (if not already handled by nova).

Currently running a lockdep kernel to try and hunt down the cause of that crash so as to figure out if the problem starts somewhere in our kernel machinations, here, or at Linus' door.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Turns out that when the libvrit container is down, the compute container also crashes, and dmesg gets very similar stack traces with mntput_no_expire at the top.

Revision history for this message
Boris Lukashev (rageltman) wrote :

The lockdep trace ended up too deep to discern (even journald didnt have the top of it, so likely started during init), and we had to get the node back to a happy state for ops. That said, this seems to be a pretty serious bug which produces ever longer delays in the ability to connect to a host to reboot it (only way i've found so far to fix it) and upon reboot, if the compute container crashes like this, the system starts to degrade pretty quickly. Since we've resolved blockers to nova-compute starting up correctly (disabled swtpm, got libvirt containers up first), things have been running swimmingly. However, now we have to sleep with one eye open for alerts on the containers going down or throwing errors as the next thing after that is very unpleasant.

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

As of today (2a52fad2d0f92b1a3b93c on stable/xena), this problem still persists. If a user adds the following to their nova-compute.conf (per openstack docs):
```
swtpm_enabled = true

```
it will blow up nova because the compute and libvirt containers do not have an swtpm binary/configs and will crash incessantly. If they also have (say NFS) mounts inside of them, they will remount that NFS path without unmounting it first until they overflow or hit the max mounts limit.

Since i filed this issue, it has become more pressing with newer OS demanding TPMs to work correctly proliferating in the market.

While there is no CVE assigned to this concern, it does create a full-fledged denial of service and due to the kernel stalls from the repeated remounts, may result in loss of or damage to data/systems.

Revision history for this message
Noel Ashford (nashford77) wrote :

This is impacting me as well - why has this not been fixed given how simple this is to correct the Quay.io packages?

Revision history for this message
Noel Ashford (nashford77) wrote :

How can i submit the fix myself to quay.io ? Please advise - it is this simple...

add -> swtpm trousers swtpm-tools & the tss user, I would fix this if i knew how to push to quay.io / get it into master / backports

Revision history for this message
Noel Ashford (nashford77) wrote :

I fixed this myself via updating the image & doing a new commit with the necessary changes i listed above, i then removed nova-compute, let it rebuild w my updated commit / image and bam! Up like a charm. The root issue is Quay.io has a bad nova compute image.

Revision history for this message
Boris Lukashev (rageltman) wrote :

@nashford77: hand-editing images makes for a "fun time" upgrading unless you're maintaining your own repository and CI building off tags as they come. Had a very bad day due to the "crashing and auto restarting of nova containers making INTMAX mounts" problem having forgotten i'd done this on a cloud previously.

If the maintainers could fix this upstream, it would be grand far as stability for smaller clouds which aren't running their own build stack.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.