`lxc` commands returning `Error: Failed to begin transaction: context deadline exceeded`

Bug #2067633 reported by Skia
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Unassigned
lxd (Ubuntu)
New
Undecided
Unassigned

Bug Description

Since around 2024-05-24, we've been experiencing an error in the autopkgtest.ubuntu.com infrastructure.
The symptom is all `lxc` command invokation returning `Error: Failed to begin transaction: context deadline exceeded`, whether from the machine usually running the commands through an LXD remote (like `lxc list lxd-armhf-10.123.123.123:` to list containers on the given remote), or from the machine itself where the containers run on (`lxc list` on the machine `10.123.123.123`).

This is happening quite randomly, sometime every few days, sometimes two times per day on the same machine. At one point, the 16 workers where in that same situation at the same time (around 2024-05-26, Sunday evening, when nobody took care of that).

Those workers are all `arm64` Jammy machines, and the containers running on them are all `armhf` of all the supported Ubuntu releases.
LXD version: 5.21.1 LTS, installed with snapd
Kernel version: 5.15.0-107.117

Here are logs from around the issue on three machines:
https://pastebin.ubuntu.com/p/ZMCbY2gHmX/
https://pastebin.ubuntu.com/p/kVBp7RQb2n/
https://pastebin.ubuntu.com/p/HyGsgdXkqb/

As we can see, the pattern is always the same, and has also been observed on other problematic machines:
* First `kernel: physZlw57F: renamed from eth0` and following network related lines.
  These lines are common during normal operation, but also always happen before the kernel calltrace. That might still just be a coincidence.
* Then `kernel: Unable to handle kernel paging request at virtual address` with the calltrace
* Finally LXD starting to have issues with the `Failed to begin transaction: context deadline exceeded`.
  Sometimes these lines start to appear half an hour after the kernel issue, but we've never seen them before.

One workaround we're experimenting right now is running the HWE kernel (version 6.5.0-35.35~22.04.1), and so far, the four machines running it haven't had the issue in two days, but it's still too early to conclude anything.

Revision history for this message
Skia (hyask) wrote :
tags: added: cuqa-manual-testing
Revision history for this message
Skia (hyask) wrote :

Upload the logs as attachment for longer lifespan.

Revision history for this message
Skia (hyask) wrote :
Revision history for this message
Skia (hyask) wrote :
Revision history for this message
Skia (hyask) wrote :

This was hit again, as soon as the load rose with a big influx of jobs. Same situation, with the 5.15.0-107.117 kernel, and the same pattern in the logs.

Revision history for this message
Simon Déziel (sdeziel) wrote :

@hyask, in the main issue description you mentioned the HWE was helping. Is this still the case?

Revision history for this message
Skia (hyask) wrote :

That seems to be the case, yes. Since the issue arises only now and then, I can't exactly be sure, but so far, I've absolutely never seen it on the HWE kernel.

I've just encountered the following:
```
Jun 11 20:22:12 lxd-armhf-bos03-04 lxd.daemon[929146]: time="2024-06-11T20:22:12Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
...
Jun 11 21:49:08 lxd-armhf-bos03-04 kernel: Unable to handle kernel paging request at virtual address ffff37575eb49000
...
Jun 11 21:49:30 lxd-armhf-bos03-04 lxd.daemon[929146]: time="2024-06-11T21:49:30Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
Jun 11 21:49:35 lxd-armhf-bos03-04 lxd.daemon[929146]: time="2024-06-11T21:49:35Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
```
It's the first time I see the `context deadline exceeded` before the kernel issue, but there was only one error, and it might be a legitimate one since the machine was probably under high load at that time, and there are also issues with the underlying arm64 hypervisor these days.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.