DEP8 consistent failure to launch lxd container in ppc64el

Bug #2012432 reported by Andreas Hasenack
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxd
Unknown
Unknown
samba (Ubuntu)
Fix Released
Medium
Andreas Hasenack

Bug Description

There is a consistent failure[1] that happens during "lxc launch" in the samba domain join DEP8 tests, but only on ppc64el. It passes in the other architectures.

I don't think it's related to ppc64el specifically, just to that environment (speed, load, maybe network setup).

This is the typical failure log:
## Setting up member server to join a domain using method realmd_sssd
## Got test dependencies: realmd krb5-user smbclient
## Launching lunar container
Error: Failed instance creation: write unix @->/var/snap/lxd/common/lxd/unix.socket: i/o timeout
## Something failed, gathering logs

That happens exactly at "lxc launch", which unfortunately is run with "-q":

    echo "## Launching ${release} container"
    lxc launch "ubuntu-daily:${release}" "${container_name}" -q

It might be related to downloading the lxd image for lunar in that environment.

1. https://autopkgtest.ubuntu.com/results/autopkgtest-lunar/lunar/ppc64el/s/samba/20230321_184559_192f0@/log.gz

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

I'm wondering if this happens when using images:ubuntu/lunar instead of ubuntu-daily:lunar.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Got more interesting logs on a PPA-driven DEP8 test[1] on ppc64el:

2023-03-21T22:01:06.811639+00:00 autopkgtest systemd[1]: Started snap.lxd.lxc.10225d32-2081-448b-8000-a3d05afb19d9.scope.
2023-03-21T22:02:31.412488+00:00 autopkgtest lxd.daemon[3633]: time="2023-03-21T22:02:30Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
2023-03-21T22:02:31.497189+00:00 autopkgtest lxd.daemon[3633]: time="2023-03-21T22:02:30Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
2023-03-21T22:02:32.073046+00:00 autopkgtest lxd.daemon[3633]: time="2023-03-21T22:01:56Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1

These logs match the timestamp of when "lxc launch" failed.

1. https://autopkgtest.ubuntu.com/results/autopkgtest-lunar-ahasenack-samba-dep8-troubleshooting/lunar/ppc64el/s/samba/20230321_220308_9b5df@/log.gz

Revision history for this message
Andreas Hasenack (ahasenack) wrote (last edit ):

LXD debug logs show the problematic timeout of 30s that is being exceeded:

Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=debug msg="Instance operation lock created" action=create instance=member-server project=default reusable=false
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=info msg="Creating instance" ephemeral=false instance=member-server instanceType=container project=default
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=debug msg="Adding device" device=eth0 instance=member-server instanceType=container project=default type=nic
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=debug msg="Adding device" device=root instance=member-server instanceType=container project=default type=disk
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=info msg="Created instance" ephemeral=false instance=member-server instanceType=container project=default
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=debug msg="CreateInstanceFromImage started" instance=member-server project=default
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=debug msg="Running filler function" dev= driver=dir path=/var/snap/lxd/common/lxd/storage-pools/default/containers/member-server pool=default
Mar 22 13:16:45 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:45Z" level=info msg="Image unpack started" imageFile=/var/snap/lxd/common/lxd/images/6e24b9324ad0e87a7ff8387d4d7bbea6478342056b1c083565650b6e7f002e89 volName=member-server
Mar 22 13:16:46 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:16:46Z" level=debug msg="Updated metadata for operation" class=task description="Creating instance" operation=371f9ac1-7113-47f9-895b-2cdce855ab3c project=default
Mar 22 13:17:34 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:17:34Z" level=debug msg="Event listener server handler stopped" listener=5b1087ea-c641-4cda-8f5b-087e4e8f06bb local=/var/snap/lxd/common/lxd/unix.socket remote=@
Mar 22 13:17:35 autopkgtest lxd.daemon[4292]: time="2023-03-22T13:17:34Z" level=debug msg="Instance operation lock finished" action=create err="Instance \"create\" operation timed out after 30s" instance=member-server project=default reusable=false

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

If storage on these VMs is network-based, the slow network on the ppc64el VM could explain the slow i/o and longer time to unpack the container image.

ppc64el[1]: Retrieving image: rootfs: 100% (514.59kB/s)
amd64[2]: Retrieving image: rootfs: 100% (4.87MB/s)

1. https://autopkgtest.ubuntu.com/results/autopkgtest-lunar-ahasenack-samba-dep8-troubleshooting/lunar/ppc64el/s/samba/20230322_131759_d1066@/log.gz
2. https://autopkgtest.ubuntu.com/results/autopkgtest-lunar-ahasenack-samba-dep8-troubleshooting/lunar/amd64/s/samba/20230322_133523_34033@/log.gz

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

To workaround this problem for now, a skiptest hint was merged[1], so this particular upload should migrate.

1. https://code.launchpad.net/~ubuntu-release/britney/+git/hints-ubuntu/+merge/439411

Changed in samba (Ubuntu):
status: In Progress → Triaged
importance: High → Medium
Revision history for this message
Paride Legovini (paride) wrote (last edit ):

While discussing the problem Andreas spotted this unsquashfs OOM in the test logs:

2023-03-24T12:22:41.258642+00:00 autopkgtest kernel: [ 313.097173] Out of memory: Killed process 4323 (unsquashfs) total-vm:504128kB, anon-rss:288768kB, file-rss:2048kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:0

So apparently we have to issues:

1. The squashfs OOM, with the following failure:

## Launching lunar container
Error: Failed instance creation: websocket: close 1006 (abnormal closure): unexpected EOF

2. The LXD timeout the upstream LXD issue is about.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We moved the samba/ppc64el tests to big_packages, and the first attempt after that passed. Will keep an eye on it.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The move to big_packages fixed it for now. Closing the bug.

Changed in samba (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.