snapd.seeded.service timeout

Bug #1897952 reported by Joshua Powers
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
High
Unassigned

Bug Description

Overview
===
We are currently testing AWS Outpost, which allows users to put an AWS rack in their own datacenter. On a number of AWS instances launched, about 10-20%, with Bionic we have seen the snapd.seeded.service fail due "error: cannot communicate with server: timeout exceeded while waiting for response".

Expected Result
===
No service failures, system not in degraded state

Actual Result
===
snapd.seeded.service in a failed state
error: cannot communicate with server: timeout exceeded while waiting for response

Steps to Reproduce
===
1. Launch a number of r5 or m5 instances with Bionic
2. `systemctl status snapd.seeded.service`

We have seen this using Bionic images dated:

20200908 using snapd 2.45.1+18.04.2
20200923 using snapd 2.46.1+18.04

Logs
===
journalctl snapd: https://pastebin.ubuntu.com/p/HJf3DF6y8H/
journalctl -b: http://paste.ubuntu.com/p/fwQPBdN7gN/
dmesg: http://paste.ubuntu.com/p/Fp2XtvHpmJ/
snap changes: https://pastebin.ubuntu.com/p/PMP4SMhSkD/

System Access
===
I have added the SSH keys of the snapd team to this system:

ssh ubuntu@54.202.120.142

Revision history for this message
Cody Shepherd (codyshepherd) wrote :

This also occurred in the latest Xenial release image in AWS, serial 20200916.

Joshua Powers (powersj)
description: updated
Revision history for this message
Joshua Powers (powersj) wrote :

After some discussions, I decided to launch another 12 instances to see if the issues had been resolved. Out of 12 instances, 4 failed to communicate during the snapd.seeded.service with the error:

Sep 30 20:47:31 ip-10-0-1-134 systemd[1]: Starting Wait until snapd is fully seeded...
Sep 30 20:48:22 ip-10-0-1-134 snap[1028]: error: cannot communicate with server: timeout exceeded while waiting for response
Sep 30 20:48:22 ip-10-0-1-134 systemd[1]: snapd.seeded.service: Main process exited, code=exited, status=1/FAILURE
Sep 30 20:48:22 ip-10-0-1-134 systemd[1]: snapd.seeded.service: Failed with result 'exit-code'.
Sep 30 20:48:22 ip-10-0-1-134 systemd[1]: Failed to start Wait until snapd is fully seeded.

I have kept one of these systems up and added the SSH keys of the snapd team:

ssh ubuntu@34.215.224.87

ssh-import-id lp:pedronis lp:mvo lp:zyga lp:mvo lp:sergio-j-cazzolato lp:morrisong lp:maciek-borzecki lp:anonymouse67

Revision history for this message
Joshua Powers (powersj) wrote :

After further discussion in snapstore, we enabled the following debug variables in /etc/enviornment:

SNAPD_DEBUG_HTTP=7
SNAPD_DEBUG=1

After restarting snapd, I installed a snap, vault, and noted that the install was occuring against cloudfront. Here is the full log:

https://paste.ubuntu.com/p/wm7vpHN8cd/

Revision history for this message
Maciej Borzecki (maciek-borzecki) wrote :

The client timeout in 2.46.1 is set to 2 minutes in which snapd should respond.

Looking at the later log, there is nothing that would suggest that `snap wait system seed.loaded` was run. Can you collect the log when the problem happens during boot?

Changed in snapd:
status: New → Incomplete
Revision history for this message
Joshua Powers (powersj) wrote :

Hi,

When you say logs are you after the snapd journal output? In which case does the log in the description not show the issue? https://pastebin.ubuntu.com/p/HJf3DF6y8H/

Changed in snapd:
status: Incomplete → New
Revision history for this message
Joshua Powers (powersj) wrote :

We will lose access to these systems on Monday. I would really appreciate if it is possible for you to connect and look at the logs on the system itself:

ssh ubuntu@34.215.224.87
ssh ubuntu@54.202.120.142

Revision history for this message
Paweł Stołowski (stolowski) wrote :

It seems we lost the opportunity to connect to these systems, anyway I think the attached logs are good enough:

Sep 30 17:57:04 ip-10-0-1-79 snapd[920]: daemon.go:343: started snapd/2.46.1+18.04 (series 16; classic) ubuntu/18.04 (amd64) linux/5.4.0-1025-aws.
Sep 30 17:57:04 ip-10-0-1-79 snapd[920]: daemon.go:436: adjusting startup timeout by 30s (pessimistic estimate of 30s plus 5s per snap)
Sep 30 17:57:06 ip-10-0-1-79 snapd[920]: helpers.go:105: error trying to compare the snap system key: system-key missing on disk
Sep 30 17:57:06 ip-10-0-1-79 systemd[1]: Started Snap Daemon.
Sep 30 17:57:06 ip-10-0-1-79 systemd[1]: Starting Wait until snapd is fully seeded...
Sep 30 17:57:13 ip-10-0-1-79 systemd-timesyncd[522]: Synchronized to time server 91.189.94.4:123 (ntp.ubuntu.com).
Sep 30 17:57:56 ip-10-0-1-79 snap[1214]: error: cannot communicate with server: timeout exceeded while waiting for response
Sep 30 17:57:56 ip-10-0-1-79 systemd[1]: snapd.seeded.service: Main process exited, code=exited, status=1/FAILURE

what I gather from this is snapd started on boot at 17:57:04 (started seeding), at 17:57:06 the snap.service.seeded job was started but failed after less than 1 minute at 17:57:56, despite using timeout of 2 minutes as Maciej said.

Now, I wonder if problem isn't related to time synchronization taking place in the middle of all this:
Sep 30 17:57:13 ip-10-0-1-79 systemd-timesyncd[522]: Synchronized to time server 91.189.94.4:123 (ntp.ubuntu.com).

This is during snap wait execution. In snap wait client code we set deadline time to "time.Now().Add(timeout))", so perhaps time gets skewed and we don't wait long enough.

Changed in snapd:
status: New → Triaged
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers