MAAS stops working and deployment fails after `Loading ephemeral` step
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
High
|
Adam Collard | ||
simplestreams |
Fix Released
|
Undecided
|
Adam Collard | ||
simplestreams (Ubuntu) |
Fix Released
|
Undecided
|
Paride Legovini | ||
Focal |
Fix Released
|
Undecided
|
Paride Legovini | ||
Groovy |
Fix Released
|
Undecided
|
Paride Legovini | ||
Hirsute |
Fix Released
|
Undecided
|
Paride Legovini |
Bug Description
[Impact]
The bug is about simplestreams possibly getting stuck waiting forever for an an HTTP response that never comes, e.g. because of networking issues. This can potentially affect any package depending on simplestreams, but specifically it was reported affecting MAAS, where it causes server deployments to timeout.
[Test Plan]
Install an iptables rule to block SSL handshaking w/ the MAAS simplestreams repo:
-------
$ sudo iptables -A INPUT -p tcp -s 91.189.88.136 -m string --string maas.io --algo bm -j DROP
-------
Run the reproducer described below, and verify that it hangs indefinitely (I recommend waiting 60s):
-------
$ cat repro.py
#!/usr/bin/env python3
from simplestreams.
url = "https:/
r = RequestsUrlRead
-------
With the fix applied, verify that it does timeout in ~10s.
[Regression Potential]
Scenarios where it takes more than 10s to initiate a connection are unlikely, but possible. Code that does not properly handle a timeout exception in these situations may begin to fail.
[Original Description]
= How to determine you are seeing this problem =
Does your MAAS server seem to get "hung up", where deployments suddenly start failing w/ lots of connection timeouts to the MAAS server?
Get a list of pids of your regiond processes:
$ ps -ef | grep regiond
Run strace on each one to see if one is stuck in a connect() or recv() call:
$ sudo strace -p $pid
recv(...
(normally you should see a lot of epoll_ctl() calls go by if not hung)
If one is hung, use lsof to see what it is connected to:
sudo lsof -i -a -p $pid
If you see an open connection to your images server, then this maybe your problem. sudo kill -9 of the hung pid will cause it to respawn and recover.
Related branches
- dann frazier (community): Approve
- Canonical Server: Pending requested
- git-ubuntu developers: Pending requested
-
Diff: 75 lines (+55/-0)3 files modifieddebian/changelog (+8/-0)
debian/patches/0001-Add-10s-timeout-to-out-going-requests.patch (+46/-0)
debian/patches/series (+1/-0)
- dann frazier (community): Approve
- Canonical Server: Pending requested
- git-ubuntu developers: Pending requested
-
Diff: 75 lines (+55/-0)3 files modifieddebian/changelog (+8/-0)
debian/patches/0001-Add-10s-timeout-to-out-going-requests.patch (+46/-0)
debian/patches/series (+1/-0)
- dann frazier (community): Approve
- Canonical Server: Pending requested
- git-ubuntu developers: Pending requested
-
Diff: 75 lines (+55/-0)3 files modifieddebian/changelog (+8/-0)
debian/patches/0001-Add-10s-timeout-to-out-going-requests.patch (+46/-0)
debian/patches/series (+1/-0)
- Paride Legovini: Approve
- Server Team CI bot: Approve (continuous-integration)
-
Diff: 32 lines (+5/-2)1 file modifiedsimplestreams/contentsource.py (+5/-2)
Changed in maas: | |
status: | Invalid → New |
Changed in maas: | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | nobody → Lee Trager (ltrager) |
Changed in simplestreams: | |
status: | New → In Progress |
assignee: | nobody → Adam Collard (adam-collard) |
Changed in simplestreams: | |
status: | In Progress → Fix Committed |
Changed in simplestreams (Ubuntu): | |
status: | New → Triaged |
description: | updated |
description: | updated |
Changed in simplestreams (Ubuntu): | |
assignee: | nobody → Paride Legovini (paride) |
Changed in simplestreams (Ubuntu Focal): | |
status: | New → Triaged |
Changed in simplestreams (Ubuntu): | |
status: | Triaged → In Progress |
Changed in simplestreams (Ubuntu Focal): | |
assignee: | nobody → Paride Legovini (paride) |
description: | updated |
Changed in simplestreams (Ubuntu Focal): | |
status: | Triaged → In Progress |
Changed in simplestreams (Ubuntu Groovy): | |
assignee: | nobody → Paride Legovini (paride) |
Changed in simplestreams (Ubuntu Hirsute): | |
assignee: | nobody → Paride Legovini (paride) |
Changed in simplestreams (Ubuntu Groovy): | |
status: | New → In Progress |
Changed in simplestreams (Ubuntu Hirsute): | |
status: | New → In Progress |
Changed in simplestreams (Ubuntu): | |
status: | Fix Released → Confirmed |
status: | Confirmed → Fix Released |
Changed in simplestreams (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in simplestreams (Ubuntu Groovy): | |
status: | In Progress → Fix Committed |
Changed in simplestreams (Ubuntu Hirsute): | |
status: | In Progress → Fix Committed |
description: | updated |
Changed in maas: | |
assignee: | Lee Trager (ltrager) → nobody |
Changed in maas: | |
status: | In Progress → Fix Committed |
milestone: | none → 3.4.0 |
Changed in maas: | |
milestone: | 3.4.0 → 3.4.0-rc2 |
Changed in maas: | |
assignee: | nobody → Adam Collard (adam-collard) |
Changed in maas: | |
status: | Fix Committed → Fix Released |
attaching rackd.log and regiond.log