Units get stuck in "Waiting for peers" with more number of units

Bug #1908520 reported by Aurelien Lourot
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
charm-magpie
Confirmed
Undecided
Unassigned

Bug Description

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi, I managed to reproduce this issue on top of AWS with Juju 2.9.17.

I can see that, when it is stuck, the leader unit keeps the "running" status instead of "idle" while some of the units are stuck on "Waiting for peers": https://pastebin.canonical.com/p/vNqtnnDmDj/

I can see on the logs: https://pastebin.canonical.com/p/pyYQg8kwwp/
Which means the leader is stuck in the loop below for almost 30 mins:

    def hostcheck(self, nodes, iperf_duration):
        # Wait for other nodes to start their servers...
        for node in nodes:
            msg = "checking iperf on {}".format(node[1])
            hookenv.log(msg)
            cmd = "iperf -t {} -c {} -P{}".format(iperf_duration, node[1],
                                                  min(8, self.num_cpus()))
            os.system(cmd)

Where the leader iterates node-by-node and tries to reach out to each of the units.
Indeed, the leader had an iperf process stuck:
ubuntu@ip-10-200-4-158:~$ sudo ps aux | grep iperf
root 27614 0.0 0.0 2616 608 ? S 14:39 0:00 sh -c iperf -t 1 -c 10.200.3.53 -P2
root 27615 0.0 0.0 237348 2016 ? Sl 14:39 0:00 iperf -t 1 -c 10.200.3.53 -P2
ubuntu 33235 0.0 0.0 8172 672 pts/0 S+ 15:30 0:00 grep --color=auto iperf

I believe the root-cause of this issue is the fact that we are using "os.system(cmd)" on hostcheck method, instead of subprocess. os.system does not allow us to configure any timeouts.

I did a manual replacement of the os.system(cmd) above for:

subprocess.check_output(cmd.split(), timeout=20)

After doing that and manually stopping the hook process, the leader could run the iperf for a defined amount of time and then timeout; allowing it to iterate over all the nodes. Eventually, I've got all units in active/idle: https://pastebin.canonical.com/p/n55PT5ZzkQ/

I recommend we move away from "os.system" to subprocess and clearly define timeouts for each of the commands magpie needs to call.

One can reproduce this issue on AWS by simply deploying magpie as follows:

juju deploy cs:~openstack-charmers/magpie --constraints="instance-type=t2.medium root-disk=40G spaces=test-internal" -n40 --bind="test-internal"

Changed in charm-magpie:
status: New → Confirmed
Revision history for this message
Sérgio Manso (sergiomanso) wrote :

I'm also facing this bug (using Focal, Juju 2.8.11) when deploying magpie with 20+ machines (either baremetal or lxd).
One of the workarounds that I found was to force a re-election of the leader (eg. stop the unit) so that the waiting units could resume their operations with the new leader.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

This issue is still impacting deployments on a regular basis and cause delays in the HAP phase of most project with several nodes. Is there any update on fixing this issue ?

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

As pointed out, this issue has been impacting us on a regular basis. I adding the field-medium SLA to get it resolved.

Revision history for this message
Nobuto Murata (nobuto) wrote :

We are doing some work to use charm actions as a replacement of the peer relation based iperf testing in juju status. If I'm not mistaken, using a peer relation to run this type of tests with a certain duration like 30 seconds is not scalable by design. So setting check_iperf=False and running iperf in an action should make the whole testing process shorter.

- https://review.opendev.org/c/openstack/charm-magpie/+/867756
"Change run-iperf action total-run-time to seconds"
and
"unconditionally install the iperf apt package in the charm, because it should always be available for the case where the user wants to run the run-iperf action"

- https://review.opendev.org/c/openstack/charm-magpie/+/879333
For
"run-iperf action does not allow to precise a single concurrency mode"
https://bugs.launchpad.net/charm-magpie/+bug/2015173
and
"Bandwidth output from run-iperf action is not calculated properly"
https://bugs.launchpad.net/charm-magpie/+bug/2015174

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.