Bug #1908520 “Units get stuck in “Waiting for peers” with more n...” : Bugs : charm-magpie

Revision history for this message

Pedro Guimarães (pguimaraes) wrote on 2021-11-05:

#1

Hi, I managed to reproduce this issue on top of AWS with Juju 2.9.17.

I can see that, when it is stuck, the leader unit keeps the "running" status instead of "idle" while some of the units are stuck on "Waiting for peers": https://pastebin.canonical.com/p/vNqtnnDmDj/

I can see on the logs: https://pastebin.canonical.com/p/pyYQg8kwwp/
Which means the leader is stuck in the loop below for almost 30 mins:

    def hostcheck(self, nodes, iperf_duration):
        # Wait for other nodes to start their servers...
        for node in nodes:
            msg = "checking iperf on {}".format(node[1])
            hookenv.log(msg)
            cmd = "iperf -t {} -c {} -P{}".format(iperf_duration, node[1],
                                                  min(8, self.num_cpus()))
            os.system(cmd)

Where the leader iterates node-by-node and tries to reach out to each of the units.
Indeed, the leader had an iperf process stuck:
ubuntu@ip-10-200-4-158:~$ sudo ps aux | grep iperf
root 27614 0.0 0.0 2616 608 ? S 14:39 0:00 sh -c iperf -t 1 -c 10.200.3.53 -P2
root 27615 0.0 0.0 237348 2016 ? Sl 14:39 0:00 iperf -t 1 -c 10.200.3.53 -P2
ubuntu 33235 0.0 0.0 8172 672 pts/0 S+ 15:30 0:00 grep --color=auto iperf

I believe the root-cause of this issue is the fact that we are using "os.system(cmd)" on hostcheck method, instead of subprocess. os.system does not allow us to configure any timeouts.

I did a manual replacement of the os.system(cmd) above for:

subprocess.check_output(cmd.split(), timeout=20)

After doing that and manually stopping the hook process, the leader could run the iperf for a defined amount of time and then timeout; allowing it to iterate over all the nodes. Eventually, I've got all units in active/idle: https://pastebin.canonical.com/p/n55PT5ZzkQ/

I recommend we move away from "os.system" to subprocess and clearly define timeouts for each of the commands magpie needs to call.

One can reproduce this issue on AWS by simply deploying magpie as follows:

juju deploy cs:~openstack-charmers/magpie --constraints="instance-type=t2.medium root-disk=40G spaces=test-internal" -n40 --bind="test-internal"

Hi, I managed to reproduce this issue on top of AWS with Juju 2.9.17.

I can see that, when it is stuck, the leader unit keeps the "running" status instead of "idle" while some of the units are stuck on "Waiting for peers": https://pastebin.canonical.com/p/vNqtnnDmDj/

I can see on the logs: https://pastebin.canonical.com/p/pyYQg8kwwp/
Which means the leader is stuck in the loop below for almost 30 mins:

def hostcheck(self, nodes, iperf_duration):
        # Wait for other nodes to start their servers...
        for node in nodes:
            msg = "checking iperf on {}".format(node[1])
            hookenv.log(msg)
            cmd = "iperf -t {} -c {} -P{}".format(iperf_duration, node[1],
                                                  min(8, self.num_cpus()))
            os.system(cmd)

Where the leader iterates node-by-node and tries to reach out to each of the units.
Indeed, the leader had an iperf process stuck:
ubuntu@ip-10-200-4-158:~$ sudo ps aux | grep iperf
root       27614  0.0  0.0   2616   608 ?        S    14:39   0:00 sh -c iperf -t 1 -c 10.200.3.53 -P2
root       27615  0.0  0.0 237348  2016 ?        Sl   14:39   0:00 iperf -t 1 -c 10.200.3.53 -P2
ubuntu     33235  0.0  0.0   8172   672 pts/0    S+   15:30   0:00 grep --color=auto iperf

I believe the root-cause of this issue is the fact that we are using "os.system(cmd)" on hostcheck method, instead of subprocess. os.system does not allow us to configure any timeouts.

I did a manual replacement of the os.system(cmd) above for:

subprocess.check_output(cmd.split(), timeout=20)

After doing that and manually stopping the hook process, the leader could run the iperf for a defined amount of time and then timeout; allowing it to iterate over all the nodes. Eventually, I've got all units in active/idle: https://pastebin.canonical.com/p/n55PT5ZzkQ/

I recommend we move away from "os.system" to subprocess and clearly define timeouts for each of the commands magpie needs to call.

One can reproduce this issue on AWS by simply deploying magpie as follows:

juju deploy cs:~openstack-charmers/magpie --constraints="instance-type=t2.medium root-disk=40G spaces=test-internal" -n40 --bind="test-internal"

Changed in charm-magpie:
status:	New → Confirmed

Revision history for this message

Sérgio Manso (sergiomanso) wrote on 2021-11-05:

#2

I'm also facing this bug (using Focal, Juju 2.8.11) when deploying magpie with 20+ machines (either baremetal or lxd).
One of the workarounds that I found was to force a re-election of the leader (eg. stop the unit) so that the waiting units could resume their operations with the new leader.

Revision history for this message

Camille Rodriguez (camille.rodriguez) wrote on 2023-02-13:

#3

This issue is still impacting deployments on a regular basis and cause delays in the HAP phase of most project with several nodes. Is there any update on fixing this issue ?

Revision history for this message

Pedro Guimarães (pguimaraes) wrote on 2023-03-10:

#4

As pointed out, this issue has been impacting us on a regular basis. I adding the field-medium SLA to get it resolved.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-04-10:

#5

We are doing some work to use charm actions as a replacement of the peer relation based iperf testing in juju status. If I'm not mistaken, using a peer relation to run this type of tests with a certain duration like 30 seconds is not scalable by design. So setting check_iperf=False and running iperf in an action should make the whole testing process shorter.

- https://review.opendev.org/c/openstack/charm-magpie/+/867756
"Change run-iperf action total-run-time to seconds"
and
"unconditionally install the iperf apt package in the charm, because it should always be available for the case where the user wants to run the run-iperf action"

- https://review.opendev.org/c/openstack/charm-magpie/+/879333
For
"run-iperf action does not allow to precise a single concurrency mode"
https://bugs.launchpad.net/charm-magpie/+bug/2015173
and
"Bandwidth output from run-iperf action is not calculated properly"
https://bugs.launchpad.net/charm-magpie/+bug/2015174

charm-magpie

Units get stuck in "Waiting for peers" with more number of units

Bug Description

Other bug subscribers

Remote bug watches