charm-magpie

Bug #1908520
Comment #1

Comment 1 for bug 1908520

Revision history for this message

Pedro Guimarães (pguimaraes) wrote on 2021-11-05:

Hi, I managed to reproduce this issue on top of AWS with Juju 2.9.17.

I can see that, when it is stuck, the leader unit keeps the "running" status instead of "idle" while some of the units are stuck on "Waiting for peers": https://pastebin.canonical.com/p/vNqtnnDmDj/

I can see on the logs: https://pastebin.canonical.com/p/pyYQg8kwwp/
Which means the leader is stuck in the loop below for almost 30 mins:

    def hostcheck(self, nodes, iperf_duration):
        # Wait for other nodes to start their servers...
        for node in nodes:
            msg = "checking iperf on {}".format(node[1])
            hookenv.log(msg)
            cmd = "iperf -t {} -c {} -P{}".format(iperf_duration, node[1],
                                                  min(8, self.num_cpus()))
            os.system(cmd)

Where the leader iterates node-by-node and tries to reach out to each of the units.
Indeed, the leader had an iperf process stuck:
ubuntu@ip-10-200-4-158:~$ sudo ps aux | grep iperf
root 27614 0.0 0.0 2616 608 ? S 14:39 0:00 sh -c iperf -t 1 -c 10.200.3.53 -P2
root 27615 0.0 0.0 237348 2016 ? Sl 14:39 0:00 iperf -t 1 -c 10.200.3.53 -P2
ubuntu 33235 0.0 0.0 8172 672 pts/0 S+ 15:30 0:00 grep --color=auto iperf

I believe the root-cause of this issue is the fact that we are using "os.system(cmd)" on hostcheck method, instead of subprocess. os.system does not allow us to configure any timeouts.

I did a manual replacement of the os.system(cmd) above for:

subprocess.check_output(cmd.split(), timeout=20)

After doing that and manually stopping the hook process, the leader could run the iperf for a defined amount of time and then timeout; allowing it to iterate over all the nodes. Eventually, I've got all units in active/idle: https://pastebin.canonical.com/p/n55PT5ZzkQ/

I recommend we move away from "os.system" to subprocess and clearly define timeouts for each of the commands magpie needs to call.

One can reproduce this issue on AWS by simply deploying magpie as follows:

juju deploy cs:~openstack-charmers/magpie --constraints="instance-type=t2.medium root-disk=40G spaces=test-internal" -n40 --bind="test-internal"