simultaneously started lucid containers pause while starting after the first seven
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxc (Ubuntu) |
Fix Released
|
Low
|
Unassigned |
Bug Description
We are gathering more data, but it feels like we have enough to start a bug report.
On Precise, on a 16 core/32 thread EC2 machine (instance type cc2.8xlarge), when starting approximately eight or more Lucid containers simultaneously: after the first seven or so connect to dnsmasq within just a few seconds, the next group waits just under three minutes before they start to connect.
This affects our ability to ssh into the containers and do work.
We will provide short steps to duplicate this. Please note also the extra data we include at the end of this report, showing experiments we have run to rule out certain causes and scenarios.
To duplicate (again, I am doing this on a specific EC2 machine with a lot of cores, but I believe you can see something similar to this even on a 4 core/8 hyperthread machine):
sudo apt-get install moreutils
sudo lxc-create -t ubuntu -n lucid -- -r lucid -a i386 -b ubuntu
parallel -j 16 bash -c "lxc-start-
Now look at /var/log/syslog at the most recent dnsmasq entries. You will see that about seven containers get DHCPACK from dnsmasq within about 20 seconds; then you will see a pause of about three minutes; then you will see more containers report to dnsmasq. Here's an excerpt (the approximate three minute jump happens between lines 5 and 6, and then I include all lines up to the next ACK):
...
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.161 00:16:3e:38:9f:e2 lptests-
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 10.0.3.37 00:16:3e:9c:ae:16
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.37 00:16:3e:9c:ae:16 lptests-
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 10.0.3.62 00:16:3e:69:83:2b
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.62 00:16:3e:69:83:2b lptests-
Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.203 00:16:3e:0d:7b:65
Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.68 00:16:3e:32:5b:99
Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.35 00:16:3e:7c:21:38
Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.208 00:16:3e:72:0c:64
Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.82 00:16:3e:ed:b3:a0
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.247 00:16:3e:e3:8f:1a
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.225 00:16:3e:5c:99:20
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.82 00:16:3e:ed:b3:a0
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 10.0.3.203 00:16:3e:0d:7b:65
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.203 00:16:3e:0d:7b:65 lptests-
...
What I expect to see is all of the containers reporting approximately together, without a three minute break, so that all containers are ready for work much sooner.
I looked in the syslogs of the started containers. The first one to connect to dnsmasq *after* the three minute pause also showed a three minute pause in its own syslog. The following is an excerpt from the syslog of that machine (notice the jump between 20:40 to 20:43). The "udevtrigger post-stop process" lines and following dhclient startup lines, down through the "plymouth-splash" line, are not found in the two fast machine syslogs that I looked at.
Jun 18 20:40:10 lptests-
Jun 18 20:40:12 lptests-
Jun 18 20:40:12 lptests-
Jun 18 20:40:13 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:02 lptests-
Jun 18 20:43:05 lptests-
Jun 18 20:43:05 lptests-
Here's a bit more data to close with.
We have experience indicating that more waiting occurs when you add more containers to be run, because simultaneously starting 32 containers will occasionally (about once in 15 attempts) result in one container that still has not connected to dnsmasq after ten minutes. We have not tried to duplicate this, because we hope that tackling the smaller issue described here will also address the larger one.
I have tried with a precise container. *Precise containers do not exhibit this bug* in the experiment I performed, at least up to 16 containers. I plan to duplicate this at least a couple more times to verify, but I'm comfortable saying initially that I believe this is specific to Lucid containers. I'll report back.
I have not yet had enough disk space on this machine to duplicate this without lxc-start-
Finally, according to the stats we gathered while doing this, as far as we could tell, the following are not causes of contention, even when starting 24 simultaneous containers on a 16 core hyperthreaded machine:
* cpu (never got higher than about 80% utilization, load average still had some headroom)
* RAM (never got even half utilization of RAM)
* ioutil% never got above 1 or 2%
* dnsmasq itself does not seem to be an issue, looking at syslog, since we do not see restarts there
We plan to do a bit more analysis, as mentioned above, but we think this data might be good enough to ask for your input at this point. Thank you.
description: | updated |
Changed in lxc (Ubuntu): | |
status: | New → Confirmed |
Thanks, Gary, I'll try to reproduce this.