simultaneously started lucid containers pause while starting after the first seven

Bug #1014916 reported by Gary Poster on 2012-06-19
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Low
Unassigned

Bug Description

We are gathering more data, but it feels like we have enough to start a bug report.

On Precise, on a 16 core/32 thread EC2 machine (instance type cc2.8xlarge), when starting approximately eight or more Lucid containers simultaneously: after the first seven or so connect to dnsmasq within just a few seconds, the next group waits just under three minutes before they start to connect.

This affects our ability to ssh into the containers and do work.

We will provide short steps to duplicate this. Please note also the extra data we include at the end of this report, showing experiments we have run to rule out certain causes and scenarios.

To duplicate (again, I am doing this on a specific EC2 machine with a lot of cores, but I believe you can see something similar to this even on a 4 core/8 hyperthread machine):

sudo apt-get install moreutils
sudo lxc-create -t ubuntu -n lucid -- -r lucid -a i386 -b ubuntu
parallel -j 16 bash -c "lxc-start-ephemeral -d -o lucid" -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Now look at /var/log/syslog at the most recent dnsmasq entries. You will see that about seven containers get DHCPACK from dnsmasq within about 20 seconds; then you will see a pause of about three minutes; then you will see more containers report to dnsmasq. Here's an excerpt (the approximate three minute jump happens between lines 5 and 6, and then I include all lines up to the next ACK):

...
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.161 00:16:3e:38:9f:e2 lptests-temp-5uKcEua
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 10.0.3.37 00:16:3e:9c:ae:16
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.37 00:16:3e:9c:ae:16 lptests-temp-1HxiwbU
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 10.0.3.62 00:16:3e:69:83:2b
Jun 18 20:40:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.62 00:16:3e:69:83:2b lptests-temp-6o30rvH
Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:0d:7b:65
Jun 18 20:43:05 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.203 00:16:3e:0d:7b:65
Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:32:5b:99
Jun 18 20:43:08 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.68 00:16:3e:32:5b:99
Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:7c:21:38
Jun 18 20:43:11 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.35 00:16:3e:7c:21:38
Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:72:0c:64
Jun 18 20:43:14 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.208 00:16:3e:72:0c:64
Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:ed:b3:a0
Jun 18 20:43:17 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.82 00:16:3e:ed:b3:a0
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:e3:8f:1a
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.247 00:16:3e:e3:8f:1a
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:5c:99:20
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.225 00:16:3e:5c:99:20
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPDISCOVER(lxcbr0) 10.0.3.123 00:16:3e:ed:b3:a0
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPOFFER(lxcbr0) 10.0.3.82 00:16:3e:ed:b3:a0
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPREQUEST(lxcbr0) 10.0.3.203 00:16:3e:0d:7b:65
Jun 18 20:43:20 ip-10-58-49-120 dnsmasq-dhcp[7546]: DHCPACK(lxcbr0) 10.0.3.203 00:16:3e:0d:7b:65 lptests-temp-C5po26y
...

What I expect to see is all of the containers reporting approximately together, without a three minute break, so that all containers are ready for work much sooner.

I looked in the syslogs of the started containers. The first one to connect to dnsmasq *after* the three minute pause also showed a three minute pause in its own syslog. The following is an excerpt from the syslog of that machine (notice the jump between 20:40 to 20:43). The "udevtrigger post-stop process" lines and following dhclient startup lines, down through the "plymouth-splash" line, are not found in the two fast machine syslogs that I looked at.

Jun 18 20:40:10 lptests-temp-C5po26y ntpd[542]: frequency initialized 0.000 PPM from /var/lib/ntp/ntp.drift
Jun 18 20:40:12 lptests-temp-C5po26y kernel: 6n[8su4vP >25vs3e6e3 rr[2zIr
Jun 18 20:40:12 lptests-temp-C5po26y kernel: <7[9]zPrt26t6pl3 w
Jun 18 20:40:13 lptests-temp-C5po26y kernel: 03rntt8hot<7[91h:6st42nre72.]Oorr67eq6 44vqvs
Jun 18 20:43:02 lptests-temp-C5po26y init: udevtrigger post-stop process (65) terminated with status 1
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Internet Systems Consortium DHCP Client V3.1.3
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Copyright 2004-2009 Internet Systems Consortium.
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: All rights reserved.
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: For info, please visit https://www.isc.org/software/dhcp/
Jun 18 20:43:02 lptests-temp-C5po26y dhclient:
Jun 18 20:43:02 lptests-temp-C5po26y init: plymouth-splash main process (639) terminated with status 1
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Listening on LPF/eth0/00:16:3e:0d:7b:65
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Sending on LPF/eth0/00:16:3e:0d:7b:65
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: Sending on Socket/fallback
Jun 18 20:43:02 lptests-temp-C5po26y dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Jun 18 20:43:05 lptests-temp-C5po26y dhclient: DHCPOFFER of 10.0.3.203 from 10.0.3.1
Jun 18 20:43:05 lptests-temp-C5po26y dhclient: DHCPREQUEST of 10.0.3.203 on eth0 to 255.255.255.255 port 67

Here's a bit more data to close with.

We have experience indicating that more waiting occurs when you add more containers to be run, because simultaneously starting 32 containers will occasionally (about once in 15 attempts) result in one container that still has not connected to dnsmasq after ten minutes. We have not tried to duplicate this, because we hope that tackling the smaller issue described here will also address the larger one.

I have tried with a precise container. *Precise containers do not exhibit this bug* in the experiment I performed, at least up to 16 containers. I plan to duplicate this at least a couple more times to verify, but I'm comfortable saying initially that I believe this is specific to Lucid containers. I'll report back.

I have not yet had enough disk space on this machine to duplicate this without lxc-start-ephemeral. I intend to do so. Because the Precise container did not exhibit the problem for me even with ephemeral containers, I suspected that the unusual ephemeral characteristics, such as the use of overlayfs, might not be involved.

Finally, according to the stats we gathered while doing this, as far as we could tell, the following are not causes of contention, even when starting 24 simultaneous containers on a 16 core hyperthreaded machine:

* cpu (never got higher than about 80% utilization, load average still had some headroom)
* RAM (never got even half utilization of RAM)
* ioutil% never got above 1 or 2%
* dnsmasq itself does not seem to be an issue, looking at syslog, since we do not see restarts there

We plan to do a bit more analysis, as mentioned above, but we think this data might be good enough to ask for your input at this point. Thank you.

Gary Poster (gary) on 2012-06-19
description: updated
Serge Hallyn (serge-hallyn) wrote :

Thanks, Gary, I'll try to reproduce this.

Changed in lxc (Ubuntu):
importance: Undecided → Medium
Serge Hallyn (serge-hallyn) wrote :

@Gary,

I wasn't able to reproduce this on a m1.xlarge.

Regarding needing disk space, note that /dev/vdb1 should have a large amount of disk. You can unmount /mnt, pvcreate /dev/xvdb; vgcreate lxc /dev/xvdb; then use -B lvm as an additional lxc-create flag to create an lvm-backed container. SNapshotted lvm containers can be created wit lxc-clone -s -o p1 -n p2.

Serge Hallyn (serge-hallyn) wrote :

I wasn't able to reproduce this on a cc2.8xlarge either.

Changed in lxc (Ubuntu):
status: New → Confirmed
Serge Hallyn (serge-hallyn) wrote :

Ok I've been able to reproduce this. I've tried with lvm snapshot containers - those did NOT do this. I don't understand why, unless it's simply the timing.

I used the following script:

cat notit.sh
#!/bin/bash

maclist=[]
i=0
for c in /var/lib/lxc/*/config; do
  mac=`grep lxc.network.hwaddr $c | awk -F= '{ print $2 }'`
  maclist[$i]="$mac"
  i=$((i+1))
done

for j in `seq 0 $i`; do
 echo "got macaddr ${maclist[j]}"
done

for j in `seq 0 $i`; do
  tail -100 /var/log/syslog | grep -q ${maclist[j]}
  if [ $? -ne 0 ]; then
        echo "${maclist[j]} was not in syslog"
  fi
done

to find a mac address of a container which didn't get an address, (grepping for the mac addr in /var/lib/lxc/*/config), and attached to such a console. ps -ef showed:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:38 ? 00:00:00 /sbin/init
root 42 1 0 16:38 ? 00:00:00 upstart-udev-bridge --daemon
root 44 1 0 16:38 ? 00:00:00 /usr/sbin/sshd -D
syslog 47 1 0 16:38 ? 00:00:00 rsyslogd -c4
root 52 1 0 16:38 ? 00:00:00 udevd --daemon
root 54 1 0 16:38 ? 00:00:00 /sbin/udevadm monitor -e
root 58 1 0 16:38 ? 00:00:00 udevadm settle
root 81 1 0 16:38 ? 00:00:00 /sbin/getty -8 38400 tty4
root 84 1 0 16:38 ? 00:00:00 /sbin/getty -8 38400 tty2
root 85 1 0 16:38 ? 00:00:00 /sbin/getty -8 38400 tty3
root 90 1 0 16:38 ? 00:00:00 cron
root 117 1 0 16:38 ? 00:00:00 /bin/login --
root 118 1 0 16:38 ? 00:00:00 /sbin/getty -8 38400 /dev/console
ubuntu 133 117 0 16:39 tty1 00:00:00 -bash
ubuntu 145 133 0 16:40 tty1 00:00:00 ps -ef

and ifconfig -a showed:

eth0 Link encap:Ethernet HWaddr 00:16:3e:2d:02:b9
          inet6 addr: fe80::216:3eff:fe2d:2b9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:118 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:14800 (14.8 KB) TX bytes:550 (550.0 B)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

Note that udevadm monitor and settle are still running, and dhclient3 is
not yet running.

This is basically what I'd assumed since precise containers don't do this:
precise containers don't do udevadm trigger.

A workaround therefore will be to remove /etc/init/udevtrigger.conf from
the container.

Question remains exactly what is causing this - is there a rate limit in
the kernel or in user-space udev which makes this pause happen?

Serge Hallyn (serge-hallyn) wrote :

Actually simply disabling udevtrigger.conf doesn't work, because in lucid eth0 won't then come up. You also need to change the start on for /etc/init/networking.conf to

start on (local-filesystems)

Serge Hallyn (serge-hallyn) wrote :

(lowering priority as there is a workaround)

Changed in lxc (Ubuntu):
importance: Medium → Low
Gary Poster (gary) wrote :

Serge, thank you! The workaround appears to work very well for us. The containers started quickly, and it should not only give us more reliable starts but also seems to have taken at least three minutes off our average run time, as you might expect.

Gary Poster (gary) wrote :

This was working very well for us, and the lxc-start-ephemeral TRIES=60 was working fine. Starting last week, we began seeing recurrence of this problem, even with the workarounds applied and TRIES hacked to 180. I intend to circle around with Serge and see if he has any other ideas.

Gary Poster (gary) wrote :

Note that the workaround helps significantly but is not always sufficient. If I do not apply it, I can reliably encounter the problem every single test run.

Another team is working on moving us to run on Precise in production, which would let us use Precise containers. Precise containers do not appear to trigger the problem. Because of this, I will advise that we wait on pursuing the problem.

Serge Hallyn (serge-hallyn) wrote :

The go API bindings have been tested with a large number of simultaneous create/start/destroys, so I believe the upstream API thread-safety work must have fixed this at least upstream.

Changed in lxc (Ubuntu):
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 1.0.0~alpha1-0ubuntu2

---------------
lxc (1.0.0~alpha1-0ubuntu2) saucy; urgency=low

  * Add allow-stderr to autopkgtst restrictions as the Ubuntu template
    uses policy-rc.d to disable some daemons and that causes a message to
    be printed on stderr when the service tries to start.
 -- Stephane Graber <email address hidden> Thu, 12 Sep 2013 13:57:17 -0400

Changed in lxc (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers