2012-05-04 18:16:13 |
Gary Poster |
bug |
|
|
added bug |
2012-05-04 22:07:40 |
Serge Hallyn |
lxc (Ubuntu): status |
New |
Triaged |
|
2012-05-04 22:07:44 |
Serge Hallyn |
lxc (Ubuntu): importance |
Undecided |
High |
|
2012-05-04 22:08:32 |
Serge Hallyn |
nominated for series |
|
Ubuntu Precise |
|
2012-05-04 22:08:32 |
Serge Hallyn |
bug task added |
|
lxc (Ubuntu Precise) |
|
2012-05-04 22:08:32 |
Serge Hallyn |
nominated for series |
|
Ubuntu Quantal |
|
2012-05-04 22:08:32 |
Serge Hallyn |
bug task added |
|
lxc (Ubuntu Quantal) |
|
2012-05-14 11:27:02 |
Francesco Banconi |
attachment added |
|
bug-994752-lxc-ip.debdiff https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/994752/+attachment/3145695/+files/bug-994752-lxc-ip.debdiff |
|
2012-05-14 12:13:41 |
Ubuntu Foundations Team Bug Bot |
tags |
|
patch |
|
2012-05-14 12:13:52 |
Ubuntu Foundations Team Bug Bot |
bug |
|
|
added subscriber Ubuntu Sponsors Team |
2012-05-16 15:47:34 |
Serge Hallyn |
lxc (Ubuntu Precise): status |
New |
Confirmed |
|
2012-05-16 15:47:39 |
Serge Hallyn |
lxc (Ubuntu Precise): importance |
Undecided |
High |
|
2012-05-16 16:20:12 |
Launchpad Janitor |
lxc (Ubuntu Quantal): status |
Triaged |
Fix Released |
|
2012-05-16 16:44:25 |
Launchpad Janitor |
branch linked |
|
lp:ubuntu/lxc |
|
2012-05-17 04:26:39 |
Bryce Harrington |
description |
When lxc-start-ephemeral is given a command to run (-- do_something) it wants to use lxc-attach to run the command, but lxc-attach is not ready yet. Instead, it parses the dhcp leases to figure out the IP for the container, and then tries to use ssh to run the command.
Twice today in tests involving lxc-start-ephemeral, the dhcp leases were unavailable and lxc-start-ephemeral failed. The machine was under fairly heavy load and was virtualized (EC2).
I'd like to try and make this less fragile. As discussed on IRC, using lxcip (http://bazaar.launchpad.net/~launchpad/lpsetup/trunk/files/head:/lplxcip/) should make this more reliable. Perhaps increasing the timeout in that code might be useful as well. |
[Impact]
<fill me in with explanation of severity and frequency of bug on users and justification for backporting the fix to the stable release>
[Development Fix]
<fill me in with an explanation of how the bug has been addressed in the development branch, including the relevant version numbers of packages modified in order to implement the fix. >
[Stable Fix]
<fill me in by pointing out a minimal patch applicable to the stable version of the package.>
[Text Case]
<fill me in with detailed *instructions* on how to reproduce the bug. This will be used by people later on to verify the updated package fixes the problem.>
1.
2.
3.
Broken Behavior:
Fixed Behavior:
[Regression Potential]
<fill me in with a discussion of likelihood and potential severity of regressions and how users could get inadvertently affected.>
[Original Report]When lxc-start-ephemeral is given a command to run (-- do_something) it wants to use lxc-attach to run the command, but lxc-attach is not ready yet. Instead, it parses the dhcp leases to figure out the IP for the container, and then tries to use ssh to run the command.
Twice today in tests involving lxc-start-ephemeral, the dhcp leases were unavailable and lxc-start-ephemeral failed. The machine was under fairly heavy load and was virtualized (EC2).
I'd like to try and make this less fragile. As discussed on IRC, using lxcip (http://bazaar.launchpad.net/~launchpad/lpsetup/trunk/files/head:/lplxcip/) should make this more reliable. Perhaps increasing the timeout in that code might be useful as well. |
|
2012-05-17 19:03:52 |
Bryce Harrington |
removed subscriber Ubuntu Sponsors Team |
|
|
|
2012-05-22 23:51:50 |
Kristian Øllegaard |
bug |
|
|
added subscriber Kristian Øllegaard |
2012-05-24 16:39:36 |
Francis J. Lacoste |
bug |
|
|
added subscriber Francis J. Lacoste |
2012-05-24 18:19:00 |
Stéphane Graber |
lxc (Ubuntu Precise): assignee |
|
Stéphane Graber (stgraber) |
|
2012-05-24 18:19:03 |
Stéphane Graber |
lxc (Ubuntu Precise): status |
Confirmed |
In Progress |
|
2012-05-24 21:01:34 |
Stéphane Graber |
lxc (Ubuntu Precise): status |
In Progress |
Fix Committed |
|
2012-05-26 00:24:11 |
Gary Poster |
description |
[Impact]
<fill me in with explanation of severity and frequency of bug on users and justification for backporting the fix to the stable release>
[Development Fix]
<fill me in with an explanation of how the bug has been addressed in the development branch, including the relevant version numbers of packages modified in order to implement the fix. >
[Stable Fix]
<fill me in by pointing out a minimal patch applicable to the stable version of the package.>
[Text Case]
<fill me in with detailed *instructions* on how to reproduce the bug. This will be used by people later on to verify the updated package fixes the problem.>
1.
2.
3.
Broken Behavior:
Fixed Behavior:
[Regression Potential]
<fill me in with a discussion of likelihood and potential severity of regressions and how users could get inadvertently affected.>
[Original Report]When lxc-start-ephemeral is given a command to run (-- do_something) it wants to use lxc-attach to run the command, but lxc-attach is not ready yet. Instead, it parses the dhcp leases to figure out the IP for the container, and then tries to use ssh to run the command.
Twice today in tests involving lxc-start-ephemeral, the dhcp leases were unavailable and lxc-start-ephemeral failed. The machine was under fairly heavy load and was virtualized (EC2).
I'd like to try and make this less fragile. As discussed on IRC, using lxcip (http://bazaar.launchpad.net/~launchpad/lpsetup/trunk/files/head:/lplxcip/) should make this more reliable. Perhaps increasing the timeout in that code might be useful as well. |
[Impact]
This affects anyone using lxc-start-ephemeral as part of an automated process for which intermittent failures are a problem. This includes the people who developed the initial version of the script, the Launchpad developers. Our automated test suite will fail, stopping our landing tools, whenever this failure is triggered.
[Development Fix]
1. no longer look in the container's file system for a dhcp table to get the ip of the container; instead, look in the host's network information. This is more reliable and ready sooner. r101 of quantal lxc package has this change.
2. increase the timeout waiting for the containers' network and sshd to be ready.
Note that the current increase, from 30 retries @ 1/sec to 60 retries @ 1/sec, is insufficient for the people who filed the bug, unfortunately. Making the retry count configurable would be ideal. Increasing it to 300 would be sufficient, based on our experience so far. stgraber suggested using the `parallel -l maxload` construct to keep the starts from being overwhelmed by load. Unfortunately, we believe that this is insufficient for at least two reasons. First, the point of our effort is to do a lot of work in parallel, with an lxc per core. The work we have to do takes more than half an hour. Waiting for load to decrease would miss the point of the effort. Second, it doesn't seem that cpu contention is always the problem, from watching top.
[Stable Fix]
[stgraber will need to specify]
[Text Case]
1. Create an lxc container (which has sshd running and your home directory mounter, as is the default). For the sake of these instructions, we will call it "lxctest".
2. Run something like this. Replace "username" with your user name. You might need to do this more or fewer times; we've seen it most easily on a 32 core (16 core hyperthreaded) machine trying to run 32 concurrent callsTo make this less annoying, you could create a temporary passphraseless ssh key.
parallel -j 16 bash -c "lxc-start-ephemeral -u gary -o lpdev -- 'cat /etc/hostname'" -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Do this a few times.
Broken Behavior: At least one of the times, at least one of these fail (emitting an error message rather than the hostname) either because the code could not get the ip address in time, or because the container's sshd wasn't ready in time.
Fixed Behavior: You get all 16 hostnames.
[Regression Potential]
The increased timeout might cause some code to wait longer than before to discover that something is wrong. The improved ip code should have no negative effect.
[Original Report]When lxc-start-ephemeral is given a command to run (-- do_something) it wants to use lxc-attach to run the command, but lxc-attach is not ready yet. Instead, it parses the dhcp leases to figure out the IP for the container, and then tries to use ssh to run the command.
Twice today in tests involving lxc-start-ephemeral, the dhcp leases were unavailable and lxc-start-ephemeral failed. The machine was under fairly heavy load and was virtualized (EC2).
I'd like to try and make this less fragile. As discussed on IRC, using lxcip (http://bazaar.launchpad.net/~launchpad/lpsetup/trunk/files/head:/lplxcip/) should make this more reliable. Perhaps increasing the timeout in that code might be useful as well. |
|
2012-05-31 23:17:05 |
Clint Byrum |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2012-05-31 23:17:09 |
Clint Byrum |
bug |
|
|
added subscriber SRU Verification |
2012-05-31 23:17:11 |
Clint Byrum |
tags |
patch |
patch verification-needed |
|
2012-05-31 23:43:35 |
Launchpad Janitor |
branch linked |
|
lp:ubuntu/precise-proposed/lxc |
|
2012-06-01 14:59:47 |
Stéphane Graber |
tags |
patch verification-needed |
patch verification-done |
|
2012-06-11 15:30:32 |
Launchpad Janitor |
lxc (Ubuntu Precise): status |
Fix Committed |
Fix Released |
|