[network/multi_nic] test hangs on dhclient when run

Bug #926229 reported by Jeff Lane 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox
Fix Released
High
Jeff Lane 
isc-dhcp (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

DIscovered this while testing a server during a test run of the checkbox offline process.

First, the config:

1U server with 2 eth ports. To install, I completely simulated the offline experience by disconnecting both NIC ports from my LAN, installing Precise Server alpha 2 from USB stick and then installing the checbox-certification tarball.

After that was installed, I reconnected the NIC ports to my LAN.

Note that because they were not connected during installation, there is no config for them in /etc/network/interfaces.

Next, I started checkbox-certification and allowed it to begin testing. Once it hit network/multi_nic it just stopped.

The process being run at the time was dhclient, the first step in the multi_nic test.

Unfortunately, I forgot to save any logs before reinstalling the system for other things...

Related branches

Revision history for this message
Jeff Lane  (bladernr) wrote :

I have re-installed the server now to hit this test again. First, I installed the system exactly as mentioned above.

Next, after installing the tarball, but before starting checkbox, I checked that each NIC port was available by doing :

sudo dhclient eth0
sudo dhclient eth1

I verified that each port had an IP address, then shut each port down and killed the dhclient processes that were running.

Then I started checkbox on this server.

Once again, when the test got to network/multi_nic_eth0 the dhclient bit got hung... I'm attaching a bit of syslog showing what's happening. For whatever reason, dhclient is continuously looping, it never registers that eth0 acked the address from the dhcp server.

In the attached log bits, you will see the unending loop for dhclient on eth0 followed eventually by eth1. Once I had some log stuff captured, I manually killed the dhclient process for eth0. At this point, the test then went on and successfully did it's testing on eth0 and successfully used dhclient on eth1 and also successfully tested that adapter.

I don't know WHY dhclient is hanging on eth0 on my system, and only when run through checkbox, but it is. When I am oustide of checkbox, I can use dhclient manually without a problem to bring eth0 up.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Gonna mark this confirmed and targeting 0.13.5 for now. I'd like to see this fixed for Precise, or at least guarantee its on the radar for later

Changed in checkbox:
status: New → Confirmed
milestone: none → 0.13.5
Changed in checkbox:
milestone: 0.13.5 → 0.13.6
Revision history for this message
Jeff Lane  (bladernr) wrote :

Setting this to High because it's now being seen in the wild as part of bug #977041 as reported by Juergen Chiu

Changed in checkbox:
importance: Undecided → High
milestone: 0.13.6 → 0.13.x
Ara Pulido (ara)
Changed in checkbox:
milestone: 0.13.x → 0.13.7
Ara Pulido (ara)
Changed in checkbox:
milestone: 0.13.7 → 0.13.x
Revision history for this message
Jeff Lane  (bladernr) wrote :

Ok... so after talking to roadmr about this, the easiest way to get around this for checkbox sake is to remove the dhclient call from the networking/multi_nic_* jobs and insist on the user configuring all network devices prior to running checkbox. I think that's a reasonable workaround for this issue.

The root cause seems to be dhclient choking a bit... if it's run more than once on eth0, it goes into this loop. but only on eth0. for eth1, I can run dhclient to bring the device up, shut eth1 down, and use dhclient again, and each time it just spaws a new instance of dhclient for eth1.

this seems to be very easily reproducible on my server here by doing the following:

1: comment out any ethX configuration in /etc/network/interfaces
2: reboot to ensure a clean system
3: run this short script:

for x in 1 2 3; do
sudo dhclient eth0
ifconfig eth0
sleep 1
sudo ifconfig eth0 down
ifconfig
done

What you should see happening is that on the first iteration, dhclient successfully brings eth0 up, displays the output of 'ifconfig eth0' then shuts it down and the second run of ifconfig only shows the 'lo' interface active.

On the second iteration, however, the script will appear to hang. So move to a different console and 'tail -f /var/log/syslog' and you will now see the infinite loop of DHCPREQEUST, DHCPDISCOVER, and DHCPOFFER messages but never an ack.,

Move back to the first console and ctrl-c to stop dhclient which is now hung. this will move on to the 'ifconfig eth0' line again, and you'll see that eth0 was actually activated, dhclient just never realized that.

Then on the third iteration, you'll have to ctrl-c again to kill dhclient one more time.

Now, after that is complete, do a 'ps axf |grep dhclient' and you should see only one instace:

$ ps axf |grep dhclient |grep -v grep
2811 ? Ss 0:00 dhclient eth0

NOW, run that script again, but this time use eth1 (this has to be done on a system with two ethernet devices that are connected to a working LAN:

When this runs against eth1, you'll see that the script complets all three loops successfully without dhclient being hung up at all. And after it's done, redo the ps command:

$ ps axf |grep dhclient |grep -v grep
2811 ? Ss 0:00 dhclient eth0
3151 ? Ss 0:00 dhclient eth1
3238 ? Ss 0:00 dhclient eth1
3326 ? Ss 0:00 dhclient eth1

So it appears that there is actually a problem with dhclient.

Revision history for this message
Jeff Lane  (bladernr) wrote :

NOTE: I've marked this Fix Committed for the Checkbox task only. The branch I've linked ONLY provides a workaround to this issue in the checkbox test case, it does NOT address the root cause, which appears to be wonkiness with dhclient itself.

Changed in checkbox:
status: Confirmed → Fix Committed
assignee: nobody → Jeff Lane (bladernr)
Daniel Manrique (roadmr)
Changed in checkbox:
milestone: 0.13.x → 0.13.7
Jeff Lane  (bladernr)
Changed in checkbox:
status: Fix Committed → Fix Released
Revision history for this message
Stéphane Graber (stgraber) wrote :

I'm unable to reproduce on either precise or quantal using:
for x in 1 2 3; do sudo dhclient eth0; ifconfig eth0; sleep 1; sudo ifconfig eth0 down; ifconfig; done

However, please note that doing the above is wrong as dhclient will fork in the background when starting and doesn't like being run multiple times.

I'd suggest using "dhclient -1" for such cases.

Changed in isc-dhcp (Ubuntu):
status: New → Incomplete
Revision history for this message
Stéphane Graber (stgraber) wrote :

Closing this bug as it's gone over 6 months without more detailed reproducing steps being provided.

Changed in isc-dhcp (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.