udhcp is randomly failing on the Arndale in lava-ssh sessions

Bug #1239820 reported by Mike Holmes
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LAVA Validation Lab
Fix Released
High
Dave Pigott
Linaro OpenEmbedded
Fix Released
High
Unassigned
linaro-networking
Fix Released
High
Unassigned

Bug Description

I ran the same lava-ssh test on the same arndale in the regular lab over a short span of a couple of hours, with RT and non RT kernels and they have a common issue, udhcp randomly fails.

regular kernel
 - http://validation.linaro.org/scheduler/job/78934 Fail
 - http://validation.linaro.org/scheduler/job/78934/log_file#L_15_377

regular kernel
 - http://validation.linaro.org/scheduler/job/78933 Pass

rt kernel
 - http://validation.linaro.org/scheduler/job/78926 Fail
 - http://validation.linaro.org/scheduler/job/78926/log_file#L_29_372

rt kernel
 - http://validation.linaro.org/scheduler/job/78917 Pass

Revision history for this message
Mike Holmes (mike-holmes) wrote :

This also occurs daily for KVM which uses the network, for example http://validation.linaro.org/scheduler/job/78659/log_file#L_15_377

Revision history for this message
Mike Holmes (mike-holmes) wrote :

Ubuntu file systems appear to be ok in a limited number of trials.
http://validation.linaro.org/scheduler/job/79397/log_file

Revision history for this message
Mike Holmes (mike-holmes) wrote :

http://validation.linaro.org/scheduler/job/79424/log_file#L_31_367

This also fails for 3.10.15 rt11 in addition to the in dhcp

Revision history for this message
Mike Holmes (mike-holmes) wrote :
Revision history for this message
Mike Holmes (mike-holmes) wrote :
Revision history for this message
Mike Holmes (mike-holmes) wrote :

I think we need something like wireshark to determine why the dhcp is failing for just about any OE image / kernel you care to try. How do we get access to the dhcp server in the lab ?

Revision history for this message
Matthew Hart (matthew-hart) wrote :

As I've mentioned before, a recurring theme in the logs you have linked:

http://validation.linaro.org/scheduler/job/79440/log_file#L_15_435
http://validation.linaro.org/scheduler/job/79424/log_file#L_31_367
http://validation.linaro.org/scheduler/job/78934/log_file#L_15_367
http://validation.linaro.org/scheduler/job/78926/log_file#L_29_366

is that udhcpc is already well into it's DHCP requests before the kernel has even noticed that the network link is up, so a considerable amount of the total time udhcpc is willing to wait, is lost without any requests making it out of the board.
The DHCP server then only has a short amount of time to respond. On a smaller network that is probably fine, but with the large amount of devices in our and the sheer amount of DHCP traffic it's likely taking just a tiny bit too long to respond and the board has already given up.

Revision history for this message
Mike Holmes (mike-holmes) wrote : Re: [Bug 1239820] Re: udhcp is randomly failing on the Arndale in lava-ssh sessions

I hear you Matt, but given the very wide spread use of OE, why is this is
not seen else where, by very many people unless our set up is considerably
slower that normal ?

>>
The DHCP server then only has a short amount of time to respond. On a
smaller network that is probably fine, but with the large amount of devices
in our and the sheer amount of DHCP traffic it's likely taking just a tiny
bit too long to respond and the board has already given up.
>>

If we can be sure the root cause is because our lab is slow but "acceptably
slow" rather than needing fixing in some way, and that is then the reason
for the issue, modifying all the Linaro OE images to wait longer makes
sense, but it feels like a band aid.

I think we need Fathi to comment on modifying all the Linaro OE images to
wait longer, because every Linaro OE image we tried failed the same way.

If Fathi is fine with the -n fix in the OE images, I am.

Last note, this usually manifests in the deployed image boot, not in the
master image - is that LAVA related in any way ?

Mike

On 16 October 2013 18:35, Matthew Hart <email address hidden> wrote:

> As I've mentioned before, a recurring theme in the logs you have linked:
>
> http://validation.linaro.org/scheduler/job/79440/log_file#L_15_435
> http://validation.linaro.org/scheduler/job/79424/log_file#L_31_367
> http://validation.linaro.org/scheduler/job/78934/log_file#L_15_367
> http://validation.linaro.org/scheduler/job/78926/log_file#L_29_366
>
> is that udhcpc is already well into it's DHCP requests before the kernel
> has even noticed that the network link is up, so a considerable amount of
> the total time udhcpc is willing to wait, is lost without any requests
> making it out of the board.
> The DHCP server then only has a short amount of time to respond. On a
> smaller network that is probably fine, but with the large amount of devices
> in our and the sheer amount of DHCP traffic it's likely taking just a tiny
> bit too long to respond and the board has already given up.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1239820
>
> Title:
> udhcp is randomly failing on the Arndale in lava-ssh sessions
>
> Status in LAVA Validation Lab:
> New
>
> Bug description:
> I ran the same lava-ssh test on the same arndale in the regular lab
> over a short span of a couple of hours, with RT and non RT kernels and
> they have a common issue, udhcp randomly fails.
>
> regular kernel
> - http://validation.linaro.org/scheduler/job/78934 Fail
> - http://validation.linaro.org/scheduler/job/78934/log_file#L_15_377
>
> regular kernel
> - http://validation.linaro.org/scheduler/job/78933 Pass
>
> rt kernel
> - http://validation.linaro.org/scheduler/job/78926 Fail
> - http://validation.linaro.org/scheduler/job/78926/log_file#L_29_372
>
> rt kernel
> - http://validation.linaro.org/scheduler/job/78917 Pass
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lava-lab/+bug/1239820/+subscriptions
>

Revision history for this message
Dave Pigott (dpigott) wrote :

We will reconfirm this after deployment of dhcpd on Monday

Changed in lava-lab:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Dave Pigott (dpigott)
Revision history for this message
Mike Holmes (mike-holmes) wrote :

The master images appear to always work and I noticed this comment in the log

"lava-master-network waiting 120 seconds for a network device."

http://validation.linaro.org/scheduler/job/80099/log_file#L_3_437

Revision history for this message
Kim Phillips (kim-phillips) wrote :

based on the timestamps, it didn't take 120 seconds.

Meanwhile, I'm attaching a plan B - a patch to OE - if, come Monday, this hasn't been rectified via lab reconfiguration.

Fathi Boudra (fboudra)
Changed in linaro-oe:
status: New → Triaged
importance: Undecided → High
milestone: none → 13.10
Revision history for this message
Mike Holmes (mike-holmes) wrote :

The work arround suggested above was applied https://staging.review.linaro.org/#/c/357/ and appears to fix the issue.

Changed in linaro-networking:
importance: Undecided → High
status: New → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Mike Holmes (mike-holmes) wrote :

The lab switch configuration was modified along with the timeout in open embedded for udhcp retries.

Changed in linaro-networking:
status: In Progress → Fix Released
Changed in lava-lab:
status: Confirmed → Fix Released
Changed in linaro-oe:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.