Timeout being hit when using remote and testruns involve lengthy tests that make machine go silent

Bug #1873053 reported by Jeff Lane 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Next Generation Checkbox (CLI)
Invalid
High
Unassigned

Bug Description

UPDATE: this happens any time there's a lengthy test run that makes a machine go silent for too long. I have some pretty large machines that could go silent for a very long time due to network, memory or CPU testing. I am now seeing this affect some machines during cpu stress testing as well as during network tests.

Unfortunately, this is still blocking me from being able to do full regression testing.

Original description:

I think I'm hitting a timeout with checkbox-remote in the polling it does when a target disappears. The scenario is that remote is taking to the target via NIC-1. During testing of NIC-2, NIC-3 and NIC-4, the system may disappear from the network for several hours (the network testing is 1 hour per port minimum), so it's possible that NIC-1 will disappear for at least 3 hours in this scenario.

So what appears to be happening is that the test is initiated by checkbox-remote, the network test fires, and because NIC-1 is gone for longer than $TIMEOUT, checkbox just stops polling and says that the connection is now lost, killing the session.

We need to be able to either alter this timeout, or disable it entirely, something like:

checkbox-cli master --polling-timeout=18000 IP_ADDRESS LAUNCHER

so that the master will wait a full 5 hours before declaring the session dead. Depending on the network config of the SUT, this is an entirely common possibility, and in some cases, it's possible that the network test could cause the system to disappear for 12 hours or longer (given a large enough number of multi-port NICs installed).

Jeff Lane  (bladernr)
summary: Timeout being hit when using remote and target machine disappears for
- too long
+ network testing
Revision history for this message
Jeff Lane  (bladernr) wrote : Re: Timeout being hit when using remote and target machine disappears for network testing

Soooo... any chance this can be fixed soon?

Changed in checkbox-ng:
importance: Undecided → Medium
Changed in checkbox-ng:
milestone: none → 1.11.0
Changed in checkbox-ng:
status: New → Fix Released
Changed in checkbox-ng:
milestone: 1.11.0 → 1.12.0
status: Fix Released → New
Revision history for this message
Jeff Lane  (bladernr) wrote :

Is there an MR that shows the fix for this?

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

Not yet, It was accidentally set as Fix released because I've assigned it to the 1.11 milestone. But I used it to release a hotfix for checkbox remote. Be patient, it's scheduled for 1.12!

Revision history for this message
Jeff Lane  (bladernr) wrote : Re: [Bug 1873053] Re: Timeout being hit when using remote and target machine disappears for network testing

ahhh ok. No worries, I just got excited for a moment :D

Thanks for letting me know.

On Wed, Sep 30, 2020 at 12:55 PM Sylvain Pineau
<email address hidden> wrote:
>
> Not yet, It was accidentally set as Fix released because I've assigned
> it to the 1.11 milestone. But I used it to release a hotfix for checkbox
> remote. Be patient, it's scheduled for 1.12!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1873053
>
> Title:
> Timeout being hit when using remote and target machine disappears for
> network testing
>
> Status in Next Generation Checkbox (CLI):
> New
>
> Bug description:
> I think I'm hitting a timeout with checkbox-remote in the polling it
> does when a target disappears. The scenario is that remote is taking
> to the target via NIC-1. During testing of NIC-2, NIC-3 and NIC-4,
> the system may disappear from the network for several hours (the
> network testing is 1 hour per port minimum), so it's possible that
> NIC-1 will disappear for at least 3 hours in this scenario.
>
> So what appears to be happening is that the test is initiated by
> checkbox-remote, the network test fires, and because NIC-1 is gone for
> longer than $TIMEOUT, checkbox just stops polling and says that the
> connection is now lost, killing the session.
>
> We need to be able to either alter this timeout, or disable it
> entirely, something like:
>
> checkbox-cli master --polling-timeout=18000 IP_ADDRESS LAUNCHER
>
> so that the master will wait a full 5 hours before declaring the
> session dead. Depending on the network config of the SUT, this is an
> entirely common possibility, and in some cases, it's possible that the
> network test could cause the system to disappear for 12 hours or
> longer (given a large enough number of multi-port NICs installed).
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/checkbox-ng/+bug/1873053/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=checkbox-ng; milestone=1.12.0; status=New; importance=Medium; assignee=None;
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: bladernr sylvain-pineau
> Launchpad-Bug-Reporter: Jeff Lane (bladernr)
> Launchpad-Bug-Modifier: Sylvain Pineau (sylvain-pineau)
> Launchpad-Message-Rationale: Subscriber
> Launchpad-Message-For: bladernr

--
Jeff Lane
Engineering Manager
IHV/OEM Alliances and Server Certification

"Entropy isn't what it used to be."

Jeff Lane  (bladernr)
summary: - Timeout being hit when using remote and target machine disappears for
- network testing
+ Timeout being hit when using remote and testruns involve lengthy tests
+ that make machine go silent
description: updated
tags: added: blocks-hwcert-server
Jeff Lane  (bladernr)
Changed in checkbox-ng:
importance: Medium → High
Revision history for this message
Jeff Lane  (bladernr) wrote :

This as been blocking me for 8 months, so moving it to high. It directly impacts my ability to run tests via checkbox remote.

stress-ng: info: [1274262] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [1274262] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [1274262] stress-ng-stream: Using CPU cache size of 4096K

ERROR: Output timeout reached! (3600s)

So I do want to ask when will 1.12 appear?

I even thought I'd try taking a stab at this myself, just to find that timeout value... but grepping through pretty much every checkbox and plainbox source tree I have cloned locally results in nothing. The error message I see doesn't exist:

bladernr@galactica:~/development$ grep -r "Output timeout reached" checkbox*
bladernr@galactica:~/development$ grep -r "Output timeout reached" plainbox*
bladernr@galactica:~/development$

likewise, grepping for "3600" the timeout in the error message, results in nothing relevant, though it does at least return a lot of irrelevant hits:
https://paste.ubuntu.com/p/GFqZtftScx/

Finally, I tried grepping for that string through every single code tree I have locally, most of which are completely unrelated (kernels, other projects, etc)
bladernr@galactica:~/development$ grep -r "Output timeout reached" *
bladernr@galactica:~/development$

So I don't have whatever source code that error is coming from, and cannot find anywhere in checkbox-ng or any provider I use that would set up a 3600s timeout for checkbox remote.

Changed in checkbox-ng:
milestone: 1.12.0 → 1.13.0
Changed in checkbox-ng:
milestone: 1.13.0 → none
Revision history for this message
Lukas Waymann (meribold) wrote :

> ERROR: Output timeout reached! (3600s)

That's Testflinger!

Revision history for this message
Lukas Waymann (meribold) wrote :

You need add `output_timeout` to your Testflinger agent config file or increase it. If you also specify `output_timeout` in your YAML file for `testflinger submit` you need to increase it there, too. Testflinger takes the minimum of those two values with a default of 900 if neither is specified.

Lukas Waymann (meribold)
Changed in checkbox-ng:
status: New → Incomplete
Revision history for this message
Jonathan Cave (jocave) wrote :

Based on the comments above it sounds like this was a problem encountered when running checkbox from a jenkins job using testflinger i.e. we apply timeouts within jenkins to prevent blocking access to devices.

As a result I don't think this is a bug in checkbox itself and will mark invalid - if you have more evidence to the contrary then please re-open.

Changed in checkbox-ng:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.