checkbox-remote is SLOW when running tests

Bug #1855683 reported by Jeff Lane 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Next Generation Checkbox (CLI)
Fix Released
High
Sylvain Pineau

Bug Description

Discussed briefly while in the Taipei sprint but when runnign tests via checkbox-remote Ive noticed that the tests are insanely slow.

For example, this is running the test-firmware launcher in the server provider manually, locally on a machine:

ubuntu@jellyplum:~$ time test-firmware
=========================[ Resume Incomplete Session ]==========================
There are 0 incomplete sessions that might be resumed
Preparing...
<SNIP>
real 2m33.180s
user 1m48.855s
sys 0m37.771s

And this is the same test run via checkbox-remote locally:

==================================[ Results ]=================================== (H) Help
file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.html
file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.junit.xml
file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.tar.xz
file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.xlsx
Do you want to submit 'upload to certification' report?
  y => yes
  n => no
[yn]: n

real 24m50.807s
user 0m23.067s
sys 0m4.435s

this looked initially like it was due to the sosreport job, since that generates a large tarball that is then attached to the submission via the attachment plugin, which with remote, transmits this across whatever remote is using to transfer data from the slave to the master. So I retried this by removing the sosreport jobs.

That seems to have made a huge difference, getting times back down to locally non-remote run attempts:
real 2m52.436s
user 0m15.049s
sys 0m3.158s

So now one last verification, I run checkbox remote over the internet without the sosreport job, and the same test plan that just took 3 minutes is still running after over an hour.

Related branches

Revision history for this message
Jeff Lane  (bladernr) wrote :

After a lot of discussion in Taipei, joc thinks this could be related to just bottlenecks with RPyC and the amount of data being passed back and forth during testing with Checkbox Remote and testflinger connections over the internet.

Also, after talking with Andy, I bumped this to High for our sake because this is going to seriously affect regression testing for us and that's a huge focal point for 2020. We'll be expected quite soon to be testing servers remotely in earnest and this definitely blocks that if a test run takes orders of magnitude longer when run via Checkbox Remote.

Changed in checkbox-ng:
importance: Undecided → High
Revision history for this message
Jeff Lane  (bladernr) wrote :

Additionally, while testing a full server certification run while in TPE, we found that when running remote, this bottleneck increases test time by a significant amount, and for a test run that could take 30 hours or longer when run locally, we're taking about a test time of days due to this issue.

For example, this run took 30.6 hours to run locally:

https://certification.canonical.com/hardware/201912-27576/submission/155985/

Given what I've seen with remote, it would easily have taken 300 hours (assuming an order of magnitude given the 2m vs 24m observed times before) if we tried running this via checkbox-remote.

Revision history for this message
Jeff Lane  (bladernr) wrote :

So just out of curiosity, any idea of a timeline or even hopefulness of getting this worked out?

Changed in checkbox-ng:
assignee: nobody → Sylvain Pineau (sylvain-pineau)
status: New → In Progress
Changed in checkbox-ng:
milestone: none → 1.8.0
Changed in checkbox-ng:
status: In Progress → Fix Committed
Revision history for this message
Jeff Lane  (bladernr) wrote :

Seems much better. This is my testflinger-sample launcher run locally:

real 9m7.680s
user 2m18.126s
sys 0m20.739s

And this is run via remote:
real 29m8.282s
user 0m3.765s
sys 0m0.652s

Still a lot longer, but I think that's a lot better than before.

Now I'll do a full test run and see how long that takes.

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

I think the extra time is still due to the sosreport transport back to master. Could you check again w/o this attachment.

"So now one last verification, I run checkbox remote over the internet without the sosreport job, and the same test plan that just took 3 minutes is still running after over an hour."

I'd like to confirm it's not lasting 1h again.

Revision history for this message
Jeff Lane  (bladernr) wrote :

These are the times with the sosreport job removed:
This is the attached launcher run locally:
real 8m38.884s
user 2m22.455s
sys 0m19.952s

And this is the attached launcher run via remote (from a container on the testflinger agent host):
real 26m39.128s
user 0m3.601s
sys 0m0.612s

And this is the same launcher run locally WITH the sosreport attach job added back in:
real 9m2.055s
user 2m27.579s
sys 0m20.608s

And this is via remote with the sosreport attach job added back in:
real 33m3.120s
user 0m4.352s
sys 0m0.848s

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

I'm wondering if the extra report defined in the launcher could be the problem.
Jeff are you able to say which part costs the most? test execution or report generation?

Revision history for this message
Jeff Lane  (bladernr) wrote : Re: [Bug 1855683] Re: checkbox-remote is SLOW when running tests
Download full text (3.5 KiB)

From watching it it looks like report generation. IT seems to get by
pretty fast through the test run, then takes a long while at the end
when it's creating the report files (the very end when it's generating
files and dumping links to them on the console)

On Wed, Apr 8, 2020 at 3:51 AM Sylvain Pineau
<email address hidden> wrote:
>
> I'm wondering if the extra report defined in the launcher could be the problem.
> Jeff are you able to say which part costs the most? test execution or report generation?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1855683
>
> Title:
> checkbox-remote is SLOW when running tests
>
> Status in Next Generation Checkbox (CLI):
> Fix Committed
>
> Bug description:
> Discussed briefly while in the Taipei sprint but when runnign tests
> via checkbox-remote Ive noticed that the tests are insanely slow.
>
> For example, this is running the test-firmware launcher in the server
> provider manually, locally on a machine:
>
> ubuntu@jellyplum:~$ time test-firmware
> =========================[ Resume Incomplete Session ]==========================
> There are 0 incomplete sessions that might be resumed
> Preparing...
> <SNIP>
> real 2m33.180s
> user 1m48.855s
> sys 0m37.771s
>
> And this is the same test run via checkbox-remote locally:
>
>
> ==================================[ Results ]=================================== (H) Help
> file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.html
> file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.junit.xml
> file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.tar.xz
> file:///home/ubuntu/.local/share/checkbox-ng/submission_2019-12-09T07.47.08.062908.xlsx
> Do you want to submit 'upload to certification' report?
> y => yes
> n => no
> [yn]: n
>
> real 24m50.807s
> user 0m23.067s
> sys 0m4.435s
>
> this looked initially like it was due to the sosreport job, since that
> generates a large tarball that is then attached to the submission via
> the attachment plugin, which with remote, transmits this across
> whatever remote is using to transfer data from the slave to the
> master. So I retried this by removing the sosreport jobs.
>
> That seems to have made a huge difference, getting times back down to locally non-remote run attempts:
> real 2m52.436s
> user 0m15.049s
> sys 0m3.158s
>
> So now one last verification, I run checkbox remote over the internet
> without the sosreport job, and the same test plan that just took 3
> minutes is still running after over an hour.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/checkbox-ng/+bug/1855683/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=checkbox-ng; milestone=1.8.0; status=Fix Committed; importance=High; <email address hidden>;
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-C...

Read more...

Revision history for this message
Jeff Lane  (bladernr) wrote :

OK locally ran the "full" run:
real 1126m57.675s
user 2873m57.566s
sys 1388m34.364s

I'll run it remote now to see if the difference remains orders of magnitude.

Changed in checkbox-ng:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.