autopkgtests sometimes timeout

Bug #1885186 reported by dann frazier
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
edk2 (Ubuntu)
Fix Released
Undecided
Unassigned
Eoan
Won't Fix
Undecided
Unassigned
Focal
Fix Released
Undecided
Mustafa Kemal Gilor
Groovy
Fix Released
Undecided
Unassigned

Bug Description

[ Impact ]

edk2 autopkgtests fail intermittently, causing builds to fail.

The autopkgtests run QEMU to boot to an EFI prompt, and then power off. The run for each of the supported archs (ARM, AARCH64 & X64). The failure symptom is that 1 or more of these tests times out after not seeing a prompt for 30s.

The root cause of timeouts is unknown and could be related to the resource status in a busy/stressed execution environment.

The SRU includes a patch that aims to eliminate the timeout-related flakiness of the autopkgtests by increasing the timeout value from the 30s to 60s.

[ Test Plan ]

Considering the fact that the problem is observed in automated builds and is intermittent in nature, the best way to verify the change would be to watch future Focal build failures caused by test timeouts. Test timeout-related failures in Focal builds should decrease after this patch is landed.

[ Where problems could occur ]

Given that the changes are in the test code, there is no possibility of regression for end users.

[Original bug description]

edk2 autopkgtests fail intermittently - but the frequency has increased quite a bit recently. The autopkgtests run QEMU to boot to an EFI prompt, and then power off. The run for each of the supported archs (ARM, AARCH64 & X64). The failure symptom is that 1 or more of these tests times out after not seeing a prompt for 30s.

Revision history for this message
dann frazier (dannf) wrote :

It's not clear to me what the pattern is here, so I wrote a script to generate a csv file of the failures, including how long each test took and which ones timed out.

Revision history for this message
dann frazier (dannf) wrote :

And here's the resulting csv from that script.

Revision history for this message
dann frazier (dannf) wrote :

The failures don't appear to follow an Ubuntu release, a host architecture, or the guest architecture being tested. So did something change externally? Looking at the data:

Prior to June 2020, we only saw a 2.7% failure rate (3/112).
Starting in June, that's grown to 25% (8/32). This is across all releases, so it's unlikely due to a change in edk2 itself (eoan/focal releases haven't changed recently).

Perhaps something changed with our test systems. Just looking at armhf tests, time to run a test seems to have gone up, on average, since then - +6s/+3s/+6s for AARCH64/ARM/X86 guests, respectively. What's also interesting is that, again just looking at armhf, 100% of the time (7/7) we had a timeout, there was another test very close to timing out (>= 30s). When no timeouts occurred, only 13% (6/46) had any test close to timing out.

What that says to me is that this is unlikely to be an actual hang. Rather, sometimes we get a test host w/ less available resources[*], and the odds of that have gone up recently for some reason. I'll therefore plan to bump the timeout to 60s to see if that resolves the issue.

[*] Hard to say what resource(s) could be limited here - one theory is that QEMU is waiting on available entropy, but it could just be CPU.

dann frazier (dannf)
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for spawning such a quality analysis out of my trivial IRC ping!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package edk2 - 2020.05-2ubuntu1

---------------
edk2 (2020.05-2ubuntu1) groovy; urgency=medium

  * Increase autopkgtest timeout from 30s to 60s. LP: #1885186.

 -- dann frazier <email address hidden> Thu, 25 Jun 2020 16:17:15 -0600

Changed in edk2 (Ubuntu Groovy):
status: New → Fix Released
Revision history for this message
dann frazier (dannf) wrote :

The initial autopkgtest results after bumping the timeout to 60s passed, but each test ran within a time length that may have passed even with the previous 30s timeout (longest case I see was 32s). Before (low-impact) SRU'ing this back, ideally we'd be able to point to a test that clearly blew passed the 30s timeout, but still eventually completed. On the other hand, waiting for such proof to appear naturally in groovy before SRU'ing back could result in more unnecessary failures. While I could be swayed either way, I think I'll wait a bit for a demonstrable case in groovy.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Have we tried adding edk2 source package, to "big_packages" configuration of autopkgtest-cloud to spin up larger VM for edk2 tests to run? I.e. a multi-core one. As the default VM is single core I think.

Revision history for this message
dann frazier (dannf) wrote :

This build log shows tests running for ~40s:

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-groovy/groovy/armhf/e/edk2/20200807_201106_d36d2@/log.gz

This provides the evidence I was looking for in Comment #6 that the tests can take longer than 30s and still pass. I therefore think it worth including this fix in a future SRU.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in edk2 (Ubuntu Eoan):
status: New → Won't Fix
Changed in edk2 (Ubuntu Focal):
status: New → In Progress
assignee: nobody → Mustafa Kemal Gilor (mustafakemalgilor)
Revision history for this message
Mustafa Kemal Gilor (mustafakemalgilor) wrote :

-- SRU paperwork for Focal --

[ Impact ]

edk2 autopkgtests fail intermittently, causing builds to fail.

The autopkgtests run QEMU to boot to an EFI prompt, and then power off. The run for each of the supported archs (ARM, AARCH64 & X64). The failure symptom is that 1 or more of these tests times out after not seeing a prompt for 30s.

The root cause of timeouts is unknown and could be related to the resource status in a busy/stressed execution environment.

The SRU includes a patch that aims to eliminate the timeout-related flakiness of the autopkgtests by increasing the timeout value from the 30s to 60s.

[ Test Plan ]

Considering the fact that the problem is observed in automated builds and is intermittent in nature, the best way to verify the change would be to watch future Focal build failures caused by test timeouts. Test timeout-related failures in Focal builds should decrease after this patch is landed.

[ Where problems could occur ]

Given that the changes are in the test code, there is no possibility of regression for end users.

description: updated
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello dann, or anyone else affected,

Accepted edk2 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/edk2/0~20191122.bd85bf54-2ubuntu3.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in edk2 (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Mustafa Kemal Gilor (mustafakemalgilor) wrote :

Verification OK on Focal, I've run autopkgtests 5 times from scratch, no timeouts.

tags: added: verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for edk2 has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package edk2 - 0~20191122.bd85bf54-2ubuntu3.4

---------------
edk2 (0~20191122.bd85bf54-2ubuntu3.4) focal; urgency=medium

  [ dann frazier ]
  * Provide 4MB OVMF images: The existing 2MB images no longer
    have sufficient variable space for the current Secure Boot
    Forbidden Signature Database. (LP: #1885662)
    - Convert targets for pre-enrolled variable template images
      into pattern rules. This will be useful for adding additional
      pre-enrolled variable templates.
    - Update fw descriptors to reference 4M images instead of their
      2M counterparts. This will migrate tools that use the descriptor
      interface (like libvirt) over to the 4M images when creating new
      VMs. Existing 2M VMs will require manual migration.
  * Increase autopkgtest timeout from 30s to 60s. (LP: #1885186)

  [ Mustafa Kemal Gilor ]
  * Added autopkg tests for 4MB OVMF images. (LP: #1885662)

 -- Mustafa Kemal GILOR <email address hidden> Tue, 08 Nov 2022 11:40:07 +0300

Changed in edk2 (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.