memcg_test_3 from controllers in LTP timeout with Bionic/Focal/Hirsute (arm64)

Bug #1836694 reported by Po-Hsu Lin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
In Progress
Undecided
Po-Hsu Lin
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Invalid
Undecided
Unassigned
Focal
Invalid
Undecided
Unassigned
Hirsute
Won't Fix
Undecided
Unassigned
Impish
Won't Fix
Undecided
Unassigned

Bug Description

This failure could be found in the LTP test suite on a Moonshot ARM64 node with B-4.15, but sometimes it will pass if you try to run it manually. (Sometimes not.)

<<<test_start>>>
tag=memcg_test_3 stime=1563249678
cmdline="memcg_test_3"
contacts=""
analysis=exit
<<<test_output>>>
incrementing stop
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
Test timeouted, sending SIGKILL!
tst_test.c:1140: INFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1141: BROK: Test killed! (timeout?)

Summary:
passed 0
failed 0
skipped 0
warnings 0
tst_tmpdir.c:330: WARN: tst_rmdir: rmobj(/tmp/ltp-nJ05WiJDR1/06EzUc) failed: unlink(/tmp/ltp-nJ05WiJDR1/06EzUc/memcg/cgroup.clone_children) failed; errno=1: EPERM
<<<execution_status>>>

When it fails, the attempt to remove files will fail, and most of the cgroup_fj_* test will fail:
  * cgroup_fj_function_memory
  * cgroup_fj_stress_memory_10_3_each
  * cgroup_fj_stress_memory_10_3_none
  * cgroup_fj_stress_memory_10_3_one
  * cgroup_fj_stress_memory_1_200_each
  * cgroup_fj_stress_memory_1_200_none
  * cgroup_fj_stress_memory_1_200_one
  * cgroup_fj_stress_memory_200_1_each
  * cgroup_fj_stress_memory_200_1_none
  * cgroup_fj_stress_memory_200_1_one
  * cgroup_fj_stress_memory_2_2_each
  * cgroup_fj_stress_memory_2_2_none
  * cgroup_fj_stress_memory_2_2_one
  * cgroup_fj_stress_memory_2_9_each
  * cgroup_fj_stress_memory_2_9_none
  * cgroup_fj_stress_memory_2_9_one
  * cgroup_fj_stress_memory_3_3_each
  * cgroup_fj_stress_memory_3_3_none
  * cgroup_fj_stress_memory_3_3_one
  * cgroup_fj_stress_memory_4_4_each
  * cgroup_fj_stress_memory_4_4_none
  * cgroup_fj_stress_memory_4_4_one

Steps to run this:
  git clone --depth=1 https://github.com/linux-test-project/ltp.git
  cd ltp; make autotools; ./configure; make; sudo make install
  echo "memcg_test_3 memcg_test_3" > /tmp/jobs
  sudo /opt/ltp/runltp -f /tmp/jobs

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-54-generic 4.15.0-54.58
ProcVersionSignature: User Name 4.15.0-54.58-generic 4.15.18
Uname: Linux 4.15.0-54-generic aarch64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jul 16 03:46 seq
 crw-rw---- 1 root audio 116, 33 Jul 16 03:46 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.6
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Tue Jul 16 04:16:14 2019
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
PciMultimedia:

ProcFB:

ProcKernelCmdLine: console=ttyS0,9600n8r ro
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-54-generic N/A
 linux-backports-modules-4.15.0-54-generic N/A
 linux-firmware 1.173.8
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Po-Hsu Lin (cypressyew)
tags: added: 4.15 sru-20190701 ubuntu-ltp
Revision history for this message
Po-Hsu Lin (cypressyew) wrote : Re: memcg_test_3 from controllers in LTP failed on Moonshot ARM64 with Bionic

This issue could be found on B-gcp as well.

But set the LTP_TIMEOUT_MUL to 3 can make it pass.

It tooks:
real 8m50.670s
user 1m54.287s
sys 4m39.889s

On GCP, so LTP_TIMEOUT_MUL=3 is quite enough.

Next is to see if this fix works for Moonshot ARM64.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

On Moonshot ARM64,
the test result is a bit unstable, sometimes it will pass within 2min, sometimes it will take 8, some even timeout with 30min threshold.

I think it's better to just fix it for virtual env now.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Also, this has timeout on AWS a1.medium (arm64) with B-4.15 AWS kernel.
It looks like this is not a VM

Sean Feole (sfeole)
tags: added: sru-20200106
Revision history for this message
Kelsey Steele (kelsey-steele) wrote :

spotted on Focal aws : 5.4.0-1026.26 : amd64 t2.small

tags: added: 4.5 aws sru-20200921
tags: added: sru-20200921azure
removed: sru-20200921
Po-Hsu Lin (cypressyew)
tags: added: azure sru-20200921
removed: sru-20200921azure
tags: added: sru-20210104
tags: added: hwe-5.8 sru-20210222
tags: added: sru-20210412
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

Could be similar to lp:1899465

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Found with hirsute/linux 5.11.0-35.37 on node rizzo (amd64).

tags: added: 5.11 hirsute
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Found also with focal/linux 5.4.0-85.95 -lowlatency on node fili (amd64).

tags: added: 5.4 focal
summary: - memcg_test_3 from controllers in LTP failed on Moonshot ARM64 with
- Bionic
+ memcg_test_3 from controllers in LTP failed with Bionic/Focal/Hirsute
Changed in linux (Ubuntu Focal):
status: New → Confirmed
Changed in linux (Ubuntu Hirsute):
status: New → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote : Re: memcg_test_3 from controllers in LTP failed with Bionic/Focal/Hirsute

With controllers test moved out of ubuntu_ltp, the solution in comment #3 has been removed with the following commit: https://kernel.ubuntu.com/git/ubuntu/autotest-client-tests.git/commit/?id=90a399748c05b9dbb13a7520ae0329022d375d58

We need to re-evaluate if we need to add this to ubuntu_ltp_controllers.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This failure can be found on F-aws 5.4.0-1058.61 a1.medium

With this multiplier the test can pass on this VM, otherwise it won't.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This failure can be found on ARM64 bare-metal node appleton-kernel with H-5.11, need to see how long will it take for this test.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Without this multiplier, this can be spotted on F-aws 5.4.0-1058.61 t2.small as well

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Without this multiplier, this can be spotted on H-gcp-5.11 / F-gcp-5.4 with g1.small

Testing with uncommitted changes on obruchev / garlog.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

It will take about 1989 seconds (~34 min) to run on g1.small instance.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

More test results, it will take about:

* 50 minutes to run on F-gcp-5.11 g1.small
* 38 minutes to run on B-gcp-4.15 g1.small
* 40 minutes to run on H-gcp-5.11 g1.small
* 36 minutes to run on X-gcp-4.15 g1.small

Also, it will take about 7m30s on I-azure-5.13 Standard_B1ms

So I think a 60 min timeout (LTP_TIMEOUT_MUL=12) should be OK.

Changed in ubuntu-kernel-tests:
status: New → In Progress
assignee: nobody → Po-Hsu Lin (cypressyew)
summary: - memcg_test_3 from controllers in LTP failed with Bionic/Focal/Hirsute
+ memcg_test_3 from controllers in LTP timeout with Bionic/Focal/Hirsute
Revision history for this message
Krzysztof Kozlowski (krzk) wrote (last edit ): Re: memcg_test_3 from controllers in LTP timeout with Bionic/Focal/Hirsute

Also on: 2021.09.27/bionic/linux-azure-4.15/4.15.0-1125.138 (Standard_B1ms, Standard_D48_v3)

tags: added: hinted
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Test from AWS:

* 8 minutes to run on H-aws-5.11 a1.medium
* 8 minutes to run on F-aws-5.4 t2.small
* 8 minutes to run on F-aws-5.11 a1.medium
* 8 minutes to run on F-aws-5.11 t2.small

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

Found on:
2021.09.27/bionic/linux-azure-fips/4.15.0-2037.41
2021.09.27/bionic/linux-azure-4.15/4.15.0-1125.138

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Krzysztof,
Thanks for reporting this.
I have these failing instances (Standard_B1ms, Standard_D48_v3) in your comment #21 retested with updated code. The Standard_B1ms will pass without any error. But Standard_D48_v3 will fail with bug 1946201 instead.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Note that this fix in comment in #20 is only targeting on VM instance.

I can see ARM64 bare-metal node appleton-kernel is still failing.

With a manual test by using 1h timeout for this test, it still timeout on appleton-kernel with Focal 5.11.0-38-generic-64k. (This test was not executed with the generic ARM64 kernel on this node)

This will needs to be investigated.

Po-Hsu Lin (cypressyew)
Changed in ubuntu-kernel-tests:
assignee: Po-Hsu Lin (cypressyew) → nobody
status: In Progress → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Spotted on 5.4.0-1061.64~18.04.1

summary: memcg_test_3 from controllers in LTP timeout with Bionic/Focal/Hirsute
+ (arm64)
tags: added: sru-20211129
Revision history for this message
Brian Murray (brian-murray) wrote :

The Hirsute Hippo has reached End of Life, so this bug will not be fixed for that release.

Changed in linux (Ubuntu Hirsute):
status: Confirmed → Won't Fix
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 21.10 (Impish Indri) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Impish):
status: Confirmed → Won't Fix
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This memcg_test_3 failure on ARM64 instances is no longer causing cgroup_fj_* test to fail, it just timeout with 30 seconds threshold.

Will need to test it again manually.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote (last edit ):

Test on B-aws arm64 instance a1.medium shows it will take about 500 seconds to finish (~8 minutes), which is higher than our current 5 minutes threshold.

As a1.medium is not a VM, it won't benefit from the LTP_TIMEOUT_MUL setting we have.
I will modify this part to see how it goes.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Set timeout multiplier unconditionally [1] can solve this issue on AWS instances, but test shows ARM64 bare metal node appleton-kernel will still get this timeout error with jammy lowlatency kernel. Even with 1 hr timeout.

We will need some manual test result on this node before adjusting this.

[1] https://git.launchpad.net/~canonical-kernel-team/+git/autotest-client-tests/commit/?id=a4905b2f93d933fb31314ea7aadbc42d1e7403e8

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

One interesting finding on ARM64 node appleton-kernel with Jammy 5.15.0-46-generic

If you run the memcg_test_3 directly, it's fairly quick to finish:

However, if you run the cgroup test first (as the order in controller suite), it will timeout even with 2hr threshold.

controllers suite test sequence:
#DESCRIPTION:Resource Management testing
cgroup cgroup_regression_test.sh
memcg_regression memcg_regression_test.sh
memcg_test_3 memcg_test_3

Revision history for this message
Luke Nowakowski-Krijger (lukenow) wrote :

So this timeout is reproduced only on ARM64? And only when its run cgroup -> memcg test order? Ill also look into this issue

Po-Hsu Lin (cypressyew)
tags: added: ubuntu-ltp-controllers
removed: ubuntu-ltp
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Note that with LTP test suite update this test is not causing failure to cgroup_fj_* anymore.

Now it's only failing on GCP g1-small instance. This test will take over an hour (1h2m) which is slightly above our current threshold (120x, actual timeout is 60m).

I think we can bump it to 150x (75m).

Po-Hsu Lin (cypressyew)
Changed in ubuntu-kernel-tests:
status: Confirmed → In Progress
assignee: nobody → Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu Bionic):
status: Confirmed → Invalid
Changed in linux (Ubuntu Focal):
status: Confirmed → Invalid
Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.