OEM Priority Project

memory/info failed on system which ram is lower than 8G

Bug #1941854 reported by Andy Chi on 2021-08-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Checkbox Provider - Base	Expired	Undecided	Andy Chi
	OEM Priority Project	Incomplete	Undecided	Andy Chi

Bug Description

[I/O log]
Results:
/proc/meminfo reports: 7.16GiB
lshw reports: 8GiB

FAIL: Meminfo reports 905527296 less than lshw, a difference of 10.54%. Only a variance of 10% in reported memory is allowed.

[Reproduce Steps]
1. sudo checkbox-cli run com.canonical.certification::memory/info

Why the tolerance is 10%? I have another machine with 16G ram, but 14.6 in meminfo. This test case can pass due to bigger denominator.

Tags:

Related branches

~os369510/plainbox-provider-checkbox:lp#1941854

Rejected for merging into plainbox-provider-checkbox:master

Pierre Equoy: Needs Fixing on 2022-05-20

Bin Li: Approve on 2022-04-01

Sylvain Pineau: Pending requested 2022-04-07

Andy Chi: Pending requested 2022-03-17

Jonathan Cave: Pending requested 2022-03-17

Jeff Lane : Pending requested 2022-03-17

~binli/plainbox-provider-checkbox/+git/plainbox-provider-checkbox:extend-memory

Rejected for merging into plainbox-provider-checkbox:master

Bin Li: Disapprove on 2022-03-16

Maciej Kisielewski (community): Needs Fixing on 2022-03-15

Jeff Lane : Needs Information on 2022-03-15

Revision history for this message

Andy Chi (andch) wrote on 2021-08-27:

submission_2021-05-07T07.52.07.084481.tar Edit (644.7 KiB, application/x-tar)

tags:

added: oem-priority originate-from-1927709 stella

Revision history for this message

Jonathan Cave (jocave) wrote on 2021-08-27:

Thresholds are set here: https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/memory_compare.py#n85

According to the git history they were established in 2014 and have served their purpose since then. Let me throw the question back to you, what do you think are reasonable thresholds?

Changed in oem-priority:
status:	New → Incomplete
Changed in plainbox-provider-checkbox:
status:	New → Incomplete
Changed in oem-priority:
assignee:	nobody → Andy Chi (andch)
Changed in plainbox-provider-checkbox:
assignee:	nobody → Andy Chi (andch)

Revision history for this message

Jeff Lane  (bladernr) wrote on 2021-09-03:

So thinking back on this, here are a few comments:

1: This test has existed for a long, long time. It was (and is) intended to check to see that the amount of memory the kernel sees is reasonably close to what is physically installed on the system (per lshw). Unfortunately, "reasonably close" is difficult to define, and difficult to check for.

2: 10% variance was, at least then, reasonable to account for physical memory reallocated for things like embedded graphics that the kernel never sees. Perhaps newer embedded GPUs are using more shared memory on occasion.

3: Using a percentage was the best way at the time to accomplish this because the amount of shared RAM varies from system to system, GPU to GPU. A hard limit like 256MB for example may be perfectly valid for 50% of systems, but then the other 50% may use 384MG or 512MB (those are arbitrary numbers just for example, they do not reflect actual amounts of shared RAM).

I sometimes think about this test and wonder if there is a better way to do this, because the problem with percentages (and this also bugs me with the ethernet testing too) is, as you've observed, the larger the number the bigger that percentage becomes (10% of 1GB is a lot smaller than 10% of 10GB).

As a thought, at least for this, is there a way to probe how much RAM is being consumed outside the OS by the graphics or other system overhead? That could be a good improvement if you can probe that and then subtract the amount of system shared RAM from what lshw says is installed before comparing it to what the kernel has addressed.

Anyway, just some thoughts. This is more an issue on client systems than servers as my stuff generally has very little shared ram so this test never fails.

Revision history for this message

Andy Chi (andch) wrote on 2021-09-06:

Hi @jeff,
I observed that some HP laptops, which uses AMD CPU & GPU, BIOS can setup video memory size. Default settings is `auto`, it will use 512 MB. If select 256 MB manually, memory/info will pass.

Revision history for this message

jeremyszu (os369510) wrote on 2021-09-06 (last edit on 2021-09-06):

@Andy,

In this case,

please refer something like:

$ lspci -nnv -d ::0x0302
01:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX150] [10de:1d10] (rev a1)
Subsystem: Lenovo ThinkPad T480 [17aa:225e]
Flags: bus master, fast devsel, latency 0, IRQ 169, IOMMU group 13
Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
Memory at 80000000 (64-bit, prefetchable) [size=256M] # <----- here
Memory at 90000000 (64-bit, prefetchable) [size=32M]
I/O ports at d000 [size=128]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

could you please help to confirm the memory here is same as you saw in BIOS?

When implementing the solution, please consider the multi GPU cases.

To filter out the iGPU, please refer ACPI spec and the following:

```
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:00\:02.0/firmware_node/adr
0x00020000
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:01\:00.0/firmware_node/adr
0x00000000
```

for GPU class, please consider all display classes (e.g. 0x0300, 0x0302, etc...)

Revision history for this message

Andy Chi (andch) wrote on 2021-09-07:

@jeremy,

$ lspci -nnv -d ::0x0300
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:1638] (rev d3) (prog-if 00 [VGA controller])
        DeviceName: Onboard IGD
        Subsystem: Hewlett-Packard Company Device [103c:8895]
        Flags: bus master, fast devsel, latency 0, IRQ 51
        Memory at 260000000 (64-bit, prefetchable) [size=256M]
        Memory at 270000000 (64-bit, prefetchable) [size=2M]
        I/O ports at 1000 [size=256]
        Memory at fb300000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

It shows 256M in lspci instead 512 MB in kernel log.

[kernel log]
[ 0.870069] [drm] amdgpu: 512M of VRAM memory ready
[ 0.870072] [drm] amdgpu: 3072M of GTT memory ready.

Revision history for this message

jeremyszu (os369510) wrote on 2021-09-07:

Seems like the this memory region is not responsible for BIOS reserved and the kernel logs are reported by amdgpu (probably get from FW).

I think we need to list all FW reserved memory first.

Matias Piipari (mz2) on 2022-03-14

tags:

added: cbox-52

Revision history for this message

Bin Li (binli) wrote on 2022-03-15:

/proc/meminfo reports:6.61GiB
lshw reports:8GiB

FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.

tags:

added: originate-from-1964451 sutton

Revision history for this message

Bin Li (binli) wrote on 2022-03-15:

For simple, could we just update the 6G to 8G? Thanks!

Revision history for this message

Jeff Lane  (bladernr) wrote on 2022-03-15:

#10

Before you just change the criteria to make the test pass, can you answer the question of WHY so much memory is being shunted elsewhere?

In this case, you have a machine with 8GB of RAM, and nearly 20% of that RAM is unavailable to the OS because it's being consumed somewhere else. I'm not saying that lowering the limit just to get the test to pass is the wrong answer here, only that by lowering the threshold to fail, you're likely to hide other cases where this shouldn't be happening.

In general, when I review certs, I expect that some things will fail in some cases, and in those cases I will ask questions, and either accept that or reject it based on the answers to those questions. IMO, in the case of a test that has existed for 8 years and has done it's job all that time, lowering it because one machine isn't working as the test expects seems a bit premature?

Revision history for this message

Maciej Kisielewski (kissiel) wrote on 2022-03-15:

#11

If the system reserves so much memory this should be well documented and justified. But this IMHO should not warrant changing the thresholds for _all_ systems. If there is justification for that special system, create a custom job for that system, or make the threshold customizable via configs with the default being what has been used for years.

Bin Li (binli) on 2022-03-16

tags:	added: originate-from-1954987
tags:	added: originate-from-1958473
tags:	added: originate-from-1958337

Revision history for this message

Bin Li (binli) wrote on 2022-03-16 (last edit on 2022-03-16):

#12

I reviewed all the related bugs in sutton project, all the configs are AMD platforms. And found it failed when ram is bigger than 8G. And change to 20 for 8G could not fix all issues.

On M75n I found the difference is 28.34%, cause it used 2G shared memory for VRAM. And I could not change the value from BIOS.

Results:
/proc/meminfo reports: 5.73GiB
lshw reports: 8GiB

FAIL: Meminfo reports 2434531328 less than lshw, a difference of 28.34%. Only a variance of 10% in reported memory is allowed.

[ 0.746168] amdgpu 0000:04:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[ 0.746181] [drm] Detected VRAM RAM=2048M, BAR=2048M
[ 0.746246] [drm] amdgpu: 2048M of VRAM memory ready

Revision history for this message

Bin Li (binli) wrote on 2022-03-16:

#13

On drift3-amd, the memory is 32G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB,4GB and 8G options. When I set 1G or 2G this testcase is passed, the 'Auto' mean 4G from dmesg. This issue looks not related to Prefetchable value in lspci. It will keep 256M whatever the VRAM's value is.

In this case 4G used as default for shared memory, it sounds good, how could we avoid the failure of memory/info testcase?

Results:
/proc/meminfo reports: 27.25GiB
lshw reports: 32GiB

FAIL: Meminfo reports 5104971776 less than lshw, a difference of 14.86%. Only a variance of 10% in reported memory is allowed.

Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of VRAM memory ready
Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of GTT memory ready.

$ sudo lspci -nv | grep Prefetchable
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: 0000000830000000-00000008301fffff [size=2M]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: 0000000860000000-00000008701fffff [size=258M]

Revision history for this message

Bin Li (binli) wrote on 2022-03-16:

#14

On golem-amd, the memory is 8G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB and 4GB options. By default the 'Auto' means 1G from dmesg.

Is it possible that we compare lshw with the sum of VRAM and /proc/meminfo?

[ 1.057123] [drm] Detected VRAM RAM=1024M, BAR=1024M
[ 1.057123] [drm] RAM width 64bits DDR4
[ 1.057152] [drm] amdgpu: 1024M of VRAM memory ready
[ 1.057153] [drm] amdgpu: 3072M of GTT memory ready.

Results:
/proc/meminfo reports: 6.61GiB
lshw reports: 8GiB

FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.

jeremyszu (os369510) on 2022-03-16

tags:	added: originate-from-1938006
tags:	added: originate-from-1953698
tags:	added: originate-from-1958516
tags:	added: originate-from-1962148

Revision history for this message

Bin Li (binli) wrote on 2022-03-17:

#15

From 'glxinfo -B', we could get the 'Video memory', if we could count this value with /proc/meminfo, then all the platforms in my side could fix this issue. Thanks!

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD RENOIR (DRM 3.42.0, 5.14.0-1027-oem, LLVM 12.0.0) (0x15e7)
    Version: 21.2.6
    Accelerated: yes
    Video memory: 1024MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2

Revision history for this message

jeremyszu (os369510) wrote on 2022-03-17:

#16

The value from comment#15 is GLX_RENDERER_UNIFIED_MEMORY_ARCHITECTURE_MESA which is not exactly correct in my I+N system.

If possible, then we better to get the reserved memory from kernel space.
Thus, I wondering why the AMDGPU doesn't show the reserved memory in lspci?

Revision history for this message

Bin Li (binli) wrote on 2022-03-18:

#17

lspci_nvv.txt Edit (55.0 KiB, text/plain)

@kaihengfeng,

Here is the full lspci. Thanks!

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2022-03-18:

#18

For amdgpu, there's a sysfs attribute 'mem_info_vram_total' shows carved out ram size. So please consider that in checkbox logic.

Using BAR size as VRAM size is only accurate for discrete AMD GFX. AMD APU has its own way to decide VRAM size.

Revision history for this message

jeremyszu (os369510) wrote on 2022-03-22:

#19

I've no idea to know if AMDGPU belongs to APU unless amdgpu_device->flag exports it.

The "mem_info_vram_total" seems work in amd iGPU and dGPU.
Let's consider to count them by gpu vendor.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2022-03-23:

#20

So it's better to just use "mem_info_vram_total" - it will work regardless of integrated or discrete.

Revision history for this message

jeremyszu (os369510) wrote on 2022-03-23:

#21

yeap, I proposed it as https://code.launchpad.net/~os369510/plainbox-provider-checkbox/+git/plainbox-provider-checkbox/+merge/416966

Yao Wei (medicalwei) on 2022-05-20

tags:

added: originate-from-1974175 somerville

Bin Li (binli) on 2022-06-07

tags:

added: originate-from-1976476

Bin Li (binli) on 2022-09-20

tags:

added: originate-from-1990217

Revision history for this message

Maksim Beliaev (beliaev-maksim) wrote on 2022-11-28:

#22

Bug was migrated to GitHub: https://github.com/canonical/checkbox/issues/191.
Bug is no more monitored here.

Changed in plainbox-provider-checkbox:
status:	Incomplete → Expired

Yujin.Wu (eugene2021) on 2024-01-30

tags:

added: thinkbook-15-g5 thinkbook-15-g5-3

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-github-canonical-checkbox #191
[open bug FromLaunchpad] Edit

Bug watches keep track of this bug in other bug trackers.