memory/info failed on system which ram is lower than 8G

Bug #1941854 reported by Andy Chi
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox Provider - Base
Expired
Undecided
Andy Chi
OEM Priority Project
Incomplete
Undecided
Andy Chi

Bug Description

[I/O log]
Results:
 /proc/meminfo reports: 7.16GiB
 lshw reports: 8GiB

FAIL: Meminfo reports 905527296 less than lshw, a difference of 10.54%. Only a variance of 10% in reported memory is allowed.

[Reproduce Steps]
1. sudo checkbox-cli run com.canonical.certification::memory/info

Why the tolerance is 10%? I have another machine with 16G ram, but 14.6 in meminfo. This test case can pass due to bigger denominator.

Related branches

Revision history for this message
Andy Chi (andch) wrote :
tags: added: oem-priority originate-from-1927709 stella
Revision history for this message
Jonathan Cave (jocave) wrote :

Thresholds are set here: https://git.launchpad.net/plainbox-provider-checkbox/tree/bin/memory_compare.py#n85

According to the git history they were established in 2014 and have served their purpose since then. Let me throw the question back to you, what do you think are reasonable thresholds?

Changed in oem-priority:
status: New → Incomplete
Changed in plainbox-provider-checkbox:
status: New → Incomplete
Changed in oem-priority:
assignee: nobody → Andy Chi (andch)
Changed in plainbox-provider-checkbox:
assignee: nobody → Andy Chi (andch)
Revision history for this message
Jeff Lane  (bladernr) wrote :

So thinking back on this, here are a few comments:

1: This test has existed for a long, long time. It was (and is) intended to check to see that the amount of memory the kernel sees is reasonably close to what is physically installed on the system (per lshw). Unfortunately, "reasonably close" is difficult to define, and difficult to check for.

2: 10% variance was, at least then, reasonable to account for physical memory reallocated for things like embedded graphics that the kernel never sees. Perhaps newer embedded GPUs are using more shared memory on occasion.

3: Using a percentage was the best way at the time to accomplish this because the amount of shared RAM varies from system to system, GPU to GPU. A hard limit like 256MB for example may be perfectly valid for 50% of systems, but then the other 50% may use 384MG or 512MB (those are arbitrary numbers just for example, they do not reflect actual amounts of shared RAM).

I sometimes think about this test and wonder if there is a better way to do this, because the problem with percentages (and this also bugs me with the ethernet testing too) is, as you've observed, the larger the number the bigger that percentage becomes (10% of 1GB is a lot smaller than 10% of 10GB).

As a thought, at least for this, is there a way to probe how much RAM is being consumed outside the OS by the graphics or other system overhead? That could be a good improvement if you can probe that and then subtract the amount of system shared RAM from what lshw says is installed before comparing it to what the kernel has addressed.

Anyway, just some thoughts. This is more an issue on client systems than servers as my stuff generally has very little shared ram so this test never fails.

Revision history for this message
Andy Chi (andch) wrote :

Hi @jeff,
I observed that some HP laptops, which uses AMD CPU & GPU, BIOS can setup video memory size. Default settings is `auto`, it will use 512 MB. If select 256 MB manually, memory/info will pass.

Revision history for this message
jeremyszu (os369510) wrote (last edit ):

@Andy,

In this case,

please refer something like:

$ lspci -nnv -d ::0x0302
01:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX150] [10de:1d10] (rev a1)
 Subsystem: Lenovo ThinkPad T480 [17aa:225e]
 Flags: bus master, fast devsel, latency 0, IRQ 169, IOMMU group 13
 Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
 Memory at 80000000 (64-bit, prefetchable) [size=256M] # <----- here
 Memory at 90000000 (64-bit, prefetchable) [size=32M]
 I/O ports at d000 [size=128]
 Capabilities: <access denied>
 Kernel driver in use: nvidia
 Kernel modules: nouveau, nvidia_drm, nvidia

could you please help to confirm the memory here is same as you saw in BIOS?

When implementing the solution, please consider the multi GPU cases.

To filter out the iGPU, please refer ACPI spec and the following:

```
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:00\:02.0/firmware_node/adr
0x00020000
jeremysu@arch [ /home/jeremysu ]
$ cat /sys/bus/pci/devices/0000\:01\:00.0/firmware_node/adr
0x00000000
```

for GPU class, please consider all display classes (e.g. 0x0300, 0x0302, etc...)

Revision history for this message
Andy Chi (andch) wrote :

@jeremy,

$ lspci -nnv -d ::0x0300
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:1638] (rev d3) (prog-if 00 [VGA controller])
        DeviceName: Onboard IGD
        Subsystem: Hewlett-Packard Company Device [103c:8895]
        Flags: bus master, fast devsel, latency 0, IRQ 51
        Memory at 260000000 (64-bit, prefetchable) [size=256M]
        Memory at 270000000 (64-bit, prefetchable) [size=2M]
        I/O ports at 1000 [size=256]
        Memory at fb300000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

It shows 256M in lspci instead 512 MB in kernel log.

[kernel log]
[ 0.870069] [drm] amdgpu: 512M of VRAM memory ready
[ 0.870072] [drm] amdgpu: 3072M of GTT memory ready.

Revision history for this message
jeremyszu (os369510) wrote :

Seems like the this memory region is not responsible for BIOS reserved and the kernel logs are reported by amdgpu (probably get from FW).

I think we need to list all FW reserved memory first.

Matias Piipari (mz2)
tags: added: cbox-52
Revision history for this message
Bin Li (binli) wrote :

/proc/meminfo reports:6.61GiB
lshw reports:8GiB

FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.

tags: added: originate-from-1964451 sutton
Revision history for this message
Bin Li (binli) wrote :

For simple, could we just update the 6G to 8G? Thanks!

Revision history for this message
Jeff Lane  (bladernr) wrote :

Before you just change the criteria to make the test pass, can you answer the question of WHY so much memory is being shunted elsewhere?

In this case, you have a machine with 8GB of RAM, and nearly 20% of that RAM is unavailable to the OS because it's being consumed somewhere else. I'm not saying that lowering the limit just to get the test to pass is the wrong answer here, only that by lowering the threshold to fail, you're likely to hide other cases where this shouldn't be happening.

In general, when I review certs, I expect that some things will fail in some cases, and in those cases I will ask questions, and either accept that or reject it based on the answers to those questions. IMO, in the case of a test that has existed for 8 years and has done it's job all that time, lowering it because one machine isn't working as the test expects seems a bit premature?

Revision history for this message
Maciej Kisielewski (kissiel) wrote :

If the system reserves so much memory this should be well documented and justified. But this IMHO should not warrant changing the thresholds for _all_ systems. If there is justification for that special system, create a custom job for that system, or make the threshold customizable via configs with the default being what has been used for years.

Bin Li (binli)
tags: added: originate-from-1954987
tags: added: originate-from-1958473
tags: added: originate-from-1958337
Revision history for this message
Bin Li (binli) wrote (last edit ):

I reviewed all the related bugs in sutton project, all the configs are AMD platforms. And found it failed when ram is bigger than 8G. And change to 20 for 8G could not fix all issues.

On M75n I found the difference is 28.34%, cause it used 2G shared memory for VRAM. And I could not change the value from BIOS.

Results:
        /proc/meminfo reports: 5.73GiB
        lshw reports: 8GiB

FAIL: Meminfo reports 2434531328 less than lshw, a difference of 28.34%. Only a variance of 10% in reported memory is allowed.

[ 0.746168] amdgpu 0000:04:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[ 0.746181] [drm] Detected VRAM RAM=2048M, BAR=2048M
[ 0.746246] [drm] amdgpu: 2048M of VRAM memory ready

Revision history for this message
Bin Li (binli) wrote :

On drift3-amd, the memory is 32G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB,4GB and 8G options. When I set 1G or 2G this testcase is passed, the 'Auto' mean 4G from dmesg. This issue looks not related to Prefetchable value in lspci. It will keep 256M whatever the VRAM's value is.

In this case 4G used as default for shared memory, it sounds good, how could we avoid the failure of memory/info testcase?

Results:
        /proc/meminfo reports: 27.25GiB
        lshw reports: 32GiB

FAIL: Meminfo reports 5104971776 less than lshw, a difference of 14.86%. Only a variance of 10% in reported memory is allowed.

Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of VRAM memory ready
Mar 16 04:49:05 Drift3-AMD-6 kernel: [drm] amdgpu: 4096M of GTT memory ready.

$ sudo lspci -nv | grep Prefetchable
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: 0000000830000000-00000008301fffff [size=2M]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: [disabled]
        Prefetchable memory behind bridge: 0000000860000000-00000008701fffff [size=258M]

Revision history for this message
Bin Li (binli) wrote :

On golem-amd, the memory is 8G. And in BIOS, there are "UMA frame buffer size", by default it set to 'Auto', and there are also 1GB,2GB and 4GB options. By default the 'Auto' means 1G from dmesg.

Is it possible that we compare lshw with the sum of VRAM and /proc/meminfo?

[ 1.057123] [drm] Detected VRAM RAM=1024M, BAR=1024M
[ 1.057123] [drm] RAM width 64bits DDR4
[ 1.057152] [drm] amdgpu: 1024M of VRAM memory ready
[ 1.057153] [drm] amdgpu: 3072M of GTT memory ready.

Results:
        /proc/meminfo reports: 6.61GiB
        lshw reports: 8GiB

FAIL: Meminfo reports 1493368832 less than lshw, a difference of 17.39%. Only a variance of 10% in reported memory is allowed.

jeremyszu (os369510)
tags: added: originate-from-1938006
tags: added: originate-from-1953698
tags: added: originate-from-1958516
tags: added: originate-from-1962148
Revision history for this message
Bin Li (binli) wrote :

From 'glxinfo -B', we could get the 'Video memory', if we could count this value with /proc/meminfo, then all the platforms in my side could fix this issue. Thanks!

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD RENOIR (DRM 3.42.0, 5.14.0-1027-oem, LLVM 12.0.0) (0x15e7)
    Version: 21.2.6
    Accelerated: yes
    Video memory: 1024MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2

Revision history for this message
jeremyszu (os369510) wrote :

The value from comment#15 is GLX_RENDERER_UNIFIED_MEMORY_ARCHITECTURE_MESA which is not exactly correct in my I+N system.

If possible, then we better to get the reserved memory from kernel space.
Thus, I wondering why the AMDGPU doesn't show the reserved memory in lspci?

Revision history for this message
Bin Li (binli) wrote :

@kaihengfeng,

 Here is the full lspci. Thanks!

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

For amdgpu, there's a sysfs attribute 'mem_info_vram_total' shows carved out ram size. So please consider that in checkbox logic.

Using BAR size as VRAM size is only accurate for discrete AMD GFX. AMD APU has its own way to decide VRAM size.

Revision history for this message
jeremyszu (os369510) wrote :

I've no idea to know if AMDGPU belongs to APU unless amdgpu_device->flag exports it.

The "mem_info_vram_total" seems work in amd iGPU and dGPU.
Let's consider to count them by gpu vendor.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So it's better to just use "mem_info_vram_total" - it will work regardless of integrated or discrete.

Revision history for this message
jeremyszu (os369510) wrote :
Yao Wei (medicalwei)
tags: added: originate-from-1974175 somerville
Bin Li (binli)
tags: added: originate-from-1976476
Bin Li (binli)
tags: added: originate-from-1990217
Revision history for this message
Maksim Beliaev (beliaev-maksim) wrote :

Bug was migrated to GitHub: https://github.com/canonical/checkbox/issues/191.
Bug is no more monitored here.

Changed in plainbox-provider-checkbox:
status: Incomplete → Expired
Yujin.Wu (eugene2021)
tags: added: thinkbook-15-g5 thinkbook-15-g5-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.