"mcelog --client" cannot ouput after performing PFA test on Ubunt22.04 and SR850v2

Bug #1972149 reported by conie chang
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mcelog (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

No mcelog --client output after running pfa test. It should have output after performing intel RAS PFA test.

I tried to run the PFA test on Ubuntu 22.04 OS and Lenovo SR850v2 system(Cedar Island platform).
But there is no any output after run the "mcelog --client" command.

I tried another OS, RHEL8.6 and run the same test and use the same command "mcelog --client" which mcelog packaged is provided by Red Hat. It can shows output after running the "mcelog --client"

I am not sure what I missed some steps. Below are my steps, please advise what i need to do. Thank you.

Test Steps:
1. Prepare Test environement
  1-1 Enable the MCE feature in system UEFI setting.
  1-2 download the Lenovo onecli package and set UEFI enviroment for RAS PFA test in Ubuntu 22.04 OS .
  https://download.lenovo.com/servers/mig/2021/12/22/55090/lnvgy_utl_lxce_onecli01u-3.4.0_rhel_x86-64.tgz
    #./onecli config set SystemOobCustom.AdvancedDebugControl enabled --override --log 5
 #./OneCli config set Memory.WHEAErrorInjectionSupport enabled --override --log 5
 #./OneCli config set Memory.SWErrorInjectionSupport enabled --override --log 5
 #./OneCli config set Memory.McaBankErrorInjectionSupport enabled --override --log 5
 #./OneCli config set Memory.PMEMErrorInjection enabled --override --log 5
 #./OneCli config set Memory.DirectoryModeEnable disabled --override --log 5
 #./OneCli config set Memory.CorrectableErrorThreshold 2 --override --log 5
 #./OneCli config set Memory.PatrolScrubInterval 24 --override --log 5
 #./OneCli config set AdvancedRAS.EVDFXFeatures enabled --override --log 5
 #./OneCli config set AdvancedRAS.LockChipset disabled --override --log 5
 #./OneCli config set Memory.PollCEevent enabled --override --log 5
 #./OneCli config set SystemOobCustom.PFATest enabled --override --log 5

 #ipmitool raw 0x3A 0xC4 0x03 0x00 0x1A 0x01 0x93 0x2F 0x61 0x63 0x2F 0x69 0x62 0x6D 0x63 0x2F 0x75 0x65 0x66 0x69 0x2F 0x44 0x63 0x69 0x45 0x6E 0x11 0x01
 #reboot

2. Download the ras tool from github and compile the ras tool
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# ./Init.sh
mount: /sys/kernel/debug: none already mounted on /run/credentials/systemd-sysusers.service.
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# ./mca-recover
flags for page 20526f: uptodate mmap anon swapbacked
vtop(7ffaa98d1000) = 20526f000
Hit any key to access: ^Z
[1]+ Stopped ./mca-recover
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# ./injection_error.sh 0x8 0x20526f000 0xfffffffffffff000 10
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
0x00000020 Memory Uncorrectable fatal
Injecting Correctable Memory Error
Injecting 10 errors at address 0x20526f000.
System performance will be affected while errors are being injected.
inject times: 1
inject times: 2
inject times: 3
inject times: 4
inject times: 5
inject times: 6
inject times: 7
inject times: 8
inject times: 9
inject times: 10
Injection Complete

3. check the syslog and "mcelog --client"
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# dmesg|grep -i hardware
[ 3.351156] Booting paravirtualized kernel on bare hardware
[ 579.630701] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 162
[ 579.630707] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 579.630709] {1}[Hardware Error]: event severity: corrected
[ 579.630711] {1}[Hardware Error]: Error 0, type: corrected
[ 579.630713] {1}[Hardware Error]: section_type: memory error
[ 579.630714] {1}[Hardware Error]: error_status: 0x0000000000000400
[ 579.630716] {1}[Hardware Error]: physical_address: 0x000000020526f000
[ 579.630718] {1}[Hardware Error]: node: 0 card: 7 module: 0 rank: 0 bank: 13 device: 1 row: 3625 column: 968
[ 579.630719] {1}[Hardware Error]: error_type: 2, single-bit ECC
[ 579.630722] {1}[Hardware Error]: DIMM location: CPU 1 DIMM 5
[ 579.643486] mce: [Hardware Error]: Machine check events logged
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# tail /var/log/mes
tail: cannot open '/var/log/mes' for reading: No such file or directory
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# tail /var/log/syslog
Apr 8 23:37:49 test kernel: [ 579.630709] {1}[Hardware Error]: event severity: corrected
Apr 8 23:37:49 test kernel: [ 579.630711] {1}[Hardware Error]: Error 0, type: corrected
Apr 8 23:37:49 test kernel: [ 579.630713] {1}[Hardware Error]: section_type: memory error
Apr 8 23:37:49 test kernel: [ 579.630714] {1}[Hardware Error]: error_status: 0x0000000000000400
Apr 8 23:37:49 test kernel: [ 579.630716] {1}[Hardware Error]: physical_address: 0x000000020526f000
Apr 8 23:37:49 test kernel: [ 579.630718] {1}[Hardware Error]: node: 0 card: 7 module: 0 rank: 0 bank: 13 device: 1 row: 3625 column: 968
Apr 8 23:37:49 test kernel: [ 579.630719] {1}[Hardware Error]: error_type: 2, single-bit ECC
Apr 8 23:37:49 test kernel: [ 579.630722] {1}[Hardware Error]: DIMM location: CPU 1 DIMM 5
Apr 8 23:37:49 test kernel: [ 579.643486] mce: [Hardware Error]: Machine check events logged
Apr 8 23:37:50 test systemd-timesyncd[2372]: Timed out waiting for reply from [2001:67c:1560:8003::c7]:123 (ntp.ubuntu.com).
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master# mcelog --client
root@test:~/Desktop/ras-tool-master.tar/ras-tool-master#

Revision history for this message
conie chang (conie) wrote :
Revision history for this message
conie chang (conie) wrote :
Revision history for this message
conie chang (conie) wrote :
Revision history for this message
conie chang (conie) wrote :
information type: Public → Public Security
information type: Public Security → Private Security
Revision history for this message
Seth Arnold (seth-arnold) wrote :

Hello Conie, can you please explain why you marked this bug "Private Security"?

mcelog is not an Ubuntu package in 22.04. I understand it has been replaced by rasdaemon and edac-utils.

I not only don't see the security relevance, as far as I can tell this isn't about software shipped with Ubuntu.

Thanks

Changed in mcelog (Ubuntu):
status: New → Incomplete
conie chang (conie)
information type: Private Security → Public Security
information type: Public Security → Public
Revision history for this message
conie chang (conie) wrote :
Download full text (6.4 KiB)

Hi Seth,

I set wrong information type and I correct the type to Public. Sorry for the wrong setting.

I tried to replace rasdaemon and edac-utils, and run again the test. And it can catch the mce error log after PFA test. The test result is below. I think we can close this bug. Thank you.

Steps:
root@conie:/home/conie# sudo rasdaemon --enable
rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: mce:mce_record event enabled
rasdaemon: ras:extlog_mem_event event enabled
root@conie:/home/conie# sudo systemctl enable rasdaemon
Synchronizing state of rasdaemon.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable rasdaemon
root@conie:/home/conie#
root@conie:/home/conie# sudo systemctl start rasdaemon
root@conie:/home/conie# sudo systemctl status rasdaemon
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-05-10 08:28:19 UTC; 2min 32s ago
   Main PID: 1692 (rasdaemon)
      Tasks: 1 (limit: 308967)
     Memory: 13.6M
        CPU: 171ms
     CGroup: /system.slice/rasdaemon.service
             └─1692 /usr/sbin/rasdaemon -f -r

May 10 08:28:18 conie rasdaemon[1692]: rasdaemon: Enabled event ras:extlog_mem_event
May 10 08:28:18 conie rasdaemon[1692]: Enabled event mce:mce_record
May 10 08:28:18 conie rasdaemon[1692]: ras:extlog_mem_event event enabled
May 10 08:28:18 conie rasdaemon[1692]: Enabled event ras:extlog_mem_event
May 10 08:28:18 conie rasdaemon[1692]: rasdaemon: Listening to events for cpus 0 to 71
May 10 08:28:19 conie systemd[1]: Started RAS daemon to log the RAS events.
May 10 08:28:19 conie rasdaemon[1692]: rasdaemon: Recording mc_event events
May 10 08:28:19 conie rasdaemon[1692]: rasdaemon: Recording aer_event events
May 10 08:28:19 conie rasdaemon[1692]: rasdaemon: Recording extlog_event events
May 10 08:28:19 conie rasdaemon[1692]: rasdaemon: Recording mce_record events
root@conie:/home/conie# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.
root@conie:/home/conie# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

root@conie:/home/conie# cd ras-tool-master
root@conie:/home/conie/ras-tool-master# ./Init.sh
./Init.sh: line 1: mcelog: command not found
mount: /sys/kernel/debug: none already mounted on /run/credentials/systemd-sysusers.service.
root@conie:/home/conie/ras-tool-master# chmod -R 777 injection_error.sh
root@conie:/home/conie/ras-tool-master# ./mca-recover
flags for page 15e98f: uptodate mmap anon swapbacked
vtop(7f9749724000) = 15e98f000
Hit any key to access: ^Z
[1]+ Stopped ./mca-recover
root@conie:/home/conie/ras-tool-master# ./injection_error.sh 0x8 0x15e98f000 0xfffffffffffff000 10
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
0x00000020 Memory Uncorrectable fatal
Injecting Correctable Memory Error
Injecting 10 errors at address 0x15e98f000.
System performance will be affected while errors are being injected.
inject times: 1
in...

Read more...

Jeff Lane  (bladernr)
Changed in mcelog (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.