Comment 18 for bug 1741978

Revision history for this message
Jeffrey Hugo (jhugo-o) wrote :

Verification of the package in -proposed failed. Run with Linux ubuntu 4.14.14 #11 SMP Mon Feb 26 17:06:57 EST 2018 aarch64 aarch64 aarch64 GNU/Linux

From our tester -

I did the testing you asked and it appears that Canonical is not enabling the ARM event support in rasdaemon when they are building it. Without it enabled I can’t verify that it works.

root@ubuntu:/home/ubuntu# dpkg -l | grep rasdaemon
ii rasdaemon 0.5.6-2ubuntu1 arm64 utility to receive RAS error tracings
root@ubuntu:/home/ubuntu# rasdaemon -f
overriding event (830) ras:mc_event with new print handler
rasdaemon: ras:mc_event event enabled
rasdaemon: Enabled event ras:mc_event
overriding event (827) ras:aer_event with new print handler
rasdaemon: ras:aer_event event enabled
rasdaemon: Enabled event ras:aer_event
rasdaemon: Can't parse /proc/cpuinfo: missing [vendor_id] [cpu family] [model] [cpu MHz] [flags]
rasdaemon: Can't register mce handler
rasdaemon: Can't get ras:extlog_mem_event traces. Perhaps this feature is not supported on your system.
rasdaemon: Can't get traces from ras:aer_event
rasdaemon: Listening to events for cpus 0 to 45

It appears I have the correct rasdaemon executable as the package is tagged with version 0.5.6-2ubuntu1. When running rasdaemon, mc_event and aer_event are enabled properly but arm_event is missing. I’ve also verified that mc_event and aer_event reporting works, but arm_event reporting is missing when triggering errors (see below).

They will need to compile the rasdaemon executable with the configuration “--enable-arm” similar to how they must be configuring for AER with “--enable-aer”.

------

Feb 27 10:12:43 ubuntu kernel: [72120.868869] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Feb 27 10:12:43 ubuntu kernel: [72120.876193] {4}[Hardware Error]: event severity: recoverable
Feb 27 10:12:43 ubuntu kernel: [72120.881818] {4}[Hardware Error]: precise tstamp: 2018-02-27 18:12:41
Feb 27 10:12:43 ubuntu kernel: [72120.888258] {4}[Hardware Error]: Error 0, type: recoverable
Feb 27 10:12:43 ubuntu kernel: [72120.893897] {4}[Hardware Error]: section_type: memory error
Feb 27 10:12:43 ubuntu kernel: [72120.899612] {4}[Hardware Error]: error_status: 0x00000000000c0400
Feb 27 10:12:43 ubuntu kernel: [72120.905874] {4}[Hardware Error]: physical_address: 0x0000000010097570
Feb 27 10:12:43 ubuntu kernel: [72120.912458] {4}[Hardware Error]: physical_address_mask: 0x00000fffffffffff
Feb 27 10:12:43 ubuntu kernel: [72120.919505] {4}[Hardware Error]: node: 0 card: 5 module: 0 rank: 0 bank: 0 device: 0 row: 342 column: 1006
Feb 27 10:12:43 ubuntu kernel: [72120.929397] {4}[Hardware Error]: error_type: 3, multi-bit ECC
Feb 27 10:12:43 ubuntu kernel: [72120.935330] EDAC MC0: 1 UE Multi-bit ECC on unknown label (node:0 card:5 module:0 rank:0 bank:0 row:342 col:1006 page:0x10097 offset:0x570 grain:-4096 - status(0x00000000000c0400): Storage error in DRAM memory)
Feb 27 10:12:43 ubuntu rasdaemon[24088]: overriding event (830) ras:mc_event with new print handler
Feb 27 10:12:43 ubuntu rasdaemon[24088]: overriding event (827) ras:aer_event with new print handler
Feb 27 10:12:43 ubuntu rasdaemon[24088]: Calling ras_mc_event_opendb()
Feb 27 10:12:43 ubuntu rasdaemon[24088]: cpu 00:rasdaemon: mc_event store: 0x26207ed8
Feb 27 10:12:43 ubuntu rasdaemon[24088]: rasdaemon: register inserted at db

Feb 27 10:14:09 ubuntu kernel: [72207.296768] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
Feb 27 10:14:09 ubuntu kernel: [72207.296771] {5}[Hardware Error]: event severity: info
Feb 27 10:14:09 ubuntu kernel: [72207.296777] {5}[Hardware Error]: precise tstamp: 2018-02-27 18:13:51
Feb 27 10:14:09 ubuntu kernel: [72207.296780] {5}[Hardware Error]: Error 0, type: info
Feb 27 10:14:09 ubuntu kernel: [72207.296783] {5}[Hardware Error]: section_type: ARM processor error
Feb 27 10:14:09 ubuntu kernel: [72207.296786] {5}[Hardware Error]: MIDR: 0x00000000510f8000
Feb 27 10:14:09 ubuntu kernel: [72207.296789] {5}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000000000000
Feb 27 10:14:09 ubuntu kernel: [72207.296791] {5}[Hardware Error]: error affinity level: 2
Feb 27 10:14:09 ubuntu kernel: [72207.296793] {5}[Hardware Error]: running state: 0x1
Feb 27 10:14:09 ubuntu kernel: [72207.296796] {5}[Hardware Error]: Power State Coordination Interface state: 0
Feb 27 10:14:09 ubuntu kernel: [72207.296799] {5}[Hardware Error]: Error info structure 0:
Feb 27 10:14:09 ubuntu kernel: [72207.296801] {5}[Hardware Error]: num errors: 1
Feb 27 10:14:09 ubuntu kernel: [72207.296803] {5}[Hardware Error]: first error captured
Feb 27 10:14:09 ubuntu kernel: [72207.296805] {5}[Hardware Error]: last error captured
Feb 27 10:14:09 ubuntu kernel: [72207.296808] {5}[Hardware Error]: error_type: 0, cache error
Feb 27 10:14:09 ubuntu kernel: [72207.296811] {5}[Hardware Error]: error_info: 0x0000000000c2007f
Feb 27 10:14:09 ubuntu kernel: [72207.296814] {5}[Hardware Error]: transaction type: Generic
Feb 27 10:14:09 ubuntu kernel: [72207.296816] {5}[Hardware Error]: operation type: Generic error (type cannot bedetermined)
Feb 27 10:14:09 ubuntu kernel: [72207.296818] {5}[Hardware Error]: cache level: 3
Feb 27 10:14:09 ubuntu kernel: [72207.296820] {5}[Hardware Error]: processor context not corrupted
Feb 27 10:14:09 ubuntu kernel: [72207.296822] {5}[Hardware Error]: the error has not been corrected
Feb 27 10:14:09 ubuntu kernel: [72207.296824] {5}[Hardware Error]: PC is imprecise

Feb 27 10:24:05 ubuntu kernel: [72803.528432] pciehp 0003:00:00.0:pcie004: Slot(4): Link Down
Feb 27 10:24:06 ubuntu kernel: [72803.881709] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Feb 27 10:24:06 ubuntu kernel: [72803.889042] {6}[Hardware Error]: event severity: recoverable
Feb 27 10:24:06 ubuntu kernel: [72803.894658] {6}[Hardware Error]: precise tstamp: 2018-02-27 18:24:04
Feb 27 10:24:06 ubuntu kernel: [72803.901094] {6}[Hardware Error]: Error 0, type: recoverable
Feb 27 10:24:06 ubuntu kernel: [72803.906722] {6}[Hardware Error]: section_type: PCIe error
Feb 27 10:24:06 ubuntu kernel: [72803.912288] {6}[Hardware Error]: port_type: 4, root port
Feb 27 10:24:06 ubuntu kernel: [72803.917745] {6}[Hardware Error]: version: 3.0
Feb 27 10:24:06 ubuntu kernel: [72803.922271] {6}[Hardware Error]: command: 0x0407, status: 0x0010
Feb 27 10:24:06 ubuntu kernel: [72803.928423] {6}[Hardware Error]: device_id: 0003:00:00.0
Feb 27 10:24:06 ubuntu kernel: [72803.933902] {6}[Hardware Error]: slot: 4
Feb 27 10:24:06 ubuntu kernel: [72803.937971] {6}[Hardware Error]: secondary_bus: 0x01
Feb 27 10:24:06 ubuntu kernel: [72803.943104] {6}[Hardware Error]: vendor_id: 0x17cb, device_id: 0x0401
Feb 27 10:24:06 ubuntu kernel: [72803.949701] {6}[Hardware Error]: class_code: 000406
Feb 27 10:24:06 ubuntu kernel: [72803.954725] {6}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Feb 27 10:24:06 ubuntu kernel: [72803.962523] pcieport 0003:00:00.0: aer_status: 0x00004000, aer_mask: 0x00500000
Feb 27 10:24:06 ubuntu kernel: [72803.969760] pcieport 0003:00:00.0: Completion Timeout
Feb 27 10:24:06 ubuntu kernel: [72803.974778] pcieport 0003:00:00.0: aer_layer=Transaction Layer, aer_agent=Requester ID
Feb 27 10:24:06 ubuntu kernel: [72803.982688] pcieport 0003:00:00.0: aer_uncor_severity: 0x00462030
Feb 27 10:24:06 ubuntu kernel: [72803.988779] pcieport 0003:00:00.0: broadcast error_detected message
Feb 27 10:24:06 ubuntu rasdaemon[24088]: cpu 00:rasdaemon: aer_event store: 0x261e6e08
Feb 27 10:24:06 ubuntu rasdaemon[24088]: rasdaemon: register inserted at db