Comment 7 for bug 1979276

Revision history for this message
yatin (yatinkarel) wrote :

Further troubleshooting resulted into:-

The issue happens on certain CPU modes and node providers:-
- Seen issue with node providers: rax-ord, rax-dfw, rax-iad, iweb-mtl01
- Seen issue with below cpu models
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Intel Xeon E312xx (Sandy Bridge)

This doesn't mean only ^ cpu modes impacted, there may be others which are not in upstream CI, some example of success cpu models:-
- Intel Core Processor (Haswell, no TSX)
- Intel Xeon Processor (Cascadelake)
- AMD EPYC-Rome Processor

This is the reason the issue was not seen in rdo infra and downstream as the nodes there don't have the above mentioned cpu models.

On affected node, even ovn-nbctl --version traces back,

core dump is seen as below:-
# coredumpctl info
           PID: 640886 (ovn-nbctl)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 4 (ILL)
     Timestamp: Thu 2022-06-23 08:48:35 UTC (3s ago)
  Command Line: ovn-nbctl --version
    Executable: /usr/bin/ovn-nbctl
 Control Group: /machine.slice/libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope/container
          Unit: libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope
         Slice: machine.slice
       Boot ID: 4f2c55fc25f34c84a6160468479ece43
    Machine ID: c26d255f89064955aa655cf12e74d969
      Hostname: standalone.localdomain
       Storage: /var/lib/systemd/coredump/core.ovn-nbctl.0.4f2c55fc25f34c84a6160468479ece43.640886.1655974115000000.zst (present)
     Disk Size: 160.0K
       Message: Process 640886 (ovn-nbctl) of user 0 dumped core.

                Module /usr/bin/ovn-nbctl with build-id 2798d30ce0833d6e0fcabb6d8a0a98cba4da707d
                Module linux-vdso.so.1 with build-id 826a46efc5a1c4a55cc6fdceeb06554eda66067e
                Module libnghttp2.so.14 with build-id 7eadbd56a0e5bcd3d8a6b39b9bab2327e380283a
                Module libpython3.9.so.1.0 with build-id bb4578c381c6d22045835e803bf846e2b5a28502
                Module libevent-2.1.so.7 with build-id af406c254338ff6ceff47360cba92cdcf233cf14
                Module libprotobuf-c.so.1 with build-id 46661ae5d66cbaa2aa82b1b765472bdfa4712a24
                Module ld-linux-x86-64.so.2 with build-id 1d95aae3e4174446d3b885ad234d4f7e573e71db
                Module libz.so.1 with build-id 25486226566596e403da5485fb0ec85deed6b9fa
                Module libc.so.6 with build-id 14830f7e71953d5f0dac317543ac1e3fcdd874f5
                Module libunbound.so.8 with build-id def32d1bb7a7d99c59bf62e00c628af0246afa91
                Module libm.so.6 with build-id 3eb525d2e163793ef2e888d5bb46e104d11a3201
                Module libcap-ng.so.0 with build-id fdca0a301667e15db99d726152b57feeb35e4dbe
                Module libcrypto.so.3 with build-id 12bfb8486a63c1daa0d3b1d901401cd152c09f8e
                Module libssl.so.3 with build-id 4f82a7edeeafe3698ccc5442d011a8cd5aaf4e9d
                Stack trace of thread 96216:
                #0 0x000055d209c3dba8 n/a (/usr/bin/ovn-nbctl + 0x16ba8)
                ELF object binary architecture: AMD x86-64

Also /proc/cpuinfo looks like below on affected system:-
===========================
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping : 7
microcode : 0x71a
cpu MHz : 2593.881
cache size : 20480 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp xsaveopt md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5187.52
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
===========================
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping : 4
microcode : 0x42e
cpu MHz : 2599.955
cache size : 20480 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase smep erms xsaveopt md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5200.03
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
================================================

Will create a bz against OVN.