ovn-dbs-bundle fails to start because ovn-ctl crashes with coredump generated

Bug #1979276 reported by Takashi Kajinami
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Unassigned

Bug Description

Description
===========

The puppet-glance-tripleo-standalone job started failong consistently.

Example:
https://zuul.opendev.org/t/openstack/build/4757380fddac4d59a02f778887727c0e

Looking at the deployment log, it seems ovn-dbs-bundle resource fails to start
and pacemaker does not start the vip resource because of location constraint.

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_475/846784/8/check/puppet-glance-tripleo-standalone/4757380/logs/undercloud/var/log/extra/pcs.txt

```
Full List of Resources:
  * ip-192.168.24.3 (ocf:heartbeat:IPaddr2): Stopped
  * Container bundle: haproxy-bundle [127.0.0.1:5001/tripleomastercentos9/openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0 (ocf:heartbeat:podman): Started standalone
  * Container bundle: galera-bundle [127.0.0.1:5001/tripleomastercentos9/openstack-mariadb:pcmklatest]:
    * galera-bundle-0 (ocf:heartbeat:galera): Promoted standalone
  * Container bundle: rabbitmq-bundle [127.0.0.1:5001/tripleomastercentos9/openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf:heartbeat:rabbitmq-cluster): Started standalone
  * Container bundle: ovn-dbs-bundle [127.0.0.1:5001/tripleomastercentos9/openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0 (ocf:ovn:ovndb-servers): Unpromoted standalone

Failed Resource Actions:
  * ovndb_servers promote on ovn-dbs-bundle-0 could not be executed (Timed Out: Resource agent did not complete within 2m) at Tue Jun 21 06:41:09 2022 after 2m1ms
```

Looking at journal log, it seems ovn-nbctl command crashes with core dump.

```
Jun 21 06:41:08 standalone.localdomain kernel: traps: ovn-nbctl[212704] trap invalid opcode ip:55d658f09ba8 sp:7ffcdc0e3140 error:0 in ovn-nbctl[55d658f05000+5c000]
Jun 21 06:41:08 standalone.localdomain systemd[1]: Started Process Core Dump (PID 212705/UID 0).
Jun 21 06:41:08 standalone.localdomain systemd-coredump[212707]: Process 212704 (ovn-nbctl) of user 0 dumped core.

                                                                 Module /usr/bin/ovn-nbctl with build-id 2798d30ce0833d6e0fcabb6d8a0a98cba4da707d
                                                                 Module linux-vdso.so.1 with build-id 932e8861e1b4a3fa34f93ff803210fc441bcd188
                                                                 Module libnghttp2.so.14 with build-id 7eadbd56a0e5bcd3d8a6b39b9bab2327e380283a
                                                                 Module libpython3.9.so.1.0 with build-id bbe909b82db5ae1835b0022275d690951734a378
                                                                 Module libevent-2.1.so.7 with build-id af406c254338ff6ceff47360cba92cdcf233cf14
                                                                 Module libprotobuf-c.so.1 with build-id 46661ae5d66cbaa2aa82b1b765472bdfa4712a24
                                                                 Module ld-linux-x86-64.so.2 with build-id 1d95aae3e4174446d3b885ad234d4f7e573e71db
                                                                 Module libz.so.1 with build-id 25486226566596e403da5485fb0ec85deed6b9fa
                                                                 Module libc.so.6 with build-id 14830f7e71953d5f0dac317543ac1e3fcdd874f5
                                                                 Module libunbound.so.8 with build-id def32d1bb7a7d99c59bf62e00c628af0246afa91
                                                                 Module libm.so.6 with build-id 3eb525d2e163793ef2e888d5bb46e104d11a3201
                                                                 Module libcap-ng.so.0 with build-id fdca0a301667e15db99d726152b57feeb35e4dbe
                                                                 Module libcrypto.so.3 with build-id ea50b2486363fd2ce58686de4fe12956a9fa4622
                                                                 Module libssl.so.3 with build-id 6a3692862938d5df4111a2474b84f3ee9124f941
                                                                 Stack trace of thread 4928:
                                                                 #0 0x000055d658f09ba8 n/a (/usr/bin/ovn-nbctl + 0x16ba8)
                                                                 ELF object binary architecture: AMD x86-64
```

Steps to reproduce
==================
* Deploy standalone with ml2+ovn enabled

Expected result
===============
* Deployment should succeed without any error

Actual result
=============
* Deployment fails because vip is not started

Environment
===========
* The problem is observed only in master so far

Logs & Configs
==============
See https://zuul.opendev.org/t/openstack/build/4757380fddac4d59a02f778887727c0e

Revision history for this message
Takashi Kajinami (kajinamit) wrote :

We are facing a similar problem in puppet jobs, and the only difference from the last successful run was a few neutron packages updated.

--- a/rpm-qa-fail.txt
+++ b/rpm-qa-pass.txt
@@ -643,10 +643,10 @@ openssl-pkcs11-0.4.11-7.el9.x86_64
 openstack-glance-25.0.0-0.20220616170836.53f322f.el9.noarch
 openstack-keystone-21.1.0-0.20220603170558.daa8e74.el9.noarch
 openstack-network-scripts-10.11.1-1.el9s.x86_64
-openstack-neutron-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
-openstack-neutron-common-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
-openstack-neutron-ml2-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
-openstack-neutron-ovn-metadata-agent-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
+openstack-neutron-20.1.0-0.20220617192907.92b70ef.el9.noarch
+openstack-neutron-common-20.1.0-0.20220617192907.92b70ef.el9.noarch
+openstack-neutron-ml2-20.1.0-0.20220617192907.92b70ef.el9.noarch
+openstack-neutron-ovn-metadata-agent-20.1.0-0.20220617192907.92b70ef.el9.noarch
 openstack-nova-api-25.1.0-0.20220620152950.ebe0883.el9.noarch
 openstack-nova-common-25.1.0-0.20220620152950.ebe0883.el9.noarch
 openstack-nova-compute-25.1.0-0.20220620152950.ebe0883.el9.noarch
@@ -908,7 +908,7 @@ python3-munch-2.5.0-4.el9s.noarch
 python3-netaddr-0.8.0-5.el9.noarch
 python3-netifaces-0.10.6-15.el9.x86_64
 python3-networkx-2.6.2-2.el9.noarch
-python3-neutron-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
+python3-neutron-20.1.0-0.20220617192907.92b70ef.el9.noarch
 python3-neutronclient-7.8.0-0.20220215101221.6ca3341.el9.noarch
 python3-neutron-lib-2.21.0-0.20220614074327.d2b395f.el9.noarch
 python3-neutron-tests-tempest-1.9.1-0.20220616171513.6d09715.el9.noarch

description: updated
Revision history for this message
yatin (yatinkarel) wrote :

Just noticed it in devstack[1] too, and i suspect new openvswitch and ovn releases[2][3] yesterday could causing this considering the issue is random and timing of failures.

[1] https://zuul.opendev.org/t/openstack/build/f2745fa4127d4bf196362e3453a45ccb/log/controller/logs/services.txt#1693-1694
[2] https://review.rdoproject.org/r/c/nfvinfo/+/43620
[3] https://review.rdoproject.org/r/c/nfvinfo/+/42683

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

It only happens in puppet-glance-tripleo-standalone job? how often? we didn't find it in the check jobs. Note we can revert those releases if needed, but it'd be good to involve someone from ovs/ovn teams.

Revision history for this message
yatin (yatinkarel) wrote :
Arx Cruz (arxcruz)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
tags: added: alert ci promotion-blocker
Revision history for this message
Takashi Kajinami (kajinamit) wrote :
Download full text (4.0 KiB)

It seems I was looking at something wrong, and actually I see ovn/ovs was updated when the issue started.

The first failed run: https://zuul.opendev.org/t/openstack/build/245e7dfc2a584535a25070614fa43378

The last successful run: https://zuul.opendev.org/t/openstack/build/2d2f2d6ddade48f2892a926630669f48

--- a/home/tkajinam/Downloads/rpm-qa-fail.txt
+++ b/home/tkajinam/Downloads/rpm-qa-pass.txt
@@ -610,7 +610,7 @@ nettle-3.7.3-2.el9.x86_64
 net-tools-2.0-0.62.20160912git.el9.x86_64
 NetworkManager-1.39.6-1.el9.x86_64
 NetworkManager-libnm-1.39.6-1.el9.x86_64
-network-scripts-openvswitch2.15-2.15.0-99.el9s.x86_64
+network-scripts-openvswitch2.15-2.15.0-81.el9s.x86_64
 nfs-utils-2.5.4-10.el9.x86_64
 nmap-7.91-10.el9.x86_64
 nmap-ncat-7.91-10.el9.x86_64
@@ -643,10 +643,10 @@ openssl-pkcs11-0.4.11-7.el9.x86_64
 openstack-glance-25.0.0-0.20220616170836.53f322f.el9.noarch
 openstack-keystone-21.1.0-0.20220603170558.daa8e74.el9.noarch
 openstack-network-scripts-10.11.1-1.el9s.x86_64
-openstack-neutron-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
-openstack-neutron-common-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
-openstack-neutron-ml2-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
-openstack-neutron-ovn-metadata-agent-20.1.0-0.20220620175903.ae2d4c1.el9.noarch
+openstack-neutron-20.1.0-0.20220617192907.92b70ef.el9.noarch
+openstack-neutron-common-20.1.0-0.20220617192907.92b70ef.el9.noarch
+openstack-neutron-ml2-20.1.0-0.20220617192907.92b70ef.el9.noarch
+openstack-neutron-ovn-metadata-agent-20.1.0-0.20220617192907.92b70ef.el9.noarch
 openstack-nova-api-25.1.0-0.20220620152950.ebe0883.el9.noarch
 openstack-nova-common-25.1.0-0.20220620152950.ebe0883.el9.noarch
 openstack-nova-compute-25.1.0-0.20220620152950.ebe0883.el9.noarch
@@ -657,7 +657,7 @@ openstack-placement-api-7.1.0-0.20220617102903.f052a35.el9.noarch
 openstack-placement-common-7.1.0-0.20220617102903.f052a35.el9.noarch
 openstack-selinux-0.8.32-0.20220615144412.d53c3f0.el9.noarch
 openstack-tempest-31.0.1-0.20220618012213.7559bb6.el9.noarch
-openvswitch2.15-2.15.0-99.el9s.x86_64
+openvswitch2.15-2.15.0-81.el9s.x86_64
 openvswitch-selinux-extra-policy-1.0-31.el9s.noarch
 opus-1.3.1-10.el9.x86_64
 orc-0.4.31-6.el9.x86_64
@@ -665,9 +665,9 @@ osinfo-db-20220516-1.el9.noarch
 osinfo-db-tools-1.9.0-3.el9.x86_64
 os-prober-1.77-9.el9.x86_64
 ostree-libs-2022.3-2.el9.x86_64
-ovn-2021-21.12.0-46.el9s.x86_64
-ovn-2021-central-21.12.0-46.el9s.x86_64
-ovn-2021-host-21.12.0-46.el9s.x86_64
+ovn-2021-21.12.0-11.el9s.x86_64
+ovn-2021-central-21.12.0-11.el9s.x86_64
+ovn-2021-host-21.12.0-11.el9s.x86_64
 p11-kit-0.24.1-2.el9.x86_64
 p11-kit-server-0.24.1-2.el9.x86_64
 p11-kit-trust-0.24.1-2.el9.x86_64
@@ -852,7 +852,6 @@ python3-futurist-2.4.1-0.20220509165344.159d752.el9.noarch
 python3-glance-25.0.0-0.20220616170836.53f322f.el9.noarch
 python3-glanceclient-4.0.0-0.20220523182016.be8f394.el9.noarch
 python3-glance-store-4.0.0-0.20220523182411.aeee48b.el9.noarch
-python3-glance-tests-tempest-0.3.1-0.20220615170855.87df2f4.el9.noarch
 python3-gobject-3.40.1-5.el9.x86_64
 python3-gobject-base-3.40.1-5.el9.x86_64
 python3-gpg-1.15.1-6.el9.x86_64
@@ -908,7 +907,7 @@ python3-munch-2.5.0-4.el9s.noarch
 python3-netadd...

Read more...

Revision history for this message
yatin (yatinkarel) wrote :

Both ovs and ovn updates are reverted[1][2], need to wait for these to be removed from facebook and infra mirrors. Root cause of this is not clear yet as gate jobs passed in original update patches and also the issue only being seen in upstream infra, not in rdo infra[3][4].
May be something wrong with latest ovn version and upstream c9 stream nodes(i noticed old kernel in those 5.14.0-96 vs 5.14.0-105)?, there are some other package difference but that doesn't look related.

I tried to reproduce locally but couldn't reproduce

[1] https://review.rdoproject.org/r/c/nfvinfo/+/43674
[2] https://review.rdoproject.org/r/c/nfvinfo/+/43673
[3] https://review.rdoproject.org/r/c/testproject/+/43551
[4] https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-integration-main

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (5.1 KiB)

Further troubleshooting resulted into:-

The issue happens on certain CPU modes and node providers:-
- Seen issue with node providers: rax-ord, rax-dfw, rax-iad, iweb-mtl01
- Seen issue with below cpu models
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Intel Xeon E312xx (Sandy Bridge)

This doesn't mean only ^ cpu modes impacted, there may be others which are not in upstream CI, some example of success cpu models:-
- Intel Core Processor (Haswell, no TSX)
- Intel Xeon Processor (Cascadelake)
- AMD EPYC-Rome Processor

This is the reason the issue was not seen in rdo infra and downstream as the nodes there don't have the above mentioned cpu models.

On affected node, even ovn-nbctl --version traces back,

core dump is seen as below:-
# coredumpctl info
           PID: 640886 (ovn-nbctl)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 4 (ILL)
     Timestamp: Thu 2022-06-23 08:48:35 UTC (3s ago)
  Command Line: ovn-nbctl --version
    Executable: /usr/bin/ovn-nbctl
 Control Group: /machine.slice/libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope/container
          Unit: libpod-449776acdb3089ad2f92d49b850b234089c5ec549b6f5d0fcfb5414b5f19717a.scope
         Slice: machine.slice
       Boot ID: 4f2c55fc25f34c84a6160468479ece43
    Machine ID: c26d255f89064955aa655cf12e74d969
      Hostname: standalone.localdomain
       Storage: /var/lib/systemd/coredump/core.ovn-nbctl.0.4f2c55fc25f34c84a6160468479ece43.640886.1655974115000000.zst (present)
     Disk Size: 160.0K
       Message: Process 640886 (ovn-nbctl) of user 0 dumped core.

                Module /usr/bin/ovn-nbctl with build-id 2798d30ce0833d6e0fcabb6d8a0a98cba4da707d
                Module linux-vdso.so.1 with build-id 826a46efc5a1c4a55cc6fdceeb06554eda66067e
                Module libnghttp2.so.14 with build-id 7eadbd56a0e5bcd3d8a6b39b9bab2327e380283a
                Module libpython3.9.so.1.0 with build-id bb4578c381c6d22045835e803bf846e2b5a28502
                Module libevent-2.1.so.7 with build-id af406c254338ff6ceff47360cba92cdcf233cf14
                Module libprotobuf-c.so.1 with build-id 46661ae5d66cbaa2aa82b1b765472bdfa4712a24
                Module ld-linux-x86-64.so.2 with build-id 1d95aae3e4174446d3b885ad234d4f7e573e71db
                Module libz.so.1 with build-id 25486226566596e403da5485fb0ec85deed6b9fa
                Module libc.so.6 with build-id 14830f7e71953d5f0dac317543ac1e3fcdd874f5
                Module libunbound.so.8 with build-id def32d1bb7a7d99c59bf62e00c628af0246afa91
                Module libm.so.6 with build-id 3eb525d2e163793ef2e888d5bb46e104d11a3201
                Module libcap-ng.so.0 with build-id fdca0a301667e15db99d726152b57feeb35e4dbe
                Module libcrypto.so.3 with build-id 12bfb8486a63c1daa0d3b1d901401cd152c09f8e
                Module libssl.so.3 with build-id 4f82a7edeeafe3698ccc5442d011a8cd5aaf4e9d
                Stack trace of thread 96216:
                #0 0x000055d209c3dba8 n/a (/usr/bin/ovn-nbctl + 0x16ba8)
                ELF object binary architecture: AMD x86-64

Also /proc/cpuinfo looks like below on affected system:-
====...

Read more...

Revision history for this message
yatin (yatinkarel) wrote :
yatin (yatinkarel)
Changed in tripleo:
milestone: none → yoga-1
Revision history for this message
Alan Pevec (apevec) wrote :

closing old promotion-blocker
OVN rhbz was resolved

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.