tripleo_ovn_northd_healthcheck.service loaded failed failed ovn_northd healthcheck

Bug #1823882 reported by wes hayutin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Cédric Jeanneret

Bug Description

2019-04-05 03:46:38.261202 | primary |
2019-04-05 03:46:38.261346 | primary | TASK [validate-services : Print out any failed Systemd services for tripleo_*] ***
2019-04-05 03:46:38.292466 | primary | Friday 05 April 2019 03:46:38 +0000 (0:00:00.973) 0:46:30.454 **********
2019-04-05 03:46:38.603218 | primary | ok: [undercloud] => {
2019-04-05 03:46:38.604274 | primary | "systemd_state.stdout_lines": [
2019-04-05 03:46:38.604597 | primary | "tripleo_ovn_northd_healthcheck.service loaded failed failed ovn_northd healthcheck"
2019-04-05 03:46:38.604687 | primary | ]
2019-04-05 03:46:38.604724 | primary | }
2019-04-05 03:46:38.659140 | primary |

http://logs.openstack.org/76/649476/5/check/tripleo-ci-centos-7-standalone-upgrade/33cf41f/job-output.txt.gz#_2019-04-05_03_46_38_261346

http://logs.openstack.org/76/649476/5/check/tripleo-ci-centos-7-standalone-upgrade/33cf41f/logs/undercloud/var/log/journal.txt.gz#_Apr_05_03_32_11

Apr 05 03:32:11 standalone.localdomain podman[71254]: There is no ovn-northd process connected to socket sockets running in the container
Apr 05 03:32:11 standalone.localdomain podman[71254]: exit status 1
Apr 05 03:32:11 standalone.localdomain systemd[1]: tripleo_ovn_northd_healthcheck.service: main process exited, code=exited, status=1/FAILURE
Apr 05 03:32:11 standalone.localdomain systemd[1]: Failed to start ovn_northd healthcheck.
Apr 05 03:32:11 standalone.localdomain systemd[1]: Unit tripleo_ovn_northd_healthcheck.service entered failed state.
Apr 05 03:32:11 standalone.localdomain systemd[1]: tripleo_ovn_northd_healthcheck.service failed.

Kamil Sambor (ksambor)
Changed in tripleo:
assignee: nobody → Kamil Sambor (ksambor)
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Wondering if https://review.openstack.org/#/c/651517/ would solve this issue.
Lately, the healthcheck changed and are using "pgrep" for a better output. Apparently, procps-ng, the package providing pgrep, isn't installed everywhere.

The patch should add it to ovn_northd container.

Revision history for this message
Kamil Sambor (ksambor) wrote :

This issue is solved by this patch: https://review.openstack.org/#/c/651213/

Revision history for this message
Marios Andreou (marios-b) wrote :

o/ folks.

I am looking at the job in Emilien fix at https://review.openstack.org/#/c/651789/ (for bug 1824301 )

I can't see what version of kolla is there from yum.log [1]. Looking at the delorean repo [2] to see what is available [3], I see

    openstack-kolla-8.0.0-0.20190408093855.77e1745.el7.noarch.rpm

and checking the 2 available 8.0.0 tags at [4][5] i don't see any "procps-ng" from Tengu fix at [6]
So Tengu... wondering if we need to revisit cherrypick that to rocky too?

[1] http://logs.openstack.org/89/651789/1/check/tripleo-ci-centos-7-standalone-upgrade/bfb0318/logs/undercloud/var/log/yum.log.txt.gz
[2] http://logs.openstack.org/89/651789/1/check/tripleo-ci-centos-7-standalone-upgrade/bfb0318/logs/undercloud/etc/yum.repos.d/delorean.repo.txt.gz
[3] http://mirror.gra1.ovh.openstack.org:8080/rdo/centos7-stein/c5/b2/c5b283cab4999921135b3815cd4e051b43999bce_5b53d5ba/
[4] https://github.com/openstack/kolla/blob/8.0.0.0b1/docker/base/Dockerfile.j2
[5] https://github.com/openstack/kolla/blob/8.0.0.0rc1/docker/base/Dockerfile.j2
[6] https://review.openstack.org/#/c/651537/

Revision history for this message
Marios Andreou (marios-b) wrote :

update after irc with Tengu just now we will disable for now for the standalone-upgrade job so we need https://review.openstack.org/#/c/651728/ and https://review.openstack.org/#/c/651725/

Changed in tripleo:
milestone: stein-rc1 → train-1
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Soooo. I've taken some more time into this issue.

There are a couple of things:

- the script itself is wrong, there's a typo in it (missing "$" in the "for" loop)
- apparently something is preventing lsof to know about the unix sockets used by the ovn-northd service

While the first issue is easy to correct, the second one is a bit trickier.

We can see the following lsof output:
()[root@undercloud /]# lsof -c ovn-northd -Ua
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ovn-north 8 root 4u unix 0xffff9ac671751c00 0t0 753132 socket
ovn-north 8 root 8u unix 0xffff9ac671752000 0t0 753134 /var/run/openvswitch/ovn-northd.8.ctl
ovn-north 8 root 9u unix 0xffff9ac671750400 0t0 753135 socket
ovn-north 8 root 10u unix 0xffff9ac671756000 0t0 753136 socket

This shows some "socket", but without pointing to the actual files.

I tried to go directly for the /proc/, and here's the thing:
()[root@undercloud /]# ls /proc/8/fd -l
total 0
lrwx------. 1 root root 64 Apr 16 09:18 0 -> /dev/null
l-wx------. 1 root root 64 Apr 16 09:18 1 -> pipe:[759993]
lrwx------. 1 root root 64 Apr 16 09:18 10 -> socket:[753136]
l-wx------. 1 root root 64 Apr 16 09:18 2 -> pipe:[759994]
l-wx------. 1 root root 64 Apr 16 09:18 3 -> /var/log/openvswitch/ovn-northd.log
lrwx------. 1 root root 64 Apr 16 09:18 4 -> socket:[753132]
lrwx------. 1 root root 64 Apr 16 09:18 5 -> /run/openvswitch/ovn-northd.pid
lr-x------. 1 root root 64 Apr 16 09:18 6 -> pipe:[753133]
l-wx------. 1 root root 64 Apr 16 09:18 7 -> pipe:[753133]
lrwx------. 1 root root 64 Apr 16 09:18 8 -> socket:[753134]
lrwx------. 1 root root 64 Apr 16 09:18 9 -> socket:[753135]

All the pipe and socket are in red, usually meaning "broken symlink", although for this kind of things I'm not sure it's the same meaning.

Is there a way to ensure that ovn-northd thingy is actually working as expected? The running command is pointing the sockets, but lsof doesn't show that:
/usr/bin/ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=unix:/run/openvswitch/ovnnb_db.sock --ovnsb-db=unix:/run/openvswitch/ovnsb_db.sock [...]

Cheers,

C.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Some more comments:

- using ss, we can find some nice information about the raw sockets, for instance, given "ovn-northd" PID is 8:
()[root@undercloud /]# ss -pex | grep 'pid=8'
u_str ESTAB 0 0 * 753135 * 760167 users:(("ovn-northd",pid=8,fd=9)) <->
u_str ESTAB 0 0 * 753136 * 761859 users:(("ovn-northd",pid=8,fd=10)) <->

We get the socket inodes (753135 and 753136) as well as the file descriptor ID (9 and 10). This can be related to the content of /proc/8/fd.

The main issue we are facing now is: there isn't any easy way to relate the inode with the files inside the container. The sockets are apparently created from another container, shared on the host, and this creates some mixed up FS voodoo preventing to easily match inodes and actual files...

We might want to rethink the way we're testing ovn-northd container imho.

Anyone from Neutron? :)

Changed in tripleo:
assignee: Kamil Sambor (ksambor) → Cédric Jeanneret (cjeanner)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/652980

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

@wes: the real fix is being processed right now :). The healthcheck was faulty.

Changed in tripleo:
status: Fix Released → In Progress
tags: removed: promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/652980
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=baa9fb8c4b2b993b4d5886dc014c92667727d369
Submitter: Zuul
Branch: master

commit baa9fb8c4b2b993b4d5886dc014c92667727d369
Author: Cédric Jeanneret <email address hidden>
Date: Tue Apr 16 13:34:03 2019 +0200

    Correct ovn-dbs health check

    We can't really check the socket presence in containers:
    - ovn doesn't really listen to them
    - playing with lsof and ss doesn't help, since there are some
      issues with inodes and overlays

    The new healthcheck ensures the service is properly running, and
    will fail if ovn-northd has an issue.

    Please note: the current STDERR has some output, this is due to
    some packaging issue being worked on right now.

    Change-Id: I645e18cf198a948479083df94b5d373ed92f2aae
    Closes-Bug: #1823882

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.7.0

This issue was fixed in the openstack/tripleo-common 10.7.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.