Nova conductor container fails on healthcheck

Bug #1843555 reported by Sagi (Sergey) Shnaidman
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Emilien Macchi

Bug Description

Periodic standalone jobs in master fail because of failing nova-conductor healthcheck:

In periodic-tripleo-ci-centos-7-scenario003-standalone-master: http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-scenario003-standalone-master/c6e8e80/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

TASK [Fail if nova-conductor healthcheck report failed status] *****************
2019-09-11 04:52:09 | fatal: [standalone]: FAILED! => {"changed": false, "msg": "nova-conductor isn't working (healthcheck failed)"}

In standalone upgrade job: http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-standalone-upgrade-master/1f38483/logs/undercloud/home/zuul/standalone_upgrade.log.txt.gz

2019-09-11 02:57:21 | 2019-09-11 02:57:21.970 76995 WARNING tripleoclient.v1.tripleo_upgrade.Upgrade [-] TASK [Fail if nova-conductor healthcheck report failed status] *****************[00m
2019-09-11 02:57:22 | 2019-09-11 02:57:22.004 76995 WARNING tripleoclient.v1.tripleo_upgrade.Upgrade [-] fatal: [standalone]: FAILED! => {"changed": false, "msg": "nova-conductor isn't working (healthcheck failed)"}[00m

Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
Oliver Walsh (owalsh) wrote :

IIUC https://github.com/openstack/paunch/blob/b33aeea9728233aca852a3e132f23fca71ac42df/paunch/utils/systemd.py#L238 set OnActiveSec=120 so first healthcheck does not run until 2 minutes after starting the service. The validation tasks need to account for this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/681442

Changed in tripleo:
assignee: nobody → Oliver Walsh (owalsh)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/681525

Changed in tripleo:
assignee: Oliver Walsh (owalsh) → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/681695

tags: added: queens-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/681525
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=32832187434406d766296305e6a4ba7fc8ba9138
Submitter: Zuul
Branch: master

commit 32832187434406d766296305e6a4ba7fc8ba9138
Author: Emilien Macchi <email address hidden>
Date: Wed Sep 11 11:49:41 2019 -0400

    healthcheck_port: run ss -ntp with sudo

    The output of "ss" is different if you run it as root, or as the user
    which is used to execute the process we want to monitor.

    e.g. nova-conuductor (check reported bug)

    ()[root@undercloud /]$ ss -ntp | grep -E ":($ports).*,pid=($pids),"
    (empty output)

    ()[root@undercloud /]$ sudo -u nova ss -ntp | grep -E ":($ports).*,pid=($pids),"
    ESTAB 192.168.24.1:56959 192.168.24.3:3306 users:(("nova-conductor",pid=25,fd=7))
    ESTAB 192.168.24.1:46860 192.168.24.3:3306 users:(("nova-conductor",pid=26,fd=7))
    ESTAB 192.168.24.1:55918 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=8))
    ESTAB 192.168.24.1:56786 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=9))
    ESTAB 192.168.24.1:55920 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=8))
    ESTAB 192.168.24.1:57238 192.168.24.3:3306 users:(("nova-conductor",pid=25,fd=10))
    ESTAB 192.168.24.1:56840 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=9))
    ESTAB 192.168.24.1:35115 192.168.24.3:3306 users:(("nova-conductor",pid=26,fd=10)

    (output was implified for the commit message)

    So the idea of this patch is to introduce a new function,
    get_user_from_process() which will figure out what user runs the
    process, by using pgrep and ps.

    More infos about how the ps was done is documented in the code, but to
    make it safer we grep the pid AND cmd to get accurate informations.

    Then later in healthcheck_port, use the new function to figure out which
    user is running the process, then run the "ss" with "sudo -u" to get the
    accurate output and know if the process is actually connected to the
    port that we want.

    Change-Id: I7be514832fc7af8dbcfbafe15b2425db8dcfe3c7
    Closes-Bug: #1843555

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/681752

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/681752
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=a27e7f04be81b2a9db86e7330bfd2e3597a6e8ee
Submitter: Zuul
Branch: master

commit a27e7f04be81b2a9db86e7330bfd2e3597a6e8ee
Author: Emilien Macchi <email address hidden>
Date: Thu Sep 12 09:42:20 2019 -0400

    healthcheck_port: run ss with both sudo & root as best effort

    Privileged containers are running under the system pid namespace which
    makes 'ss' output different from the container user used to run the
    process.

    e.g. nova-compute is run as nova on the overcloud, but the container is
    privileged, so the previous patch with sudo didn't help to fix the
    healthcheck. The 'ss' needs to be run as root, not as nova.

    In this patch we run the ss twice, once as root, once with sudo, run
    sort to make sure we get uniq output; then grep is as before.

    Change-Id: Ia2897a6be3e000a9594103502b716431baa615b1
    Co-Authored-By: Oliver Walsh <email address hidden>
    Related-Bug: #1843555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/681953

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/681695
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=eeb91c4fc6b4c45f33cf8003c78e435e5ae6e1f7
Submitter: Zuul
Branch: stable/stein

commit eeb91c4fc6b4c45f33cf8003c78e435e5ae6e1f7
Author: Emilien Macchi <email address hidden>
Date: Wed Sep 11 11:49:41 2019 -0400

    healthcheck_port: run ss -ntp with sudo

    The output of "ss" is different if you run it as root, or as the user
    which is used to execute the process we want to monitor.

    e.g. nova-conuductor (check reported bug)

    ()[root@undercloud /]$ ss -ntp | grep -E ":($ports).*,pid=($pids),"
    (empty output)

    ()[root@undercloud /]$ sudo -u nova ss -ntp | grep -E ":($ports).*,pid=($pids),"
    ESTAB 192.168.24.1:56959 192.168.24.3:3306 users:(("nova-conductor",pid=25,fd=7))
    ESTAB 192.168.24.1:46860 192.168.24.3:3306 users:(("nova-conductor",pid=26,fd=7))
    ESTAB 192.168.24.1:55918 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=8))
    ESTAB 192.168.24.1:56786 192.168.24.1:5672 users:(("nova-conductor",pid=26,fd=9))
    ESTAB 192.168.24.1:55920 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=8))
    ESTAB 192.168.24.1:57238 192.168.24.3:3306 users:(("nova-conductor",pid=25,fd=10))
    ESTAB 192.168.24.1:56840 192.168.24.1:5672 users:(("nova-conductor",pid=25,fd=9))
    ESTAB 192.168.24.1:35115 192.168.24.3:3306 users:(("nova-conductor",pid=26,fd=10)

    (output was implified for the commit message)

    So the idea of this patch is to introduce a new function,
    get_user_from_process() which will figure out what user runs the
    process, by using pgrep and ps.

    More infos about how the ps was done is documented in the code, but to
    make it safer we grep the pid AND cmd to get accurate informations.

    Then later in healthcheck_port, use the new function to figure out which
    user is running the process, then run the "ss" with "sudo -u" to get the
    accurate output and know if the process is actually connected to the
    port that we want.

    Change-Id: I7be514832fc7af8dbcfbafe15b2425db8dcfe3c7
    Closes-Bug: #1843555
    (cherry picked from commit 32832187434406d766296305e6a4ba7fc8ba9138)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/681989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/681953
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=4754deabf31ea11ed79640d47072b70733ab4210
Submitter: Zuul
Branch: stable/stein

commit 4754deabf31ea11ed79640d47072b70733ab4210
Author: Emilien Macchi <email address hidden>
Date: Thu Sep 12 09:42:20 2019 -0400

    healthcheck_port: run ss with both sudo & root as best effort

    Privileged containers are running under the system pid namespace which
    makes 'ss' output different from the container user used to run the
    process.

    e.g. nova-compute is run as nova on the overcloud, but the container is
    privileged, so the previous patch with sudo didn't help to fix the
    healthcheck. The 'ss' needs to be run as root, not as nova.

    In this patch we run the ss twice, once as root, once with sudo, run
    sort to make sure we get uniq output; then grep is as before.

    Change-Id: Ia2897a6be3e000a9594103502b716431baa615b1
    Co-Authored-By: Oliver Walsh <email address hidden>
    Related-Bug: #1843555
    (cherry picked from commit a27e7f04be81b2a9db86e7330bfd2e3597a6e8ee)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/681442
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c919f1b65b1b33291334464e8d9636db749f7a80
Submitter: Zuul
Branch: master

commit c919f1b65b1b33291334464e8d9636db749f7a80
Author: Oliver Walsh <email address hidden>
Date: Wed Sep 11 11:50:35 2019 +0100

    Wait for first healthcheck before running validation tasks

    The systemd healthcheck timer first triggers 120s after activation.
    The initial value for ExecMainStatus is 0, resulting in false positives if we
    check this too early.
    This changes waits (up to 5 mins) for ExecMainPID to be set and the service to
    return to an inactive/failed state.

    Change-Id: Iad4ebb283a7a6559b6fffead4145cc9bbad45e4e
    Depends-On: Ia2897a6be3e000a9594103502b716431baa615b1
    Related-bug: #1843555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/681989
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=0ca91f0d149c14982a2071e58e06f3ee71d5b3cd
Submitter: Zuul
Branch: master

commit 0ca91f0d149c14982a2071e58e06f3ee71d5b3cd
Author: Rabi Mishra <email address hidden>
Date: Fri Sep 13 13:26:16 2019 +0530

    healthcheck: List udp ports with ss

    Looks like octavia healthmanager uses UDP 5555 port.

    Also changes octavia-health-manager healath check to use
    healthcheck_port rather than healthcheck_listen, as udp
    is a connectionless protocol.

    Change-Id: Id48d5efe17eeb1a524e280d8885e98cfe1a5577a
    Related-Bug: #1843555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/682230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/682230
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=337dda3fb4def56916eb3757668bc821c7bf4e1f
Submitter: Zuul
Branch: stable/stein

commit 337dda3fb4def56916eb3757668bc821c7bf4e1f
Author: Rabi Mishra <email address hidden>
Date: Fri Sep 13 13:26:16 2019 +0530

    healthcheck: List udp ports with ss

    Looks like octavia healthmanager uses UDP 5555 port.

    Also changes octavia-health-manager healath check to use
    healthcheck_port rather than healthcheck_listen, as udp
    is a connectionless protocol.

    Change-Id: Id48d5efe17eeb1a524e280d8885e98cfe1a5577a
    Related-Bug: #1843555
    (cherry picked from commit 0ca91f0d149c14982a2071e58e06f3ee71d5b3cd)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 11.2.0

This issue was fixed in the openstack/tripleo-common 11.2.0 release.

Revision history for this message
Michele Baldessari (michele) wrote :

I still see this: https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_2f7/669847/24/check/tripleo-ci-centos-7-scenario001-standalone/2f7d525/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

2019-09-25 20:33:24 | TASK [Fail if nova-metadata healthcheck report failed status] ******************
2019-09-25 20:33:24 | Wednesday 25 September 2019 20:33:24 +0000 (0:00:00.810) 1:50:18.324 ***
2019-09-25 20:33:24 | fatal: [standalone]: FAILED! => {"changed": false, "msg": "nova-metadata isn't working (healthcheck failed)"}
2019-09-25 20:33:24 |
2019-09-25 20:33:24 | NO MORE HOSTS LEFT *************************************************************
2019-09-25 20:33:25 |
2019-09-25 20:33:25 | PLAY RECAP *********************************************************************
2019-09-25 20:33:25 | standalone : ok=275 changed=167 unreachable=0 failed=1 skipped=231 rescued=0 ignored=1
2019-09-25 20:33:25 | undercloud : ok=128 changed=43 unreachable=0 failed=0 skipped=223 rescued=0 ignored=16

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (master)

Fix proposed to branch: master
Review: https://review.opendev.org/685063

Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: Fix Released → In Progress
Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Cédric Jeanneret (cjeanner)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note for the other tripleo-ci-centos-7-scenario004-standalone CI job logs (running docker) for there is no such an issue with delayed nova_api container start.

tags: added: alert
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Added alert as this also quite often times-out patches in the gate

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

(maybe it worths opening a dedicated bug for nova_api?)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/685691

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/685698

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/685691
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1761fc81c252e3dd565fe4f27e13f2c26426c806
Submitter: Zuul
Branch: master

commit 1761fc81c252e3dd565fe4f27e13f2c26426c806
Author: Oliver Walsh <email address hidden>
Date: Mon Sep 30 12:41:21 2019 +0100

    Temporaily disable nova inflight healthchecks to unblock the gate

    Change-Id: I8b687dcf7b36730a282e2091566a15a7ddc6fd23
    Related-bug: #1843555

Changed in tripleo:
assignee: Cédric Jeanneret (cjeanner) → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (master)

Reviewed: https://review.opendev.org/685063
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=3b9041bb45627abacf6daf820ca01550769839ee
Submitter: Zuul
Branch: master

commit 3b9041bb45627abacf6daf820ca01550769839ee
Author: Emilien Macchi <email address hidden>
Date: Thu Sep 26 08:45:26 2019 -0400

    (actually) disable the inflight validations by default

    - undercloud_config: set no_validations to True by default, since we want
      to disable them by default now.

      Introduce --inflight-validations in order to get those particular
      validations independently.

    - tripleo_deploy (undercloud + standalone): add missing extra_args when
      running ansible.

    Co-Authored-By: Cédric Jeaneret <email address hidden>
    Closes-Bug: #1843555
    Change-Id: I95b8d7abc632b190ac6731393bd490bfa3aedcca

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Cédric Jeanneret (Tengu) (<email address hidden>) on branch: master
Review: https://review.opendev.org/685698
Reason: since this option is soon deprecated, let's not rely on it.
Thanks Oliver for the digging and headup!

Revision history for this message
Oliver Walsh (owalsh) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-tripleoclient 12.3.0

This issue was fixed in the openstack/python-tripleoclient 12.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.8.2

This issue was fixed in the openstack/tripleo-common 10.8.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.