Random nova_migrate_target healthcheck error

Bug #1863635 reported by Cédric Jeanneret
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Incomplete
Medium
Cédric Jeanneret

Bug Description

Hello there,

From time to time, I can see the nova_migrate_target container healthcheck failing with no obvious reasons.

Apparently, one over 2 checks is failing with no obvious reason:

Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: Starting nova_migration_target healthcheck...
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : 10
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : curl-healthcheck
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : /dev/null
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + ps -e
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + grep nova-compute
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + process=sshd
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + args=2022
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + test_func=healthcheck_listen
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + err_msg='There is no sshd process listening on port(s) 2022 in the container'
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + healthcheck_listen sshd 2022
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + process=sshd
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + shift 1
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + args=2022
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + ports=2022
Feb 17 16:12:31 overcloud-0-novacompute-0 podman[63200]: 2020-02-17 16:12:31.558264431 +0000 UTC m=+0.253246818 container exec 509760be1073f5283dbe574451d667db8d7b7578483da3935273c3f132e6f642 (image=undercloud.ctlplane:8787/tripleomaster/centos-binary-nova-compute:current-tripleo, name=nova_migration_target)
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ pgrep -d '|' -f sshd
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + pids=9
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + ss -lnp
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + grep -qE ':(2022).*,pid=(9),'
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: There is no sshd process listening on port(s) 2022 in the container
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + echo There is no sshd process listening on 'port(s)' 2022 in the container
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + exit 1
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: Error: non zero exit code: 1: OCI runtime error
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service: main process exited, code=exited, status=1/FAILURE
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: Failed to start nova_migration_target healthcheck.
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: Unit tripleo_nova_migration_target_healthcheck.service entered failed state.
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service failed.
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: Starting nova_migration_target healthcheck...
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : 10
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : curl-healthcheck
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : /dev/null
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + ps -e
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + grep nova-compute
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + process=sshd
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + args=2022
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + test_func=healthcheck_listen
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + err_msg='There is no sshd process listening on port(s) 2022 in the container'
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + healthcheck_listen sshd 2022
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + process=sshd
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + shift 1
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + args=2022
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + ports=2022
Feb 17 16:14:31 overcloud-0-novacompute-0 podman[63583]: 2020-02-17 16:14:31.51596223 +0000 UTC m=+0.202973980 container exec 509760be1073f5283dbe574451d667db8d7b7578483da3935273c3f132e6f642 (image=undercloud.ctlplane:8787/tripleomaster/centos-binary-nova-compute:current-tripleo, name=nova_migration_target)
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ pgrep -d '|' -f sshd
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + pids=9
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + ss -lnp
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + grep -qE ':(2022).*,pid=(9),'
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + echo There is no sshd process listening on 'port(s)' 2022 in the container
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: There is no sshd process listening on port(s) 2022 in the container
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + exit 1
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: Error: non zero exit code: 1: OCI runtime error
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service: main process exited, code=exited, status=1/FAILURE
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: Failed to start nova_migration_target healthcheck.
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: Unit tripleo_nova_migration_target_healthcheck.service entered failed state.
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service failed.

My env is currently running with a soon-to-be-merged patch[1] aiming at squashing unreliable behaviour.

Apparently, that patch uncovered an issue, probably due to the way we check the used ports. Until now, I didn't find this weird behaviour inside another container using the same check, but it might be just some luck|timing.

Marking this patch as [incomplete] for now since I need to track down some more evidences - I'll updated is asap.

[1] https://review.opendev.org/707799

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Need to dig a bit more into this in order to fetch logs and compare containers, timing and things like that.

Changed in tripleo:
status: Triaged → Incomplete
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Revision history for this message
Marios Andreou (marios-b) wrote :

This is an automated action. Bug status has been set to 'Incomplete' and target milestone has been removed due to inactivity. If you disagree please re-set these values and reach out to us on freenode #tripleo

Changed in tripleo:
milestone: wallaby-3 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.