Hello there,
From time to time, I can see the nova_migrate_target container healthcheck failing with no obvious reasons.
Apparently, one over 2 checks is failing with no obvious reason:
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: Starting nova_migration_target healthcheck...
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : 10
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : curl-healthcheck
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ : /dev/null
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + ps -e
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + grep nova-compute
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + process=sshd
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + args=2022
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + test_func=healthcheck_listen
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + err_msg='There is no sshd process listening on port(s) 2022 in the container'
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + healthcheck_listen sshd 2022
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + process=sshd
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + shift 1
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + args=2022
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + ports=2022
Feb 17 16:12:31 overcloud-0-novacompute-0 podman[63200]: 2020-02-17 16:12:31.558264431 +0000 UTC m=+0.253246818 container exec 509760be1073f5283dbe574451d667db8d7b7578483da3935273c3f132e6f642 (image=undercloud.ctlplane:8787/tripleomaster/centos-binary-nova-compute:current-tripleo, name=nova_migration_target)
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: ++ pgrep -d '|' -f sshd
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + pids=9
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + ss -lnp
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + grep -qE ':(2022).*,pid=(9),'
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: There is no sshd process listening on port(s) 2022 in the container
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + echo There is no sshd process listening on 'port(s)' 2022 in the container
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: + exit 1
Feb 17 16:12:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63200]: Error: non zero exit code: 1: OCI runtime error
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service: main process exited, code=exited, status=1/FAILURE
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: Failed to start nova_migration_target healthcheck.
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: Unit tripleo_nova_migration_target_healthcheck.service entered failed state.
Feb 17 16:12:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service failed.
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: Starting nova_migration_target healthcheck...
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : 10
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : curl-healthcheck
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : '\n%{http_code}' '%{remote_ip}:%{remote_port}' '%{time_total}' 'seconds\n'
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ : /dev/null
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + ps -e
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + grep nova-compute
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + process=sshd
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + args=2022
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + test_func=healthcheck_listen
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + err_msg='There is no sshd process listening on port(s) 2022 in the container'
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + healthcheck_listen sshd 2022
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + process=sshd
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + shift 1
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + args=2022
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + ports=2022
Feb 17 16:14:31 overcloud-0-novacompute-0 podman[63583]: 2020-02-17 16:14:31.51596223 +0000 UTC m=+0.202973980 container exec 509760be1073f5283dbe574451d667db8d7b7578483da3935273c3f132e6f642 (image=undercloud.ctlplane:8787/tripleomaster/centos-binary-nova-compute:current-tripleo, name=nova_migration_target)
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: ++ pgrep -d '|' -f sshd
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + pids=9
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + ss -lnp
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + grep -qE ':(2022).*,pid=(9),'
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + echo There is no sshd process listening on 'port(s)' 2022 in the container
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: There is no sshd process listening on port(s) 2022 in the container
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: + exit 1
Feb 17 16:14:31 overcloud-0-novacompute-0 healthcheck_nova_migration_target[63583]: Error: non zero exit code: 1: OCI runtime error
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service: main process exited, code=exited, status=1/FAILURE
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: Failed to start nova_migration_target healthcheck.
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: Unit tripleo_nova_migration_target_healthcheck.service entered failed state.
Feb 17 16:14:31 overcloud-0-novacompute-0 systemd[1]: tripleo_nova_migration_target_healthcheck.service failed.
My env is currently running with a soon-to-be-merged patch[1] aiming at squashing unreliable behaviour.
Apparently, that patch uncovered an issue, probably due to the way we check the used ports. Until now, I didn't find this weird behaviour inside another container using the same check, but it might be just some luck|timing.
Marking this patch as [incomplete] for now since I need to track down some more evidences - I'll updated is asap.
[1] https://review.opendev.org/707799
Need to dig a bit more into this in order to fetch logs and compare containers, timing and things like that.