Nonstop restarting ovn_metadata_haproxy container

Bug #1861694 reported by Sagi (Sergey) Shnaidman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Seems like ovn_metadata_haproxy-ovnmeta container is failing and restarting. It break command "docker stats" which is freezing and fails logs collection.

https://logserver.rdoproject.org/84/704384/4/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/884042e/logs/overcloud-novacompute-0/var/log/extra/docker/containers/ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87/docker_info.log.txt.gz

+ docker exec ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87 top -bwn1
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

+ docker exec ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87 bash -c '$(command -v dnf || command -v yum) list installed'
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

wes hayutin (weshayutin)
tags: added: promotion-blocker
Revision history for this message
Marios Andreou (marios-b) wrote :

openstack/ansible-role-collect-logs master
"Fix logs collection with new docker" https://review.opendev.org/#/c/705446/

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

No idea about this but since it's a side car container, could it be a race condition between the command execution and the container exits?

The message says "cannot exec a container that has run and stopped"

Whenever the instances have been shut down on that compute node, the sidecar container ovn_metadata_haproxy will be stopped so it could happen that by the time that tripleo executes the command in this bug, the container is no longer running because the VMs went off. Is that a possibility?

Revision history for this message
yatin (yatinkarel) wrote :

Also see https://logserver.rdoproject.org/84/704384/4/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/884042e/logs/overcloud-novacompute-0/var/log/containers/neutron/kill-script.log

and from journalctl https://logserver.rdoproject.org/84/704384/4/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/884042e/logs/overcloud-novacompute-0/var/log/journal.txt.gz, pasting below for a ovn-haproxy container that failed above in kill-script.

Sending kill signal 15 to container ca89f8d4361a
Container ca89f8d4361a failed to exit within 10 seconds of signal 15 - using the force
Sending kill signal 9 to container ca89f8d4361a6b2d53062ed36bcfc73e0df6e134f988715341e42f981686b713
container kill failed because of 'container not found' or 'no such process': Cannot kill container ca89f8d4361a: rpc error: code = 2 desc = no such process"
Container ca89f8d4361a failed to exit within 10 seconds of kill - trying direct SIGKILL
Handler for DELETE /v1.26/containers/ca89f8d4361a returned error: You cannot remove a running container ca89f8d4361a. Stop the container before attempting removal or use -f

Also following changes are there in new docker(1.13.1-108) https://git.centos.org/rpms/docker/c/b9462ce5493d0b79ecbf22f54a5a65cc40e18bb3?branch=c7-extras, there are changes which seems related, but i couldn't figure out what exactly is causing issues.

Revision history for this message
Brent Eagles (beagles) wrote :

Also interesting is that in around there is this:

Feb 03 13:23:01 overcloud-novacompute-0 sync[33860]: + docker stop ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.454435942Z" level=debug msg="Calling GET /_ping"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.454820582Z" level=debug msg="Unable to determine container for /"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.455276280Z" level=debug msg="{Action=_ping, LoginUID=4294967295, PID=33877}"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456129466Z" level=debug msg="Calling POST /v1.26/containers/ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87/stop"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456156694Z" level=debug msg="Unable to determine container for ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456222227Z" level=debug msg="{Action=stop, LoginUID=4294967295, PID=33877}"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456335045Z" level=error msg="Handler for POST /v1.26/containers/ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87/stop returned error: No such container: ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87"

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

That may be also related to the upgraded docker 1.13, see https://bugzilla.redhat.com/show_bug.cgi?id=1793455

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

(it seems the docker upgrade from 1.13.1-104 to 1.13.1-108 has brought some regressions)

Revision history for this message
Marios Andreou (marios-b) wrote :

This bug is also discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1793455 for anyone that is interested.

at least * upstream * today we are not seeing this any more. Looking at https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 there are 2/3 errors but they are not related to this bug.

Does any one have pointer to logs with this happening in upstream CI still? Or do you know if patches merged to fix this upstream?

I am tempted to move this to fix-released, as per my comment here I cannot find a log of this bug today. It is more than 10 days since the last update here, leading me to believe it is no longer happening. Am I wrong?

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

We don't see it upstream, because we removed the command that triggered bug conditions: https://review.opendev.org/#/c/705446/
But it unblocked jobs only, it didn't solve anything.

Revision history for this message
Marios Andreou (marios-b) wrote :

thanks sshnaidm & ykarel... as per irc and comment #8 the patch https://review.opendev.org/#/c/705446/ is a workaround. Apparently the network team is working on this we'll update once there is something new

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :

I have not seen this in c8 ussuri jobs.
Please reopen if someone spots the issue

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.