tripleo

Nonstop restarting ovn_metadata_haproxy container

Bug #1861694 reported by Sagi (Sergey) Shnaidman on 2020-02-03

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo ussuri-3 "tripleo ussuri-3"

Bug Description

Seems like ovn_metadata_haproxy-ovnmeta container is failing and restarting. It break command "docker stats" which is freezing and fails logs collection.

https://logserver.rdoproject.org/84/704384/4/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/884042e/logs/overcloud-novacompute-0/var/log/extra/docker/containers/ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87/docker_info.log.txt.gz

+ docker exec ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87 top -bwn1
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

+ docker exec ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87 bash -c '$(command -v dnf || command -v yum) list installed'
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

Tags:

wes hayutin (weshayutin) on 2020-02-03

tags:

added: promotion-blocker

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-02-03:

#1

openstack/ansible-role-collect-logs master
"Fix logs collection with new docker" https://review.opendev.org/#/c/705446/

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2020-02-03:

#2

No idea about this but since it's a side car container, could it be a race condition between the command execution and the container exits?

The message says "cannot exec a container that has run and stopped"

Whenever the instances have been shut down on that compute node, the sidecar container ovn_metadata_haproxy will be stopped so it could happen that by the time that tripleo executes the command in this bug, the container is no longer running because the VMs went off. Is that a possibility?

Revision history for this message

yatin (yatinkarel) wrote on 2020-02-03:

#3

Also see https://logserver.rdoproject.org/84/704384/4/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/884042e/logs/overcloud-novacompute-0/var/log/containers/neutron/kill-script.log

and from journalctl https://logserver.rdoproject.org/84/704384/4/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/884042e/logs/overcloud-novacompute-0/var/log/journal.txt.gz, pasting below for a ovn-haproxy container that failed above in kill-script.

Sending kill signal 15 to container ca89f8d4361a
Container ca89f8d4361a failed to exit within 10 seconds of signal 15 - using the force
Sending kill signal 9 to container ca89f8d4361a6b2d53062ed36bcfc73e0df6e134f988715341e42f981686b713
container kill failed because of 'container not found' or 'no such process': Cannot kill container ca89f8d4361a: rpc error: code = 2 desc = no such process"
Container ca89f8d4361a failed to exit within 10 seconds of kill - trying direct SIGKILL
Handler for DELETE /v1.26/containers/ca89f8d4361a returned error: You cannot remove a running container ca89f8d4361a. Stop the container before attempting removal or use -f

Also following changes are there in new docker(1.13.1-108) https://git.centos.org/rpms/docker/c/b9462ce5493d0b79ecbf22f54a5a65cc40e18bb3?branch=c7-extras, there are changes which seems related, but i couldn't figure out what exactly is causing issues.

Revision history for this message

Brent Eagles (beagles) wrote on 2020-02-06:

#4

Also interesting is that in around there is this:

Feb 03 13:23:01 overcloud-novacompute-0 sync[33860]: + docker stop ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.454435942Z" level=debug msg="Calling GET /_ping"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.454820582Z" level=debug msg="Unable to determine container for /"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.455276280Z" level=debug msg="{Action=_ping, LoginUID=4294967295, PID=33877}"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456129466Z" level=debug msg="Calling POST /v1.26/containers/ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87/stop"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456156694Z" level=debug msg="Unable to determine container for ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456222227Z" level=debug msg="{Action=stop, LoginUID=4294967295, PID=33877}"
Feb 03 13:23:01 overcloud-novacompute-0 dockerd-current[14548]: time="2020-02-03T13:23:01.456335045Z" level=error msg="Handler for POST /v1.26/containers/ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87/stop returned error: No such container: ovn_metadata_haproxy-ovnmeta-f8628e8b-9d86-4793-befb-6166e3fbeb87"

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2020-02-06:

#5

That may be also related to the upgraded docker 1.13, see https://bugzilla.redhat.com/show_bug.cgi?id=1793455

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2020-02-06:

#6

(it seems the docker upgrade from 1.13.1-104 to 1.13.1-108 has brought some regressions)

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-02-17:

#7

This bug is also discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1793455 for anyone that is interested.

at least * upstream * today we are not seeing this any more. Looking at https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 there are 2/3 errors but they are not related to this bug.

Does any one have pointer to logs with this happening in upstream CI still? Or do you know if patches merged to fix this upstream?

I am tempted to move this to fix-released, as per my comment here I cannot find a log of this bug today. It is more than 10 days since the last update here, leading me to believe it is no longer happening. Am I wrong?

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2020-02-17:

#8

We don't see it upstream, because we removed the command that triggered bug conditions: https://review.opendev.org/#/c/705446/
But it unblocked jobs only, it didn't solve anything.

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-02-17:

#9

thanks sshnaidm & ykarel... as per irc and comment #8 the patch https://review.opendev.org/#/c/705446/ is a workaround. Apparently the network team is working on this we'll update once there is something new

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2020-02-17:

#10

The problems with docker command still exist, you can see in patch https://review.opendev.org/#/c/708106/
there are no overcloud logs for ovb jobs https://logserver.rdoproject.org/06/708106/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/7698847/logs/ and logs collection was stuck: https://logserver.rdoproject.org/06/708106/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/7698847/logs/quickstart_collect_logs.log

it means "docker stats" command freezes as before.

Revision history for this message

wes hayutin (weshayutin) wrote on 2020-03-18:

#11

I have not seen this in c8 ussuri jobs.
Please reopen if someone spots the issue

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1793455
[ASSIGNED] Edit

Bug watches keep track of this bug in other bug trackers.