Promoter: running more than one promotion at once is causing overload to the docker dm subsystem

Bug #1765084 reported by Gabriele Cerami
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
High
Gabriele Cerami

Bug Description

On the promoter server, when more than one promotion process is running, the logs are showing weird errors:

for example

failed: [localhost] (item=[u'etcd', u'f106094e961c5ab430687d673063baee379f6bbd_310b64d1']) => {"changed": false, "item": ["etcd", "f106094e961c5ab430687d673063baee379f6bbd_310b64d1"], "msg": "Error removing image docker.io/tripleomaster/centos-binary-etcd:f106094e961c5ab430687d673063baee379f6bbd_310b64d1 - UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)"}

failed: [localhost] (item=[u'd742e07492138edcec140300ceec9ca0b991ba35_3ef7f2f8', u'mistral-api']) => {"changed": false, "item": ["d742e07492138edcec140300ceec9ca0b991ba35_3ef7f2f8", "mistral-api"], "msg": "Error searching for image docker.io/tripleoqueens/centos-binary-mistral-api - 500 Server Error: Internal Server Error (\"{\"message\":\"layer does not exist\"}\")"}

we then see

[4147491.688016] XFS (dm-1): Ending clean mount
[4147491.808620] XFS (dm-1): Unmounting Filesystem
[4147623.826142] device-mapper: thin: Deletion of thin device 38037 failed.
[4147623.837952] device-mapper: ioctl: remove_all left 1 open device(s)
[4148168.722891] device-mapper: thin: Deletion of thin device 38448 failed.

in dmesg, and

Apr 18 13:44:13 promoter-server.rdocloud systemd-udevd[2696]: inotify_add_watch(7, /dev/dm-1, 10) failed: No such file or directory
Apr 18 13:44:14 promoter-server.rdocloud systemd-udevd[2696]: inotify_add_watch(7, /dev/dm-1, 10) failed: No such file or directory

in the journal

We had a lot of promotion in the past two days, and sometimes two releases are promoting at the same time, with all the containers needed to be pushed and pulled. We are suspecting these are causing IO overload on the server

Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Revision history for this message
Gabriele Cerami (gcerami) wrote :

Uploaded patch to do only a single promotion at a time
https://review.rdoproject.org/r/13429

Revision history for this message
Alan Pevec (apevec) wrote :

overlay2 storage driver is not used?
TripleO switched to overlay2 in Pike: https://review.openstack.org/451916

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Nice, we can try changing the driver. The promoter server has a dedicated partition for the container images, so it's easy to switch to other drivers at will.
What I'm worried about is that it may look also as a problems with race conditions for the layers.
If different releases are using same layers, we may end up deleting a layer when deleting an image (for cleanup) that is instead used by another image, and the two concurrent operations are unable to establish that the layer is in use.
We'll experiment with the storage drivers at this point and see where it lead.
I'll change the commit message in the change and mark it as partial

Changed in tripleo:
milestone: rocky-1 → rocky-2
Matt Young (halcyondude)
Changed in tripleo:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.