systemd wrappers (sidecars) locking doesn't really work

Bug #1874470 reported by Brent Eagles
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Incomplete
High
Unassigned

Bug Description

IIUC The wrapper that populates the processes file and the sync executed by the related systemd service on the host are supposed to share a file lock to prevent races on the processes file. This is because the wrapper adds to the file and sync truncates the file after it runs. However, the lock used in the wrapper is under /var/lock in the container which is not shared with the host so the sync script never waits for the wrapper to be done. Moving the lock file to a path on a shared mount in the container seems to solve that particular race.

... in addition ...

it appears that the triggering of the systemd process that runs the sync command is also racy. It appears that if the processes file has entry added after the shared lock is released, but the sync process isn't completed, sync doesnt happen again.

This was reproduced by restarting the neutron dhcp agent container when 3 subnets were configured. The first problem resulted in only one sidecar being created, the second issue would occasionally result in one or more side car containers being missed in the sync. The processes file would have remaining entries and restarting the dhcp_dnsmasq service on the host would cause the remaining side cars to get created.

Brent Eagles (beagles)
Changed in tripleo:
status: New → Triaged
milestone: none → ussuri-rc3
importance: Undecided → Critical
Brent Eagles (beagles)
description: updated
tags: added: train-backport-potential
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

> It appears that if the processes file has entry added after the shared lock is released, but the sync process isn't completed, sync doesnt happen again.

that particular part of the issue really sounds like a race in systemd watchers?..

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Sorry, this one is probably a better link https://github.com/systemd/systemd/issues/5770

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/722816

Changed in tripleo:
assignee: nobody → Brent Eagles (beagles)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/723373

Changed in tripleo:
assignee: Brent Eagles (beagles) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

An alternative implementation w/o introducing a sync daemon to replace the oneshot sync service https://review.opendev.org/#/c/723373/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

note, patches should depend on the fix for the "the lock used in the wrapper is under /var/lock in the container which is not shared with the host so the sync script never waits for the wrapper to be done" part, which is not ready yet...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/723522

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/724259

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/723522
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1517df0fc30b7b10263aa96fe48978d7bf17a0fe
Submitter: Zuul
Branch: master

commit 1517df0fc30b7b10263aa96fe48978d7bf17a0fe
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 27 15:11:21 2020 +0200

    Add shared volume for side-car wrapper locks

    The lock used in the wrapper is under /var/lock in the container which
    is not shared with the host so the sync script never waits for the
    wrapper to be done. Moving the lock file to a path on a shared mount in
    the container seems to solve that particular race.

    Partial-bug: #1874470

    Change-Id: Iaa3a19bc47241e6eb686d65c1a198ec69505398e
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/724259
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=90a05a5f8a57928f3d429925468749c482eaf1b6
Submitter: Zuul
Branch: master

commit 90a05a5f8a57928f3d429925468749c482eaf1b6
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Apr 29 11:05:12 2020 +0200

    Use shared volume for side-car wrapper locks

    The lock used in the wrapper is under /var/lock in the container which
    is not shared with the host so the sync script never waits for the
    wrapper to be done. Moving the lock file to a path on a shared mount in
    the container seems to solve that particular race.

    Change-Id: I660b7189a9e1c3197f2cdcc77af62584691dde16
    Partial-bug: #1874470
    Depends-On: https://review.opendev.org/723522
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/723373

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Brent Eagles (<email address hidden>) on branch: master
Review: https://review.opendev.org/722816

Brent Eagles (beagles)
Changed in tripleo:
importance: Critical → High
Revision history for this message
yatin (yatinkarel) wrote :

After reboot /var/lock/containers get's deleted and ovn metadata container didn't start until the directory get's created manually.

/var/lock is a symlink to /var/run/lock, so get's cleaned up on reboot.

lrwxrwxrwx. 1 root root 11 Jan 13 21:49 /var/lock -> ../run/lock

Changed in tripleo:
status: In Progress → Triaged
assignee: Bogdan Dobrelya (bogdando) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/728360

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by yatin (<email address hidden>) on branch: master
Review: https://review.opendev.org/728360
Reason: In favor of revert https://review.opendev.org/#/c/728891/

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
milestone: wallaby-3 → wallaby-rc1
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Revision history for this message
Marios Andreou (marios-b) wrote :

This is an automated action. Bug status has been set to 'Incomplete' and target milestone has been removed due to inactivity. If you disagree please re-set these values and reach out to us on freenode #tripleo

Changed in tripleo:
milestone: xena-1 → none
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.