Swift race condition on AIO

Bug #1274358 reported by Mark T. Voelker
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cisco Openstack
Fix Released
Medium
Chris Ricker

Bug Description

A few folks have reported seeing this on h.1 (though it appears to be inconsistent, suggesting a race condition that only sometimes manifests). It appears to become more prevalent with newer updates from StackForge pulled in after h.1. If this condition is encountered, the requite services can generally be restarted manually (or a second puppet run performed) to rectify the condition ("service swift-container-replicator restart;service swift-container-sync restart;service swift-account-replicator restart"):

Error: Could not start Service[swift-container-replicator]: Execution of '/sbin/start swift-container-replicator' returned 1:

Error: /Stage[main]/Swift::Storage::Container/Swift::Storage::Generic[container]/Service[swift-container-replicator]/ensure: change from stopped to running failed: Could not start Service[swift-container-replicator]: Execution of '/sbin/start swift-container-replicator' returned 1:

Error: Could not start Service[swift-container-sync]: Execution of '/sbin/start swift-container-sync' returned 1:

Error: /Stage[main]/Swift::Storage::Container/Service[swift-container-sync]/ensure: change from stopped to running failed: Could not start Service[swift-container-sync]: Execution of '/sbin/start swift-container-sync' returned 1:

Error: Could not start Service[swift-account-replicator]: Execution of '/sbin/start swift-account-replicator' returned 1:

Error: /Stage[main]/Swift::Storage::Account/Swift::Storage::Generic[account]/Service[swift-account-replicator]/ensure: change from stopped to running failed: Could not start Service[swift-account-replicator]: Execution of '/sbin/start swift-account-replicator' returned 1:

The root cause appears to be a race condition in which services may be started before the ringsync is complete, as evidenced by the messages like the following in the upstart logs for the failed services:

IOError: [Errno 2] No such file or directory: '/etc/swift/account.ring.gz'

Changed in openstack-cisco:
assignee: Chip (cbaesema) → Chris Ricker (chris-ricker)
importance: High → Medium
Revision history for this message
Mark T. Voelker (mvoelker) wrote :

Looks like we won't have time for this in h.2, but there is a simple workaround (either do a second puppet run or restart the three services manually) and therefore this isn't a showstopper.

Changed in openstack-cisco:
milestone: h.2 → i.0
Revision history for this message
Shweta P (shweta-ap05) wrote :

I see this error in the full HA(swift) setup as well. Do not see the ring.gz in the swift storage nodes

Changed in openstack-cisco:
status: In Progress → New
status: New → Confirmed
Revision history for this message
Chris Ricker (chris-ricker) wrote :
Changed in openstack-cisco:
status: Confirmed → In Progress
Changed in openstack-cisco:
status: In Progress → Fix Committed
Changed in openstack-cisco:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.