Problem with disk overflowing (help my swift cluster is full)

Bug #1359160 reported by Anton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Expired
Undecided
Unassigned

Bug Description

Hi,

if container server and object server using same drivers, swift have probjem, when free disk space is none;

When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...

For start normal work need to delete partition or object on partition in order to free up space

Sorry for my english

Aug 20 06:35:12 str-05 container-server: ERROR __call__ error with DELETE /sdv1/141877/AUTH_system/5288e8796dd05/5288fbb1c9007/5288fbb1856c6 : #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/server.py", line 495, in __call__#012 res = method(req)#012 File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 2132, in wrapped#012 return func(*a, **kw)#012 File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 836, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/server.py", line 211, in DELETE#012 broker.delete_object(obj, req.headers.get('x-timestamp'))#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/backend.py", line 194, in delete_object#012 self.put_object(name, timestamp, 0, 'application/deleted', 'noetag', 1)#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/backend.py", line 234, in put_object#012 fp.flush()#012IOError: [Errno 28] No space left on device (txn: txe267800f722a488db2b8e-0053f32812)

Revision history for this message
clayg (clay-gerrard) wrote :

> When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...

All three devices trying to write the tombstone were full? But the entry still got removed from the container listing? (!)

That doesn't quite match up with my reading of the current code:

            disk_file.delete(req_timestamp)
            self.container_update(
                'DELETE', account, container, obj, request,
                HeaderKeyDict({'x-timestamp': req_timestamp.internal}),
                device, policy_idx)
        return response_class(request=request)

I'm not sure if this bug report is trying to highlight a specific code path(s) that we may want to target to improve for the "cluster full" case - or just opening an issue to track that "having a full swift cluster is no fun"

We'll probably need something that looks more like a plan than a bug if we want to actually try and move the needle towards making that a better operational experience.

I can confirm that when a good number of devices in the cluster get full the swift api will return errors and replication as it's written can have a hard time straightening things out without some manual intervention - adding capacity definitely helps; trying to delete data quickly through the swift API rarely does :\

Some things I've seen:

 * kill the rsync modules on the overflowing nodes so they stop trying (ineffectively) to eat new bytes.
 * stop swift-object-replicator on nodes that have capacity so they can focus on eating bytes instead of barfing them out and tying up other object-servers in REPLICATE requests when they'd do better to let them focus on syncing up the parts they're trying to push out.
 * go ahead and flip on handoffs_first and drop handoff_delete to 2 (or 1!) on your object-replicators on the most full node(s) so they can get some breathing room ASAP.

Other's may have additional war-stories/ops-disaster-porn experience that might help to share even if it's just therapeutic - but as in so far it's seems like "don't let your cluster fill up" is a lesson you really only learn once - so I'm not sure when it'll get direct focused attention to improve.

Changed in swift:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Object Storage (swift) because there has been no activity for 60 days.]

Changed in swift:
status: Incomplete → Expired
Revision history for this message
clayg (clay-gerrard) wrote :

ops war-stories at Liberty Design Summit suggested an enhancement for the replicator - lp bug #1457262

summary: - Problem with disk overflowing
+ Problem with disk overflowing (help my swift is full)
summary: - Problem with disk overflowing (help my swift is full)
+ Problem with disk overflowing (help my swift cluster is full)
Revision history for this message
Bulat Gaifullin (bulat.gaifullin) wrote :

I met the similar problem but with reading when there is no storage space on nodes.
I guess the problem was in the following, when node responses error 'Insufficient Storage', proxy marks it as errored, because there is no space on each node, all nodes are became in errored state and when proxy try to make get request, there is no alive node. I believe proxy should separate write errors and read errors.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.