OpenStack Object Storage (swift)

Problem with disk overflowing (help my swift cluster is full)

Bug #1359160 reported by Anton on 2014-08-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Expired	Undecided	Unassigned

Bug Description

Hi,

if container server and object server using same drivers, swift have probjem, when free disk space is none;

When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...

For start normal work need to delete partition or object on partition in order to free up space

Sorry for my english

Aug 20 06:35:12 str-05 container-server: ERROR __call__ error with DELETE /sdv1/141877/AUTH_system/5288e8796dd05/5288fbb1c9007/5288fbb1856c6 : #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/server.py", line 495, in __call__#012 res = method(req)#012 File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 2132, in wrapped#012 return func(*a, **kw)#012 File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 836, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/server.py", line 211, in DELETE#012 broker.delete_object(obj, req.headers.get('x-timestamp'))#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/backend.py", line 194, in delete_object#012 self.put_object(name, timestamp, 0, 'application/deleted', 'noetag', 1)#012 File "/usr/local/lib/python2.7/dist-packages/swift/container/backend.py", line 234, in put_object#012 fp.flush()#012IOError: [Errno 28] No space left on device (txn: txe267800f722a488db2b8e-0053f32812)

Revision history for this message

clayg (clay-gerrard) wrote on 2014-08-25:

> When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...

All three devices trying to write the tombstone were full? But the entry still got removed from the container listing? (!)

That doesn't quite match up with my reading of the current code:

            disk_file.delete(req_timestamp)
            self.container_update(
                'DELETE', account, container, obj, request,
                HeaderKeyDict({'x-timestamp': req_timestamp.internal}),
                device, policy_idx)
        return response_class(request=request)

I'm not sure if this bug report is trying to highlight a specific code path(s) that we may want to target to improve for the "cluster full" case - or just opening an issue to track that "having a full swift cluster is no fun"

We'll probably need something that looks more like a plan than a bug if we want to actually try and move the needle towards making that a better operational experience.

I can confirm that when a good number of devices in the cluster get full the swift api will return errors and replication as it's written can have a hard time straightening things out without some manual intervention - adding capacity definitely helps; trying to delete data quickly through the swift API rarely does :\

Some things I've seen:

* kill the rsync modules on the overflowing nodes so they stop trying (ineffectively) to eat new bytes.
* stop swift-object-replicator on nodes that have capacity so they can focus on eating bytes instead of barfing them out and tying up other object-servers in REPLICATE requests when they'd do better to let them focus on syncing up the parts they're trying to push out.
* go ahead and flip on handoffs_first and drop handoff_delete to 2 (or 1!) on your object-replicators on the most full node(s) so they can get some breathing room ASAP.

Other's may have additional war-stories/ops-disaster-porn experience that might help to share even if it's just therapeutic - but as in so far it's seems like "don't let your cluster fill up" is a lesson you really only learn once - so I'm not sure when it'll get direct focused attention to improve.

> When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...

All three devices trying to write the tombstone were full?  But the entry still got removed from the container listing? (!)

That doesn't quite match up with my reading of the current code:

We'll probably need something that looks more like a plan than a bug if we want to actually try and move the needle towards making that a better operational experience.

Some things I've seen:

* kill the rsync modules on the overflowing nodes so they stop trying (ineffectively) to eat new bytes.
 * stop swift-object-replicator on nodes that have capacity so they can focus on eating bytes instead of barfing them out and tying up other object-servers in REPLICATE requests when they'd do better to let them focus on syncing up the parts they're trying to push out.
 * go ahead and flip on handoffs_first and drop handoff_delete to 2 (or 1!) on your object-replicators on the most full node(s) so they can get some breathing room ASAP.

John Dickinson (notmyname) on 2014-09-26

Changed in swift:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-11-26:

[Expired for OpenStack Object Storage (swift) because there has been no activity for 60 days.]

Changed in swift:
status:	Incomplete → Expired

Revision history for this message

clayg (clay-gerrard) wrote on 2015-05-21:

ops war-stories at Liberty Design Summit suggested an enhancement for the replicator - lp bug #1457262

summary:	- Problem with disk overflowing + Problem with disk overflowing (help my swift is full)
summary:	- Problem with disk overflowing (help my swift is full) + Problem with disk overflowing (help my swift cluster is full)

Revision history for this message

Bulat Gaifullin (bulat.gaifullin) wrote on 2019-08-07:

I met the similar problem but with reading when there is no storage space on nodes.
I guess the problem was in the following, when node responses error 'Insufficient Storage', proxy marks it as errored, because there is no space on each node, all nodes are became in errored state and when proxy try to make get request, there is no alive node. I believe proxy should separate write errors and read errors.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.