Problem with disk overflowing (help my swift cluster is full)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Expired
|
Undecided
|
Unassigned |
Bug Description
Hi,
if container server and object server using same drivers, swift have probjem, when free disk space is none;
When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...
For start normal work need to delete partition or object on partition in order to free up space
Sorry for my english
Aug 20 06:35:12 str-05 container-server: ERROR __call__ error with DELETE /sdv1/141877/
Changed in swift: | |
status: | New → Incomplete |
> When I try to delete object, first swift updates container info, but it operation not possible, since disk is full...
All three devices trying to write the tombstone were full? But the entry still got removed from the container listing? (!)
That doesn't quite match up with my reading of the current code:
return response_
I'm not sure if this bug report is trying to highlight a specific code path(s) that we may want to target to improve for the "cluster full" case - or just opening an issue to track that "having a full swift cluster is no fun"
We'll probably need something that looks more like a plan than a bug if we want to actually try and move the needle towards making that a better operational experience.
I can confirm that when a good number of devices in the cluster get full the swift api will return errors and replication as it's written can have a hard time straightening things out without some manual intervention - adding capacity definitely helps; trying to delete data quickly through the swift API rarely does :\
Some things I've seen:
* kill the rsync modules on the overflowing nodes so they stop trying (ineffectively) to eat new bytes. replicator on nodes that have capacity so they can focus on eating bytes instead of barfing them out and tying up other object-servers in REPLICATE requests when they'd do better to let them focus on syncing up the parts they're trying to push out.
* stop swift-object-
* go ahead and flip on handoffs_first and drop handoff_delete to 2 (or 1!) on your object-replicators on the most full node(s) so they can get some breathing room ASAP.
Other's may have additional war-stories/ ops-disaster- porn experience that might help to share even if it's just therapeutic - but as in so far it's seems like "don't let your cluster fill up" is a lesson you really only learn once - so I'm not sure when it'll get direct focused attention to improve.