OpenStack Object Storage (swift)

CPU usage of object-replicator spikes more than 100%

Bug #1038129 reported by Kota Tsuyuzaki on 2012-08-17

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Confirmed	Undecided	Unassigned

Bug Description

CPU usage of object-replicator spikes to more than 100% when
ojbect-replicator calls replicate() to ensure consistency of objects.
Object-replicator generates many REPLICATE requests at the same time
in replicate(), so this generations causes this spike.
This generations are related on the number of partitions. Object-replicater
generates as many requests as the number of partitions on a machine.
In our case, CPU usage reaches 100% when more than 1000 partitions exist
on a physical machine.
If we configure settings as each disk has approximately 100 partitions,
we can use at most 10 disks per machine to avoid this spike.
We think the request generation should be limited to average this spike and
use more disks per machine.
In the simplest case, swift limits the number of request generations
by a config-paramater like the object-auditor.
In the strict case, swift should change the limitation depending on the machine
resources.
(e.g. CPU, I/O, network resources)

(swift version is 1.5.0)

Revision history for this message

Xingchao Yu (yuxcer) wrote on 2012-08-21:

I think this problem is caused by the current mechanism of object-replicator , and besides that we should not set small partitions .
Maybe we need to change the object-replicator mechanism.

Revision history for this message

Kota Tsuyuzaki (tsuyuzaki-kota) wrote on 2012-08-24:

I think the mechanism of object-replicator should be changed too.
How should we reduce the replicate requests?
We can see two ways out of this:

1.check several partitions at a request.
2.limits the number of requests per seconds.

I think 1st way is better.

Revision history for this message

Constantine Peresypkin (constantine-q) wrote on 2012-08-26:

Are you sure that the number of REPLICATE requests is a problem?
The code operates in a strictly serial fashion: default concurrency for replicator = 1
The REPLICATE request itself is very straightforward as well: load pickle from file, dump it as a string into response
I have a hunch that this could be connected with the pickle de-serialization slowdown in python 2.7 vs 2.6

Revision history for this message

Alex Yang (alexyang) wrote on 2012-09-03:

The current mechanism of object-replicator is to loop all partitions and sync per suffix directory.
This mechanism could cause high CPU usage, gigh IO, many requests to object-server and long time to keep eventual consistency.
I suggest that we can implement a new replicator based on logs and messages, an event-driven framework.

Revision history for this message

Mike Barton (redbo) wrote on 2012-09-03:

I'm all for adding more hinted handoff (i.e. queueing fixes when an error happens), but I don't think we can ever get rid of anti-entropy recovery. Better indexing of disk contents could bring it closer to optimal, though.

Revision history for this message

Kota Tsuyuzaki (tsuyuzaki-kota) wrote on 2012-09-30:

I verified this spike behavior is not caused by pickle loading slow down(in python2.7).
Though I ran object-replicator without pickle loading
(I modified object-replicator so that it can have partition information
in cache memory),I found this spike.

Though I think the event-driven is better for preventing this behavior,
I fear the Anti-Entropy issue and complexity of implementation.

If a object-replicator fails an object replication within a event,
when should the object-replicator retry the replication?
(I feel error queuing and immediate retrying cost is too high)

clayg (clay-gerrard) on 2014-03-20

Changed in swift:
importance:	Undecided → Wishlist
status:	New → Confirmed

Revision history for this message

John Dickinson (notmyname) wrote on 2015-06-03:

Last activity was nearly 3 years ago. Need to be reverified.

Changed in swift:
importance:	Wishlist → Undecided

Revision history for this message

Thiago da Silva (thiagodasilva) wrote on 2017-07-07:

@Kota this bug is almost 5 years old, do you think it is still valid? Should we close it?

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.