copy-storage-all/inc does not easily converge with load going on

Bug #1689499 reported by Christian Ehrhardt 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QEMU
Expired
Undecided
Unassigned

Bug Description

Hi,
for now this is more a report to discuss than a "bug", but I wanted to be sure if there are things I might overlook.

I'm regularly testing the qemu's we have in Ubuntu which currently are 2.0, 2.5, 2.6.1, 2.8 plus a bunch of patches. And for all sorts of verification upstream every now and then.

I recently realized that the migration options around --copy-storage-[all/inc] seem to have got worse at converging on migration. Although it is not a hard commit that is to be found, it just seems more likely to occur the newer the qemu versions is. I assume that is partially due to guest performance optimization that keep it busy.
To a user it appears as a hanging migration being locked up.

But let me outline what actually happens:
- Setup without shared storage
- Migration using --copy-storage-all/--copy-storage-inc
- Working fine with idle guests
- If the guests is busy the migration does take like forever (1 vCPU that are busy with 1 CPU, 1 memory and one disk hogging processes)
- statistically seems to trigger more likely on newer qemu's (might be a red herring)

The background workloads are most trivial burners:
- cpu: md5sum /dev/urandom
- memory: stress-ng -m 1 --vm-keep --vm-bytes 256M
- disk: while /bin/true; do dd if=/dev/urandom of=/var/tmp/mjb.1 bs=4M count=100; done

We are talking about ~1-2 minutes on qemu 2.5 (4 tries x 3 architectures) and 2-10+ hours on >=qemu 2.6.1.

I say it is likely not a bug, but more a discussion as I can easily avoid hanging via either:
- timeouts (--timeout, ...) to abort or suspend to migrate it
- --auto-converge ( I had only one try, but it seemed to help by slowing down the load generators)

So you might say "that is all as it should be, and the users can use the further options to mitigate" and I'm all fine with that. In that case the bug still serves as a "searchable" document of some kind for others triggering the same case. But if anything comes to your mind that need better handling around this case lets start to discuss more deeply about it.

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

Interesting.
That's quite a big difference, so if you could bisect it down it would be interesting to figure out where the change occurred.

What happens if you just make it a 'disk' workload without the memory stress?

What network interface (1G/10G etc) are you migrating over and what bandwidth limit have you got set?

Dave

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1689499] Re: copy-storage-all/inc does not easily converge with load going on

On Tue, May 9, 2017 at 12:41 PM, Dr. David Alan Gilbert <<email address hidden>
> wrote:

> Interesting.
> That's quite a big difference, so if you could bisect it down it would be
> interesting to figure out where the change occurred.
>

Hi David,
if it turns out to stay reproducible enough I can certainly try somewhen
next week.

What happens if you just make it a 'disk' workload without the memory
> stress?
>

Will do so along my checks if it triggers reliably enough for a bisect.

> What network interface (1G/10G etc) are you migrating over and what
> bandwidth limit have you got set?
>

No explicit bandwith limit set, the connection itself is only virtual
(migrating two libvirt/qemu stacks in between lxd containers) so other than
networking overhead this is usually really fast.
I quickly sniffed with iperf on a few of the test hosts and speed was
around 30-120 GBit/s which should qualify as "fast enough".

I'll get back to you once I found the time to verify reproducibility and
hopefully a bisect.
I beg your pardon as this might need a few days (to free up my tasks as
well a system capable to do so).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, I found an hour to set up a test environment.
I already had all the bisect script written until the systems were ready, but unfortunately it is not reproducible enough with my build from git as of today's master - out of 8 tries it showed as less of a slowdown once, and never to the hours of wait that I saw when testing on a wider scope.

If my regular testing triggers these issues again I'll try to sort out what the difference is between this and my bisect environment (I set the latter up with the scripts that do the former), but until then I'm kind of out of options for now :-/.

Changed in qemu:
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I'm "incompleting" myself until I was able to provide more :-/

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

OK, it'll be interesting if you get something repeatable, but sometimes these type of bugs go and hide in a dark corner until someone trips over them in a more critical situation.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for QEMU because there has been no activity for 60 days.]

Changed in qemu:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.