--copy-storage-all failing with qemu 2.10
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Fix Released
|
Undecided
|
Unassigned | ||
qemu (Ubuntu) |
Fix Released
|
High
|
Christian Ehrhardt |
Bug Description
We fixed an issue around disk locking already in regard to qemu-nbd [1], but there still seem to be issues.
$ virsh migrate --live --copy-storage-all kvmguest-
error: internal error: qemu unexpectedly closed the monitor: 2017-08-
2017-08-
Source libvirt log for the guest:
2017-08-18 12:09:08.251+0000: initiating migration
2017-08-
2017-08-
Target libvirt log for the guest:
2017-08-
2017-08-18 12:09:09.010+0000: shutting down, reason=crashed
Given the timing it seems that the actual copy now works (it is busy ~10 seconds on my environment which would be the copy).
Also we don't see the old errors we saw before, but afterwards on the actual take-over it fails.
Dmesg has no related denials as often apparmor is in the mix.
Need to check libvirt logs of source [2] and target [3] in Detail.
[1]: https:/
[2]: http://
[3]: http://
Christian Ehrhardt (paelzer) wrote : | #1 |
Changed in libvirt (Ubuntu): | |
status: | New → Confirmed |
importance: | Undecided → High |
assignee: | nobody → ChristianEhrhardt (paelzer) |
Christian Ehrhardt (paelzer) wrote : | #2 |
--copy-storage-inc also affected.
It seems in the copy storage set of migrations is overall affected.
Christian Ehrhardt (paelzer) wrote : | #3 |
Note: We do upgrade migration checks from the old version onto the new one with the planned upgrade - that revealed that if the migration source is still at the old level the migration works (even with the 2.10 based one on the target).
Christian Ehrhardt (paelzer) wrote : | #4 |
I reached out to the people involved in the initial fixes which were related to image locking and qemu-nbd. But this might after all be something completely different.
Yet until we know better it might be wiser to reach out to more people.
=> http://
Changed in qemu (Ubuntu): | |
assignee: | nobody → ChristianEhrhardt (paelzer) |
status: | New → Confirmed |
importance: | Undecided → High |
Christian Ehrhardt (paelzer) wrote : | #5 |
The source log is virsh, I need to ensure we also have a source libvirtd log ...
Christian Ehrhardt (paelzer) wrote : | #6 |
Since this is pretty reproducible here on the setup:
- Two systems (actually two lxd containers on one system)
- Both running Artful with qemu 2.10-rc3 and libvirt 3.6
- Storage path is not shared but set up equivalent with a manual pre-copy
- Migration with post copy is failing, no other options set, example:
$ virsh migrate --live --copy-storage-all kvmguest-
- Same setup works on the qemu versions in Xenial (2.5), Yakkety (2.6), and Zesty (2.8)
- In fact it seems even a migration from a Zesty qemu (2.8) to the new (2.10) works
Christian Ehrhardt (paelzer) wrote : | #7 |
- virsh-source.log Edit (329.2 KiB, text/plain)
To simplify downloading the logs I'm attaching here a full set of:
- virsh
- source libvirtd
- target libvirtd
Christian Ehrhardt (paelzer) wrote : | #8 |
Christian Ehrhardt (paelzer) wrote : | #9 |
Christian Ehrhardt (paelzer) wrote : | #10 |
I've seen something in the logs which I want to eliminate from the list of possibilities:
"warning: host doesn't support requested feature: CPUID.80000001H
We had always a patch I questioned to enable svm capabilitiy for guests in general, it worked all the time but I'd have preferred to be an explicit user opt-in.
I remember seeing the warning in the past which made me neglect it at first, but maybe the target capability check is now more strict.
I'll drop this change for a test build and run all that again to be sure.
I doubt that is the reason, but let verifying this particular lead be my task - please be open with other suggestions.
Christian Ehrhardt (paelzer) wrote : | #11 |
Currently I plan to test with the svm/vmx changes disabled as well as a cross test on ppc64 and s390x which might complete the picture.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #12 |
The 'host doesn't support requested feature' is probably a red-herring in this case
The fact it's failing with an IO error but nothing else suggests either:
a) it's something closing the socket between the two qemu's
b) The I/O error is from storage/NBD
Assuming it fails on precopy, I'd look at the qemu_loadvm_
You could also add some debug/tracing in qemu_loadvm_state to see at what point it fails.
Dave
Christian Ehrhardt (paelzer) wrote : | #13 |
Hi David,
confirming the red-herring on the cpu feature - I had a build without runnign over the weekend so this was easy to test - and still the migration fails.
I have about 7 seconds from kicking off the migration until the sync seems to pass its first phase and then qemu is exiting (at least that is what libvirt thinks):
"closed without SHUTDOWN event; assuming the domain crashed"
Christian Ehrhardt (paelzer) wrote : | #14 |
Since the qemu "lives" in that time I can try to debug what happens.
With strace to sniff where things could be I see right before the end:
0.000203 recvmsg(27, {msg_name=NULL, msg_namelen=0, msg_iov=
0.000049 futex(0xca65dacf4, FUTEX_CMP_
0.000038 getpid() = 29750 <0.000023>
0.000011 tgkill(29750, 29760, SIGUSR1) = 0 <0.000030>
0.000012 futex(0xca4785a80, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000048>
0.000010 futex(0xca47b46e4, FUTEX_WAIT_PRIVATE, 19, NULL) = 0 <0.002215>
0.000032 sendmsg(21, {msg_name=NULL, msg_namelen=0, msg_iov=
0.000074 write(2, "2017-08-
0.000055 close(27) = 0 <0.000090>
Now 29750 is the main process/tgid and 29760 is the third process started on the migration.
It is the one that does the vcpu ioctl's so I assume this is just the one representing the vpu.
Well gdb should be more useful so looking with that.
Christian Ehrhardt (paelzer) wrote : | #15 |
As expected by David when I trace on process_
It appears as "Thread 4 "CPU 0/KVM" received signal SIGUSR1" and similar which is just the break down of the guest.
Diving "into" qemu_loadvm_state reveals that it gets until "cpu_synchroniz
In qemu_loadvm_state none of the initial checks fail.
Then the "ret = vmstate_
It seems reproducible in "cpu_synchroniz
I can catch the incoming qemu easily with:
$ while ! pid=$(pidof qemu-system-
# Then in gdb break on "cpu_synchroniz
# And when I step over it I the next thing I see is the "beginning of the end" for the process
Thread 4 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7f418136e700 (LWP 3887)]
__lll_lock_wait () at ../sysdeps/
The guest only has one vcpu, so CPU_FOREACH(cpu) is not much of a loop.
Looking down that path I tracked it to along:
cpu_synchronize
Here it queues the function "do_kvm_
That is done via queue_work_
That seems to trigger the first SIGUSR1
Following that I get the breakpoint that I set at "do_kvm_
The actual function only sets "cpu->vcpu_dirty = true;" and works.
On the way out from there, there is a "qemu_kvm_
Going on I see another "do_run_on_cpu" with "vapic_
I set a breakpoint on that as well but took a full CPUstate before going on:
p *cpu
$4 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0 <g_free>, properties = 0x5ffe736e40, ref = 1,
parent = 0x5ffe726160}, id = 0x0, realized = true, pending_
lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0, instance_id_alias = -1, alias_required_
nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false, has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = true,
created = true, stop = false, stopped = true, unplug = false, crash_occurred = false, exit_request = false, interrupt_request = 4,
singlestep_
__saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
initialized = true}, queued_work_first = 0x5fffefc990, queued_work_last = 0x5fffefc990, cpu_ases = 0x5ffe803c10, num_ases = 1,
...
Christian Ehrhardt (paelzer) wrote : | #16 |
After this I was trying to start closer to the issue, so I put a break on "process_
Once that was hit I added "qemu_kvm_
Of course when I try that the other functions do not trigger.
Maybe it is partially influenced by the debugging itself and/or the timing changes it causes.
I'll check what else I can find with slightly different debugging, but so much as an update for now.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #17 |
oh yeh you want to tell gdb to ignore SIGUSR1, something like:
handle SIGUSR1 nostop noprint pass
Christian Ehrhardt (paelzer) wrote : | #18 |
Sure, but initially I wanted to see what is going on overall so I let it pop up.
Started a debugging another session today.
First I confirmed with
(gdb) catch syscall exit exit_group
That this is the "normal" exit along the error message we knew:
migrate_
error_
qemu_
exit(
I found that already the retval of qemu_loadvm_state it -5.
Every thing else afterwards is cleanup.
Inside qemu_loadvm_state the first 2/3 pass and then that ret=-5 is from "ret = qemu_file_
Christian Ehrhardt (paelzer) wrote : | #19 |
Via a watchpoints I found that the error is set by qemu_fill_buffer.
b qemu_loadvm_state
handle SIGUSR1 nostop noprint pass
c
# on the break check and watch the status
(gdb) p f
$1 = (QEMUFile *) 0xb9babb3c00
(gdb) p *f
$2 = {ops = 0xb9b89880a0 <channel_
buf_size = 0, buf = '\000' <repeats 32767 times>, may_free = {0}, iov = {{iov_base = 0x0, iov_len = 0} <repeats 64 times>}, iovcnt = 0,
last_error = 0}
# ok still no err, set watchpoint
(gdb) p &(f->last_error)
$4 = (int *) 0xb9babbc044
(gdb) watch *(int *) 0xb9babbc044
Hardware watchpoint 2: *(int *) 0xb9babbc044
# This catches the following
Thread 1 "qemu-system-x86" hit Hardware watchpoint 2: *(int *) 0xb9babbc044
Old value = 0
New value = -5
0x000000b9b82bd0ec in qemu_file_set_error (ret=-5, f=0xb9babb3c00) at ./migration/
warning: Source file is more recent than executable.
125 f->last_error = ret;
(gdb) bt
#0 0x000000b9b82bd0ec in qemu_file_set_error (ret=-5, f=0xb9babb3c00) at ./migration/
#1 qemu_fill_buffer (f=0xb9babb3c00) at ./migration/
#2 0x000000b9b82bdbb1 in qemu_peek_byte (f=0xb9babb3c00, offset=0) at ./migration/
#3 0x000000b9b82bdc1b in qemu_get_byte (f=f@entry=
#4 0x000000b9b82b5853 in qemu_loadvm_
#5 0x000000b9b82b864f in qemu_loadvm_state (f=f@entry=
#6 0x000000b9b82af5c3 in process_
#7 0x000000b9b83e42a6 in coroutine_
#8 0x00007fbf3702fac0 in ?? () from /lib/x86_
#9 0x00007fffe3f9f800 in ?? ()
#10 0x0000000000000000 in ?? ()
Christian Ehrhardt (paelzer) wrote : | #20 |
So this is failing I/O that iterates over a channel.
I was tracking down the len, pending and pos used.
I found that this is not completely broken (like no access or generla I/O error)
It starts at pos 0 and iterated with varying offsets, but works for quite some time.
Example:
[...]
Thread 1 "qemu-system-x86" hit Breakpoint 2, qemu_fill_buffer (f=f@entry=
295 if (len > 0) {
$11183 = 28728
$11184 = 4040
$11185 = {ops = 0xd3b3d740a0 <channel_
buf_index = 0, buf_size = 4040,
buf = "\v\327\
may_free = {0}, iov = {{iov_base = 0x0, iov_len = 0} <repeats 64 times>}, iovcnt = 0, last_error = 0}
[...]
Well you could see the whole file read passing by one by one buffer
Yet this isn't particularly fast, so track the one that has len==0
(gdb) b ./migration/
And I got it as:
(gdb) p *f
$11195 = {ops = 0xd3b3d740a0 <channel_
buf_index = 0, buf_size = 0, buf = '\000' <repeats 5504 times>..., may_free = {0}, iov = {{iov_base = 0x0, iov_len = 0} <repeats 64 times>},
iovcnt = 0, last_error = 0}
Here pending == 0 so buf_size = 0 as well also pos is further down incremented to 319638837.
Checking in detail I found that I had pending=0 and buf_size=0 as well as non aligned pos entried, but they worked.
So I excluded the buf_size=
Maybe it just iterates pos out of the range that is working?
Christian Ehrhardt (paelzer) wrote : | #21 |
(gdb) handle SIGUSR1 nostop noprint pass
(gdb) b migration/
(gdb) command
p f->pos
c
end
That showed the pos is ever increasing and fails at an offset it never read before. Yet the absolute number was different.
$1 = 0
$2 = 8948
$3 = 41423
[...]
$11359 = 326387440
$11360 = 326420208 => This was the one failing this time
This was a different f->pos than last time, so I wondered if this would change every time.
With a less interactive gdb config I got in three tries:
1. 313153311
2. 313313376
3. 313571856
So a different f->pos to fail each time.
Different but rather close.
I wondered if the reasons I got a higher one when tracing in more detail printing all offsets could be that there still is something copied/synced and only slowly gets available.
I stepped through rather slowly and got to 322429260 this time.
So slower continuing on the iteration over qemu_fill_buffer makes it fail "later"?
Finally it is surely interesting which channel that actually is- likely the migration socket?
And yes, ioc->name in qio_channel_read is:
$8 = 0x56ab78e5c0 "migration-
Christian Ehrhardt (paelzer) wrote : | #22 |
So TL;DR summary for now:
- error triggers in qio_channel_read
- file is migration-
- reads work a while, but then fail at high f->pos offsets (slightly different ones each time)
- slower execution seems to lead to slightly higher offsets that are failing
- only happens on --copy-storage-* migrations (libvirt/virsh argument)
I don't really know atm where to look deeper - is there a good side channel that I could use to look at what is going on on the migration-
Dr. David Alan Gilbert (dgilbert-h) wrote : | #23 |
OK, so that looks like a real case of the migration stream failing and getting an IO error; so the question is why:
a) Is the source qemu dieing first and closing the socket?
b) Is libvirt closing the socket for some reason
Dr. David Alan Gilbert (dgilbert-h) wrote : | #24 |
also, you might want to chase it a bit further down, I think we've got:
qemu-
io/
it would be good to know what the readv/readmsg is actually returning in the case where it's failing.
Dave
Christian Ehrhardt (paelzer) wrote : | #25 |
I'll track down the actual read and then add debugging the source at the same time (that should be the best way to track the migration socket on both sides).
This might be slightly tricky since I don't know exactly on which offset but I can surely start over 310*10^6 it seems.
I'll report back once I know more, thanks for your guidance David
Dr. David Alan Gilbert (dgilbert-h) wrote : | #26 |
Hmm i just tried to reproduce this and hit (on the source):
main_channel_
qemu-system-x86_64: /root/qemu/
2017-08-22 10:50:04.888+0000: shutting down, reason=crashed
Dr. David Alan Gilbert (dgilbert-h) wrote : | #27 |
OK, 3rd try and I've hit the same behaviour as Christian.
Christian Ehrhardt (paelzer) wrote : | #28 |
Stack from qemu_fill_buffer to qio_channel_
#0 qio_channel_
at ./io/channel-
#1 0x0000001486ec97e2 in qio_channel_read (ioc=ioc@
buf=
#2 0x0000001486e005ec in channel_get_buffer (opaque=<optimized out>,
buf=
#3 0x0000001486dff095 in qemu_fill_buffer (f=f@entry=
I checked that sioc->fd, &msg, sflags) is in fact the socket.
With e.g. with this fd being 27
tcp ESTAB 1405050 0 ::ffff:
I need to break on the fail of that recvmsg in qio_channel_
# the following does not work due to optimization the ret value is only around later
b io/channel-
But catching it "inside" the if works
b io/channel-
Take the following with a grain of salt, this is very threaded and noisy to debug.
Once I hit it the recmsg returned "-1", that was on f->pos = 311641887
But at the same time I could confirm (via ss) that the socket itself is still open on source and target of the migration.
-1 is EAGAIN and returns QIO_CHANNEL_
That seems to arrive in nbd_rwv nbd/common.c:44).
And led to "qio_channel_yield"
There are a few corouting switches in between so I hope I'm not loosing anything.
But that first ret<0 actually worked, it seems the yield and retry got it working.
I got back to qemu_fill_buffer iterating further after this.
This hit ret<0 in qio_channel_
This time on returning the QIO_CHANNEL_
That was interesting as it is different than before.
After this it seemed to become a death spiral - recmsg returned -1 every time (still on the same offset).
It passed back through the nbd_rwv which called qio_channel_yield for multiple times.
Then it continued and later on on 321998304 is the last I saw.
It did no more pass b io/channel-
Hmm, I might have lost myself on the coroutine switches - but it is odd at least.
Trying to redo less interactive and with a bit more prep ...
Maybe the results are more reliable then ...
Getting back with more later ...
Christian Ehrhardt (paelzer) wrote : | #29 |
Only now read comment #27, thanks David for reproducing with me, it is somewhat relieving that you seem to see the same.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #30 |
(4th try) breakpoint on qemu_file_
(gdb) list
1155 if (inactivate_disks) {
1156 /* Inactivate before sending QEMU_VM_EOF so that the
1157 * bdrv_invalidate
1158 ret = bdrv_inactivate
1159 if (ret) {
1160 qemu_file_
1161 return ret;
1162 }
1163 }
Christian Ehrhardt (paelzer) wrote : | #31 |
For me qemu_file_set_error was always called from qemu_fill_buffer, interesting that it seems different for you.
I'll rerun a few times to ensure it really always is always from qemu_fill_buffer for me.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #32 |
The difference with the qemu_file_set_error is I'm looking on the source - because what's happening is the source is erroring so closing the socket, and so the error you're seeing on the destination is real - the socket just EOF'd!
Dr. David Alan Gilbert (dgilbert-h) wrote : | #33 |
repeated the assert in #26:
Program received signal SIGABRT, Aborted.
0x00007f02163005f7 in __GI_raise (sig=sig@entry=6) at ../nptl/
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) where
#0 0x00007f02163005f7 in __GI_raise (sig=sig@entry=6) at ../nptl/
#1 0x00007f0216301ce8 in __GI_abort () at abort.c:90
#2 0x00007f02162f9566 in __assert_fail_base (fmt=0x7f0216449288 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=
at assert.c:92
#3 0x00007f02162f9612 in __GI___assert_fail (assertion=
#4 0x0000560ac0036a08 in qio_channel_yield (ioc=ioc@
at /root/qemu/
#5 0x0000560ac001930e in nbd_rwv (ioc=0x560ac239
#6 0x0000560ac0007e24 in nbd_co_send_request (bs=bs@
#7 0x0000560ac0008244 in nbd_client_
#8 0x0000560ac00030e1 in bdrv_driver_pwritev (bs=bs@
#9 0x0000560ac0004480 in bdrv_aligned_
#10 0x0000560ac0005258 in bdrv_co_pwritev (child=
#11 0x0000560abffbf564 in raw_co_pwritev (bs=0x560ac22807f0, offset=3414163456, bytes=<optimized out>, qiov=0x560ac242
#12 0x0000560ac00030e1 in bdrv_driver_pwritev (bs=bs@
#13 0x0000560ac0004480 in bdrv_aligned_
#14 0x0000560ac0005258 in bdrv_co_pwritev (child=
Christian Ehrhardt (paelzer) wrote : | #34 |
In 5/5 tries this was on qemu_fill_buffer for my case.
But that was on the receiving side, and what you found is closer to the root cause on the source of the migration.
I checked on qemu_file_set_error on the source and can confirm your finding that on the source it is from bdrv_inactivate
#0 qemu_file_set_error (f=f@entry=
#1 0x0000006b727140cb in qemu_savevm_
inactivate_
#2 0x0000006b7270c84b in migration_
s=0x6b74ef53b0) at ./migration/
#3 migration_thread (opaque=
#4 0x00007f61a740e74a in start_thread (arg=0x7f61467f
#5 0x00007f61a714acaf in clone () at ../sysdeps/
Also as I outlined - what seems ages ago in comment #6 - if the source is a qemu 2.8 the migration works for me which would kind of match assuming the root cause is in the source.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #35 |
OK, Stefan posted a patch for that assert (see 'nbd-client: avoid spurious qui_channel_yield() re-entry) so now I'm running with the following patch and I'm seeing the bdrv_inactivate return a -1 for
drive-virtio-disk0
Christian: Could you see what your source says with this patch?
diff --git a/block.c b/block.c
index 3615a68..f9bd689 100644
--- a/block.c
+++ b/block.c
@@ -4078,9 +4078,11 @@ static int bdrv_inactivate
BdrvChild *child, *parent;
int ret;
+ fprintf(stderr, "%s: entry for %s\n", __func__, bdrv_get_
if (!setting_flag && bs->drv-
ret = bs->drv-
if (ret < 0) {
+ fprintf(stderr, "%s: exit 1(%d) for %s\n", __func__, ret, bdrv_get_
return ret;
}
}
@@ -4094,6 +4096,7 @@ static int bdrv_inactivate
if (parent-
if (ret < 0) {
+ fprintf(stderr, "%s: exit 2(%d) for %s\n", __func__, ret, bdrv_get_
}
@@ -4109,6 +4112,7 @@ static int bdrv_inactivate
QLIST_
ret = bdrv_inactivate
if (ret < 0) {
+ fprintf(stderr, "%s: exit 3(%d) for %s\n", __func__, ret, bdrv_get_
return ret;
}
}
@@ -4117,6 +4121,7 @@ static int bdrv_inactivate
* driver */
bdrv_
+ fprintf(stderr, "%s: exit end good for %s\n", __func__, bdrv_get_
return 0;
}
Christian Ehrhardt (paelzer) wrote : | #36 |
Building with the attached debug patch ...
Christian Ehrhardt (paelzer) wrote : | #37 |
I didn't add Stefans patch yet.
Note: the Mentioned patch is at: Note: http://
With your debug patch applied I get:
2017-08-22 17:57:04.486+0000: initiating migration
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
I'm currently building one with Stefans patch applied as well over (my) night, but let me know if there is more that makes sense to try.
Christian Ehrhardt (paelzer) wrote : | #38 |
With the patch from Stefan and your debug applied source and target I still run into the same issue I'd say.
Id's are slightly off, but they are different on every try anyway.
Still looks the same for me:
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
Dr. David Alan Gilbert (dgilbert-h) wrote : | #39 |
OK, yeh that's the same symptom I saw - it's that final failure that causes bdrv_inactivate_all to return a failure and fail the source migration.
Stefan Hajnoczi (stefanha) wrote : | #40 |
Please see Fam's patch series "[PATCH for-2.10 0/4] block: Fix non-shared storage migration" that fixes this issue.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #41 |
yes, seems to fix it for me.
Thanks Christian for filing this; we probably wouldn't have spotted it before the release without it
(which the test Stefan has just added will hopefully cure!).
Christian Ehrhardt (paelzer) wrote : | #42 |
Hi Stefan,
I was part of the report around the series in "[PATCH for-2.10 0/4] block: Fix non-shared storage migration", but this is happening on rc3 which contains this.
AFAIK Fam's series is:
dd7fdaad iotests: Add non-shared storage migration case 192 (Fam)
5f7772c4 block-backend: Defer shared_perm tightening migration completion (Fam)
3dff24f2 nbd: Fix order of bdrv_set_perm and bdrv_invalidate
80adf54e stubs: Add vm state change handler stubs (Fam)
All these got into v2.10.0-rc3 which these tests are based on already.
IMHO - This is not complete for qemu 2.10 and a regression since 2.9 (well since 2.8 as I haven't tested 2.9 personally).
Christian Ehrhardt (paelzer) wrote : | #43 |
Ok, clarified with Stefanha
It has exactly the same title as a series of 18th August which was related to a similar issue.
It is about an hour old now on qemu-devel, quoting
"This fixes the issue reported as https:/
Fam Zheng (3):
block-backend: Refactor inactivate check
block-backend: Allow more "can inactivate" cases
mirror: Mark target BB as "force allow inactivate"
Stefan Hajnoczi (1):
block: Update open_flags after ->inactivate() callback"
I'll prep a build with that and test as well
Eric Blake (eblake) wrote : Re: [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10 | #44 |
On 08/23/2017 09:55 AM, ChristianEhrhardt wrote:
> Ok, clarified with Stefanha
> It has exactly the same title as a series of 18th August which was related to a similar issue.
> It is about an hour old now on qemu-devel, quoting
>
> "This fixes the issue reported as
> https:/
>
> Fam Zheng (3):
> block-backend: Refactor inactivate check
> block-backend: Allow more "can inactivate" cases
> mirror: Mark target BB as "force allow inactivate"
>
> Stefan Hajnoczi (1):
> block: Update open_flags after ->inactivate() callback"
>
>
> I'll prep a build with that and test as well
Here's what is brewing for my pull request, although if you can
successfully test things, I'm happy to add a Tested-by: tag before
actually sending the pull request:
git fetch git://repo.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
Christian Ehrhardt (paelzer) wrote : | #45 |
Hmm,
it gets further but can still not complete this kind of migration:
$ virsh migrate --live --copy-storage-all kvmguest-
Source:
2017-08-23 16:49:23.022+0000: initiating migration
Unexpected error in bdrv_check_perm() at /build/
2017-08-
2017-08-23 16:49:23.762+0000: shutting down, reason=crashed
Target:
2017-08-
2017-08-
2017-08-
2017-08-
2017-08-
2017-08-23 16:49:23.797+0000: shutting down, reason=crashed
I was to eager to get this close-to-real so I don't have Davids fprintf's applied anymore - I'll build those and then run it in the debugger, but until then what I can see is that behavior slightly changes (worse).
It now crashes the guest on the source as well when aborting the migration.
I need to debug to confirm, but it seems it still aborts the migration
-> qemu-system-x86_64: load of migration failed: Input/output error
But then can't fall back to the source and crashes at
-> qemu-system-x86_64: Block node is read-only
Christian Ehrhardt (paelzer) wrote : | #46 |
That was rc3 +:
- nbd-client-
- the four patches mentioned in comment #43
I could also re-base onto master + pacthes or rc4 if there is one soon.
For now building with Davids debug statements applied again to check if we still abort around that assert.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #47 |
I need to recheck with that combo - I'd seen that error but only when I'd commented out 'if (!blk->dev && !blk_name(blk)[0]) {' when debugging earlier.
Dr. David Alan Gilbert (dgilbert-h) wrote : | #48 |
Looks good here, just retested:
here's teh top of my git:
f89f59fad5119f8
cf26039a2b50f07
8ccc527d84ec9a5
34c3f17c99a43f2
952ad9fd9dd43e9
1f2967338764341
Dr. David Alan Gilbert (dgilbert-h) wrote : | #49 |
just tested current head - 1eed33994e28d4a
Christian Ehrhardt (paelzer) wrote : | #50 |
Yeah seems to be slightly different than the former assert.
2017-08-23 18:41:54.556+0000: initiating migration
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
bdrv_inactivate
Unexpected error in bdrv_check_perm() at /build/
2017-08-
Which is:
1553 /*
1554 * Check whether permissions on this node can be changed in a way that
1555 * @cumulative_perms and @cumulative_
1556 * permissions of all its parents. This involves checking whether all necessary
1557 * permission changes to child nodes can be performed.
1558 *
1559 * A call to this function must always be followed by a call to bdrv_set_perm()
1560 * or bdrv_abort_
1561 */
1562 static int bdrv_check_
1563 uint64_t cumulative_
1564 GSList *ignore_children, Error **errp)
1565 {
1566 BlockDriver *drv = bs->drv;
1567 BdrvChild *c;
1568 int ret;
1569
1570 /* Write permissions never work with read-only images */
1571 if ((cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_
Christian Ehrhardt (paelzer) wrote : | #51 |
Yes, with all the series of [1] on top it finally works.
Saw it already being merged on master.
Expecting a late rc4 or early release tag and then wrap all it up.
Thanks everybody involved!
[1]: http://
Changed in qemu: | |
status: | New → Fix Committed |
no longer affects: | libvirt (Ubuntu) |
Changed in qemu (Ubuntu): | |
status: | Confirmed → In Progress |
Launchpad Janitor (janitor) wrote : | #52 |
This bug was fixed in the package qemu - 1:2.10~
---------------
qemu (1:2.10~
* Merge with Upstream 2.10-rc4; This fixes a migration issue (LP: #1711602);
Remaining changes:
- qemu-kvm to systemd unit
- d/qemu-kvm-init: script for QEMU KVM preparation modules, ksm,
hugepages and architecture specifics
- d/qemu-kvm.service: systemd unit to call qemu-kvm-init
- d/qemu-
- d/qemu-
- d/qemu-
- d/rules: install /etc/default/
- Enable nesting by default
- set nested=1 module option on intel. (is default on amd)
- re-load kvm_intel.ko if it was loaded without nested=1
- d/p/ubuntu/
in qemu64 cpu type.
- d/p/ubuntu/
in qemu64 on amd
- libvirt/qemu user/group support
- qemu-system-
trigger.
- qemu-system-
- Distribution specific machine type
- d/p/ubuntu/
types to ease future live vm migration.
- d/qemu-
- improved dependencies
- Make qemu-system-common depend on qemu-block-extra
- Make qemu-utils depend on qemu-block-extra
- let qemu-utils recommend sharutils
- s390x support
- Create qemu-system-s390x package
- Include s390-ccw.img firmware
- Enable numa support for s390x
- ppc64[le] support
- d/qemu-
- Enable seccomp for ppc64el
- bump libseccomp-dev dependency, 2.3 is the minimum for ppc64
- arch aware kvm wrappers
- update VCS-git to match the Artful branch
- disable missing x32 architecture
- d/rules: or32 is now named or1k (since 4a09d0bb)
- d/qemu-
- d/qemu-
by qapi-schema.json which is already packaged (since 4d8bb958)
- d/p/02_
to Debian patch to match qemu 2.10)
- s390x package now builds correctly on all architectures (LP 1710695)
* Added changes:
- d/qemu-
since 8508eee7
- d/qemu-
- make nios2/hppa not installed explicitly until further stablized
- d/qemu-
qemu-ga-ref
- d/qemu-
along the qapi intro
- d/not-installed: ignore further generated (since 56e8bdd4) files in
dh_missing that are already provid...
Changed in qemu (Ubuntu): | |
status: | In Progress → Fix Released |
Changed in qemu: | |
status: | Fix Committed → Fix Released |
Is anybody but me testing this combo ?
All else seems to work nice, just this special (and only this) migration setup fails.