migration failed when running BurnInTest in guest
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Expired
|
Undecided
|
Unassigned |
Bug Description
Hi,
I found a live migration problem and have found out the reason, but I can't fix it up myself. I really need help.
When live migration vm and it's block device in save time, it will occur probabilistic .
Step:
1. start a windows vm,and run burnInTest(it will write dirty data to block device in migration)
2. migrate vm with it's block device.
3. a few minutes later, dest vm was killed and migration will be failed (probabilistic )
Reason:
when migraion start, in source host libvirt will send command to qemu,and qemu will call mirror_run coroutine to copy blcok device data to dest vm block device. mirror_run running in qemu main thread. When this finished(actually it still running because in following steps,there may generate dirty data by vm), qemu will start migration_thread to migration ram and other device.
In dest vm, qemu will call "bdrv_invalidat
The primary reason is mirror_run and migration run in two thread. although qemu stopping vm before write "QEMU_VM_EOF" byte, it still can't ensure mirror_run coroutine do not write dirty data after migration thread sending "QEMU_VM_EOF" byte.
Program received signal SIGSEGV, Segmentation fault.
0x00007f90d250db24 in get_cluster_table (bs=0x7f90d493f500, offset=1832189952, new_l2_
new_
573 l2_offset = s->l1_table[
(gdb) bt
#0 0x00007f90d250db24 in get_cluster_table (bs=0x7f90d493f500, offset=1832189952, new_l2_
new_
#1 0x00007f90d250e577 in handle_copied (bs=0x7f90d493f500, guest_offset=
bytes=
#2 0x00007f90d250ef45 in qcow2_alloc_
host_
#3 0x00007f90d250445f in qcow2_co_writev (bs=0x7f90d493f500, sector_num=3578496, remaining_
qiov=
#4 0x00007f90d24d4764 in bdrv_aligned_
qiov=
#5 0x00007f90d24d4d21 in bdrv_co_do_pwritev (bs=0x7f90d493f500, offset=1832189952, bytes=1044480, qiov=0x7f8fbd6f
flags=0) at block.c:3447
#6 0x00007f90d24d3115 in bdrv_rw_co_entry (opaque=
#7 0x00007f90d24d31e7 in bdrv_prwv_co (bs=0x7f90d493f500, offset=1832189952, qiov=0x7f8fbd6f
at block.c:2746
#8 0x00007f90d24d32eb in bdrv_rw_co (bs=0x7f90d493f500, sector_num=3578496,
buf=
#9 0x00007f90d24d3429 in bdrv_write (bs=0x7f90d493f500, sector_num=3578496,
buf=
#10 0x00007f90d24cc2b5 in nbd_trip (opaque=
#11 0x00007f90d24e86fb in coroutine_
#12 0x00007f90d0449310 in ?? () from /lib/x86_
#13 0x00007fff3fcfda10 in ?? ()
#14 0x0000000000000000 in ?? ()
(gdb) p bs
$1 = (BlockDriverState *) 0x7f90d493f500
(gdb) p *s
$3 = {cluster_bits = 16, cluster_size = 65536, cluster_sectors = 128, l2_bits = 13, l2_size = 8192, l1_size = 40,
l1_vm_state_index = 40, csize_shift = 54, csize_mask = 255, cluster_offset_mask = 18014398509481983,
l1_table_offset = 196608, l1_table = 0x0, l2_table_cache = 0x7f90d493eee0, refcount_
cluster_cache = 0x7f90d4a84350 "", cluster_data = 0x7f90ce4de010 "", cluster_
cluster_allocs = {lh_first = 0x0}, refcount_table = 0x7f90d4a94360, refcount_
refcount_
tqh_first = 0x0, tqh_last = 0x7f90d4942c60}}}, crypt_method = 0, crypt_method_header = 0, aes_encrypt_key = {
rd_key = {0 <repeats 60 times>}, rounds = 0}, aes_decrypt_key = {rd_key = {0 <repeats 60 times>}, rounds = 0},
snapshots_offset = 0, snapshots_size = 0, nb_snapshots = 0, snapshots = 0x0, flags = 10338, qcow_version = 3,
use_lazy_
overlap_check = 127, incompatible_
unknown_
tqh_first = 0x0, tqh_last = 0x7f90d4942ec8}, cache_discards = false}
(gdb) p s->l1_table
$4 = (uint64_t *) 0x0
Changed in qemu: | |
status: | Incomplete → Triaged |
actually, I have met this issue in earlier qemu version, and it expose frankly. e_cache( ).
new_ l1_size = (new_l1_size * 3 + 1) / 2;
The BDRVQcowState *s was memset to zero in qcow2_invalidat
So there is a dead-loop in qcow2_grow_l1_table
while (min_size > new_l1_size) {
}
The dead-loop stack as follow:
#0 qcow2_grow_l1_table () at block/qcow2- cluster. c:59 cluster. c:542 cluster. c:1252 0x7f8efb3195a0) at nbd.c:1139 trampoline (i0=<optimized out>, i1=<optimized
#1 0x00007f8ef635dafe in get_cluster_table () at block/qcow2-
#2 0x00007f8ef635e981 in handle_copied () at block/qcow2-
#4 0x00007f8ef6363257 in qcow2_co_writev () at block/qcow2.c:828
#5 0x00007f8ef63500d5 in bdrv_co_do_writev ) at block.c:2623
#6 0x00007f8ef63508d4 in bdrv_rw_co_entry (opaque=<optimized out>) at block.c:2133
#7 0x00007f8ef6350a3a in bdrv_rwv_co()
#8 0x00007f8ef6350c06 in bdrv_rw_co ) at block.c:2215
#9 0x00007f8ef6457df2 in nbd_trip (opaque=
#10 0x00007f8ef6382a1b in coroutine_
frame 2 block_cache = 0x0, cluster_cache = 0x0, cluster_data = 0x0, cluster_ cache_offset = 184467440737095 51615, table_offset = 0, refcount_table_size = 0, cluster_ index = 0, free_byte_offset = 0, lock = {locked = true, queue = {entries = {tqh_first = 0x0, tqh_last = 0x0}, features = 0, features = 0, autoclear_features = 0, unknown_ header_ fields_ size = 0, unknown_ header_ fields = 0x0, header_ ext = {lh_first = 0x0}}
p *s
$4 = {cluster_bits = 0, cluster_size = 0, cluster_sectors = 0, l2_bits = 0, l2_size = 0, l1_size = 0, l1_vm_state_index = 0,
csize_shift = 0, csize_mask = 0, cluster_offset_mask = 0, l1_table_offset = 0, l1_table = 0x0, l2_table_cache = 0x0,
refcount_
cluster_allocs = {lh_first = 0x0}, refcount_table = 0x0, refcount_
free_
ctx = 0x0}}, crypt_method = 0, crypt_method_header = 0, aes_encrypt_key = {rd_key = {0 <repeats 60 times>},
rounds = 0}, aes_decrypt_key = {rd_key = {0 <repeats 60 times>}, rounds = 0}, snapshots_offset = 0, snapshots_size = 0,
nb_snapshots = 0, snapshots = 0x0, flags = 0, qcow_version = 0, use_lazy_refcounts = false, incompatible_
compatible_
unknown_