Thanks James, but now I'm unsure where to go from here as it isn't reproducible with many tries at different scales that James and I did.
@Sean/Lee
Since you wondered if it might be due to Ubuntu Delta on top of 4.2 - there are two things we could compare Ubuntu's qemu to then:
1. qemu 4.2 as released by upstream
2. qemu 4.2 as build in centos (they have some delta as well)
Not sure I can provide you #2 easily and #1 will always need a bit of delta to build&integrate correctly. All of that would be doable still, but in general if I'd provide you PPA builds could you try those in your environment to !reliably! trigger the issue allowing us to play bisect-ping-pong? The word "reliable" is important here as we'd need to sort out builds/patches by a reliable yes/no on each step.
@Sean/Lee
- which "the centos 8 build of the same qemu" version is that exactly? I might take a look at comparing the patches applied. We are at 4.2.1 already, the latest I found there was 4.2.0-29.el8.3.x86_64.rpm which already is the advanced-virt version (otherwise 2.12 based).
@Sean/Lee
All of the following suggested approaches depend on the question if you can test this reliably with different qemu PPA builds:
- qemu 4.2 (as upstream) vs usual Ubuntu build -> find the offending patch in our delta
- test Ubuntu 20.10 which has qemu 5.0 and libvirt 6.6 -> if fixed there find by which change
- qemu 4.2 Ubuntu vs qemu 4.2 as in CentOS (but build for Ubuntu) -> if the latter works better then let us find by which (set of) patches.
I was checking the Delta on Centos8 advanced-virt qemu 4.2 as that was reported
to (maybe) work better. I was comparing which patches are applied, that are no
on Ubuntu and which of those might be related.
Among several individual fixes for some issues the biggest patch sets
are feature backports for Enhanced LUKS/backup/snapshot handling,
multifd migration/cancel, block-mirror, HMAT changes, virtiofs, qemu-img zero
write, arm time handling and some related build time self tests.
Due to the nature of these changes some affect the block handling by affecting block/job/hotplug. But they might only do so by accident, nothing is clearly for addressing the issue that
was reported. And even of the list below most seem unrelated - so as Sean assumed maybe it just isn't exercised on centos enough to be seen there?
eca0f3524a4eb57d03a56b0cbcef5527a0981ce4 backup: don't acquire aio_context in backup_clean
58226634c4b02af7b10862f7fbd3610a344bfb7f backup: Improve error for bdrv_getlength() failure
958a04bd32af18d9a207bcc78046e56a202aebc2 backup: Make sure that source and target size match
7b8e4857426f2e2de2441749996c6161b550bada block: Add flags to bdrv(_co)_truncate()
92b92799dc8662b6f71809100a4aabc1ae408ebb block: Add flags to BlockDriver.bdrv_co_truncate()
087ab8e775f48766068e65de1bc99d03b40d1670 block: always fill entire LUKS header space with zeros
8c6242b6f383e43fd11d2c50f8bcdd2bba1100fc block-backend: Add flags to blk_truncate()
564806c529d4e0acad209b1e5b864a8886092f1f block-backend: Reorder flush/pdiscard function definitions
0abf2581717a19d9749d5c2ff8acd0ac203452c2 block/backup-top: Don't acquire context while dropping top
1de6b45fb5c1489b450df7d1a4c692bba9678ce6 block: bdrv_reopen() with backing file in different AioContext
e1d7f8bb1ec0c6911dcea81641ce6139dbded02d block.c: adding bdrv_co_delete_file
69032253c33ae1774233c63cedf36d32242a85fc block/curl: HTTP header field names are case insensitive
7788a319399f17476ff1dd43164c869e320820a2 block/curl: HTTP header fields allow whitespace around values
91005a495e228ebd7e5e173cd18f952450eef82d blockdev: Acquire AioContext on dirty bitmap functions
471ded690e19689018535e3f48480507ed073e22 blockdev: fix coding style issues in drive_backup_prepare
3ea67e08832775a28d0bd2795f01bc77e7ea1512 blockdev: honor bdrv_try_set_aio_context() context requirements
c6996cf9a6c759c29919642be9a73ac64b38301b blockdev: Promote several bitmap functions to non-static
377410f6fb4f6b0d26d4a028c20766fae05de17e blockdev: Return bs to the proper context on snapshot abort
bb4e58c6137e80129b955789dd4b66c1504f20dc blockdev: Split off basic bitmap operations for qemu-img
5b7bfe515ecbd584b40ff6e41d2fd8b37c7d5139 blockdev: unify qmp_blockdev_backup and blockdev-backup transaction paths
2288ccfac96281c316db942d10e3f921c1373064 blockdev: unify qmp_drive_backup and drive-backup transaction paths
7f16476fab14fc32388e0ebae793f64673848efa block: Fix blk->in_flight during blk_wait_while_drained()
30dd65f307b647eef8156c4a33bd007823ef85cb block: Fix cross-AioContext blockdev-snapshot
eeea1faa099f82328f5831cf252f8ce0a59a9287 block: Fix leak in bdrv_create_file_fallback()
fd17146cd93d1704cd96d7c2757b325fc7aac6fd block: Generic file creation fallback
fbb92b6798894d3bf62fe3578d99fa62c720b242 block: Increase BB.in_flight for coroutine and sync interfaces
17e1e2be5f9e84e0298e28e70675655b43e225ea block: Introduce 'bdrv_reopen_commit_post' step
9bffae14df879255329473a7bd578643af2d4c9c block: introducing 'bdrv_co_delete_file' interface
c7a0f2be8f95b220cdadbba9a9236eaf115951dc block: Make bdrv_get_cumulative_perm() public
ef893b5c84f3199d777e33966dc28839f71b1a5c block: Make it easier to learn which BDS support bitmaps
78c81a3f108870d325b0a39d88711366afe6f703 block/nbd: Fix hang in .bdrv_close()
b92902dfeaafbceaf744ab7473f2d070284f6172 block: pass BlockDriver reference to the .bdrv_co_create
65eb7c85a3e62529e2bad782e94d5a7b11dd5a92 block/qcow2: Move bitmap reopen into bdrv_reopen_commit_post
d29d3d1f80b3947fb26e7139645c83de66d146a9 block: Relax restrictions for blockdev-snapshot
5a5e7f8cd86b7ced0732b1b6e28c82baa65b09c9 block: trickle down the fallback image creation function use to the block drivers
955c7d6687fefcd903900a1e597fcbc896c661cd block: truncate: Don't make backing file data visible
1bba30da24e1124ceeb0693c81382a0d77e20ca5 crypto.c: cleanup created file when block_crypto_co_create_opts_luks fails
87ca3b8fa615b278b33cabf9ed22b3f44b5214ba file-posix: Drop hdev_co_create_opts()
2f0c6e7a650de133eccd94e9bb6cf7b2070f07f1 file-posix: Support BDRV_REQ_ZERO_WRITE for truncate
89b6fc45614bb45dcd58f1590415afe5c2791abd hmp: Allow using qdev ID for qemu-io command
80f0900905b555f00d644894c786b6d66ac2e00e iscsi: Drop iscsi_co_create_opts()
0501e1aa1d32a6e02dd06a79bba97fbe9d557cb5 hw/pci/pcie: Forbid hot-plug if it's disabled on the slot
0dabc0f6544f2c0310546f6d6cf3b68979580a9c hw/pci/pcie: Move hot plug capability check to pre_plug callback
6a1e073378353eb6ac0565e0dc649b3db76ed5dc hw/pci/pcie: Replace PCI_DEVICE() casts with existing variable
b660a84bbb0eb1a76b505648d31d5e82594fb75e job: take each job's lock individually in job_txn_apply
530a0963184e57e71a5b538e9161f115df533e96 pcie_root_port: Add hotplug disabling optio
c6bdc312f30d5c7326aa2fdca3e0f98c15eb541a qapi: Add '@allow-write-only-overlay' feature for 'blockdev-snapshot'
5d72c68b49769c927e90b78af6d90f6a384b26ac qcow2: Expose bitmaps' size during measure
eb8a0cf3ba26611f3981f8f45ac6a868975a68cc qcow2: Forward ZERO_WRITE flag for full preallocation
f01643fb8b47e8a70c04bbf45e0f12a9e5bc54de qcow2: Support BDRV_REQ_ZERO_WRITE for truncate
a555b8092abc6f1bbe4b64c516679cbd68fcfbd8 qemu-file: Don't do IO after shutdown
08558e33257ec796594bd411261028a93414a70c replication: assert we own context before job_cancel_sync
49b44549ace7890fffdf027fd3695218ee7f1121 virtio-blk: On restart, process queued requests in the proper context
7aa1c247b466870b0704d3ccdc3755e5e7394dca virtio-blk: Refactor the code that processes queued requests
d0435bc513e23a4961b6af20164d1c6c219eb4ea virtio: don't enable notifications during polling
Vice versa being on 4.2.1 already gives Ubuntu some block changes that might have caused this...
Waiting for Sean/Lee to comment on how testable/reliable that is on their end -> incomplete
Thanks James, but now I'm unsure where to go from here as it isn't reproducible with many tries at different scales that James and I did.
@Sean/Lee
Since you wondered if it might be due to Ubuntu Delta on top of 4.2 - there are two things we could compare Ubuntu's qemu to then:
1. qemu 4.2 as released by upstream
2. qemu 4.2 as build in centos (they have some delta as well)
Not sure I can provide you #2 easily and #1 will always need a bit of delta to build&integrate correctly. All of that would be doable still, but in general if I'd provide you PPA builds could you try those in your environment to !reliably! trigger the issue allowing us to play bisect-ping-pong? The word "reliable" is important here as we'd need to sort out builds/patches by a reliable yes/no on each step.
@Sean/Lee el8.3.x86_ 64.rpm which already is the advanced-virt version (otherwise 2.12 based).
- which "the centos 8 build of the same qemu" version is that exactly? I might take a look at comparing the patches applied. We are at 4.2.1 already, the latest I found there was 4.2.0-29.
@Sean/Lee
All of the following suggested approaches depend on the question if you can test this reliably with different qemu PPA builds:
- qemu 4.2 (as upstream) vs usual Ubuntu build -> find the offending patch in our delta
- test Ubuntu 20.10 which has qemu 5.0 and libvirt 6.6 -> if fixed there find by which change
- qemu 4.2 Ubuntu vs qemu 4.2 as in CentOS (but build for Ubuntu) -> if the latter works better then let us find by which (set of) patches.
I was checking the Delta on Centos8 advanced-virt qemu 4.2 as that was reported snapshot handling,
to (maybe) work better. I was comparing which patches are applied, that are no
on Ubuntu and which of those might be related.
Among several individual fixes for some issues the biggest patch sets
are feature backports for Enhanced LUKS/backup/
multifd migration/cancel, block-mirror, HMAT changes, virtiofs, qemu-img zero
write, arm time handling and some related build time self tests.
Due to the nature of these changes some affect the block handling by affecting block/job/hotplug. But they might only do so by accident, nothing is clearly for addressing the issue that
was reported. And even of the list below most seem unrelated - so as Sean assumed maybe it just isn't exercised on centos enough to be seen there?
eca0f3524a4eb57 d03a56b0cbcef55 27a0981ce4 backup: don't acquire aio_context in backup_clean 7b10862f7fbd361 0a344bfb7f backup: Improve error for bdrv_getlength() failure 9a207bcc78046e5 6a202aebc2 backup: Make sure that source and target size match de2441749996c61 61b550bada block: Add flags to bdrv(_co) _truncate( ) 6f71809100a4aab c1ae408ebb block: Add flags to BlockDriver. bdrv_co_ truncate( ) 6068e65de1bc99d 03b40d1670 block: always fill entire LUKS header space with zeros fd11d2c50f8bcdd 2bba1100fc block-backend: Add flags to blk_truncate() cad209b1e5b864a 8886092f1f block-backend: Reorder flush/pdiscard function definitions 9749d5c2ff8acd0 ac203452c2 block/backup-top: Don't acquire context while dropping top b450df7d1a4c692 bba9678ce6 block: bdrv_reopen() with backing file in different AioContext 11dcea81641ce61 39dbded02d block.c: adding bdrv_co_delete_file 74233c63cedf36d 32242a85fc block/curl: HTTP header field names are case insensitive 76ff1dd43164c86 9e320820a2 block/curl: HTTP header fields allow whitespace around values d7e5e173cd18f95 2450eef82d blockdev: Acquire AioContext on dirty bitmap functions 018535e3f484805 07ed073e22 blockdev: fix coding style issues in drive_backup_ prepare 28d0bd2795f01bc 77e7ea1512 blockdev: honor bdrv_try_ set_aio_ context( ) context requirements 29919642be9a73a c64b38301b blockdev: Promote several bitmap functions to non-static d26d4a028c20766 fae05de17e blockdev: Return bs to the proper context on snapshot abort 29b955789dd4b66 c1504f20dc blockdev: Split off basic bitmap operations for qemu-img 4b40ff6e41d2fd8 b37c7d5139 blockdev: unify qmp_blockdev_backup and blockdev-backup transaction paths 316db942d10e3f9 21c1373064 blockdev: unify qmp_drive_backup and drive-backup transaction paths 2388e0ebae793f6 4673848efa block: Fix blk->in_flight during blk_wait_ while_drained( ) ef8156c4a33bd00 7823ef85cb block: Fix cross-AioContext blockdev-snapshot 28f5831cf252f8c e0a59a9287 block: Fix leak in bdrv_create_ file_fallback( ) 4cd96d7c2757b32 5fc7aac6fd block: Generic file creation fallback bf62fe3578d99fa 62c720b242 block: Increase BB.in_flight for coroutine and sync interfaces 0298e28e7067565 5b43e225ea block: Introduce 'bdrv_reopen_ commit_ post' step 5329473a7bd5786 43af2d4c9c block: introducing 'bdrv_co_ delete_ file' interface 0cdadbba9a9236e af115951dc block: Make bdrv_get_ cumulative_ perm() public d777e33966dc288 39f71b1a5c block: Make it easier to learn which BDS support bitmaps 325b0a39d887113 66afe6f703 block/nbd: Fix hang in .bdrv_close() af744ab7473f2d0 70284f6172 block: pass BlockDriver reference to the .bdrv_co_create 9e2bad782e94d5a 7b11dd5a92 block/qcow2: Move bitmap reopen into bdrv_reopen_ commit_ post fb26e7139645c83 de66d146a9 block: Relax restrictions for blockdev-snapshot d0732b1b6e28c82 baa65b09c9 block: trickle down the fallback image creation function use to the block drivers 903900a1e597fcb c896c661cd block: truncate: Don't make backing file data visible ceeb0693c81382a 0d77e20ca5 crypto.c: cleanup created file when block_crypto_ co_create_ opts_luks fails 8b33cabf9ed22b3 f44b5214ba file-posix: Drop hdev_co_ create_ opts() 3eccd94e9bb6cf7 b2070f07f1 file-posix: Support BDRV_REQ_ZERO_WRITE for truncate dcd58f1590415af e5c2791abd hmp: Allow using qdev ID for qemu-io command 00d644894c786b6 d66ac2e00e iscsi: Drop iscsi_co_ create_ opts() 02dd06a79bba97f be9d557cb5 hw/pci/pcie: Forbid hot-plug if it's disabled on the slot 310546f6d6cf3b6 8979580a9c hw/pci/pcie: Move hot plug capability check to pre_plug callback 6ac0565e0dc649b 3db76ed5dc hw/pci/pcie: Replace PCI_DEVICE() casts with existing variable 76b505648d31d5e 82594fb75e job: take each job's lock individually in job_txn_apply 71a5b538e9161f1 15df533e96 pcie_root_port: Add hotplug disabling optio 326aa2fdca3e0f9 8c15eb541a qapi: Add '@allow- write-only- overlay' feature for 'blockdev-snapshot' 27e90b78af6d90f 6a384b26ac qcow2: Expose bitmaps' size during measure f3981f8f45ac6a8 68975a68cc qcow2: Forward ZERO_WRITE flag for full preallocation 70c04bbf45e0f12 a9e5bc54de qcow2: Support BDRV_REQ_ZERO_WRITE for truncate bbe4b64c516679c bd68fcfbd8 qemu-file: Don't do IO after shutdown 6594bd411261028 a93414a70c replication: assert we own context before job_cancel_sync fffdf027fd36952 18ee7f1121 virtio-blk: On restart, process queued requests in the proper context b0704d3ccdc3755 e5e7394dca virtio-blk: Refactor the code that processes queued requests 961b6af20164d1c 6c219eb4ea virtio: don't enable notifications during polling
58226634c4b02af
958a04bd32af18d
7b8e4857426f2e2
92b92799dc8662b
087ab8e775f4876
8c6242b6f383e43
564806c529d4e0a
0abf2581717a19d
1de6b45fb5c1489
e1d7f8bb1ec0c69
69032253c33ae17
7788a319399f174
91005a495e228eb
471ded690e19689
3ea67e08832775a
c6996cf9a6c759c
377410f6fb4f6b0
bb4e58c6137e801
5b7bfe515ecbd58
2288ccfac96281c
7f16476fab14fc3
30dd65f307b647e
eeea1faa099f823
fd17146cd93d170
fbb92b6798894d3
17e1e2be5f9e84e
9bffae14df87925
c7a0f2be8f95b22
ef893b5c84f3199
78c81a3f108870d
b92902dfeaafbce
65eb7c85a3e6252
d29d3d1f80b3947
5a5e7f8cd86b7ce
955c7d6687fefcd
1bba30da24e1124
87ca3b8fa615b27
2f0c6e7a650de13
89b6fc45614bb45
80f0900905b555f
0501e1aa1d32a6e
0dabc0f6544f2c0
6a1e073378353eb
b660a84bbb0eb1a
530a0963184e57e
c6bdc312f30d5c7
5d72c68b49769c9
eb8a0cf3ba26611
f01643fb8b47e8a
a555b8092abc6f1
08558e33257ec79
49b44549ace7890
7aa1c247b466870
d0435bc513e23a4
Vice versa being on 4.2.1 already gives Ubuntu some block changes that might have caused this...
Waiting for Sean/Lee to comment on how testable/reliable that is on their end -> incomplete