Cinder

Bug #1023755
Comment #53

Comment 53 for bug 1023755

Revision history for this message

Stefan Bader (smb) wrote on 2012-11-09:

#53

From what I got so far from the dump I took with the devstack setup on a loop block device, it seems there are many outstanding io requests for the snapshot-cow device (expected as dd is busy), but digging down to loop0 (which is backing the vg) its worker thread got into a situation where it tries to balance dirty pages and went into io schedule for that. So it looks a bit like writing to ext4 would be in a situation which needs some page(s) which it cannot get because the rest of memory is filled with incomplete requests from the snapshot-cow.

In parallel I did the same test on the same machine after adding a usb disk drive that had a stack-volumes vg on it. In that setup the test runs without problems (just 30s as timeout would be far too small, in reality it takes about 70s for deleting the snapshot or the volume). Still with this, the procedure seems to leak the snapshot-cow device (it still exists after deleting the snapshot and also after deleting the volume). I cannot say why this happens, but it is removable without any issues using dmsetup commands.

Vish, with your test-case, would it make a difference to use deadline as io scheduler (as root "echo deadline /sys/block/xxx/queue/scheduler") on Precise? We know that cfq seems to behave badly in certain situations and I wonder whether changing the io scheduler at least improves the situation.

Still need to check whether ext4 is prepared to handle low-memory situations like that (or maybe there is a fix fox that in later kernels). Might be changing as well, depending on speed of the block devices involved. So another piece to check would be to change the dd into doing sync io (if I find the right place to change).