Unfortunately the only way for us to reproduce this bug at the moment is to put a database box under production load, so we have to be judicious in our approach. It also takes several hours for us to bring up a new guest up to sync and ready for the load, so it's not something that's easy to iterate on. We've tried running several different torture tests & I/O benchmarks with no success.
Our next step is to run a PV guest with with a sched_clock_stable=0 kernel, and if that still reproduces the issue we'll try turning autogroups off.
(I work with Mike Krieger.)
Our "quick fix" for this was to use HVM guests.
Unfortunately the only way for us to reproduce this bug at the moment is to put a database box under production load, so we have to be judicious in our approach. It also takes several hours for us to bring up a new guest up to sync and ready for the load, so it's not something that's easy to iterate on. We've tried running several different torture tests & I/O benchmarks with no success.
Our next step is to run a PV guest with with a sched_clock_ stable= 0 kernel, and if that still reproduces the issue we'll try turning autogroups off.