Comment 3 for bug 1350766

Revision history for this message
Michael Steffens (michael-steffens-b) wrote :

That is really tough to guess. I don't know any reason why a production environment would be less susceptible by principle.

During my tests almost all QCOW2 instantiations and launches of snapshots failed, until applying the fsync fix. Notable exceptions:

 * The cirros and ubuntu original images instantiated right after OpenStack setup, with no other load on the system at all.
 * Windows (due to its size! qemu-img can consume the disk head without catching up, while nova download is far away still writing the disk tail).

Failures of the others were varying, from no boot disk found at all, to failures during boot. Thus, I wouldn't be too surprised if even a certain fraction of running instances in production got clipped unnoticed, but only lost chunks not being read so early, or not at all.

How big that fraction is depends on too many boundary conditions.

What's bothering me most with respect to security is the failure's stickiness. Once a base image is broken on a compute node, it requires careful intervention not get promoted into all subsequent instantiations, also of other users and tenants.