Add monitor for stale UUID_resize directories in ephemeral storage

Bug #1783592 reported by Drew Freiberger
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Triaged
Wishlist
Unassigned

Bug Description

It has been found that after manual resolution of instance migration failures that there are times that a $INSTANCE-UUID_resize directory and disk/console.log are sometimes not cleaned up by the cloud underlay operators, and are not possible to be cleaned up by cloud overlay admins.

I'd like to propose an NRPE check that looks for "stale" *_resize directories in $EPHEMERAL_DIR (/var/lib/nova/instances, /srv/nova/instances, wherever the charm configures ephemeral storage to live).

There are two scenarios to consider:
1. Successful migration/live-migration waiting on resize-confirm or resize-revert before removing the source hypervisor's $INSTANCEUUID_resize directory.

This scenario is simple and typically you might expect a user or operator to have confirmed the resize within a "reasonable" timeframe. Let's call it 2-6 hours, unless they went home for the night and it becomes a morning task to review migrations, so possibly ~20 hours until the resize-confirm is executed.

2. Unsuccessful migration where post-migrate routines fail on the destination host and set the instance state to ERROR.

In this scenario, the source host has the _resize directory, and the destination host has the instance's "live but dead" directory. An operator/admin can then run a "hard-reboot" on that instance to reset it's status and get it running on the destination host after remediation of whatever caused post-migration tasks to fail (typically auth token timeouts causing neutron post-migration calls to fail due to long migration times on large ephemeral disks or large active memory chunks.)

In the case of scenario 1, I believe Nova tracks the usage of the storage in the hypervisor stats of the _resize directory, but in scenario 2, the _resize directory becomes abandoned by nova and nova will now try to place instances onto the source host w/out factoring in the space utilized by the _resize directory. It is because of this that we need to have a check for stale _resize directories.

I think looking for _resize directories having a certain age for a warn and crit threshold would be sufficient. A sane default value for the check may be something along the lines of "warn at mtime = +6 hours and crit at mtime = +24 hours".

Obviously, disk space alerts from the standard nrpe checks can help mitigate this issue, but identifying where the space is coming from more quickly would be very helpful in pinpointing operational troubleshooting efforts.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

It appears that this may only be something we can check timestamp of the "disk" file within the instance directory.

Live instance:

/srv/nova/instances$ ls -altr df1b7c64-2d74-48b1-b5c5-45229b89a5ec/
total 1133869328
-rw-r--r-- 1 nova nova 75 May 9 11:11 disk.info
-rw-r--r-- 1 nova nova 2730 May 9 11:11 libvirt.xml
drwxr-xr-x 2 nova nova 89 May 9 11:11 .
-rw-rw---- 1 libvirt-qemu kvm 35861 May 9 11:11 console.log
drwxr-xr-x 10 nova root 4096 Jul 7 03:14 ..
-rw-r--r-- 1 libvirt-qemu kvm 1161091612672 Jul 25 15:23 disk

Orphaned resize directory:

/srv/nova/instances$ ls -altr df06d417-c5c6-42cb-9fd2-df22764d5b75_resize/
total 1497021748
-rw-r--r-- 1 nova nova 75 Jul 6 23:58 disk.info
-rw-r--r-- 1 nova nova 2714 Jul 7 03:14 libvirt.xml
drwxrwxr-x 2 nova nova 89 Jul 7 03:14 .
-rw-r--r-- 1 root root 1532950216704 Jul 7 03:14 disk
-rw------- 1 root root 32957 Jul 7 03:14 console.log
drwxr-xr-x 10 nova root 4096 Jul 7 03:14 ..
/srv/nova/instances$

Also, the instance is still defined in libvirt in a shut-off state.

Are there supposed to be checks for this in nova that should be exposed as alerts from log scraping?

Changed in charm-nova-compute:
importance: Undecided → Wishlist
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.