Add monitor for stale UUID_resize directories in ephemeral storage
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Nova Compute Charm |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
It has been found that after manual resolution of instance migration failures that there are times that a $INSTANCE-
I'd like to propose an NRPE check that looks for "stale" *_resize directories in $EPHEMERAL_DIR (/var/lib/
There are two scenarios to consider:
1. Successful migration/
This scenario is simple and typically you might expect a user or operator to have confirmed the resize within a "reasonable" timeframe. Let's call it 2-6 hours, unless they went home for the night and it becomes a morning task to review migrations, so possibly ~20 hours until the resize-confirm is executed.
2. Unsuccessful migration where post-migrate routines fail on the destination host and set the instance state to ERROR.
In this scenario, the source host has the _resize directory, and the destination host has the instance's "live but dead" directory. An operator/admin can then run a "hard-reboot" on that instance to reset it's status and get it running on the destination host after remediation of whatever caused post-migration tasks to fail (typically auth token timeouts causing neutron post-migration calls to fail due to long migration times on large ephemeral disks or large active memory chunks.)
In the case of scenario 1, I believe Nova tracks the usage of the storage in the hypervisor stats of the _resize directory, but in scenario 2, the _resize directory becomes abandoned by nova and nova will now try to place instances onto the source host w/out factoring in the space utilized by the _resize directory. It is because of this that we need to have a check for stale _resize directories.
I think looking for _resize directories having a certain age for a warn and crit threshold would be sufficient. A sane default value for the check may be something along the lines of "warn at mtime = +6 hours and crit at mtime = +24 hours".
Obviously, disk space alerts from the standard nrpe checks can help mitigate this issue, but identifying where the space is coming from more quickly would be very helpful in pinpointing operational troubleshooting efforts.
Changed in charm-nova-compute: | |
importance: | Undecided → Wishlist |
status: | New → Triaged |
It appears that this may only be something we can check timestamp of the "disk" file within the instance directory.
Live instance:
/srv/nova/ instances$ ls -altr df1b7c64- 2d74-48b1- b5c5-45229b89a5 ec/
total 1133869328
-rw-r--r-- 1 nova nova 75 May 9 11:11 disk.info
-rw-r--r-- 1 nova nova 2730 May 9 11:11 libvirt.xml
drwxr-xr-x 2 nova nova 89 May 9 11:11 .
-rw-rw---- 1 libvirt-qemu kvm 35861 May 9 11:11 console.log
drwxr-xr-x 10 nova root 4096 Jul 7 03:14 ..
-rw-r--r-- 1 libvirt-qemu kvm 1161091612672 Jul 25 15:23 disk
Orphaned resize directory:
/srv/nova/ instances$ ls -altr df06d417- c5c6-42cb- 9fd2-df22764d5b 75_resize/ instances$
total 1497021748
-rw-r--r-- 1 nova nova 75 Jul 6 23:58 disk.info
-rw-r--r-- 1 nova nova 2714 Jul 7 03:14 libvirt.xml
drwxrwxr-x 2 nova nova 89 Jul 7 03:14 .
-rw-r--r-- 1 root root 1532950216704 Jul 7 03:14 disk
-rw------- 1 root root 32957 Jul 7 03:14 console.log
drwxr-xr-x 10 nova root 4096 Jul 7 03:14 ..
/srv/nova/
Also, the instance is still defined in libvirt in a shut-off state.
Are there supposed to be checks for this in nova that should be exposed as alerts from log scraping?