Guest VM FS corruption after compute host reboot

Bug #1317056 reported by Cian O'Driscoll on 2014-05-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
James Polley

Bug Description

Rebooted NovaCompute0 which caused the guest vm to fail to become pingable (FS corruption).

nova list
+--------------------------------------+-------------------------------------+--------+------------+-------------+---------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------------------+--------+------------+-------------+---------------------+
| 04a7f53f-6f87-4357-9fa6-62ceee8993d6 | overcloud-NovaCompute0-vu5rg65v44nn | ACTIVE | - | Running | ctlplane=192.0.2.28 |
| 2267f56c-e103-4740-a1f9-31a3a974dc26 | overcloud-NovaCompute1-efuq7j4zztwc | ACTIVE | - | Running | ctlplane=192.0.2.30 |
| 6b527a34-4dbe-4d92-b42e-87635e806910 | overcloud-controller0-gzmxb2g2za2h | ACTIVE | - | Running | ctlplane=192.0.2.29 |
+--------------------------------------+-------------------------------------+--------+------------+-------------+---------------------+

nova reboot 04a7f53f-6f87-4357-9fa6-62ceee8993d6

From the console log of the guest vm hosted on NovaCompute0

ci-info: ++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++
ci-info: +-------+-------------+----------+-----------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+----------+-----------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.0.0 | 0.0.0.0 | 255.0.0.0 | eth0 | U |
ci-info: +-------+-------------+----------+-----------+-----------+-------+
[ 143.350298] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
[ 143.393895] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
[ 143.408435] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
 * Starting AppArmor profiles [80G Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
[74G[ OK ]
 * Starting iSCSI initiator service iscsid [80G [74G[ OK ]
 * Setting up iSCSI targets [80G
iscsiadm: No records found
[74G[ OK ]
 * Mounting network filesystems [80G [74G[ OK ]
landscape-client is not configured, please run landscape-config.
Cloud-init v. 0.7.3 running 'modules:config' at Wed, 07 May 2014 10:14:00 +0000. Up 155.44 seconds.
 * Restoring resolver state... [80G [74G[ OK ]
grub-editenv: error: invalid environment block.
Cloud-init v. 0.7.3 running 'modules:final' at Wed, 07 May 2014 10:14:20 +0000. Up 176.30 seconds.
Cloud-init v. 0.7.3 finished at Wed, 07 May 2014 10:14:22 +0000. Datasource DataSourceEc2. Up 178.27 seconds

Changed in tripleo:
importance: Undecided → Critical
Clint Byrum (clint-fewbar) wrote :

This seems odd. If a clean reboot is done, the OS should send libvirt a SIGTERM, which should then send a clean shutdown to all of the instances. We definitely need to investigate.

Ben Nemec (bnemec) on 2014-05-21
Changed in tripleo:
status: New → Triaged
Changed in tripleo:
assignee: nobody → Roman Podoliaka (rpodolyaka)
Roman Podoliaka (rpodolyaka) wrote :

I'm not sure what exactly changed in the last few days, but I'm not longer able to reproduce this, for better or for worse. Cian, can you?

Robert Collins (lifeless) wrote :

I presume this is caused by the vm having dirty pages and the host being hard powered off via ipmi (vs soft reboot from inside the instance)

Robert Collins (lifeless) wrote :

@Cian what sort of reboot was done - via ironic? 'sudo reboot'? something else?

Cian O'Driscoll (dricco) wrote :

It was a "sudo reboot"

Cian O'Driscoll (dricco) wrote :

Sorry i actually ran "nova reboot 04a7f53f-6f87-4357-9fa6-62ceee8993d6"
so would have been via ironic

Robert Collins (lifeless) wrote :

Ok, so this could be an example of the no-soft-off aspect of Ironic today. That said, guest FS corruption indicates a non-journalling guest FS, or unsafe cache mode for kvm.

Changed in tripleo:
assignee: Roman Podoliaka (rpodolyaka) → nobody
James Polley (tchaypo) on 2014-08-27
Changed in tripleo:
assignee: nobody → James Polley (tchaypo)
James Polley (tchaypo) wrote :

Chatted with Roman; he can't reproduce this any more and hasn't been working on TripleO.

I've assigned this to myself so that it has an owner; but from what I'm reading, I'm getting the impression that this has only been seen once or twice; is suspected to be caused by Ironic shutting instances down without warning, and is possibly exacerbated by non-journalling filesystems.

It sounds to me as though the paths forward are:

* work with Ironic on a more graceful shutdown (which I believe is in progress); and
* perhaps reconsider which filesystems are in use (which sounds to me as though it's probably something for the implementor to consider rather than TripleO

Is there anything I'm missing that would mean that there's some work for TripleO team to do here?

Changed in tripleo:
status: Triaged → Invalid
status: Invalid → Won't Fix
importance: Critical → Medium
status: Won't Fix → Incomplete
importance: Medium → High
Derek Higgins (derekh) wrote :

@james sounds good to me if ironic are working on a graceful shutdown I think we can close this

Ben Nemec (bnemec) wrote :

Closing per the previous comments (from three years ago!). If this is still a problem let's just open a new bug.

Changed in tripleo:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers