Guest VM FS corruption after compute host reboot

Bug #1317056 reported by Cian O'Driscoll
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
James Polley

Bug Description

Rebooted NovaCompute0 which caused the guest vm to fail to become pingable (FS corruption).

nova list
+--------------------------------------+-------------------------------------+--------+------------+-------------+---------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------------------+--------+------------+-------------+---------------------+
| 04a7f53f-6f87-4357-9fa6-62ceee8993d6 | overcloud-NovaCompute0-vu5rg65v44nn | ACTIVE | - | Running | ctlplane=192.0.2.28 |
| 2267f56c-e103-4740-a1f9-31a3a974dc26 | overcloud-NovaCompute1-efuq7j4zztwc | ACTIVE | - | Running | ctlplane=192.0.2.30 |
| 6b527a34-4dbe-4d92-b42e-87635e806910 | overcloud-controller0-gzmxb2g2za2h | ACTIVE | - | Running | ctlplane=192.0.2.29 |
+--------------------------------------+-------------------------------------+--------+------------+-------------+---------------------+

nova reboot 04a7f53f-6f87-4357-9fa6-62ceee8993d6

From the console log of the guest vm hosted on NovaCompute0

ci-info: ++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++
ci-info: +-------+-------------+----------+-----------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+----------+-----------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.0.0 | 0.0.0.0 | 255.0.0.0 | eth0 | U |
ci-info: +-------+-------------+----------+-----------+-----------+-------+
[ 143.350298] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
[ 143.393895] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
[ 143.408435] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
 * Starting AppArmor profiles [80G Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
[74G[ OK ]
 * Starting iSCSI initiator service iscsid [80G [74G[ OK ]
 * Setting up iSCSI targets [80G
iscsiadm: No records found
[74G[ OK ]
 * Mounting network filesystems [80G [74G[ OK ]
landscape-client is not configured, please run landscape-config.
Cloud-init v. 0.7.3 running 'modules:config' at Wed, 07 May 2014 10:14:00 +0000. Up 155.44 seconds.
 * Restoring resolver state... [80G [74G[ OK ]
grub-editenv: error: invalid environment block.
Cloud-init v. 0.7.3 running 'modules:final' at Wed, 07 May 2014 10:14:20 +0000. Up 176.30 seconds.
Cloud-init v. 0.7.3 finished at Wed, 07 May 2014 10:14:22 +0000. Datasource DataSourceEc2. Up 178.27 seconds

Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

This seems odd. If a clean reboot is done, the OS should send libvirt a SIGTERM, which should then send a clean shutdown to all of the instances. We definitely need to investigate.

Ben Nemec (bnemec)
Changed in tripleo:
status: New → Triaged
Changed in tripleo:
assignee: nobody → Roman Podoliaka (rpodolyaka)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I'm not sure what exactly changed in the last few days, but I'm not longer able to reproduce this, for better or for worse. Cian, can you?

Revision history for this message
Robert Collins (lifeless) wrote :

I presume this is caused by the vm having dirty pages and the host being hard powered off via ipmi (vs soft reboot from inside the instance)

Revision history for this message
Robert Collins (lifeless) wrote :

@Cian what sort of reboot was done - via ironic? 'sudo reboot'? something else?

Revision history for this message
Cian O'Driscoll (dricco) wrote :

It was a "sudo reboot"

Revision history for this message
Cian O'Driscoll (dricco) wrote :

Sorry i actually ran "nova reboot 04a7f53f-6f87-4357-9fa6-62ceee8993d6"
so would have been via ironic

Revision history for this message
Robert Collins (lifeless) wrote :

Ok, so this could be an example of the no-soft-off aspect of Ironic today. That said, guest FS corruption indicates a non-journalling guest FS, or unsafe cache mode for kvm.

Changed in tripleo:
assignee: Roman Podoliaka (rpodolyaka) → nobody
James Polley (tchaypo)
Changed in tripleo:
assignee: nobody → James Polley (tchaypo)
Revision history for this message
James Polley (tchaypo) wrote :

Chatted with Roman; he can't reproduce this any more and hasn't been working on TripleO.

I've assigned this to myself so that it has an owner; but from what I'm reading, I'm getting the impression that this has only been seen once or twice; is suspected to be caused by Ironic shutting instances down without warning, and is possibly exacerbated by non-journalling filesystems.

It sounds to me as though the paths forward are:

* work with Ironic on a more graceful shutdown (which I believe is in progress); and
* perhaps reconsider which filesystems are in use (which sounds to me as though it's probably something for the implementor to consider rather than TripleO

Is there anything I'm missing that would mean that there's some work for TripleO team to do here?

Changed in tripleo:
status: Triaged → Invalid
status: Invalid → Won't Fix
importance: Critical → Medium
status: Won't Fix → Incomplete
importance: Medium → High
Revision history for this message
Derek Higgins (derekh) wrote :

@james sounds good to me if ironic are working on a graceful shutdown I think we can close this

Revision history for this message
Ben Nemec (bnemec) wrote :

Closing per the previous comments (from three years ago!). If this is still a problem let's just open a new bug.

Changed in tripleo:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.