tripleo

Guest VM FS corruption after compute host reboot

Bug #1317056 reported by Cian O'Driscoll on 2014-05-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	James Polley

Bug Description

Rebooted NovaCompute0 which caused the guest vm to fail to become pingable (FS corruption).

nova reboot 04a7f53f-6f87-4357-9fa6-62ceee8993d6

From the console log of the guest vm hosted on NovaCompute0

ci-info: ++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++
ci-info: +-------+-------------+----------+-----------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+----------+-----------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.0.0 | 0.0.0.0 | 255.0.0.0 | eth0 | U |
ci-info: +-------+-------------+----------+-----------+-----------+-------+
[ 143.350298] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
[ 143.393895] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
[ 143.408435] EXT4-fs error (device vda1): ext4_find_dest_de:1648: inode #847: block 44229: comm rsyslogd: bad entry in directory: rec_len % 4 != 0 - offset=0(0), inode=1936090721, rec_len=24842, name_len=102
* Starting AppArmor profiles [80G Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
[74G[ OK ]
* Starting iSCSI initiator service iscsid [80G [74G[ OK ]
* Setting up iSCSI targets [80G
iscsiadm: No records found
[74G[ OK ]
* Mounting network filesystems [80G [74G[ OK ]
landscape-client is not configured, please run landscape-config.
Cloud-init v. 0.7.3 running 'modules:config' at Wed, 07 May 2014 10:14:00 +0000. Up 155.44 seconds.
* Restoring resolver state... [80G [74G[ OK ]
grub-editenv: error: invalid environment block.
Cloud-init v. 0.7.3 running 'modules:final' at Wed, 07 May 2014 10:14:20 +0000. Up 176.30 seconds.
Cloud-init v. 0.7.3 finished at Wed, 07 May 2014 10:14:22 +0000. Datasource DataSourceEc2. Up 178.27 seconds

Clint Byrum (clint-fewbar) on 2014-05-18

Changed in tripleo:
importance:	Undecided → Critical

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2014-05-18:

This seems odd. If a clean reboot is done, the OS should send libvirt a SIGTERM, which should then send a clean shutdown to all of the instances. We definitely need to investigate.

Ben Nemec (bnemec) on 2014-05-21

Changed in tripleo:
status:	New → Triaged

Roman Podoliaka (rpodolyaka) on 2014-06-03

Changed in tripleo:
assignee:	nobody → Roman Podoliaka (rpodolyaka)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-06-10:

I'm not sure what exactly changed in the last few days, but I'm not longer able to reproduce this, for better or for worse. Cian, can you?

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-18:

I presume this is caused by the vm having dirty pages and the host being hard powered off via ipmi (vs soft reboot from inside the instance)

Revision history for this message

Robert Collins (lifeless) wrote on 2014-08-12:

@Cian what sort of reboot was done - via ironic? 'sudo reboot'? something else?

Revision history for this message

Cian O'Driscoll (dricco) wrote on 2014-08-12:

It was a "sudo reboot"

Revision history for this message

Cian O'Driscoll (dricco) wrote on 2014-08-12:

Sorry i actually ran "nova reboot 04a7f53f-6f87-4357-9fa6-62ceee8993d6"
so would have been via ironic

Revision history for this message

Robert Collins (lifeless) wrote on 2014-08-13:

Ok, so this could be an example of the no-soft-off aspect of Ironic today. That said, guest FS corruption indicates a non-journalling guest FS, or unsafe cache mode for kvm.

Roman Podoliaka (rpodolyaka) on 2014-08-27

Changed in tripleo:
assignee:	Roman Podoliaka (rpodolyaka) → nobody

James Polley (tchaypo) on 2014-08-27

Changed in tripleo:
assignee:	nobody → James Polley (tchaypo)

Revision history for this message

James Polley (tchaypo) wrote on 2014-08-27:

Chatted with Roman; he can't reproduce this any more and hasn't been working on TripleO.

I've assigned this to myself so that it has an owner; but from what I'm reading, I'm getting the impression that this has only been seen once or twice; is suspected to be caused by Ironic shutting instances down without warning, and is possibly exacerbated by non-journalling filesystems.

It sounds to me as though the paths forward are:

* work with Ironic on a more graceful shutdown (which I believe is in progress); and
* perhaps reconsider which filesystems are in use (which sounds to me as though it's probably something for the implementor to consider rather than TripleO

Is there anything I'm missing that would mean that there's some work for TripleO team to do here?

Changed in tripleo:
status:	Triaged → Invalid
status:	Invalid → Won't Fix
importance:	Critical → Medium
status:	Won't Fix → Incomplete
importance:	Medium → High

Revision history for this message

Derek Higgins (derekh) wrote on 2014-09-19:

@james sounds good to me if ironic are working on a graceful shutdown I think we can close this

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-15:

#10

Closing per the previous comments (from three years ago!). If this is still a problem let's just open a new bug.

Changed in tripleo:
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.