All-in-one: host fails after reboot

Bug #1789420 reported by Ghada Khalil
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Joseph Richard

Bug Description

Brief Description
-----------------
After rebooting host, it failed to recover with a number of file system failures in console log.

Severity
--------
Major

Steps to Reproduce
------------------
- On simplex or duplex systems.
- sudo reboot -f (or system host-lock/unlock) on a host
- Host fails after reboot

Expected Behavior
------------------
Host reboots and recovers successfully

Actual Behavior
----------------
Host fails to reboot

Reproducibility
---------------
Fairly reproducible. Seen on 2 out of 4 systems.

System Configuration
--------------------
One node system and Two node systems

Branch/Pull Time/Commit
-----------------------
master as of 2018-08-27_20-18-00

Timestamp/Logs
--------------
[2018-08-27 07:22:04,833] 264 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Revision history for this message
Ghada Khalil (gkhalil) wrote :
Download full text (12.2 KiB)

Adding investigation details by Bob Church:

The global_filter setting on controller-1 is incorrectly set to /dev/disk/by-path/pci-0000:00:17.0-ata-1.0. It needs to include /dev/disk/by-path/pci-0000:00:17.0-ata-3.0 as /dev/sdc is the root disk for this UEFI install and contains all physical volumes for the cgts-volume group. Without it present in the global_filter, the disk is hidden from LVM and will produce the above emergency condition.

controller-1:~# grep global_filter /etc/lvm/lvm.conf
        # Configuration option devices/global_filter.
        # Use global_filter to hide devices from these LVM system components.
        # global_filter are not opened by LVM.
    global_filter = [ "a|/dev/disk/by-path/pci-0000:00:17.0-ata-1.0|", "a|/dev/drbd4|", "r|.*|" ]
        # devices/global_filter.

controller-1:~# ls -Rl /dev/disk/by-path/
/dev/disk/by-path/:
total 0
lrwxrwxrwx 1 root root 9 Aug 27 18:33 pci-0000:00:17.0-ata-1.0 -> ../../sda
lrwxrwxrwx 1 root root 9 Aug 27 18:33 pci-0000:00:17.0-ata-2.0 -> ../../sdb
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-2.0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 9 Aug 27 18:33 pci-0000:00:17.0-ata-3.0 -> ../../sdc
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-3.0-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-3.0-part2 -> ../../sdc2
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-3.0-part3 -> ../../sdc3
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-3.0-part4 -> ../../sdc4
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-3.0-part5 -> ../../sdc5
lrwxrwxrwx 1 root root 10 Aug 27 18:33 pci-0000:00:17.0-ata-3.0-part6 -> ../../sdc6

After reinstalling the node, setting the root passwd, I was able to drop into the emergency shell and debug. After manually setting the global_filter prior to node unlock, the node successfully booted, applied manifests, and became available.

Lab setup files look ok here: http://stash.wrs.com/projects/CGCS/repos/titanium-tools/browse/lab/yow/cgcs-wolfpass-01_02?at=refs%2Fheads%2FTC_DEV_0003

Looking at the generated hiera data in the node I see the following:

[wrsroot@controller-0 hieradata(keystone_admin)]$ sudo grep -A4 lvm_global_filter 192.168.204.[34].yaml
192.168.204.3.yaml:openstack::nova::storage::lvm_global_filter: !!python/unicode '[ "a|/dev/disk/by-path/pci-0000:00:17.0-ata-1.0|",
192.168.204.3.yaml- "a|/dev/drbd4|", "r|.*|" ]'
192.168.204.3.yaml-openstack::nova::storage::lvm_update_filter: !!python/unicode '[ "a|/dev/disk/by-path/pci-0000:00:17.0-ata-1.0|",
192.168.204.3.yaml- "a|/dev/drbd4|", "r|.*|" ]'
192.168.204.3.yaml-openstack::nova::storage::removing_pvs: []
--
192.168.204.4.yaml:openstack::nova::storage::lvm_global_filter: !!python/unicode '[ "a|/dev/disk/by-path/pci-0000:00:17.0-ata-1.0|",
192.168.204.4.yaml- "a|/dev/drbd4|", "r|.*|" ]'
192.168.204.4.yaml-openstack::nova::storage::lvm_update_filter: !!python/unicode '[ "a|/dev/disk/by-path/pci-0000:00:17.0-ata-1.0|",
192.168.204.4.yaml- "a|/dev/drbd4|", "r|.*|" ]'
192.168.204.4.yaml-openstack::nova::storage::removing_pvs: []

The filter setting look to be incorrect as it ...

Changed in starlingx:
importance: Undecided → High
Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
Ghada Khalil (gkhalil)
summary: - All-in-one Simplex/Duplex: host fails after reboot
+ All-in-one: host fails after reboot
tags: added: stx.2018.10 stx.config
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/597698

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Manually marking as Fix Released; not sure why it was done automatically by gerrit when the commit was merged.

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.