blkid & lsblk commands can fail in the kickstarts

Bug #1888938 reported by Frank Miller
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Brief Description
-----------------
The kickstart code checks for the existence of the platform_backup partition and if any of its checks doesn't match what is is expecting the partition is created. There is no error checking on the various blkid/lsblk/parted commands used. On some occasions these commands fail to read data likely due to a udev bug. This leads to the wrong steps taken in the install (example the existing partition gets wiped and recreated causing all previous data to be lost).

The kickstart code needs to handle the possibility of commands failing or not reading proper data.

Severity
--------
Major

Steps to Reproduce
------------------
At this point it is not clear what disk will trigger the issue.

Expected Behavior
------------------
Installation should be able to properly detect if the platform_backup partition exists. And if the partition does not exist it should create the platform_backup partition properly.

Actual Behavior
----------------
The kickstarts sometimes re-create a platform_backup partition when it shouldn't. The kickstarts sometimes create the backup_partition with incorrect attributes (size or type or guid).

Reproducibility
---------------
Seen once. Difficult to reproduce.

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
stx4.0 load

Last Pass
---------
n/a

Timestamp/Logs
--------------
n/a

Test Activity
-------------
testing

Workaround
----------
none

Revision history for this message
Frank Miller (sensfan22) wrote :

Marking stx.5.0 gating - issue is hard to reproduce and rarely seen but will lead to AIO-SX configuration set up incorrectly.

Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
status: New → Triaged
importance: Undecided → Medium
tags: added: stx.5.0 stx.config stx.storage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/743246

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/743246
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=9b5148e3b57ea48540b795be95abb035fa9a983d
Submitter: Zuul
Branch: master

commit 9b5148e3b57ea48540b795be95abb035fa9a983d
Author: Ovidiu Poncea <email address hidden>
Date: Mon Jul 27 15:04:38 2020 +0300

    Harden kickstarts as udev behavior can lead to random failures

    Whenever a dev node that is not in use is opened with open(O_RDWR)
    udev triggers a flush in devtmpfs that briefly remove & recreate all
    the nodes for partitions on that device. This leads to commands
    accessing dev nodes during the flush to fail. In our case blkid and
    lsblk failed.

    These failures are hard to reproduce, have devastating effect on
    the partitioning operations and are not solved by using 'udevadm settle'
    as some of the kernel events are asynchronous.

    So, mainly, this commit stops udev from messing up with /dev nodes by
    initializing file descriptors for all storage devices then opening
    locks on them with flock. Setting locks stops udev triggering kernel
    partition rescan.

    Locks are set at the start of the partitioning operation and
    released at the end.

    For more details and similar cases see:
     o https://github.com/systemd/systemd/commit/02ba8fb3357daf57f6120ac512fb464a4c623419
     o http://tracker.ceph.com/issues/14080
     o http://tracker.ceph.com/issues/15176

    This commit:
     o stops udev messing up with /dev nodes;
     o aborts install on critical failures;
     o adds retry for critical operations such as LVM cleanup or
       partition removal and creation.

    Closes-Bug: 1888938
    Change-Id: Iaaaaaae973ee36f2c4bfd42c327e8c6278d59303
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.