DIB - cleaning upon failure is incomplet for overcloud-hardened-uefi-full

Bug #1959093 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Medium
Steve Baker

Bug Description

Hello,

It seems the cleaning upon failure for the hardened-uefi-full OC image isn't correct:
after the cleaning, I still see dangling "vg" and its associated physical volume (usually pointing to some loop device).

This dangling "vg" prevents any new run on the same host until we manually clean things up, since the name is fixed.

By the way, I think we shouldn't hardcode "vg" as it's a generic enough name to already exist on the node that is building the image. We probably should make a derivation based on the machin-id (take the N first chars) so that we have something more random/generic.

The "incriminated" files are apparently:
diskimage_builder/block_device/level1/lvm.py
diskimage_builder/block_device/blockdevice.py

Apparently, that dangling volume is also leading to a second issue when it first fails to build the image, where something is still mounted somewhere (sub-directory I'd say) in the temporary workspace/dib_image.XXXXX, preventing that mountpoint to be cleaned. It then leads to another issue, where the target is busy, and we can't end with an actual clean state.

So, basically, the issues are:
- generic name for the volume group: we should be smarter
- lack of actual cleaning, leading to multiple issues, among them dangling "vg" still present on the system with its associated physical volume
- this lack of cleaning is probably explaining why I end up with many, many "loop" devices in /dev/mapper

Cheers,

C.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

As a side note, we seem to miss call to "dmsetup remove /dev/mapper/loopXpY" in order to get a clean status at the end.
If we don't do that, we keep dangling "partitions" in /dev/mapper, and the /dev/loopX is still binded to the /image0.raw - this may explain some of the issues. Note that this apparently happens only upon a build failure.

Revision history for this message
Steve Baker (steve-stevebaker) wrote (last edit ):

A unique volume group name would be nice, but would be an involved feature request at this point, since the growvols script assumes 'vg' by default.

I will look into removing the device mapper entries on dib cleanup though.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Also, using the machine-id in the group name would provide a unique namespace but not a discoverable one, because the machine-id changes at the beginning of image build, and again on first boot.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

OK, I'm going to need some exact steps to reproduce this. I've tried setting a breakpoint like "export break=after-finalise" then entering "exit 1" when it pauses, but whatever before/after-<phase> I try the cleanup is complete, there is no residual loopback, lvm, or device mapper enteries.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote (last edit ):

Hello Steve,

So, it's still my containerized thing[1], trying to get the overcloud-hardened-uefi-full built.

The initial failure is weird, and described in this other LP:
https://bugs.launchpad.net/tripleo/+bug/1959568

You'll find the whole build log attached there - for the initial build, on a clean env.

Once the initial failure is hit, a dangling vg, pv and loops are kept - you'll be able to see that in the attached build log.

Also, here are some of the LVM related things that are still present on the node after the failure:

[root@gw-rh build-oc-images]# vgdisplay
  --- Volume group ---
  VG Name vg
  System ID
  Format lvm2
  Metadata Areas 1
  Metadata Sequence No 3
  VG Access read/write
  VG Status resizable
  MAX LV 0
  Cur LV 0
  Open LV 0
  Max PV 0
  Cur PV 1
  Act PV 1
  VG Size <5.59 GiB
  PE Size 4.00 MiB
  Total PE 1430
  Alloc PE / Size 0 / 0
  Free PE / Size 1430 / <5.59 GiB
  VG UUID TlSTf1-NvLo-ZF5N-czwc-j7Ce-gq81-ql8Gkq

  --- Volume group ---
  VG Name fedora_fedora
[...]

[root@gw-rh build-oc-images]# pvdisplay
  --- Physical volume ---
  PV Name /dev/mapper/loop1p3
  VG Name vg
  PV Size <5.59 GiB / not usable 2.00 MiB
  Allocatable yes
  PE Size 4.00 MiB
  Total PE 1430
  Free PE 1430
  Allocated PE 0
  PV UUID 321Zp0-lylz-K7l4-c41Z-UuHD-VI2y-ihIC8C

  --- Physical volume ---
  PV Name /dev/nvme0n1p3
  VG Name fedora_fedora
[...]

  "/dev/mapper/loop2p3" is a new physical volume of "<5.59 GiB"
  --- NEW Physical volume ---
  PV Name /dev/mapper/loop2p3
  VG Name
  PV Size <5.59 GiB
  Allocatable NO
  PE Size 0
  Total PE 0
  Free PE 0
  Allocated PE 0
  PV UUID szN2Rx-4rVM-rWHG-wrAd-K2hp-SZLt-e6U66f

As well as loop devices:
[root@gw-rh build-oc-images]# losetup --list
NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC
/dev/loop1 0 0 1 0 /image0.raw (deleted) 0 512
/dev/loop2 0 0 1 0 /image0.raw (deleted) 0 512
/dev/loop0 0 0 0 0 /tmp/tmp.ZgSDIyoRGD/ZwGPFf.raw (deleted) 0 512

And device manager content
[root@gw-rh build-oc-images]# dmsetup ls
fedora_fedora-containers (253:2)
fedora_fedora-home (253:1)
fedora_fedora-root (253:0)
loop0p1 (253:3)
loop1p1 (253:4)
loop1p2 (253:5)
loop1p3 (253:6)
loop2p1 (253:7)
loop2p2 (253:8)
loop2p3 (253:9)

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

This is almost certainly caused by building inside a container, I would advise against that until there is an upstream CI job which builds an LVM based in a container build.

I have all the image build projects installed to a python venv called 'dib' and it works very well. Here is my script for building overcloud-hardened-uefi-full:

pushd /httpboot
source /home/steveb/dev/localstack/envs/dib/bin/activate

export DIB_RELEASE=9-stream
export DIB_DEBUG_TRACE=2

# from https://cloud.centos.org/centos/9-stream/x86_64/images/
export DIB_LOCAL_IMAGE=./CentOS-Stream-GenericCloud-9-20220120.0.x86_64.qcow2
export break=before-pre-finalise
# export break=before-install
branch=master
repo=current
repodir="/httpboot/repos-$branch-$repo"
mkdir $repodir || true
tripleo-repos --output-path $repodir -d centos9 -b $branch $repo ceph deps

export DIB_YUM_REPO_CONF="$repodir/*"

export ELEMENTS_PATH="/home/steveb/dev/localstack/envs/dib/share/tripleo-puppet-elements:/home/steveb/dev/localstack/envs/dib/share/tripleo-image-elements:/home/steveb/dev/localstack/envs/dib/share/ironic-python-agent-builder/dib"

rm -rf overcloud-hardened-uefi-full.*

openstack overcloud image build --verbose --debug --no-package-install \
    --image-name overcloud-hardened-uefi-full \
    --config-file /home/steveb/dev/localstack/envs/dib/share/tripleo-common/image-yaml/overcloud-hardened-images-uefi-python3.yaml \
    --config-file /home/steveb/dev/localstack/envs/dib/share/tripleo-common/image-yaml/overcloud-hardened-images-uefi-centos8.yaml

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.