redeploys are broken with Error loading unit file ''cinder-lvm-losetup"

Bug #1873899 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Jesse Pretorius

Bug Description

Ever since landing https://review.opendev.org/#/c/720232 I think we broke redeploys, mine fails with:
TASK [cinder enable the LVM losetup service] ***********************************
Monday 20 April 2020 15:16:10 +0000 (0:00:01.704) 0:04:17.437 **********
fatal: [overcloud-controller-0]: FAILED! => changed=false
  msg: 'Error loading unit file ''cinder-lvm-losetup'': org.freedesktop.systemd1.BadUnitSetting "Unit cinder-lvm-losetup.service has a bad unit file setting."'
fatal: [overcloud-controller-1]: FAILED! => changed=false
  msg: 'Error loading unit file ''cinder-lvm-losetup'': org.freedesktop.systemd1.BadUnitSetting "Unit cinder-lvm-losetup.service has a bad unit file setting."'
fatal: [overcloud-controller-2]: FAILED! => changed=false
  msg: 'Error loading unit file ''cinder-lvm-losetup'': org.freedesktop.systemd1.BadUnitSetting "Unit cinder-lvm-losetup.service has a bad unit file setting."'

The reason seems that the following code:
- name: cinder identify the LVM loopback device
  command:
    losetup -j /var/lib/cinder/cinder-volumes -n -O NAME

Returns:
[root@overcloud-controller-0 ~]# losetup -j /var/lib/cinder/cinder-volumes -n -O NAME
/dev/loop1
/dev/loop0

And then we register the above output in the 'cinder_lvm_dev' fact and we use it (with the \n) inside the systemd service which then is broken for systemd:
"""
[root@overcloud-controller-0 ~]# more /etc/systemd/system/cinder-lvm-losetup.service
[Unit]
Description=Cinder LVM losetup
After=syslog.target

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/sbin/losetup /dev/loop1
/dev/loop0 || /sbin/losetup /dev/loop1
/dev/loop0 /var/lib/cinder/cinder-volumes'
ExecStop=/bin/bash -c '/sbin/losetup -d /dev/loop1
/dev/loop0'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
"""

Revision history for this message
Alan Bishop (alan-bishop) wrote :

I understand what's happening, and will submit a patch to fix this asap.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Alan Bishop (alan-bishop)
milestone: none → ussuri-rc1
Revision history for this message
Alan Bishop (alan-bishop) wrote :

A fix for this is already underway, see https://review.opendev.org/721317.

Changed in tripleo:
assignee: Alan Bishop (alan-bishop) → Jesse Pretorius (jesse-pretorius)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/721317
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=580aca40e538b2a57b1a9e67c29c9d8233564a98
Submitter: Zuul
Branch: master

commit 580aca40e538b2a57b1a9e67c29c9d8233564a98
Author: Jesse Pretorius (odyssey4me) <email address hidden>
Date: Mon Apr 20 18:27:02 2020 +0100

    Improve the cinder LVM loopback device setup

    The current implementation has the following issues:

    1. It uses two tasks to get the loopback device information,
       and another to set a fact which is only used in the same
       play, which is unnecessary. This can all be done in a
       single task without the fact setting.

    2. The setup of the loopback device is not idempotent. If any
       of the tasks fail, another device will be setup and the
       the tasks will continue to fail because it cannot handle
       multiple value returns.

    3. The LVM PV & VG are setup using a shell task. This could
       be done using an Ansible module instead, which also makes
       the task idempotent.

    4. The oneshot systemd unit starts very late in the boot cycle.
       It can be set to start much sooner, ensuring that the
       cinder-volume service starts successfully sooner.

    5. The oneshot systemd unit starts a bash subshell unnecessarily.

    This patch aims to resolve all these issues, improving the
    execution time and idempotency of these tasks.

    Closes-Bug: #1873899
    Change-Id: I9421cf54f498b3f99a7e5afa11425a0b2419b399

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Michele Baldessari (michele) wrote :

Thanks Alan and Jesse!

Revision history for this message
Michele Baldessari (michele) wrote :

So I tested a deployment and I applied https://review.opendev.org/#/c/721317/ but it still fails with:
TASK [Get or create LVM loopback device] ***************************************
Tuesday 21 April 2020 10:21:31 +0000 (0:00:01.028) 0:04:09.488 *********
ok: [overcloud-controller-0]
ok: [overcloud-controller-1]
ok: [overcloud-controller-2]

TASK [Create LVM volume group] *************************************************
Tuesday 21 April 2020 10:21:32 +0000 (0:00:01.361) 0:04:10.849 *********
fatal: [overcloud-controller-1]: FAILED! => changed=false
  msg: Device /dev/loop1 /dev/loop0 not found.
fatal: [overcloud-controller-0]: FAILED! => changed=false
  msg: Device /dev/loop1 /dev/loop0 not found.
fatal: [overcloud-controller-2]: FAILED! => changed=false
  msg: Device /dev/loop1 /dev/loop0 not found.

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
overcloud-controller-0 : ok=72 changed=12 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
overcloud-controller-1 : ok=69 changed=12 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
overcloud-controller-2 : ok=69 changed=12 unreachable=0 failed=1 skipped=2 rescued=0 ignored=0
overcloud-novacompute-0 : ok=58 changed=11 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0

The initial deployment did not have https://review.opendev.org/#/c/721317/ as I only applied it before the redeploy (not sure how relevant that is). Trying now to deploy from scratch with https://review.opendev.org/#/c/721317/ included and then redeploy, if that fails I will reopen

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

@Michele Yeah, that makes sense because it's not expecting multiple loopback devices for the same file. I can push a patch now to handle that corner-case.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/722329

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/722329
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d9152a601e1f18c89b70db7fc26608836fc3644e
Submitter: Zuul
Branch: master

commit d9152a601e1f18c89b70db7fc26608836fc3644e
Author: Jesse Pretorius (odyssey4me) <email address hidden>
Date: Thu Apr 23 15:18:48 2020 +0100

    [cinder-lvm] Resolve issue when there are multiple loop devices

    If a deployment fails, there may end up being multiple loopback
    devices pointing to the same image file. This currently breaks
    the systemd unit file because it cannot handle starting two
    devices properly.

    We only need one device, so just in case this situation arises,
    we now ensure that anything more than the first device is ignored
    when running 'Get or create LVM loopback device'.

    Change-Id: I012ee9c60bce4148da8ec6fe077ee20cfe34d555
    Related-bug: #1873899

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/723375

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)
Download full text (3.6 KiB)

Reviewed: https://review.opendev.org/723375
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b938f7cd9e639b48c2169ca7d96222f827cd8bf2
Submitter: Zuul
Branch: stable/train

commit b938f7cd9e639b48c2169ca7d96222f827cd8bf2
Author: Jesse Pretorius <email address hidden>
Date: Wed Apr 15 03:03:06 2020 -0400

    [train-squash] Remove hardcoded reference to cinder LVM loopback device

    Backport note:
    In CentOS 7, losetup does not include the -n option. We therefore use
    a slightly different way to get the same output in this backport.

    This is a squash commit of the following patches:

    Rename loopback creates file

    We are hitting bug[1] and it looks like because /dev/loop2 file
    already exists the ansible task which creates loop didn't trigger.
    Creating file with different name "/dev/loop2cinder" to pass the
    issue[1]

    [1] https://bugs.launchpad.net/tripleo/+bug/1872881

    Change-Id: I4e18bee7864b71afca387fbea5857c9530faa2fc
    Partial-Bug: #1872881
    (cherry picked from commit f85caaf411c4b8d9db39350862d6d52f734f559b)

    Remove hardcoded reference to cinder LVM loopback device

    Replace hardcoded reference to "/dev/loop2" by allowing losetup to
    dynamically choose the next available loopback device number.

    Eliminate hiera data that set cinder_lvm_loop_device_size. This is not
    needed because the associated file is created by a host prep task, and
    the puppet code will be removed by https://review.opendev.org/720139.

    Closes-Bug: #1872881
    Change-Id: Ia0302d1d55dcb8333d7db9713822075d32cd852a
    (cherry picked from commit 502947b4beed6bc7f3f7f58706b1934766561c0e)

    Improve the cinder LVM loopback device setup

    The current implementation has the following issues:

    1. It uses two tasks to get the loopback device information,
       and another to set a fact which is only used in the same
       play, which is unnecessary. This can all be done in a
       single task without the fact setting.

    2. The setup of the loopback device is not idempotent. If any
       of the tasks fail, another device will be setup and the
       the tasks will continue to fail because it cannot handle
       multiple value returns.

    3. The LVM PV & VG are setup using a shell task. This could
       be done using an Ansible module instead, which also makes
       the task idempotent.

    4. The oneshot systemd unit starts very late in the boot cycle.
       It can be set to start much sooner, ensuring that the
       cinder-volume service starts successfully sooner.

    5. The oneshot systemd unit starts a bash subshell unnecessarily.

    This patch aims to resolve all these issues, improving the
    execution time and idempotency of these tasks.

    Closes-Bug: #1873899
    Change-Id: I9421cf54f498b3f99a7e5afa11425a0b2419b399
    (cherry picked from commit 580aca40e538b2a57b1a9e67c29c9d8233564a98)

    [cinder-lvm] Resolve issue when there are multiple loop devices

    If a deployment fails, there may end up being multiple loopback
    devices pointing to the sam...

Read more...

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.