[2.0-b6] Deploying a trusty (but not xenial) node frequently fails during storage setup of curtin

Bug #1588875 reported by Dimiter Naydenov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
curtin
Fix Released
Undecided
Unassigned

Bug Description

Steps to reproduce:
1. Edit a node storage setup to unmount and unformat all existing VGs, partitions, etc.
2. Create a single VG (vg0) on the only available device ('sda' in my case)
3. Create a couple of LVs (vg-root - 60GB - or half the available space, ext4, mounted at /; vg-ceph - 60GB - the other half, ext4, mounted at /srv/ceph-osd)
4. Make sure MAAS has the latest trusty images
5. Deploy the node with 'trusty' (expect success; no issues with xenial on every attempt)
6. Then release the node and try to deploy it again with trusty.

Now, with the previous 2.0.0-rc1+bzr5059 at that point the node transitioned to 'Failed deployment' and the installation log on the UI shows this: http://paste.ubuntu.com/16945408/

With 2.0.0-beta6+bzr5060 it's actually worse, because it still fails by *does not* transition to 'Failed deployment' but is stuck in 'Deploying'. Installation did fail, as it's apparent from the node event log having these last 2 lines:
-8<-------
Node installation - 'curtin' failed: configuring disk: sda
Node installation - 'curtin' failed: configuring storage
-8<-------

It looks like the issue is with curtin trying and failing to reformat existing partitions, LVs / VGs ?

As suggested, I'm adding the output of 'maas <profile> machine get-curtin-config <system-id>':
(just to demonstrate I did 2 deployments with xenial first on the same node)
http://paste.ubuntu.com/16949299/ (first deployment - xenial; success)
http://paste.ubuntu.com/16949127/ (second deployment - xenial; success)
http://paste.ubuntu.com/16949338/ (third deployment - trusty - silently failed, stuck in Deploying)

Contents of /var/log/maas/* is attached.

# dpkg -l '*maas'* | cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-=================================================
ii maas 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS client and command-line interface
rc maas-cluster-controller 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS server common files
ii maas-dhcp 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS DHCP server
ii maas-dns 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS DNS server
ii maas-proxy 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all Rack Controller for MAAS
ii maas-region-api 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all Region Controller for MAAS
rc maas-region-controller-min 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS Server minimum region controller
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS server provisioning libraries (Python 3)

Related branches

Revision history for this message
Dimiter Naydenov (dimitern) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

Can you include:

1. pvs, vgs, lvs output from the node prior to being deployed via maas
2. Can attach maas <session> node get-curtin-config <system-id> output, I didn't see it in the var/log/maas output; I'm mostly interested in the curtin input and log from the node.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

curtin versions on MAAS:

maashw@maas-hw:~$ dpkg -l '*curtin*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=================================================================-=====================================-=====================================-========================================================================================================================================
un curtin <none> <none> (no description available)
ii curtin-common 0.1.0~bzr387-0ubuntu1~16.04.1 all Library and tools for curtin installer
ii python3-curtin 0.1.0~bzr387-0ubuntu1~16.04.1 all Library and tools for curtin installer

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I managed to SSH into the node and capture the following logs:

/var/log/cloud-init-output.log, from a successful trusty deployment:
http://paste.ubuntu.com/16955049/

/var/log/cloud-init-output.log, from a failed trusty deployment:
http://paste.ubuntu.com/16955458/

/tmp/curtin.aptupdate: http://paste.ubuntu.com/16955494/; /tmp/install.log: http://paste.ubuntu.com/16955546/ (all from the same failed deployment)
(I couldn't find /var/log/curtin* as asked, but suspect the one in tmp is the one).

Revision history for this message
gimmic (gimmic) wrote :

Just want to say I am seeing the same issue, but it is not isolated to LVM based installs. Even flat / ext4 installs are exhibiting the same symptoms.

If I shutdown the machine and allow it to netboot again, it seems to deploy successfully.

I suspect for some reason the partitioning is exiting with an error code(but otherwise successfully deploying).

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1588875] Re: [2.0-b6] Deploying a trusty (but not xenial) node frequently fails during storage setup of curtin

Thanks,

Looks like we do need to simulate the LVM removal as that's triggering some
issues with looking for holders; and it's not surprising that we see the
issue in Trusty vs. Xenial as lvm/sysfs layers are likely different.

On Fri, Jun 3, 2016 at 1:14 PM, Dimiter Naydenov <
<email address hidden>> wrote:

> I managed to SSH into the node and capture the following logs:
>
> /var/log/cloud-init-output.log, from a successful trusty deployment:
> http://paste.ubuntu.com/16955049/
>
> /var/log/cloud-init-output.log, from a failed trusty deployment:
> http://paste.ubuntu.com/16955458/
>
> /tmp/curtin.aptupdate: http://paste.ubuntu.com/16955494/;
> /tmp/install.log: http://paste.ubuntu.com/16955546/ (all from the same
> failed deployment)
> (I couldn't find /var/log/curtin* as asked, but suspect the one in tmp is
> the one).
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1588875
>
> Title:
> [2.0-b6] Deploying a trusty (but not xenial) node frequently fails
> during storage setup of curtin
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1588875/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

@gimmic,

Actually, it's successfully wiping the original storage config, but
expecting certain parts of the constructed devices to have a sysfs dir
(holder, which points to overlay storage devices).
It exits on the exception, but the commands (like pvremove) have already
run, which cleared the disk On subsequent boots, the disks look clean from
an LVM perspective and we
skip the lvm based wiping. Similar failures can happen for non-lvm cases,
which is what I suspect you see.

In a separate bug, could you attach your curtin config which describes the
storage layout that is failing to be removed when you set wipe: superblock
? I'd like to capture that
wipe failure in addition to this LVM one.

On Fri, Jun 3, 2016 at 1:41 PM, gimmic <email address hidden> wrote:

> Just want to say I am seeing the same issue, but it is not isolated to
> LVM based installs. Even flat / ext4 installs are exhibiting the same
> symptoms.
>
> If I shutdown the machine and allow it to netboot again, it seems to
> deploy successfully.
>
> I suspect for some reason the partitioning is exiting with an error
> code(but otherwise successfully deploying).
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1588875
>
> Title:
> [2.0-b6] Deploying a trusty (but not xenial) node frequently fails
> during storage setup of curtin
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1588875/+subscriptions
>

Changed in curtin:
status: New → Confirmed
Changed in maas:
status: New → Invalid
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Still present on curtin/xenial,xenial,xenial,xenial 0.1.0~bzr389-0ubuntu1~16.04.1 all

Revision history for this message
Wesley Wiedenmeier (wesley-wiedenmeier) wrote :

The error that was occuring was happening because the lvm configuration that curtin was trying to shut down from the last installation had multiple logical volumes with data on a the physical volume on /dev/sda1. Curtin was shutting down the lvm configuration as expected, but crashed because it shut down the entire volume group when handling the first logical volume, so was unable to find data in /sys/block/dm-1/ which mapped to the second logical volume.

Curtin did have logic in place to handle this issue, but was expecting for an IOError to be thrown. On systems where curtin is running under python3 everything worked correctly, which is why you saw no issues with xenial, as python3 would have raised a FileNotFound error, which would be caught as an IOError. However, on python2, an OSError would have been raised instead, and this error was not trapped.

There is a simple fix available in https://code.launchpad.net/~wesley-wiedenmeier/curtin/1588875 pending a future refactor of clear_holders.

A build of curtin with this fix included is available in my ppa:
https://launchpad.net/~wesley-wiedenmeier/+archive/ubuntu/test/+build/10030838

The ppa can be added with
apt-add-repository ppa:wesley-wiedenmeier/test

If you get a chance, please test this out and verify that it works in your environment, thanks.

Ryan Harper (raharper)
tags: added: curtin-clear-holders curtin-sru
Changed in curtin:
status: Confirmed → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.