curtin

[2.0-b6] Deploying a trusty (but not xenial) node frequently fails during storage setup of curtin

Bug #1588875 reported by Dimiter Naydenov on 2016-06-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	Undecided	Unassigned
	curtin	Fix Released	Undecided	Unassigned

Bug Description

Steps to reproduce:
1. Edit a node storage setup to unmount and unformat all existing VGs, partitions, etc.
2. Create a single VG (vg0) on the only available device ('sda' in my case)
3. Create a couple of LVs (vg-root - 60GB - or half the available space, ext4, mounted at /; vg-ceph - 60GB - the other half, ext4, mounted at /srv/ceph-osd)
4. Make sure MAAS has the latest trusty images
5. Deploy the node with 'trusty' (expect success; no issues with xenial on every attempt)
6. Then release the node and try to deploy it again with trusty.

Now, with the previous 2.0.0-rc1+bzr5059 at that point the node transitioned to 'Failed deployment' and the installation log on the UI shows this: http://paste.ubuntu.com/16945408/

With 2.0.0-beta6+bzr5060 it's actually worse, because it still fails by *does not* transition to 'Failed deployment' but is stuck in 'Deploying'. Installation did fail, as it's apparent from the node event log having these last 2 lines:
-8<-------
Node installation - 'curtin' failed: configuring disk: sda
Node installation - 'curtin' failed: configuring storage
-8<-------

It looks like the issue is with curtin trying and failing to reformat existing partitions, LVs / VGs ?

As suggested, I'm adding the output of 'maas <profile> machine get-curtin-config <system-id>':
(just to demonstrate I did 2 deployments with xenial first on the same node)
http://paste.ubuntu.com/16949299/ (first deployment - xenial; success)
http://paste.ubuntu.com/16949127/ (second deployment - xenial; success)
http://paste.ubuntu.com/16949338/ (third deployment - trusty - silently failed, stuck in Deploying)

Contents of /var/log/maas/* is attached.

# dpkg -l '*maas'* | cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-=================================================
ii maas 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS client and command-line interface
rc maas-cluster-controller 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS server common files
ii maas-dhcp 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS DHCP server
ii maas-dns 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS DNS server
ii maas-proxy 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all Rack Controller for MAAS
ii maas-region-api 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all Region Controller for MAAS
rc maas-region-controller-min 1.9.0+bzr4533-0ubuntu1~trusty1 all MAAS Server minimum region controller
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0~beta6+bzr5060-0ubuntu1~xenial1 all MAAS server provisioning libraries (Python 3)

Tags:

Related branches

lp:~wesley-wiedenmeier/curtin/fix-extended-clear-holders

Merged into lp:~curtin-dev/curtin/trunk at revision 420

Server Team CI bot: Approve (continuous-integration) on 2016-08-21

Wesley Wiedenmeier (community): Needs Resubmitting on 2016-06-15

curtin developers: Pending requested 2016-05-14

lp:~wesley-wiedenmeier/curtin/1588875

lp:~curtin-dev/curtin/trunk

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-06-03:

var-log-maas.tar.bz2 Edit (3.8 MiB, application/x-tar)

Revision history for this message

Ryan Harper (raharper) wrote on 2016-06-03:

Can you include:

1. pvs, vgs, lvs output from the node prior to being deployed via maas
2. Can attach maas <session> node get-curtin-config <system-id> output, I didn't see it in the var/log/maas output; I'm mostly interested in the curtin input and log from the node.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-06-03:

curtin versions on MAAS:

maashw@maas-hw:~$ dpkg -l '*curtin*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=================================================================-=====================================-=====================================-========================================================================================================================================
un curtin <none> <none> (no description available)
ii curtin-common 0.1.0~bzr387-0ubuntu1~16.04.1 all Library and tools for curtin installer
ii python3-curtin 0.1.0~bzr387-0ubuntu1~16.04.1 all Library and tools for curtin installer

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-06-03:

I managed to SSH into the node and capture the following logs:

/var/log/cloud-init-output.log, from a successful trusty deployment:
http://paste.ubuntu.com/16955049/

/var/log/cloud-init-output.log, from a failed trusty deployment:
http://paste.ubuntu.com/16955458/

/tmp/curtin.aptupdate: http://paste.ubuntu.com/16955494/; /tmp/install.log: http://paste.ubuntu.com/16955546/ (all from the same failed deployment)
(I couldn't find /var/log/curtin* as asked, but suspect the one in tmp is the one).

Revision history for this message

gimmic (gimmic) wrote on 2016-06-03:

Just want to say I am seeing the same issue, but it is not isolated to LVM based installs. Even flat / ext4 installs are exhibiting the same symptoms.

If I shutdown the machine and allow it to netboot again, it seems to deploy successfully.

I suspect for some reason the partitioning is exiting with an error code(but otherwise successfully deploying).

Revision history for this message

Ryan Harper (raharper) wrote on 2016-06-03: Re: [Bug 1588875] Re: [2.0-b6] Deploying a trusty (but not xenial) node frequently fails during storage setup of curtin

Thanks,

Looks like we do need to simulate the LVM removal as that's triggering some
issues with looking for holders; and it's not surprising that we see the
issue in Trusty vs. Xenial as lvm/sysfs layers are likely different.

On Fri, Jun 3, 2016 at 1:14 PM, Dimiter Naydenov <
<email address hidden>> wrote:

> I managed to SSH into the node and capture the following logs:
>
> /var/log/cloud-init-output.log, from a successful trusty deployment:
> http://paste.ubuntu.com/16955049/
>
> /var/log/cloud-init-output.log, from a failed trusty deployment:
> http://paste.ubuntu.com/16955458/
>
> /tmp/curtin.aptupdate: http://paste.ubuntu.com/16955494/;
> /tmp/install.log: http://paste.ubuntu.com/16955546/ (all from the same
> failed deployment)
> (I couldn't find /var/log/curtin* as asked, but suspect the one in tmp is
> the one).
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1588875
>
> Title:
> [2.0-b6] Deploying a trusty (but not xenial) node frequently fails
> during storage setup of curtin
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1588875/+subscriptions
>

Revision history for this message

Ryan Harper (raharper) wrote on 2016-06-03:

@gimmic,

Actually, it's successfully wiping the original storage config, but
expecting certain parts of the constructed devices to have a sysfs dir
(holder, which points to overlay storage devices).
It exits on the exception, but the commands (like pvremove) have already
run, which cleared the disk On subsequent boots, the disks look clean from
an LVM perspective and we
skip the lvm based wiping. Similar failures can happen for non-lvm cases,
which is what I suspect you see.

In a separate bug, could you attach your curtin config which describes the
storage layout that is failing to be removed when you set wipe: superblock
? I'd like to capture that
wipe failure in addition to this LVM one.

On Fri, Jun 3, 2016 at 1:41 PM, gimmic <email address hidden> wrote:

> Just want to say I am seeing the same issue, but it is not isolated to
> LVM based installs. Even flat / ext4 installs are exhibiting the same
> symptoms.
>
> If I shutdown the machine and allow it to netboot again, it seems to
> deploy successfully.
>
> I suspect for some reason the partitioning is exiting with an error
> code(but otherwise successfully deploying).
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1588875
>
> Title:
> [2.0-b6] Deploying a trusty (but not xenial) node frequently fails
> during storage setup of curtin
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1588875/+subscriptions
>

Mike Pontillo (mpontillo) on 2016-06-03

Changed in curtin:
status:	New → Confirmed
Changed in maas:
status:	New → Invalid

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-06-07:

Still present on curtin/xenial,xenial,xenial,xenial 0.1.0~bzr389-0ubuntu1~16.04.1 all

Revision history for this message

Wesley Wiedenmeier (wesley-wiedenmeier) wrote on 2016-06-17:

The error that was occuring was happening because the lvm configuration that curtin was trying to shut down from the last installation had multiple logical volumes with data on a the physical volume on /dev/sda1. Curtin was shutting down the lvm configuration as expected, but crashed because it shut down the entire volume group when handling the first logical volume, so was unable to find data in /sys/block/dm-1/ which mapped to the second logical volume.

Curtin did have logic in place to handle this issue, but was expecting for an IOError to be thrown. On systems where curtin is running under python3 everything worked correctly, which is why you saw no issues with xenial, as python3 would have raised a FileNotFound error, which would be caught as an IOError. However, on python2, an OSError would have been raised instead, and this error was not trapped.

There is a simple fix available in https://code.launchpad.net/~wesley-wiedenmeier/curtin/1588875 pending a future refactor of clear_holders.

A build of curtin with this fix included is available in my ppa:
https://launchpad.net/~wesley-wiedenmeier/+archive/ubuntu/test/+build/10030838

The ppa can be added with
apt-add-repository ppa:wesley-wiedenmeier/test

If you get a chance, please test this out and verify that it works in your environment, thanks.

Ryan Harper (raharper) on 2016-06-22

tags:

added: curtin-clear-holders curtin-sru

Wesley Wiedenmeier (wesley-wiedenmeier) on 2016-08-29

Changed in curtin:
status:	Confirmed → Fix Committed

Revision history for this message

Scott Moser (smoser) wrote on 2017-12-15: Fixed in Curtin 17.1

#10

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

var-log-maas.tar.bz2 Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.