curtin fails to create MD devices

Bug #1659509 reported by Paolo de Rosa
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Fix Released
High
Unassigned
curtin (Ubuntu)
Fix Released
Medium
Unassigned
Xenial
Fix Released
Medium
Unassigned
Yakkety
Fix Released
Medium
Unassigned

Bug Description

==== Begin SRU Template ========
[Impact]
On some machines with existing MDADM RAID metadata on one or
more disks, curtin can fails to remove this existing metadata when
instructed to do so and fails to install on such machines.

Improvement in curtin's ability to release holders on a disk
and clear it mean that installation succeeds.

[Test Case]
Run curtin vmtest on tests/unittests/test_commands_block_meta.py
  ./tools/jenkins-runner tests/unittests/test_commands_block_meta.py

[Regression Potential]
Installation failures would be the most likely regression, and
then most likely with multipath, raid or both.

[Other Info]
==== End SRU Template ========

 * On some machines which have existing MDADM RAID metadata on one or
   more disks, curtin fails to remove this existing metadata when
   instructed to do so and fails to install on such machines.

On xenial we used curtin: 0.1.0~bzr425-0ubuntu1~16.04.1
and for trusty deployment curtin_0.1.0~bzr399-0ubuntu1~14.04.1

please see the attached full installation log.

Related branches

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

Can you attach the curtin config as well?

maas <session> machine get-curtin-config <system-id>

Changed in curtin:
importance: Undecided → High
status: New → Confirmed
Ryan Harper (raharper)
Changed in curtin:
status: Confirmed → Incomplete
Revision history for this message
Ryan Harper (raharper) wrote :

"MODEL": "LOGICAL VOLUME "

What sort of storage devices are attached?

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :
Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

There is a raid controller (H240ar), due to compatibility issues (disks in HBA mode do not work correctly, a logical volume of only one disk has been created in raid0)

Smart HBA H240ar in Slot 0 (Embedded) (RAID Mode) (sn: PDNLN0BRH141D6)
   Internal Drive Cage at Port 2I, Box 0, OK
   array A (Solid State SAS, Unused Space: 0 MB)
      logicaldrive 1 (186.3 GB, RAID 0, OK)
      physicaldrive 1I:3:1 (port 1I:box 3:bay 1, Solid State SAS, 200 GB, OK)
   array B (Solid State SAS, Unused Space: 0 MB)
      logicaldrive 2 (186.3 GB, RAID 0, OK)
      physicaldrive 1I:3:2 (port 1I:box 3:bay 2, Solid State SAS, 200 GB, OK)

This bug seems to be related only to few nodes and it does not happen always.
Wiping the disks before the deployment obviously solve the issue.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1659509] Re: curtin fails to create MD devices

OK,

I'm working on a curtin with additional debugging on disk wiping and mdadm
info which should help
us understand why something is taking a hold of the block device underneath
curtin. I'll push that
updated curtin package to PPA and update this bug.

On Thu, Jan 26, 2017 at 7:25 PM, Paolo de Rosa <email address hidden>
wrote:

> There is a raid controller (H240ar), due to compatibility issues
> (disks in HBA mode do not work correctly, a logical volume of only one
> disk has been created in raid0)
>
> Smart HBA H240ar in Slot 0 (Embedded) (RAID Mode) (sn: PDNLN0BRH141D6)
> Internal Drive Cage at Port 2I, Box 0, OK
> array A (Solid State SAS, Unused Space: 0 MB)
> logicaldrive 1 (186.3 GB, RAID 0, OK)
> physicaldrive 1I:3:1 (port 1I:box 3:bay 1, Solid State SAS, 200 GB,
> OK)
> array B (Solid State SAS, Unused Space: 0 MB)
> logicaldrive 2 (186.3 GB, RAID 0, OK)
> physicaldrive 1I:3:2 (port 1I:box 3:bay 2, Solid State SAS, 200 GB,
> OK)
>
> This bug seems to be related only to few nodes and it does not happen
> always.
> Wiping the disks before the deployment obviously solve the issue.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1659509
>
> Title:
> curtin fails to create MD devices
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1659509/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

I think I understand the problem here.

In the above log, when the system booted, /dev/sdb had two partitions
defined; but previously it had 3 partitions, one of which was a mdadm
device.
When curtin starts, it examines the current disk configuration and attempts
to wipe out metadata at the known partitions and offsets.
We then proceed to partition the disks as requested. When the 3rd
partition on sdb is created and the program writing to the device (parted
or sgdisk)
exists, systemd-udev has a watch on device nodes created in /dev and
triggers udev rules.

The 64-md-raid-assembly.rules rule triggers an assembly of a raid array ;
if the partition is a member, then it
gets claimed by the kernel driver (sys/class/block/sdb3/holders/ has a link
to mdX) which prevents exclusive
opens of the device when then prevents further creation/use of /dev/sdb3.

Ryan Harper (raharper)
Changed in curtin:
status: Incomplete → Confirmed
Revision history for this message
Ryan Harper (raharper) wrote :

I've uploaded a curtin build to this ppa:

ppa:raharper/bugfixes

In there, you can get curtin - 0.1.0~bzr443-0ubuntu1 which contains additional debugging output as well as changes which should fix this bug.

Please test and report if this allows successful repeated installs.

Changed in curtin:
status: Confirmed → In Progress
Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

Hi Ryan,
please find attached what we got running curtin 0.1.0~bzr443-0ubuntu1 in this environment

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

Looking at the install-log, it appears that the updated curtin was not used.

In particular, the updated curtin now prints the output of the initial mdadm assembly command like this:

start: cmd-install/stage-partitioning/builtin/cmd-block-meta: started: curtin command block-meta
start: cmd-install/stage-partitioning/builtin/cmd-block-meta: started: removing previous storage devices
Running command ['mdadm', '--assemble', '--scan', '-v'] with allowed return codes [0, 1, 2] (shell=False, capture=True)
mdadm assemble scan results:

mdadm: looking for devices for further assembly
mdadm: no recogniseable superblock on /dev/vdd1
mdadm: Cannot assemble mbr metadata on /dev/vdd
mdadm: Cannot assemble mbr metadata on /dev/vdc
mdadm: no recogniseable superblock on /dev/vdb
mdadm: no recogniseable superblock on /dev/vda1
mdadm: Cannot assemble mbr metadata on /dev/vda
mdadm: no recogniseable superblock on /dev/ram15
mdadm: no recogniseable superblock on /dev/ram14
mdadm: no recogniseable superblock on /dev/ram13
mdadm: no recogniseable superblock on /dev/ram12
mdadm: no recogniseable superblock on /dev/ram11
mdadm: no recogniseable superblock on /dev/ram10
mdadm: no recogniseable superblock on /dev/ram9
mdadm: no recogniseable superblock on /dev/ram8
mdadm: no recogniseable superblock on /dev/ram7
mdadm: no recogniseable superblock on /dev/ram6
mdadm: no recogniseable superblock on /dev/ram5
mdadm: no recogniseable superblock on /dev/ram4
mdadm: no recogniseable superblock on /dev/ram3
mdadm: no recogniseable superblock on /dev/ram2
mdadm: no recogniseable superblock on /dev/ram1
mdadm: no recogniseable superblock on /dev/ram0
mdadm: No arrays found in config file or automatically

Running command ['mdadm', '--detail', '--scan', '-v'] with allowed return codes [0, 1] (shell=False, capture=True)
mdadm detail scan after assemble:

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

After restarting MAAS region controller it seems that the node is using the right curtin version, please find attached the log of the successful installation. The bugfix is working.

Ryan Harper (raharper)
Changed in curtin:
status: In Progress → Fix Committed
Scott Moser (smoser)
Changed in curtin (Ubuntu):
status: New → Fix Released
Changed in curtin (Ubuntu Xenial):
status: New → Confirmed
Scott Moser (smoser)
description: updated
Scott Moser (smoser)
Changed in curtin (Ubuntu Yakkety):
status: New → Confirmed
Changed in curtin (Ubuntu Xenial):
importance: Undecided → Medium
Changed in curtin (Ubuntu Yakkety):
importance: Undecided → Medium
Changed in curtin (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Paolo, or anyone else affected,

Accepted curtin into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/curtin/0.1.0~bzr460-0ubuntu1~16.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in curtin (Ubuntu Yakkety):
status: Confirmed → Fix Committed
tags: added: verification-needed
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Paolo, or anyone else affected,

Accepted curtin into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/curtin/0.1.0~bzr460-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in curtin (Ubuntu Xenial):
status: Confirmed → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote :

I've run curtin's vmtest testsuite with the test provided in the
description by doing the following.

# get curtin for it tests at revision 460
$ bzr branch lp:curtin curtin.dist
$ bzr branch -r 460 curtin.dist curtin-r460
$ cd curtin-r460
## need to get 'curtainer' and 'curtin-in-container' from trunk
$ cp ../trunk.dist/tools/{curtainer,curtin-from-container} tools/
$ ./tools/vmtest-system-setup
$ ./tools/curtainer --proposed xenial sm-curtin-x1
...
Unpacking curtin (0.1.0~bzr460-0ubuntu1~16.04.1) ...
Setting up curtin-common (0.1.0~bzr460-0ubuntu1~16.04.1) ...
Setting up python3-curtin (0.1.0~bzr460-0ubuntu1~16.04.1) ...
Setting up curtin (0.1.0~bzr460-0ubuntu1~16.04.1) ...

# bug 1656369 tests/vmtests/test_multipath.py
# bug 1659509 tests/unittests/test_commands_block_meta.py
# bug 1661337 tests/vmtests/test_apt_source.py
$ name=sm-curtin-x1
$ CURTIN_VMTEST_TOPDIR=$PWD/$name CURTIN_VMTEST_CURTIN_EXE="./tools/curtin-from-container $name curtin" \
   ./tools/jenkins-runner \
     tests/vmtests/test_multipath.py \
     tests/unittests/test_commands_block_meta.py \
     tests/vmtests/test_apt_source.py

I'm attaching a tarball of the output directory sm-curtin-x1

Note, due to packaging bug 1666986, we see in the installation logs:
   curtin: Installation started. (0.1.0)
when that is fixed, we'll start seeing 0.1.0~bzr460-0ubuntu1~16.04.1.

tags: added: verification-done-xenial verification-needed-yakkety
removed: verification-needed
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Revision history for this message
Scott Moser (smoser) wrote :

I've run curtin's vmtest testsuite with the test provided in the
description by doing the following.

# get curtin for it tests at revision 460
$ bzr branch lp:curtin curtin.dist
$ bzr branch -r 460 curtin.dist curtin-r460
$ cd curtin-r460
## need to get 'curtainer' and 'curtin-in-container' from trunk
$ cp ../trunk.dist/tools/{curtainer,curtin-from-container} tools/
$ ./tools/vmtest-system-setup
$ ./tools/curtainer images:ubuntu/yakkety --proposed sm-curtin-y1
# note, used images: due to bug 1668710
$ ./tools/curtainer --proposed yakkety sm-curtin-y1
....
Setting up curtin (0.1.0~bzr460-0ubuntu1~16.10.1) ...

# bug 1656369 tests/vmtests/test_multipath.py
# bug 1659509 tests/unittests/test_commands_block_meta.py
# bug 1661337 tests/vmtests/test_apt_source.py
$ name=sm-curtin-y1
$ CURTIN_VMTEST_TOPDIR=$PWD/$name CURTIN_VMTEST_CURTIN_EXE="./tools/curtin-from-container $name curtin" \
   ./tools/jenkins-runner \
     tests/vmtests/test_multipath.py \
     tests/unittests/test_commands_block_meta.py \
     tests/vmtests/test_apt_source.py

Note, due to packaging bug 1666986, we see in the installation logs:
   curtin: Installation started. (0.1.0)
when that is fixed, we'll start seeing 0.1.0~bzr460-0ubuntu1~16.10.1.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package curtin - 0.1.0~bzr460-0ubuntu1~16.10.1

---------------
curtin (0.1.0~bzr460-0ubuntu1~16.10.1) yakkety-proposed; urgency=medium

  * New upstream snapshot.
    - Install zipl in target on s390x arch. (LP: #1662346)
    - avoid UnicodeDecode error on passing non-utf8 into shlex
    - adjustments to version string handling, improved pack unit tests.
    - helpers/common: Add grub install debugging output
    - curtin: add version module and display in output and logs
    - content decoding in load_file, apply_net raise exception on errors
    - gpg: retry when recv'ing gpg keys fail (LP: #1661337)
    - Add clear_holders checks to disk and partition handlers (LP: #1659509)
    - net: add new lines after rendered static routes. (LP: #1649652)
    - multipath: don't run update-grub; setup_grub will handle this better.
      (LP: #1656369)
    - Test changes:
      - vmtest: Add tests for zesty and Trusty HWE-X kernels.
      - tests: fix tox tip-pycodestyle complaints
      - image-sync: add debugging output to help diagnose errors
      - vmtest: change get_curtin_version to use version subcommand.
      - Remove style checking during build and add latest style checks to tox
      - subp doc an unit test improvements.
      - flake8: remove unused variable.
      - vmtest: Add the ability to add extra config files to test execution.
      - vmtest: overhaul image sync
      - vmtest: skip apt-proxy test if not set
      - vmtest: add 'webserv' helper
      - vmtest: add CURTIN_VMTEST_CURTIN_EXE variable.

 -- Scott Moser <email address hidden> Thu, 16 Feb 2017 22:30:13 -0500

Changed in curtin (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for curtin has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package curtin - 0.1.0~bzr460-0ubuntu1~16.04.1

---------------
curtin (0.1.0~bzr460-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * New upstream snapshot.
    - Install zipl in target on s390x arch. (LP: #1662346)
    - avoid UnicodeDecode error on passing non-utf8 into shlex
    - adjustments to version string handling, improved pack unit tests.
    - helpers/common: Add grub install debugging output
    - curtin: add version module and display in output and logs
    - content decoding in load_file, apply_net raise exception on errors
    - gpg: retry when recv'ing gpg keys fail (LP: #1661337)
    - Add clear_holders checks to disk and partition handlers (LP: #1659509)
    - net: add new lines after rendered static routes. (LP: #1649652)
    - multipath: don't run update-grub; setup_grub will handle this better.
      (LP: #1656369)
    - Test changes:
      - vmtest: Add tests for zesty and Trusty HWE-X kernels.
      - tests: fix tox tip-pycodestyle complaints
      - image-sync: add debugging output to help diagnose errors
      - vmtest: change get_curtin_version to use version subcommand.
      - Remove style checking during build and add latest style checks to tox
      - subp doc an unit test improvements.
      - flake8: remove unused variable.
      - vmtest: Add the ability to add extra config files to test execution.
      - vmtest: overhaul image sync
      - vmtest: skip apt-proxy test if not set
      - vmtest: add 'webserv' helper
      - vmtest: add CURTIN_VMTEST_CURTIN_EXE variable.

 -- Scott Moser <email address hidden> Wed, 08 Feb 2017 19:40:38 -0500

Changed in curtin (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers