curtin

Curtin fails to install on a disk previosly installed with RAID

Bug #1708052 reported by Kellen Renshaw on 2017-08-01

This bug affects 7 people

	Status	Importance	Assigned to	Milestone
MAAS	Fix Released	High	Matt Rae	MAAS 2.3.0alpha2
2.2	Fix Released	High	Blake Rouse	MAAS 2.2.3
curtin	Fix Released	Medium	Ryan Harper

Bug Description

Deployment fails with the following error after deploying a system with a RAID 1 array and running "quick erase" during the release process.

curtin: Installation started. (0.1.0~bzr505-0ubuntu1~16.04.1)
third party drivers not installed or necessary.
An error occured handling 'vda-part1': PermissionError - [Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'
[Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: An error occured handling 'vda-part1': PermissionError - [Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'
[Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'

Stderr: ''

I have reproduced this on:
MAAS 2.2.0 (bzr6054-0ubuntu2~16.04.1)
MAAS 2.2.1 (6078-g2a6d96e-0ubuntu1~16.04.1)

Steps to reproduce:
Created machine with 2 disks (I used virt-manager)
Commissioned machine
Create RAID1 + Ext4 root
Deploy - This deploy is successful
Release w/ quick erase - Suceeds
Delete Storage configuration in MAAS
Commission - to restore default Storage configuration in MAAS
Deploy - Fails with the above error

I investigated a machine that was failing to deploy, and discovered that recreating a partition on the first disk allowed mdadm to detect the old RAID array.

By default, the first partition starts 1MB after the beginning of the disk. MAAS builds the RAID out of (in my case) vda1 and vdb. Therefore the superblock for the array is somewhere after the 1MB that is erased by the "quick erase" function.

As a workaround, zeroing out the first/last 4MB in a commissioning script appears to work.

Related branches

lp:~raharper/curtin/trunk.fix-lvm-over-raid

Merged into lp:~curtin-dev/curtin/trunk at revision 525

Scott Moser (community): Approve on 2017-09-07

Chad Smith: Approve on 2017-09-07

Server Team CI bot: Approve (continuous-integration) on 2017-09-07

lp:~curtin-dev/curtin/trunk

lp:~raharper/curtin/new-artful-upload

Merged into lp:~curtin-dev/curtin/artful at revision 81

Scott Moser (community): Approve on 2017-10-06

Diff: 7045 lines (+3340/-962)

91 files modified

curtin/__init__.py (+2/-0)
curtin/block/__init__.py (+69/-20)
curtin/block/iscsi.py (+44/-3)
curtin/block/mdadm.py (+10/-6)
curtin/commands/apply_net.py (+34/-8)
curtin/commands/apt_config.py (+0/-9)
curtin/commands/curthooks.py (+197/-94)
curtin/commands/extract.py (+6/-0)
curtin/commands/install.py (+44/-4)
curtin/futil.py (+24/-1)
curtin/net/__init__.py (+106/-0)
curtin/reporter/handlers.py (+42/-0)
curtin/util.py (+137/-13)
debian/changelog (+33/-0)
doc/index.rst (+1/-0)
doc/topics/apt_source.rst (+9/-6)
doc/topics/config.rst (+18/-0)
doc/topics/curthooks.rst (+109/-0)
doc/topics/integration-testing.rst (+6/-0)
doc/topics/networking.rst (+2/-0)
doc/topics/overview.rst (+45/-47)
doc/topics/reporting.rst (+29/-0)
doc/topics/storage.rst (+2/-0)
examples/network-ipv6-bond-vlan.yaml (+2/-2)
examples/tests/bonding_network.yaml (+1/-4)
examples/tests/centos_basic.yaml (+2/-1)
examples/tests/centos_defaults.yaml (+91/-0)
examples/tests/journald_reporter.yaml (+20/-0)
examples/tests/network_alias.yaml (+29/-31)
examples/tests/network_static_routes.yaml (+10/-15)
examples/tests/network_v2_passthrough.yaml (+8/-0)
setup.py (+16/-2)
tests/unittests/helpers.py (+36/-0)
tests/unittests/test_apt_custom_sources_list.py (+3/-6)
tests/unittests/test_apt_source.py (+4/-7)
tests/unittests/test_basic.py (+4/-4)
tests/unittests/test_block.py (+20/-36)
tests/unittests/test_block_iscsi.py (+187/-18)
tests/unittests/test_block_lvm.py (+2/-2)
tests/unittests/test_block_mdadm.py (+10/-22)
tests/unittests/test_block_mkfs.py (+2/-2)
tests/unittests/test_clear_holders.py (+5/-5)
tests/unittests/test_commands_apply_net.py (+334/-0)
tests/unittests/test_commands_block_meta.py (+6/-19)
tests/unittests/test_commands_install.py (+22/-0)
tests/unittests/test_config.py (+6/-6)
tests/unittests/test_curthooks.py (+241/-57)
tests/unittests/test_feature.py (+5/-2)
tests/unittests/test_gpg.py (+4/-4)
tests/unittests/test_make_dname.py (+4/-4)
tests/unittests/test_net.py (+99/-24)
tests/unittests/test_partitioning.py (+4/-3)
tests/unittests/test_public.py (+54/-0)
tests/unittests/test_reporter.py (+29/-38)
tests/unittests/test_util.py (+201/-52)
tests/unittests/test_version.py (+7/-19)
tests/vmtests/__init__.py (+59/-7)
tests/vmtests/releases.py (+0/-15)
tests/vmtests/test_apt_config_cmd.py (+0/-4)
tests/vmtests/test_basic.py (+0/-13)
tests/vmtests/test_bcache_basic.py (+0/-4)
tests/vmtests/test_centos_basic.py (+35/-0)
tests/vmtests/test_iscsi.py (+0/-4)
tests/vmtests/test_journald_reporter.py (+52/-0)
tests/vmtests/test_lvm.py (+0/-9)
tests/vmtests/test_lvm_iscsi.py (+4/-4)
tests/vmtests/test_mdadm_bcache.py (+3/-59)
tests/vmtests/test_mdadm_iscsi.py (+4/-4)
tests/vmtests/test_multipath.py (+0/-4)
tests/vmtests/test_network.py (+202/-39)
tests/vmtests/test_network_alias.py (+33/-4)
tests/vmtests/test_network_bonding.py (+47/-22)
tests/vmtests/test_network_bridging.py (+77/-17)
tests/vmtests/test_network_enisource.py (+2/-8)
tests/vmtests/test_network_ipv6.py (+29/-4)
tests/vmtests/test_network_ipv6_enisource.py (+8/-6)
tests/vmtests/test_network_ipv6_static.py (+17/-5)
tests/vmtests/test_network_ipv6_vlan.py (+17/-5)
tests/vmtests/test_network_mtu.py (+61/-8)
tests/vmtests/test_network_static.py (+30/-4)
tests/vmtests/test_network_static_routes.py (+19/-6)
tests/vmtests/test_network_vlan.py (+40/-15)
tests/vmtests/test_nvme.py (+0/-9)
tests/vmtests/test_raid5_bcache.py (+0/-9)
tests/vmtests/test_simple.py (+0/-4)
tests/vmtests/test_uefi_basic.py (+0/-19)
tools/build-deb (+3/-1)
tools/curtainer (+14/-8)
tools/find-tgt (+54/-29)
tools/jenkins-runner (+47/-10)
tools/launch (+46/-7)

lp:~curtin-dev/curtin/artful

lp:~chad.smith/curtin/xenial-sru-1721808

Merged into lp:~curtin-dev/curtin/xenial at revision 71

curtin developers: Pending requested 2017-10-06

Diff: 7045 lines (+3340/-962)

91 files modified

lp:~chad.smith/curtin/zesty-sru-1721808

~blake-rouse/maas:fix-1708052-2.2

Merged into maas:2.2

Blake Rouse (community): Approve on 2017-08-16

~mattrae/maas:master

Merged into maas:master

Blake Rouse (community): Approve on 2017-08-15

Revision history for this message

Kellen Renshaw (krenshaw) wrote on 2017-08-01:

Requested logs Edit (155.9 KiB, application/x-tar)

Revision history for this message

Kellen Renshaw (krenshaw) wrote on 2017-08-01:

dpkg -l '*maas*' output Edit (2.3 KiB, text/plain)

Revision history for this message

Matt Rae (mattrae) wrote on 2017-08-14:

I was able to reproduce this bug following the steps provided with maas 2.2.2

After wiping the disk, I booted with gparted iso and verified a raid partition still existed.

I modified maas_wipe.py code to overwrite 2MB rather than 1MB of the beginning of the disk. After wiping the disk using 2MB gparted no longer sees any partitions. Deployment succeeds after wiping using 2MB.

Attaching a patch with the change from 1MB to 2MB.

Revision history for this message

Matt Rae (mattrae) wrote on 2017-08-14:

patch Edit (625 bytes, text/plain)

Blake Rouse (blake-rouse) on 2017-08-15

Changed in maas:
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Matt Rae (mattrae)
milestone:	none → 2.3.0

MAAS Lander (maas-lander) on 2017-08-15

Changed in maas:
status:	In Progress → Fix Committed

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2017-08-16:

This still has a curtin issue where curtin should not fail to install because an existing RAID 1 is already present. The clear_holders code should handle this correctly and keep on moving.

Andres Rodriguez (andreserl) on 2017-08-16

summary:

- Quick erase doesn't remove md superblock
+ Curtin fails to install on a disk previosly installed with RAID

Revision history for this message

Ryan Harper (raharper) wrote on 2017-08-16:

Could we please get the curtin config and install log output?

* maas 2.0 via cli
maas <session> machine get-curtin-config <system-id>

* Web UI: On the node details page in the installation output section at the bottom of the page

Changed in curtin:
status:	New → Incomplete

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-08-16:

To add to Blake's comment, the fix in MAAS is to make quick erase correctly erase it. But regardless of this, curtin should do the right thing.

Andres Rodriguez (andreserl) on 2017-08-21

Changed in maas:
milestone:	2.3.0 → 2.3.0alpha2

Andres Rodriguez (andreserl) on 2017-08-22

Changed in maas:
status:	Fix Committed → Fix Released

Revision history for this message

Ryan Harper (raharper) wrote on 2017-08-28:

I was able to reproduce this issue by enabling curtin's dirty-disk mode on an LVM over RAID configuration which deploys the same configuration twice. While the configuration doesn't match, the error produced (Permission failure on sync_action) is the same.
A fix for this is underway.

Changed in curtin:
assignee:	nobody → Ryan Harper (raharper)
status:	Incomplete → In Progress

Ryan Harper (raharper) on 2017-09-13

Changed in curtin:
status:	In Progress → Fix Committed

Scott Moser (smoser) on 2017-10-03

Changed in curtin:
importance:	Undecided → Medium

Revision history for this message

Scott Moser (smoser) wrote on 2017-12-15: Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status:	Fix Committed → Fix Released

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2017-12-26:

#10

why do you use grater wipe size, instead correct call:

mdadm --zero-superblock /dev/sda ?

meta could be not only on first\last blocks...

also , probably good idea to add something like(didn't check it yet):
pvremove -y -ff /dev/*

Revision history for this message

Darren (pneumatic) wrote on 2018-07-12:

#11

This bug is still present in Server 18.04. A simple mdadm --zero-superblock resolves it in seconds. The dd zero command does not resolve the issue as the superblock appears to be in some reserved section of the disk. This has been a nuisance for a decade now.

Revision history for this message

Vidmantas (vidmantasvgtu) wrote on 2018-07-18:

#12

Having same issues