Curtin fails to install on a disk previosly installed with RAID

Bug #1708052 reported by Kellen Renshaw
40
This bug affects 7 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Matt Rae
2.2
Fix Released
High
Blake Rouse
curtin
Fix Released
Medium
Ryan Harper

Bug Description

Deployment fails with the following error after deploying a system with a RAID 1 array and running "quick erase" during the release process.

curtin: Installation started. (0.1.0~bzr505-0ubuntu1~16.04.1)
third party drivers not installed or necessary.
An error occured handling 'vda-part1': PermissionError - [Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'
[Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'
curtin: Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: An error occured handling 'vda-part1': PermissionError - [Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'
        [Errno 13] Permission denied: '/sys/class/block/md0/md/sync_action'

Stderr: ''

I have reproduced this on:
MAAS 2.2.0 (bzr6054-0ubuntu2~16.04.1)
MAAS 2.2.1 (6078-g2a6d96e-0ubuntu1~16.04.1)

Steps to reproduce:
Created machine with 2 disks (I used virt-manager)
Commissioned machine
Create RAID1 + Ext4 root
Deploy - This deploy is successful
Release w/ quick erase - Suceeds
Delete Storage configuration in MAAS
Commission - to restore default Storage configuration in MAAS
Deploy - Fails with the above error

I investigated a machine that was failing to deploy, and discovered that recreating a partition on the first disk allowed mdadm to detect the old RAID array.

By default, the first partition starts 1MB after the beginning of the disk. MAAS builds the RAID out of (in my case) vda1 and vdb. Therefore the superblock for the array is somewhere after the 1MB that is erased by the "quick erase" function.

As a workaround, zeroing out the first/last 4MB in a commissioning script appears to work.

Related branches

Revision history for this message
Kellen Renshaw (krenshaw) wrote :
Revision history for this message
Kellen Renshaw (krenshaw) wrote :
Revision history for this message
Matt Rae (mattrae) wrote :

I was able to reproduce this bug following the steps provided with maas 2.2.2

After wiping the disk, I booted with gparted iso and verified a raid partition still existed.

I modified maas_wipe.py code to overwrite 2MB rather than 1MB of the beginning of the disk. After wiping the disk using 2MB gparted no longer sees any partitions. Deployment succeeds after wiping using 2MB.

Attaching a patch with the change from 1MB to 2MB.

Revision history for this message
Matt Rae (mattrae) wrote :
Changed in maas:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Matt Rae (mattrae)
milestone: none → 2.3.0
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Blake Rouse (blake-rouse) wrote :

This still has a curtin issue where curtin should not fail to install because an existing RAID 1 is already present. The clear_holders code should handle this correctly and keep on moving.

summary: - Quick erase doesn't remove md superblock
+ Curtin fails to install on a disk previosly installed with RAID
Revision history for this message
Ryan Harper (raharper) wrote :

Could we please get the curtin config and install log output?

 * maas 2.0 via cli
   maas <session> machine get-curtin-config <system-id>

 * Web UI: On the node details page in the installation output section at the bottom of the page

Changed in curtin:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

To add to Blake's comment, the fix in MAAS is to make quick erase correctly erase it. But regardless of this, curtin should do the right thing.

Changed in maas:
milestone: 2.3.0 → 2.3.0alpha2
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Ryan Harper (raharper) wrote :

I was able to reproduce this issue by enabling curtin's dirty-disk mode on an LVM over RAID configuration which deploys the same configuration twice. While the configuration doesn't match, the error produced (Permission failure on sync_action) is the same.
A fix for this is underway.

Changed in curtin:
assignee: nobody → Ryan Harper (raharper)
status: Incomplete → In Progress
Ryan Harper (raharper)
Changed in curtin:
status: In Progress → Fix Committed
Scott Moser (smoser)
Changed in curtin:
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

why do you use grater wipe size, instead correct call:

mdadm --zero-superblock /dev/sda ?

meta could be not only on first\last blocks...

also , probably good idea to add something like(didn't check it yet):
pvremove -y -ff /dev/*

Revision history for this message
Darren (pneumatic) wrote :

This bug is still present in Server 18.04. A simple mdadm --zero-superblock resolves it in seconds. The dd zero command does not resolve the issue as the superblock appears to be in some reserved section of the disk. This has been a nuisance for a decade now.

Revision history for this message
Vidmantas (vidmantasvgtu) wrote :

Having same issues

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.