Can't deploy CentOS with an XFS partition

Bug #1965587 reported by Derek DeMoss
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

Recently (a few weeks ago?) something changed where we can no longer deploy CentOS 7 images if we have an XFS partition defined in the storage tab, regardless of which partition it is.

Generally our storage layout was
sda-part1 536.9MB fat32 /boot/efi
sda-part2 959.7GB ext4 /
sdb-part1 960.2GB xfs /scratch

Upon attempting a deploy there are errors visible on the local console:

```
654.445670] scsi15:0:0:0: Direct-Access
AMI
Virtual HDisk8 1.00 PO: 0 ANSI: 8 CCS
654.448933] sd 15:0:0:0: Attached scsi generic sg2 type a
654.454282] sd 15:0:0:0: [sdb] Attached SCSI removable disk
Authorization not available. Check if polkit service is ruing or see debug message for more information
674.1588001 XFS (sda3): Superblock has unknown read-only compatible features (Ox4) enabled
674.1614331 XFS (sda3) : Attempted to mount read-only compatible filesystem read-write
674.1640431 XFS (sda3) : Filesystem can only be safely mounted read only.
674.1667001 XFS (sda3) : SB validate failed with error -22.
Welcome to emergency mode! After logging in, tupe
"journalct1 -xb" to view
system logs,
"systemctI reboot" to reboot,
"systemct1 default"
"D to
try again to boot into default mode.
Cannot open access to console, the root account is locked
See
sulogin(8) man page for more details.
Press Enter to continue
```

Attempting to boot into default mode fails, so I cannot view the journalctl output.

Tags: sts
Revision history for this message
Heitor (heitorpbittencourt) wrote :

Additionally, if we deploy a machine with an unformatted partition, format it to XFS, add it to fstab, and reboot, the machine does not boot anymore.

Revision history for this message
Alberto Donato (ack) wrote :

What version of maas are you using and which Ubuntu version is being used for provisioning?

Changed in maas:
status: New → Incomplete
Revision history for this message
Heitor (heitorpbittencourt) wrote :

We use MaaS 2.9.2 and Ubuntu Focal for comissioning.

A maybe-related issue we reported is https://bugs.launchpad.net/maas/+bug/1966343

Revision history for this message
Alberto Donato (ack) wrote :

Could you please provide the full installation output from the machine failing deployment?

Revision history for this message
Alberto Donato (ack) wrote :

So it seems that bionic is required as a commissioning series in order to deploy Centos 8.

MAAS should require it explicitly, but a workaround is to change it globally in the config.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
Alberto Donato (ack)
Changed in maas:
milestone: none → next
Revision history for this message
Derek DeMoss (derek-omnivector) wrote :

Per my comment in https://bugs.launchpad.net/maas/+bug/1966343
Switching to Bionic for Commissioning did succeed, but it doesn't explain why Focal suddenly stopped working..

Revision history for this message
Derek DeMoss (derek-omnivector) wrote :

This is great progress and should allow us to proceed with deployments tomorrow, but it does bring up an issue which must be addressed for the long term.

Bionic (18.04) only has one more year of 'General Support' left on the roadmap, so we still need a resolution as to why Focal can't be used, as I assume Bionic will be removed from MAAS' commissioning options at some point

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

The global setting of commissioning image can be used as a workaround, but it is not convenient for the operator. MAAS should contain the knowledge on how to deploy supported OS images of various families.

Changed in maas:
importance: High → Medium
milestone: next → 3.3.0
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :
Revision history for this message
Heitor (heitorpbittencourt) wrote :

Do you know why this regression happened? This was not an issue ~1 month ago, the Ubuntu images synced and suddenly everything stopped working.

I think a proper solution involves fixing the Ubuntu images and also addressing this bug: https://bugs.launchpad.net/maas/+bug/1888946

tags: added: sts
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

We have found a workaround on Focal, based on the root cause of the XFS issue.
Another good news is this addresses the DM RAID case as well (duplicate issue).

We have not yet, however, found the source of the regression behind this, and
will be looking at that.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Root-cause: Focal's mkfs.xfs enables the reflink feature, not supported by CentOS
---

The default for 'mkfs.xfs -m reflink=0|1' changed from 0 in Bionic to 1 in Focal.
Per mkfs.xfs(8) manpages:

 - Bionic: [1]
 By default, mkfs.xfs __will not__ create reference count btrees
 and therefore __will not__ enable the reflink feature.

 - Focal: [2]
 By default, mkfs.xfs __will__ create reference count btrees
 and therefore __will__ enable the reflink feature.

 [1] http://manpages.ubuntu.com/manpages/bionic/en/man8/mkfs.xfs.8.html
 [2] http://manpages.ubuntu.com/manpages/focal/en/man8/mkfs.xfs.8.html

When the MAAS config includes a XFS partition, the curtin config runs 'mkfs.xfs'.
(the curtin config schema has 'storage/format/extra_options' for 'mkfs', however,
the MAAS API doesn't expose it; just 'mount'-time options [3,4]).

 [ 68.680243] cloud-init[1652]: Running command ['mkfs.xfs', '-f', '-L', '', '-m', 'uuid=6841d6ce-0658-4c5c-b293-ce1c2f69f4d5', '/dev/vda3'] with allowed return codes [0] (capture=True)

 [3] https://github.com/canonical/curtin/blob/master/doc/topics/storage.rst#format-command
 [4] https://maas.io/docs/api
     POST /MAAS/api/2.0/nodes/{system_id}/blockdevices/{device_id}/partition/{id}?op=format

So, the reflink feature bit (0x4) is set, and XFS in CentOS 7 doesn't know of it,
and the filesystem mount unit fails, bringing systemd to emergency/recovery mode.

 [ 0.000000] Linux version 3.10.0-1160.45.1.el7.x86_64 ...
 ...
 [ 3.704656] XFS (vda3): Superblock has unknown read-only compatible features (0x4) enabled.
 [ 3.705751] XFS (vda3): Attempted to mount read-only compatible filesystem read-write.
 [ 3.706749] XFS (vda3): Filesystem can only be safely mounted read only.
 [ 3.707921] XFS (vda3): SB validate failed with error -22.

From ubuntu-focal.git kernel source:

 451 #define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */

The solution is to disable the reflink feature at mkfs.xfs time, with 'mkfs.xfs -m reflink=0'.

In order to do this in a os/release-dependent, not machine-dependent way, we can use a
curtin_userdata file that is specific to CentOS 7.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (3.5 KiB)

Workaround: Create 'mkfs.xfs' wrapper to append '-m reflink=0' on centos70.
---

Add this line to the 'early_commands' section (create it, if needed)
in the file 'curtin_userdata_centos_amd64_generic_centos70',
on top of the original 'curtin_userdata[_centos]' file
(the location/file depends on MAAS install method: SNAP/DEB)

  00-xfs-reflink0: [ '/bin/sh', '-c', 'F=/usr/local/sbin/mkfs.xfs; /usr/bin/echo -e "#!/bin/sh
exec /usr/sbin/mkfs.xfs -m reflink=0 \"\$@\"">$F && chmod +x $F' ]

- SNAP:
copy /var/snap/maas/current/preseeds/curtin_userdata
to curtin_userdata_centos_amd64_generic_centos70
(early_commands section already exists)

 $ diff -u /var/snap/maas/current/preseeds/curtin_userdata /var/snap/maas/current/preseeds/curtin_userdata_centos_amd64_generic_centos70
 --- /var/snap/maas/current/preseeds/curtin_userdata 2021-02-09 11:57:40.868124866 +0000
 +++ /var/snap/maas/current/preseeds/curtin_userdata_centos_amd64_generic_centos70 2022-04-26 15:43:38.703471848 +0000
 @@ -23,6 +23,7 @@
  {{else}}
    driver_00: ["sh", "-c", "echo third party drivers not installed or necessary."]
  {{endif}}
 + 00-xfs-reflink0: [ '/bin/sh', '-c', 'F=/usr/local/sbin/mkfs.xfs; /usr/bin/echo -e "#!/bin/sh
exec /usr/sbin/mkfs.xfs -m reflink=0 \"\$@\"">$F && chmod +x $F' ]
  late_commands:
    maas: [wget, '--no-proxy', {{node_disable_pxe_url|escape.json}}, '--post-data', {{node_disable_pxe_data|escape.json}}, '-O', '/dev/null']
  {{if third_party_drivers and driver}}

- DEB:
copy /etc/maas/preseeds/curtin_userdata_centos
to curtin_userdata_centos_amd64_generic_centos70
(early_commands section has to be created)

 $ diff -u /etc/maas/preseeds/curtin_userdata_centos /etc/maas/preseeds/curtin_userdata_centos_amd64_generic_centos70
 --- /etc/maas/preseeds/curtin_userdata_centos 2022-04-26 15:13:49.147844143 +0000
 +++ /etc/maas/preseeds/curtin_userdata_centos_amd64_generic_centos70 2022-04-26 15:09:14.260539677 +0000
 @@ -7,3 +7,6 @@

  late_commands:
    maas: [wget, '--no-proxy', '{{node_disable_pxe_url}}', '--post-data', '{{node_disable_pxe_data}}', '-O', '/dev/null']
 +
 +early_commands:
 + 00-xfs-reflink0: [ '/bin/sh', '-c', 'F=/usr/local/sbin/mkfs.xfs; /usr/bin/echo -e "#!/bin/sh
exec /usr/sbin/mkfs.xfs -m reflink=0 \"\$@\"">$F && chmod +x $F' ]

...

With this workaround applied, MAAS 2.9.2 can deploy and boot CentOS 7
on a plain XFS partition or XFS on top of MD RAID0:

MAAS version: 2.9.2 (9165-g.c3e7848d1)

 [centos@z-rotomvm22 ~]$ uname -rv
 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021

 [centos@z-rotomvm22 ~]$ mount | grep -w xfs
 /dev/vda3 on /xfs type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

 [centos@z-rotomvm22 ~]$ dmesg | grep -iw xfs
 [ 3.177375] SGI XFS with ACLs, security attributes, no debug enabled
 [ 3.201503] XFS (vda3): Mounting V5 Filesystem
 [ 3.213364] XFS (vda3): Ending clean mount

and,

 [centos@z-rotomvm22 ~]$ uname -rv
 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021

 [centos@z-rotomvm22 ~]$ mount | grep -w xfs
 /dev/md0 on /xfs-raid0 type xfs (rw,relatime,seclabel,attr2,inode64,sunit=1024,swidth=3072,noquota)

 [centos@z-rotomvm22 ~]$ dmesg | grep -iw ...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Note:

There's a "\n " (newline character) between '/bin/sh' and 'exec' in here.

00-xfs-reflink0: [ '/bin/sh', '-c', 'F=/usr/local/sbin/mkfs.xfs; /usr/bin/echo -e "#!/bin/sh\nexec /usr/sbin/mkfs.xfs -m reflink=0 \"\$@\"">$F && chmod +x $F' ]

Revision history for this message
Derek DeMoss (derek-omnivector) wrote :

@mfo That's awesome that you were able to narrow down the exact issue!

I'll touch base with our other engineer to see how he feels about applying the workaround to our environment.

Per comment #14, we should add it as a single line instead of having an actual new line between the shebang and the exec command?

Thank you!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Derek,

Exactly, single line with '\n' between '#!/bin/sh' and 'exec'.

Revision history for this message
Derek DeMoss (derek-omnivector) wrote :

@mfo, I think I did it right, but even after restarting (`sudo snap restart maas.supervisor`) on both the region and rack units, it's not working.

I didn't create the file on the rack controller, since the source file didn't exist (except as a .sample)

Here's my curtin_userdata_centos_amd64_generic_centos70 from my region controller (attachment).
Seems correct?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Derek,

It's just missing two spaces/indentation so it's in the `early_commands` section; the rest looks good!

Revision history for this message
Derek DeMoss (derek-omnivector) wrote :

@mfo, That was the trick! Verified working with our testing environment.
Thank you!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.