curtin fails to configure fstab properly for iSCSI disks under LVM/MD

Bug #1683910 reported by Newell Jensen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
High
Unassigned

Bug Description

Original subject: Can't SSH into deployed node that has an iSCSI disk.

MAAS 2.2 -- Deploying a node that has an iSCSI disk succeeds but you are not able to SSH into the deployed node. See attached files for this machine's curtin config and installation results for this machine.

Related branches

Revision history for this message
Newell Jensen (newell-jensen) wrote :
Revision history for this message
Newell Jensen (newell-jensen) wrote :
description: updated
Revision history for this message
Ryan Harper (raharper) wrote :

Do you have the boot log from the installed machine?

Changed in curtin:
status: New → Incomplete
Revision history for this message
Newell Jensen (newell-jensen) wrote :

Ryan,

Intel is still in the process of upgrading the software on their PODM and they have deleted all the machines we were using but once I get access again I will include the boot log.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1683910] Re: Can't SSH into deployed node that has an iSCSI disk.

OK.

It would be useful to know if the system is pingable at the assigned
network IP
and ssh -vvv output would be useful to see if there's an ssh connection vs.
ssh not available on the port.

Some curtin debugging commands/tips here:

https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b

On Wed, Apr 19, 2017 at 10:10 AM, Newell Jensen <<email address hidden>
> wrote:

> Ryan,
>
> Intel is still in the process of upgrading the software on their PODM
> and they have deleted all the machines we were using but once I get
> access again I will include the boot log.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1683910
>
> Title:
> Can't SSH into deployed node that has an iSCSI disk.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1683910/+subscriptions
>

Revision history for this message
Newell Jensen (newell-jensen) wrote : Re: Can't SSH into deployed node that has an iSCSI disk.
Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1683910] Re: Can't SSH into deployed node that has an iSCSI disk.

Version 2.17.1245. Copyright (C) 2016 American Megatrends, Inc.
BIOS Date: 03/25/2016 09:12:23 Ver: F20A3A03
Press <DEL> or <F2> to enter setup.
Press <F11> for BBS POPUP menu.
Press <F12> if you want to boot from the network.

So it didn't boot?

On Wed, Apr 19, 2017 at 7:34 PM, Newell Jensen <email address hidden>
wrote:

> ** Attachment added: "boot log"
> https://bugs.launchpad.net/curtin/+bug/1683910/+
> attachment/4865245/+files/boot-log.txt
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1683910
>
> Title:
> Can't SSH into deployed node that has an iSCSI disk.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1683910/+subscriptions
>

Revision history for this message
Newell Jensen (newell-jensen) wrote : Re: Can't SSH into deployed node that has an iSCSI disk.

It definitely booted. It deployed, rebooted and came to a console. I then looked to see if sshd was running, which it wasn't.

Scroll through entire boot log to see this.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1683910] Re: Can't SSH into deployed node that has an iSCSI disk.

Sorry I missed the rest of the file.

Here's the error:

  198.963624] cloud-init[3003]: curtin: Installation finished.
[ 201.014455] Buffer I/O error on dev dm-0, logical block 1081344,
lost sync page write
[ 201.030103] JBD2: Error -5 detected when updating journal
superblock for dm-0-8.
[ 201.044872] Aborting journal on device dm-0-8.
[ 201.053738] Buffer I/O error on dev dm-0, logical block 1081344,
lost sync page write
[ 201.069370] JBD2: Error -5 detected when updating journal
superblock for dm-0-8.
[ 201.084167] Buffer I/O error on dev dm-0, logical block 0, lost
sync page write
[ 201.098768] EXT4-fs error (device dm-0): ext4_put_super:813:
Couldn't clean up the journal
[ 201.115327] EXT4-fs (dm-0): Remounting filesystem read-only
[ 201.126446] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 201.139308] Buffer I/O error on dev dm-0, logical block 0, lost
sync page write

curtin finishes install finishes but it appears that dm-0 which is
likely the root

volume (storage config creates a LVM device over the underlying iscsi connect)

We should try to stop the system from rebooting and see if we can inspect iscsi

status.

https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b

On Thu, Apr 20, 2017 at 8:24 AM, Newell Jensen <email address hidden>
wrote:

> It definitely booted. It deployed, rebooted and came to a console. I
> then looked to see if sshd was running, which it wasn't.
>
> Scroll through entire boot log to see this.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1683910
>
> Title:
> Can't SSH into deployed node that has an iSCSI disk.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1683910/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote : Re: Can't SSH into deployed node that has an iSCSI disk.
Download full text (3.4 KiB)

Looking at the system logs during an installation, I think what we're seeing is that when curtin completes an install, it logs out of the iscsi connection but hasn't flushed the io devices completely; in this case ,there's an lvm over the iscsi disk and the logout happens before everything is umounted.

Apr 20 20:23:13 civil-cattle cloud-init[3057]: Running command ['sync'] with allowed return codes [0] (shell=False, capture=False)
Apr 20 20:23:13 civil-cattle cloud-init[3057]: iscsiadm_logout: target=iqn.maas.io:priyank portal=10.1.0.100:3260
Apr 20 20:23:13 civil-cattle cloud-init[3057]: Running command ['iscsiadm', '--mode=node', '--targetname=iqn.maas.io:priyank', '--portal=10.1.0.100:3260', '--logout'] with allowed return codes [0] (shell=False, capture=True)
Apr 20 20:23:13 civil-cattle kernel: [ 195.813504] sd 11:0:0:1: [sdc] Synchronizing SCSI cache
Apr 20 20:23:14 civil-cattle cloud-init[3057]: Command returned stdout=Logging out of session [sid: 2, target: iqn.maas.io:priyank, portal: 10.1.0.100,3260]
Apr 20 20:23:14 civil-cattle cloud-init[3057]: Logout of [sid: 2, target: iqn.maas.io:priyank, portal: 10.1.0.100,3260] successful.
Apr 20 20:23:14 civil-cattle cloud-init[3057]: , stderr=
Apr 20 20:23:14 civil-cattle cloud-init[3057]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=False)
Apr 20 20:23:14 civil-cattle cloud-init[3057]: Running command ['lsblk', '--noheadings', '--bytes', '--pairs', '--output=ALIGNMENT,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,FSTYPE,GROUP,KNAME,LABEL,LOG-SEC,MAJ:MIN,MIN-IO,MODE,MODEL,MOUNTPOINT,NAME,OPT-IO,OWNER,PHY-SEC,RM,RO,ROTA,RQ-SIZE,SIZE,STATE,TYPE,UUID'] with allowed return codes [0] (shell=False, capture=True)
Apr 20 20:23:14 civil-cattle cloud-init[3057]: Running command ['umount', '/tmp/tmplwixwl87/target/mnt'] with allowed return codes [0] (shell=False, capture=False)
Apr 20 20:23:14 civil-cattle iscsid: Connection2:0 to [target: iqn.maas.io:priyank, portal: 10.1.0.100,3260] through [iface: default] is shutdown.
Apr 20 20:23:14 civil-cattle kernel: [ 196.375256] Buffer I/O error on dev dm-0, logical block 1081344, lost sync page write
Apr 20 20:23:14 civil-cattle kernel: [ 196.392268] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
Apr 20 20:23:14 civil-cattle kernel: [ 196.407032] Aborting journal on device dm-0-8.
Apr 20 20:23:14 civil-cattle kernel: [ 196.415902] Buffer I/O error on dev dm-0, logical block 1081344, lost sync page write
Apr 20 20:23:14 civil-cattle kernel: [ 196.431540] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
Apr 20 20:23:14 civil-cattle kernel: [ 196.446326] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
Apr 20 20:23:14 civil-cattle kernel: [ 196.460917] EXT4-fs error (device dm-0): ext4_put_super:813: Couldn't clean up the journal
Apr 20 20:23:14 civil-cattle kernel: [ 196.477463] EXT4-fs (dm-0): Remounting filesystem read-only
Apr 20 20:23:14 civil-cattle kernel: [ 196.488581] EXT4-fs (dm-0): previous I/O error to superblock detected
Apr 20 20:23:14 civil-cattle kernel: [ 196.501446] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

After working around the iscsi logout (which may not have been fatal) over serial console I can see early filesystem mount errors for the iscsi volume:

dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device: Job dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device/start timed out.
Apr 20 20:53:24.097043 ubuntu systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device.
Apr 20 20:53:24.121055 ubuntu systemd[1]: Dependency failed for /mnt.
Apr 20 20:53:24.137007 ubuntu systemd[1]: Dependency failed for Local File Systems.
Apr 20 20:53:24.153056 ubuntu systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Apr 20 20:53:24.153063 ubuntu systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Apr 20 20:53:24.153370 ubuntu systemd[1]: mnt.mount: Job mnt.mount/start failed with result 'dependency'.
Apr 20 20:53:24.153381 ubuntu systemd[1]: dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device: Job dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device/start failed with result 'timeout'.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1683910] Re: Can't SSH into deployed node that has an iSCSI disk.

Looks like curtin needs to inject '_netdev' to the mount options if the
underlying device is iscsi.

On Thu, Apr 20, 2017 at 4:19 PM, Ryan Harper <email address hidden>
wrote:

> After working around the iscsi logout (which may not have been fatal)
> over serial console I can see early filesystem mount errors for the
> iscsi volume:
>
> dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device:
> Job dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device/start
> timed out.
> Apr 20 20:53:24.097043 ubuntu systemd[1]: Timed out waiting for device
> dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\
> x2d1ccddf80ae28.device.
> Apr 20 20:53:24.121055 ubuntu systemd[1]: Dependency failed for /mnt.
> Apr 20 20:53:24.137007 ubuntu systemd[1]: Dependency failed for Local File
> Systems.
> Apr 20 20:53:24.153056 ubuntu systemd[1]: local-fs.target: Job
> local-fs.target/start failed with result 'dependency'.
> Apr 20 20:53:24.153063 ubuntu systemd[1]: local-fs.target: Triggering
> OnFailure= dependencies.
> Apr 20 20:53:24.153370 ubuntu systemd[1]: mnt.mount: Job mnt.mount/start
> failed with result 'dependency'.
> Apr 20 20:53:24.153381 ubuntu systemd[1]: dev-disk-by\x2duuid-cbe775ec\
> x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device: Job
> dev-disk-by\x2duuid-cbe775ec\x2d57bd\x2d4b5d\x2db5e1\x2d1ccddf80ae28.device/start
> failed with result 'timeout'.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1683910
>
> Title:
> Can't SSH into deployed node that has an iSCSI disk.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1683910/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote : Re: Can't SSH into deployed node that has an iSCSI disk.

OK. I've debugged the situation. There are two issues:

1) at the end of a curtin install which uses an iscsi disk (which also is included in an LVM or other non-filesystem holder) we logout of the iscsi disk prior to unmounting it from the target. This may not ensure that all data is flushed despite the sync call; this could cause corruption in the filesystem on the iscsi volume. I believe we can defer disconnect until after we've unmounted everything from target

2) fstab entries for iscsi volumes are not marked as requiring network and systemd will then attempt to mount the early, fail and then system goes into recovery mode. The fix here is to add '_netdev' to the mount options.

Changed in curtin:
importance: Undecided → High
status: Incomplete → Confirmed
Revision history for this message
USD Importer (usd-importer-bot) wrote :

In theory, we do inject _netdev, but not if the kname lookup didn't detect iscsi. I'm going to try and dig around on the system to see why that might be.

Changed in curtin:
assignee: nobody → USD Importer (usd-importer-bot)
Revision history for this message
Nish Aravamudan (nacc) wrote :

That's what comes from owning a LP bot that you need to adjust ssh keys for :)

That last comment was me.

I see the node is up now, Ryan did you manually fix it up and then let it boot through?

I'd like to get back on it in the failed state so I can get a better idea of what's going wrong. Given the layering of iSCSI, we can probably actually remove the iscsiadm_logout() call from disconnect() [really, we could remove disconnect() altogether] because the iscsi service & drivers are supposed to ensure the data gets written out and devices disconnect cleanly (aiui).

Changed in curtin:
assignee: USD Importer (usd-importer-bot) → Nish Aravamudan (nacc)
Revision history for this message
Nish Aravamudan (nacc) wrote :

e.g., to delay tear down of iSCSI sessions until, in theory, all users above it have finished using them?

=== modified file 'curtin/commands/install.py'
--- curtin/commands/install.py 2017-04-20 15:57:21 +0000
+++ curtin/commands/install.py 2017-04-20 23:12:25 +0000
@@ -452,8 +452,6 @@
     finally:
         log_target_path = instcfg.get('save_install_log',
                                       '/root/curtin-install.log')
- # need to do some processing on iscsi disks to disconnect?
- iscsi.disconnect_target_disks(workingd.target)
         if log_target_path:
             copy_install_log(logfile, workingd.target, log_target_path)
         util.do_umount(workingd.target, recursive=True)

Revision history for this message
Ryan Harper (raharper) wrote :

I did manually fix it.

First, I modified the curtin preseed as described here:

https://gist.github.com/smoser/2610e9b78b8d7b54319675d9e3986a1b

And then redeployed the node; and then ssh'ed into the internal ip.
From there after the install completed (along with the error messages)
I modified /curtin/curtin/commands/install.py to remove the disconnect.

Then I rebooted the node.

After it came back up (quite a while), I needed IPMI console to get in
and found /etc/fstab only had two UUID= mounts, and the /mnt iscsi device did not have _netdev which then had systemd try to mount /mnt before network was up.

I manually modified /etc/fstab in target to add ,_netdev and rebooted.

That came up fine and ssh works; that's how the node is now.

Revision history for this message
Nish Aravamudan (nacc) wrote :

After some IRC discussion between Ryan and I we have concluded that curtin/commands/block_meta.py::mount_handler needs to be adjusted to understand how LVM/DM work and to use _netdev in fstab if any slave of the LVM/DM is on iscsi.

Pseudcode: http://paste.ubuntu.com/24423692/

But we'll move it further up after the mount has succeeded on volume_path, as it must be legitimate at that point, and look through the device name and the kname's slaves for any iscsi in the kname. If it's there, then _netdev will be used, otherwise defaults.

Changed in curtin:
status: Confirmed → In Progress
Nish Aravamudan (nacc)
summary: - Can't SSH into deployed node that has an iSCSI disk.
+ curtin fails to configure fstab properly for iSCSI disks under LVM/MD
description: updated
Revision history for this message
Newell Jensen (newell-jensen) wrote :

This was fix committed correct?

Revision history for this message
Ryan Harper (raharper) wrote :
Ryan Harper (raharper)
Changed in curtin:
assignee: Nish Aravamudan (nacc) → nobody
status: In Progress → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers