cloud-init

NFS mounts in /etc/fstab and cloud-init may cause boot hang

Bug #1913354 reported by C de-Avillez on 2021-01-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-init	Expired	Medium	Unassigned

Bug Description

Azure, RHEL 7.8, 7.9 and OEL 7.8, 7.9.

On OEL 7.8 cloud-init is cloud-init-18.5-6.el7.x86_64

On both OEL and RHel 7.* (certainly 7.8 and 7.9), if we have a NFS mount in /etc/fstab (unknown if this applies to NFSv4), then boot may not complete. The end result is a hang, and the system is inaccessible from SSH or serial console login.

All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

This happens because rpc.statd and rpc.statd-notify have the following dependencies declared:

# rcp-statd.service
[Unit]
Description=NFS status monitor for NFSv2/3 locking.
DefaultDependencies=no
Conflicts=umount.target
Requires=nss-lookup.target rpcbind.socket
Wants=network-online.target # <---
After=network-online.target nss-lookup.target rpcbind.socket # <---

PartOf=nfs-utils.service

Wants=nfs-config.service
After=nfs-config.service

[Service]
Environment=RPC_STATD_NO_NOTIFY=1
EnvironmentFile=-/run/sysconfig/nfs-utils
Type=forking
PIDFile=/var/run/rpc.statd.pid
ExecStart=/usr/sbin/rpc.statd $STATDARGS

# rpc-statd-notify.service:

[Unit]
Description=Notify NFS peers of a restart
DefaultDependencies=no
Wants=network-online.target # <---
After=local-fs.target network-online.target nss-lookup.target # <---

# Do not start up in HA environments
ConditionPathExists=!/var/lib/nfs/statd/sm.ha

# if we run an nfs server, it needs to be running before we
# tell clients that it has restarted.
After=nfs-server.service

PartOf=nfs-utils.service

Wants=nfs-config.service
After=nfs-config.service

[Service]
EnvironmentFile=-/run/sysconfig/nfs-utils
Type=forking
ExecStart=-/usr/sbin/sm-notify $SMNOTIFYARGS

while cloud-init.service is:

[Unit]
Description=Initial cloud-init job (metadata service crawler)
Wants=cloud-init-local.service
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
After=NetworkManager.service network.service
Before=network-online.target # <---
Before=sshd-keygen.service
Before=sshd.service
Before=systemd-user-sessions.service
ConditionPathExists=!/etc/cloud/cloud-init.disabled
ConditionKernelCommandLine=!cloud-init=disabled

[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init init
RemainAfterExit=yes
TimeoutSec=0

# Output needs to appear in instance console output
StandardOutput=journal+console

[Install]
WantedBy=cloud-init.target

So cloud-init is to be started before network-online.target, while rpc-statd* are to be started after network-online.target.

CX has demonstrated this to my satisfaction.

I see a few possible paths here:

1. CX has to change the (rpc-statd|rpc-statd-notify).service so that they now state:

Before=network-online.target
#Wants=network-online.target
#After network-online.target

2. CX has to change cloud-init.service so that it now states:

Wants=network-online.target
After=network-online.target
#Before=network-online.target

3. CX removes the NFS mount from /etc/fstab, and adds it as a systemd .mount unit

CX opted for change #1 above, and now sees no boot issues.

There is a Red Hat bug about that: https://bugzilla.redhat.com/show_bug.cgi?id=1858930, but it was closed WONTFIX because... support for RHEL7 ended :-(. . I also tried to search on bugzilla and Launchpad for related bugs on RHEL(7|8), but did not find any.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2021-01-29:

Thanks for using cloud-init and taking the time to file a bug report!

> All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

I don't see any evidence presented of a deadlock. The NFS units presumably should run after networking is available (that's what the N stands for, after all) and cloud-init.service is the first opportunity to run user's configuration, so having it run at a predictable point in boot before "most" things is desirable.

I don't doubt that you're hitting an issue, but we don't have enough information about it. Can you explain in a little more detail what the exact issue you're seeing is? If possible, please also include the output of `cloud-init collect-logs` from an affected instance, then move this back to New.

Thanks again!

Dan

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

C de-Avillez (hggdh2) wrote on 2021-01-29:

Hi Dan, thank you for looking into this.

The issue seems to be driven by cloud-init.service starting *before* network is actually available (in the cloud-init.service definition, we have "Before=network-online.target"). But the rpc-statd* service definitions are set to only start *after* network is fully available (Wants=network-online.target, After=network-online.target.

So... c-i starts, and drives the NFS mounts. But the dependent services (again, specifically the rpc-statd*.service) will not start until we reach the required target.

I will check with CXs what we can post in a public bug (or I may move this bug to private) due to PII restrictions.

Revision history for this message

Rakesh Ginjupalli (linuxelf001) wrote on 2021-02-02:

Can rpc-statd service change to run after network.target instead of network-online.target?

Chris Newcomer (cnewcomer) on 2021-02-11

Changed in cloud-init:
status:	Incomplete → New

Revision history for this message

C de-Avillez (hggdh2) wrote on 2021-02-15:

Moving bug to Private, in preparation for logs upload.

information type:

Public → Private

Revision history for this message

C de-Avillez (hggdh2) wrote on 2021-02-22:

Here we have the cloud-init logs, provided by CX, and -- I think -- with PII excised. CX asks this bug to be kept private until the c-i logs are deleted.

Revision history for this message

C de-Avillez (hggdh2) wrote on 2021-02-23:

updated logs.

Revision history for this message

Richard Harding (rharding) wrote on 2021-03-02:

Spoke with Anh today who has a couple of other ideas with mount options and will reply.

Revision history for this message

C de-Avillez (hggdh2) wrote on 2021-04-20:

Had a chat with Anh. Will let both customer and my colleagues know the current status.

A summary:

* no solution from RH for this, RHEL 7 already is EOL-ed;
* NVFv4 does not have this problem, since it does not have the same dependencies on rpc-statd*
* a patch is being discussed between us and c-i upstream
* if (and when) the patch is committed, it will still take around 12 months for RHEL to incorporate it to RHEL 8
*

Revision history for this message

C de-Avillez (hggdh2) wrote on 2022-04-12:

Deleted attachment and moved the bug to PUBLIC. This has completely stalled since my last comment.

information type:

Private → Public

Revision history for this message

James Falcon (falcojr) wrote on 2022-04-18:

#10

"a patch is being discussed between us and c-i upstream"

Do you happen to know the details of what was discussed? Unfortunately the upstream folks initially involved are no longer involved with the project.

Changed in cloud-init:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-12:

#11

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/3834

Changed in cloud-init:
status:	Triaged → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-canonical-cloud-init #3834
[open bug launchpad] Edit

Bug watches keep track of this bug in other bug trackers.