NFS mounts in /etc/fstab and cloud-init may cause boot hang

Bug #1913354 reported by C de-Avillez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Expired
Medium
Unassigned

Bug Description

Azure, RHEL 7.8, 7.9 and OEL 7.8, 7.9.

On OEL 7.8 cloud-init is cloud-init-18.5-6.el7.x86_64

On both OEL and RHel 7.* (certainly 7.8 and 7.9), if we have a NFS mount in /etc/fstab (unknown if this applies to NFSv4), then boot may not complete. The end result is a hang, and the system is inaccessible from SSH or serial console login.

All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

This happens because rpc.statd and rpc.statd-notify have the following dependencies declared:

# rcp-statd.service
[Unit]
Description=NFS status monitor for NFSv2/3 locking.
DefaultDependencies=no
Conflicts=umount.target
Requires=nss-lookup.target rpcbind.socket
Wants=network-online.target # <---
After=network-online.target nss-lookup.target rpcbind.socket # <---

PartOf=nfs-utils.service

Wants=nfs-config.service
After=nfs-config.service

[Service]
Environment=RPC_STATD_NO_NOTIFY=1
EnvironmentFile=-/run/sysconfig/nfs-utils
Type=forking
PIDFile=/var/run/rpc.statd.pid
ExecStart=/usr/sbin/rpc.statd $STATDARGS

# rpc-statd-notify.service:

[Unit]
Description=Notify NFS peers of a restart
DefaultDependencies=no
Wants=network-online.target # <---
After=local-fs.target network-online.target nss-lookup.target # <---

# Do not start up in HA environments
ConditionPathExists=!/var/lib/nfs/statd/sm.ha

# if we run an nfs server, it needs to be running before we
# tell clients that it has restarted.
After=nfs-server.service

PartOf=nfs-utils.service

Wants=nfs-config.service
After=nfs-config.service

[Service]
EnvironmentFile=-/run/sysconfig/nfs-utils
Type=forking
ExecStart=-/usr/sbin/sm-notify $SMNOTIFYARGS

while cloud-init.service is:

[Unit]
Description=Initial cloud-init job (metadata service crawler)
Wants=cloud-init-local.service
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
After=NetworkManager.service network.service
Before=network-online.target # <---
Before=sshd-keygen.service
Before=sshd.service
Before=systemd-user-sessions.service
ConditionPathExists=!/etc/cloud/cloud-init.disabled
ConditionKernelCommandLine=!cloud-init=disabled

[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init init
RemainAfterExit=yes
TimeoutSec=0

# Output needs to appear in instance console output
StandardOutput=journal+console

[Install]
WantedBy=cloud-init.target

So cloud-init is to be started before network-online.target, while rpc-statd* are to be started after network-online.target.

CX has demonstrated this to my satisfaction.

I see a few possible paths here:

1. CX has to change the (rpc-statd|rpc-statd-notify).service so that they now state:

Before=network-online.target
#Wants=network-online.target
#After network-online.target

2. CX has to change cloud-init.service so that it now states:

Wants=network-online.target
After=network-online.target
#Before=network-online.target

3. CX removes the NFS mount from /etc/fstab, and adds it as a systemd .mount unit

CX opted for change #1 above, and now sees no boot issues.

There is a Red Hat bug about that: https://bugzilla.redhat.com/show_bug.cgi?id=1858930, but it was closed WONTFIX because... support for RHEL7 ended :-(. . I also tried to search on bugzilla and Launchpad for related bugs on RHEL(7|8), but did not find any.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Thanks for using cloud-init and taking the time to file a bug report!

> All points to a deadlock between the starting of the rpc.statd and rpc.statd-notify services and the cloud-init.service.

I don't see any evidence presented of a deadlock. The NFS units presumably should run after networking is available (that's what the N stands for, after all) and cloud-init.service is the first opportunity to run user's configuration, so having it run at a predictable point in boot before "most" things is desirable.

I don't doubt that you're hitting an issue, but we don't have enough information about it. Can you explain in a little more detail what the exact issue you're seeing is? If possible, please also include the output of `cloud-init collect-logs` from an affected instance, then move this back to New.

Thanks again!

Dan

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
C de-Avillez (hggdh2) wrote :

Hi Dan, thank you for looking into this.

The issue seems to be driven by cloud-init.service starting *before* network is actually available (in the cloud-init.service definition, we have "Before=network-online.target"). But the rpc-statd* service definitions are set to only start *after* network is fully available (Wants=network-online.target, After=network-online.target.

So... c-i starts, and drives the NFS mounts. But the dependent services (again, specifically the rpc-statd*.service) will not start until we reach the required target.

I will check with CXs what we can post in a public bug (or I may move this bug to private) due to PII restrictions.

Revision history for this message
Rakesh Ginjupalli (linuxelf001) wrote :

Can rpc-statd service change to run after network.target instead of network-online.target?

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
C de-Avillez (hggdh2) wrote :

Moving bug to Private, in preparation for logs upload.

information type: Public → Private
Revision history for this message
C de-Avillez (hggdh2) wrote :

Here we have the cloud-init logs, provided by CX, and -- I think -- with PII excised. CX asks this bug to be kept private until the c-i logs are deleted.

Revision history for this message
C de-Avillez (hggdh2) wrote :

updated logs.

Revision history for this message
Richard Harding (rharding) wrote :

Spoke with Anh today who has a couple of other ideas with mount options and will reply.

Revision history for this message
C de-Avillez (hggdh2) wrote :

Had a chat with Anh. Will let both customer and my colleagues know the current status.

A summary:

* no solution from RH for this, RHEL 7 already is EOL-ed;
* NVFv4 does not have this problem, since it does not have the same dependencies on rpc-statd*
* a patch is being discussed between us and c-i upstream
* if (and when) the patch is committed, it will still take around 12 months for RHEL to incorporate it to RHEL 8
*

Revision history for this message
C de-Avillez (hggdh2) wrote :

Deleted attachment and moved the bug to PUBLIC. This has completely stalled since my last comment.

information type: Private → Public
Revision history for this message
James Falcon (falcojr) wrote :

"a patch is being discussed between us and c-i upstream"

Do you happen to know the details of what was discussed? Unfortunately the upstream folks initially involved are no longer involved with the project.

Changed in cloud-init:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Triaged → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.