nova_compute after wallaby (in xena) no longer mounts cinder NFS paths

Bug #1955769 reported by Boris Lukashev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
New
Undecided
Unassigned

Bug Description

We have multiple cinder services via RBD, iSCSI/LVM, and QCOW2/NFS in one of our stacks. Having completed the migration from wallaby to xena, we can no longer boot any VMs with NFS volumes despite the NFS mounts showing up correctly on the cinder_volume containers and `qemu-img info` working properly as `nova` from those mounts.
The nova_compute containers (even after migrating the VM post-upgrade to another host to reset associations) do not mount the NFS path and when they try to access the volume for any reason they throw:
```
2021-12-26 22:18:40.668 7 ERROR oslo_messaging.rpc.server [req-90c4d3bf-a796-491e-ab90-e2718f8411b3 3c4f2250c3004f89be7e905716217274 e5dba6e61405445582de5b930f191606 - default default] Exception during message handling: libvirt.libvirtError: cannot read header '/var/lib/nova/mnt/893cab8cdeb3fe1a058e59540fdd5379/volume-09c2935f-1aa3-4bae-90bc-ae9531302c39': Input/output error

```
Checking the path at `/var/lib/nova/mnt/<FSID>` yields nothing. Whats really odd though is that manually mounting to that path seems to result in it getting bound-over or unmounted somehow.

In summary: nova_compute doesnt start the NFS mount for VMs before trying to access the volume and it appears to remove or bind the underlying FS over the mounts.

I specifically tested this with a nova compute host running containers via overlay atop ext4 and aufs atop ZFS (our normal config) to eliminate that concern - same deal, no NFS mount, sad instances, nothing in the nova compute logs about even trying to mount the NFS export which does get mounted by the cinder-volume containers correctly.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Debug logs show that nova-compute is aware of the mounts:
```
2021-12-26 22:39:55.691 7 DEBUG nova.virt.libvirt.volume.mount [req-1fba932d-3dcd-42aa-94a5-91ca99902588 3c4f2250c3004f89be7e905716217274 e5dba6e61405445582de5b930f191606 - default default] _HostMountState.mount(fstype=nfs, export=<NFS_HOST_IP>:/srv/nfs, vol_name=volume-09c2935f-1aa3-4bae-90bc-ae9531302c39, /var/lib/nova/mnt/893cab8cdeb3fe1a058e59540fdd5379, options=[]) generation 0 mount /var/lib/kolla/venv/lib/python3.8/site-packages/nova/virt/libvirt/volume/mount.py:287
```
Within the container, that parent-path does not exist:
```
(nova-compute)[root@<HOSTNAME> /]# ls /var/lib/nova/mnt/893cab8cdeb3fe1a058e59540fdd5379,
ls: cannot access '/var/lib/nova/mnt/893cab8cdeb3fe1a058e59540fdd5379,': No such file or directory

```
so the NFS mount can't be created. Creating that path inside /var/lib/nova/mnt results in it being removed

Revision history for this message
Boris Lukashev (rageltman) wrote :

Confirm that creating the directory via `mkdir /var/lib/nova/mnt/893cab8cdeb3fe1a058e59540fdd5379` and then trying to start an instance with a volume which would be inside that directory as the root of the NFS mount removes the `893cab8cdeb3fe1a058e59540fdd5379` part on whatever failure it experiences, leaving `/var/lib/nova/mnt` with no place to attach the NFS mount target. Not sure if thats the only problem, but it does appear to be deleting things.
I am also seeing nova_compute not clean-up mounts on-crash events and then remounting the same mounts over themselves until the system is properly DoS'd @ ~32830 mounts.

Revision history for this message
Boris Lukashev (rageltman) wrote :

Though i don't have RCA, i do have localization on the problem: its some sort of state issue in the host NFS client which presents this way after changing from using the wallaby container to access the NFS mount and the xena container. Rebooting the host after upgrade makes the problem go away.
Since it's host level, suggest that the upgrade docs be amended to annotate the Cinder/NFS case not being a "clean" upgrade or implementing some sort of NFS client (mount, RPC state, etc) cleanup routine in the upgrade path.

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Here's what we ended up with in our nova.conf against Ganesha NFS, and seems quite stable running 50T volumes.
```
nfs_mount_options = rw,vers=4.0,_netdev,noatime,nodiratime,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none,sync

```
The key element there seems to be `vers=4.0` as the rest of it still crashed after a while on higher point versions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.