ephemeral-disk-warning yields /mnt 0444 immutable on package update when /mnt is bind-mounted

Bug #1755629 reported by Paul Meyer
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
walinuxagent (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

Both /etc/init/ephemeral-disk-warning.conf and /usr/sbin/ephemeral-disk-warning contain code to determine where the ephemeral disk is mounted, write a warning file, and set that file read-only (0444) and immutable (+i). In the case where the ephemeral drive is bind-mounted to several places (e.g. using it as a docker backing storage, etc.) an execution of one of these scripts will yield /mnt 0444 and immutable, usually breaking whatever was using the storage space.

We had one customer who had a large number of yarn nodes go bad because of this. Output from `journalctl -u ephemeral-disk-warning.service`:

-- Logs begin at Thu 2018-03-01 00:08:50 UTC, end at Tue 2018-03-13 22:33:22 UTC. --
Mar 01 00:09:14 wn103-limitl systemd[1]: Starting Write warning to Azure ephemeral disk...
Mar 01 00:09:14 wn103-limitl systemd[1]: Started Write warning to Azure ephemeral disk.
Mar 05 06:31:44 wn103-limitl systemd[1]: Stopped Write warning to Azure ephemeral disk.
Mar 05 06:32:41 wn103-limitl systemd[1]: Starting Write warning to Azure ephemeral disk...
Mar 05 06:32:41 wn103-limitl ephemeral-disk-warning[95793]: /usr/sbin/ephemeral-disk-warning: line 7: /mnt
Mar 05 06:32:41 wn103-limitl ephemeral-disk-warning[95793]: /mnt/docker-tmp/plugins
Mar 05 06:32:41 wn103-limitl ephemeral-disk-warning[95793]: /mnt/docker-tmp/overlay2/DATALOSS_WARNING_README.txt: No such file or directory
Mar 05 06:32:41 wn103-limitl ephemeral-disk-warning[95793]: chmod: cannot access '/mnt/docker-tmp/overlay2/DATALOSS_WARNING_README.txt': No such file or directory
Mar 05 06:32:41 wn103-limitl ephemeral-disk-warning[95793]: chattr: No such file or directory while trying to stat /mnt/docker-tmp/overlay2/DATALOSS_WARNING_README.txt
Mar 05 06:32:41 wn103-limitl systemd[1]: Started Write warning to Azure ephemeral disk.

Which was right after the walinuxagent package was updated to 2.2.21+really2.2.20-0ubuntu1~16.04.1
I was able to repro the problematic lines on my dev box, where I have two directories bind-mounted to my home dir:

$ dev_resource=$(readlink -f /dev/disk/azure/resource-part1)
$ dev_resource_mp=$(awk '$1==R {print$2}' "R=${dev_resource}" /proc/mounts)
$ echo $dev_resource_mp
/mnt /home/paulmey/packer_work /home/paulmey/packer_cache
$ cat /proc/mounts |grep sdb1
/dev/sdb1 /mnt ext4 rw,relatime,data=ordered 0 0
/dev/sdb1 /home/paulmey/packer_work ext4 rw,relatime,data=ordered 0 0
/dev/sdb1 /home/paulmey/packer_cache ext4 rw,relatime,data=ordered 0 0

It looks like this script needs to be more specific in determining the mountpoint, or maybe even make its own temporary mount point for the ephemeral drive, since /proc/mount does not indicate which mounts are bind mounts (and what the source directory for that bind mount was).
It looks like LP#1626318 is also an example of this bug.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in walinuxagent (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I agree that LP#1626318 is essentially the same and dup'ed it onto this one.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I'd suggest to at least add something like
  if [ -d "$(dirname $warn_file)" ]
around the cat
and then before the chmod/chattr
  if [ -f "$warn_file" ]

This does not stop the current detection from being broken for multi/bind-mounts, but should avoid at least some of the issues.

The file is not from the upstream project but just from debian/ subdir in packaging. So filing a bug upstream makes no sense IMHO.

Subscribing Daniel/Lukasz who did the last uploads to inquire if they would be willing to take a look at this.

Revision history for this message
technicianted (technicianted) wrote :

This bug is hitting our clusters hard especially when dynamically adding nodes. In this case, container workloads that use /mnt are scheduled on the new nodes before walinuxagent is updated causing corruption of attrs and perms of /mnt, and potentially file/folder underneath.

Workaround to fix attrs and perms whenever walinuxagent is updated is a bit of a hack and does not immediately solve the problem when it initially happens.

Bryce Harrington (bryce)
tags: added: server-triage-discuss
Robie Basak (racb)
tags: removed: server-triage-discuss
Bryce Harrington (bryce)
Changed in walinuxagent (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.