cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-init |
Fix Released
|
Medium
|
Pengpeng Sun |
Bug Description
We have found in our test for SLES15 with cloud-init installed, if we attach a ISO file with the VM before VM is boot, it often takes more than 10 minutes to start the SLES OS. Sometimes it failed to start the SLES OS at all.
We've root caused it is due to the "is_cdrom_ovf()" func of "tools/
In this function, there is the following logic to detect if an ISO contains certain string:
> local idstr="http://
> grep --quiet --ignore-case "$idstr" "${PATH_ROOT}$dev"
ref: https:/
It is trying to grep the who ISO file for a certain string, which causes intense IO pressure for the system.
What is worse is that sometimes the ISO file is large (e.g. >5GB for installer DVD) and it is mounted over NFS. The "grep" process often consume 99% CPU and seems hang. Then the systemd starts more and more "grep" process which smoke the CPU and consumes all the IO bandwidth for the ISO file. Then the system may hang for a long time and sometimes failed to start.
To fix this issue, I suggest that we should not grep for the entire ISO file. Rather then we should just check if the file/dir exists with os.path.exists().
-------
pek2-gosv-
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 13:32 ? 00:00:04 /usr/lib/
…
root 474 1 0 13:34 ? 00:00:00 /bin/sh /usr/lib/
root 482 474 2 13:34 ? 00:00:15 grep --quiet --ignore-case http://
root 1020 1 0 13:35 ? 00:00:00 /bin/sh /usr/lib/
root 1039 1020 1 13:35 ? 00:00:07 grep --quiet --ignore-case http://
polkitd 1049 1 0 13:37 ? 00:00:00 /usr/lib/
root 1051 1 0 13:37 ? 00:00:00 /usr/sbin/wickedd --systemd --foreground
root 1052 1 0 13:37 ? 00:00:00 /usr/lib/
root 1054 1 0 13:37 ? 00:00:00 /usr/sbin/
root 1073 1 0 13:37 ? 00:00:00 /usr/bin/vmtoolsd
root 1097 1 0 13:37 ? 00:00:00 /bin/sh /usr/lib/
root 1110 1097 1 13:37 ? 00:00:04 grep --quiet --ignore-case http://
root 1304 1 0 13:38 ? 00:00:00 /bin/sh /usr/lib/
root 1312 1304 1 13:38 ? 00:00:03 grep --quiet --ignore-case http://
root 1537 1 0 13:40 ? 00:00:00 /usr/bin/plymouth --wait
root 1613 1 0 13:40 ? 00:00:00 /bin/sh /usr/lib/
root 1645 1613 0 13:40 ? 00:00:02 grep --quiet --ignore-case http://
…
Grep use nearly 100% cpu, system very slow.
top - 13:46:37 up 26 min, 2 users, load average: 14.14, 15.03, 10.57
Tasks: 225 total, 6 running, 219 sleeping, 0 stopped, 0 zombie
%Cpu(s): 40.1 us, 49.3 sy, 0.0 ni, 0.0 id, 1.4 wa, 0.0 hi, 9.1 si, 0.0 st
KiB Mem : 1000916 total, 64600 free, 355880 used, 580436 buff/cache
KiB Swap: 1288168 total, 1285600 free, 2568 used. 492688 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4427 root 20 0 40100 3940 3084 R 99.90 0.394 0:27.41 top
1016 root 20 0 197796 4852 3400 R 99.90 0.485 1:26.44 vmtoolsd
1723 root 20 0 7256 1860 1556 D 99.90 0.186 0:28.44 grep
484 root 20 0 7256 1684 1396 D 99.90 0.168 1:51.22 grep
1278 root 20 0 7256 1856 1556 D 99.90 0.185 0:38.44 grep
1398 root 20 0 7256 1860 1556 R 99.90 0.186 0:28.53 grep
1061 root 20 0 7256 1856 1556 D 99.90 0.185 0:56.62 grep
-------
Related branches
- Server Team CI bot: Approve (continuous-integration)
- Ryan Harper: Approve
- Scott Moser: Approve
-
Diff: 105 lines (+51/-13)2 files modifiedtests/unittests/test_ds_identify.py (+25/-0)
tools/ds-identify (+26/-13)
Changed in cloud-init: | |
assignee: | nobody → Pengpeng Sun (pengpengs) |
Changed in cloud-init: | |
importance: | Undecided → Medium |
status: | New → Triaged |
@Peter,
You're correct that ds-identify both that ds-identify does grep through
and entire cdrom if attached with an ISO9660 filesystem. The reason for this hack
is that
a.) the OVF specification does not indicate the filesystem label for ISO transport.
If it did, we could require that. As it is, there is no easy way to positively
identify the OVF transport cdrom.
b.) when this code was developed I asked a vmware contact if I could rely on
any such label in practice and was told no.
c.) we did not want to attempt a mount in ds-identify and cannot rely on the
filesystem being mounted at that point in time.
Note that commit 530850f971e improves the situation on vmware to
*not* do the grep if the filesystem has a label 'OVF ENV'. In some recent
experience I've seen this label provided by vmware.
If we can rely on that label on VMWare OVF systems, then I think I would be
fine with dropping the grep.
Note that the grep is protected from executing in many ways. It will only
execute if:
a.) the ISO9660 filesystem is on a device named /dev/sr[0-9] or /dev/hd[a-z].
b.) the label is not one of case insensitive: OVF-Transport OVFENV, OVF ENV, config-2, rd_rdfe_stable*, cidata.
If we cannot drop the grep all together because VMware/OVF spec does not provide
us a filesystem label to match on, then we could potentially further limit the
grep to only occur if the device attached was very small (perhaps < 10M ?).
That way we would not grep through DVD or 700M install ISOs.
Thoughts?