Ubuntu
subiquity package

Bug #2009141
Comment #35

Comment 35 for bug 2009141

Revision history for this message

bugproxy (bugproxy) wrote on 2023-04-12: Comment bridged from LTC Bugzilla

#35

------- Comment From <email address hidden> 2023-04-12 13:24 EDT-------
(In reply to comment #25)
> Thx for attaching the logs and the crash report, we'll investigate ...
>
> What I'm just wondering about are the '"OSError: [Errno 28] No space left on
> device"' messages. Is there something with the FCP/SCSI LUN (size) or
> options to write to it?

Looking at the log, the installer has not made much progress yet. We just successfully(!) probed a few SCSI disks, but haven't configured any partitioning, let alone mount points. I take it that the installer must not write to any real disk at that point in time. So ENOSPC cannot come from zfcp-attached SCSI disks. Let's not get hung up on zfcp or on different FCP-attached storage arrays (DS8000, XIV, FlashSystem, etc.); they all present standard SCSI disks for which the common code Linux kernel driver sd_mod provides regular block devices; nothing special about this at all.

BTW, the kernel boot parameters look odd:

[ 0.440956] Kernel command line: @```@%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@<email address hidden>, what exact parm file content did you use to boot the installer?
Since the network interface ence0f appears without DPM auto conf being used and I don't see it being configured interactively in the installer, I wonder where its ccwgroup configuration came from. Maybe the parm file had enough leading zeros to get truncated during kernel console output, but maybe there was some boot parameter to group 0.0.0e0f,0.0.0e10,0.0.0e11 for ence0f?

(In reply to comment #29)
> host_installer_shell_cmds_04062023_1.txt
> Note after chzdev --enable, Quantity 1969 files generated in /var/log/crash
> filling up /
> ls -l /var/crash
> ls -l | grep unknown | wc -l
> 1969
...
> 142575 Apr 6 12:18 1680808725.929860353.unknown.crash.gz
> 77 Apr 6 12:18 1680808725.929860353.unknown.meta.gz
> 162995 Apr 6 12:18 1680808726.069142342.unknown.crash.gz
...

> root@ubuntu-server:~# df -T
> Filesystem Type 1K-blocks Used Available Use% Mounted on
> /cow overlay 16473348 16473348 0 100% /
> overlay overlay 292992 292992 0 100% /media/filesystem
> overlay overlay 292992 292992 0 100% /tmp/tmpcsrrjbgt/root.dir

I see ENOSPC also when the installer tries to log something. I assume this must happen towards some space in the ramdisk the installer runs within.
There seem to be a number of (too many?) installer "crash" files under /var/crash likely on the completely filled up overlay-fs.
Those "crash" files are neither created by chzdev nor lszdev.

However, if I read the logs correctly, these debug data files consuming too much space only get generated due to other earlier python tracebacks from subiquity. IOW, ENOSPC (or EMFILE) errors are just misleading follow-on errors.

The very first one of those tracebacks happens on udev settle for the network device (before any zfcp devices):

2023-04-06 19:17:21,023 DEBUG subiquity.server.controllers.filesystem:671 waiting 0.1 to let udev event queue settle
2023-04-06 19:17:21,124 DEBUG subiquitycore.utils:64 run_command called: ['udevadm', 'settle', '-t', '0']
2023-04-06 19:17:21,139 DEBUG subiquitycore.utils:77 run_command ['udevadm', 'settle', '-t', '0'] exited with code 0
2023-04-06 19:17:21,139 ERROR subiquity.server.server:424 top level error
Traceback (most recent call last):
File "/snap/subiquity/4383/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/snap/subiquity/4383/lib/python3.8/site-packages/subiquity/server/controllers/filesystem.py", line 682, in _udev_event
action, dev = self._monitor.receive_device()
File "/snap/subiquity/4383/lib/python3.8/site-packages/pyudev/monitor.py", line 400, in receive_device
device = self.poll()
File "/snap/subiquity/4383/lib/python3.8/site-packages/pyudev/monitor.py", line 358, in poll
if eintr_retry_call(poll.Poll.for_events((self, "r")).poll, timeout):
File "/snap/subiquity/4383/lib/python3.8/site-packages/pyudev/_util.py", line 164, in eintr_retry_call
return func(*args, **kwargs)
File "/snap/subiquity/4383/lib/python3.8/site-packages/pyudev/_os/poll.py", line 94, in poll
return list(self._parse_events(eintr_retry_call(self._notifier.poll, timeout)))
File "/snap/subiquity/4383/lib/python3.8/site-packages/pyudev/_os/poll.py", line 109, in _parse_events
raise IOError("Error while polling fd: {0!r}".format(fd))
OSError: Error while polling fd: 20
2023-04-06 19:17:21,142 DEBUG subiquity.common.errorreport:384 generating crash report
2023-04-06 19:17:21,143 INFO subiquity.common.errorreport:406 saving crash report 'unknown error crashed with OSError' to /var/crash/1680808641.142762184.unknown.crash

This repeats (at 10Hz ?) often enough in the manual udev settle loop and each iteration creates one of those "crash" files of considerable size.

Unfortunately, 1680808641.142762184.unknown.crash.gz and 1680808641.142762184.unknown.meta.gz (the other crash files show the same traceback as above) do _not_ contain any more debug data than we had originally provided (what subiquity printed on the console when asking it to show debug data for the "unknown error").

So coming back to:
(In reply to comment #16)
> > Canonical, why did the installer get an error?
> >
> > Does the installer really have a busy(!) waiting loop calling udevadm settle
> > with zero timeout?
>
> Maybe it's not busy and instead driver by a poll/select loop (with its own
> timeout/sleep) in the installer and therefore called with zero timeout.
>
> > But even if so, with the number of discovered devices and the settle finally
> > returning with success errorlevel 0, it should just work?

Maybe you could share a link to the corresponding source code of subiquity performing the udev settle?

------- Comment From MAIER@de.ibm.com 2023-04-12 13:24 EDT-------
(In reply to comment #25)
> Thx for attaching the logs and the crash report, we'll investigate ...
>
> What I'm just wondering about are the '"OSError: [Errno 28] No space left on
> device"' messages. Is there something with the FCP/SCSI LUN (size) or
> options to write to it?

BTW, the kernel boot parameters look odd:

[    0.440956] Kernel command line: @```@%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@finnegan@us.ibm.com, what exact parm file content did you use to boot the installer?
Since the network interface ence0f appears without DPM auto conf being used and I don't see it being configured interactively in the installer, I wonder where its ccwgroup configuration came from. Maybe the parm file had enough leading zeros to get truncated during kernel console output, but maybe there was some boot parameter to group 0.0.0e0f,0.0.0e10,0.0.0e11 for ence0f?

(In reply to comment #29)
> host_installer_shell_cmds_04062023_1.txt
> Note after chzdev --enable,  Quantity 1969 files generated in /var/log/crash
> filling up /
> ls -l /var/crash
> ls -l | grep unknown | wc -l
> 1969
...
>  142575 Apr  6 12:18 1680808725.929860353.unknown.crash.gz
>      77 Apr  6 12:18 1680808725.929860353.unknown.meta.gz
>  162995 Apr  6 12:18 1680808726.069142342.unknown.crash.gz
...

> root@ubuntu-server:~# df -T
> Filesystem     Type    1K-blocks     Used Available Use% Mounted on
> /cow           overlay  16473348 16473348         0 100% /
> overlay        overlay    292992   292992         0 100% /media/filesystem
> overlay        overlay    292992   292992         0 100% /tmp/tmpcsrrjbgt/root.dir

The very first one of those tracebacks happens on udev settle for the network device (before any zfcp devices):

This repeats (at 10Hz ?) often enough in the manual udev settle loop and each iteration creates one of those "crash" files of considerable size.

Maybe you could share a link to the corresponding source code of subiquity performing the udev settle?

Ubuntusubiquity package

Comment 35 for bug 2009141

Ubuntu
subiquity package