multiple processes intermittently stall at same point in strace

Bug #1886022 reported by Lachele Foley on 2020-07-02

This bug report was marked for expiration 1 days ago. (find out why)

6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

If this isn't a kernel bug, my apologies. I didn't know where else to put it. It affects seemingly unrelated processes, so there wasn't an obvious 'package'. Possibly nscd, which is in glibc, which is close to the kernel...

I noticed frequent, but intermittent hangs/stalls in multiple processes. The processes usually go on to completion after a few seconds or so. Occasionally, I get timeouts that I think might be this same symptom but aren't always strace-able.

At first, I thought it was an authentication issue because I got hangs with "sudo -i" as well as intermittently very slow logins via ssh with ldap. But, then it happened with "vim dum", which could still be authentication somehow, but I don't know enough to determine that. Htop will also hang intermittently, but I don't have a trace of it.

I started running strace on the processes that are hanging. The hang happened after the same few lines each time that I was able to strace at the same time of a hang (it is intermittent).

Here is an example of the last lines before hang where a normal user issues "strace id username":

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL, NULL, 0) = 19
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000^Cstrace: Process 2816 detached
 <detached ...>

I hit ctrl-C to stop execution so I could easily copy that info, hence detached.

Here is an example from root using "strace vim dum":

                                                                                            socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
                                                       connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 0
                             sendto(3, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL, NULL, 0) = 19
       poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000

...that took practice to capture.

I suspect that LDAP is involved somehow, but it might be victim rather than culprit. Obviously, nscd has some involvement per the lines above. Or possibly, these lines are about strace itself?

I can capture full straces with timings, etc. Just say what you need. Or direct me to the proper venue for this report.

I looked in kern.log for evidence of a hardware issue, but didn't see anything that looked significantly unusual. If it seems like hardware, I will appreciate hints as to the component.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-40-generic 5.4.0-40.44
ProcVersionSignature: Ubuntu 5.4.0-40.44-generic 5.4.44
Uname: Linux 5.4.0-40-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu27.3
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: installer 1846 F.... pulseaudio
CasperMD5CheckResult: skip
CurrentDesktop: LXDE
Date: Thu Jul 2 04:36:12 2020
InstallationDate: Installed on 2020-06-25 (7 days ago)
InstallationMedia: Ubuntu-MATE 20.04 LTS "Focal Fossa" - Release amd64 (20200423)
MachineType: HP ProLiant DL380 G7
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.4.0-40-generic root=UUID=be36c28f-47f3-4536-8bb3-8b2f3856fa42 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-40-generic N/A
 linux-backports-modules-5.4.0-40-generic N/A
 linux-firmware 1.187.1
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 05/05/2011
dmi.bios.vendor: HP
dmi.bios.version: P67
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP67:bd05/05/2011:svnHP:pnProLiantDL380G7:pvr:cvnHP:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL380 G7
dmi.product.sku: 583917-B21
dmi.sys.vendor: HP

Lachele Foley (lachele) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Kai-Heng Feng (kaihengfeng) wrote :

Does -39 have this issue?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Lachele Foley (lachele) wrote :

Yes. And still intermittent.

More data that might or might not be useful: I got frustrated trying to get the grub menu to appear and uninstalled plymouth. So, I can tell you that this happens at tty[1-4] as well as after 'startx'. (LXDE is installed) I think it is less likely to happen without plymouth installed. It's hard to get good data, though, bc intermittent. Remote logins (ssh) still happen mostly slowly but occasionally quickly.

Another issue that might be related. I had to stop/disable networking.service partway through attempting to apt update (after the -39 test and now back at -40) because the machine stopped knowing where to find a DNS server, ignoring my entries (and lack of entries) in netplan and attempts at 'netplan apply'. I had disabled network-manager long ago. After stopping networking.service, 'netplan apply' was obeyed and I had DNS again. Even after disabling networking and network-manager, two unconfigured interfaces still come up at boot (via DHCP).

Lachele Foley (lachele) wrote :

I decided to try removing ltsp, ltsp-binaries and dsnmasq, just to see what would happen after.

During the attempt, I got a timeout during update-initramfs. Timeouts like this during apt are common. I ran "update-initramfs -u" after the timeout, and that time it seemed to proceed normally. I'll go reboot the machine when I can and report back.

Aside: This machine also serves a website, and that remains fast even when the rest of the machine is slow. After ssh to another internal machine, all is fast again. So it's not a plain network thing.

The timeout:

root@frost:~# apt remove dnsmasq ltsp ltsp-binaries
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  sshfs
Use 'apt autoremove' to remove it.
The following packages will be REMOVED:
  dnsmasq ltsp ltsp-binaries
0 upgraded, 0 newly installed, 3 to remove and 0 not upgraded.
After this operation, 1,669 kB disk space will be freed.
Do you want to continue? [Y/n]
(Reading database ... 296064 files and directories currently installed.)
Removing dnsmasq (2.80-1.1ubuntu1) ...
Removing ltsp (20.06-1~ubuntu20.04.1) ...
Removing ltsp-binaries (20.04-1~ubuntu20.04.1) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for initramfs-tools (0.136ubuntu6.2) ...
update-initramfs: Generating /boot/initrd.img-5.4.0-40-generic
Error: Timeout was reached

Lachele Foley (lachele) wrote :

That didn't fix it.

So, next I uninstalled nscd. At this point, nothing else in apt's purview explicitly depended on it - though it was still showing up in straces.

After that, the behavior has been significantly better so far. So... maybe a bug in nscd?

On that note, just curious: why is nscd showing up in an strace when the root user - having logged in at the console at ctrl-alt-F2, with no GUI - is trying to open a file using vim in its home directory?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers