ds-identify - stuck in uninterruptible sleep state

Bug #1856560 reported by Nicklas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Invalid
Undecided
Unassigned
util-linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

We got recurring issues with the cloud-init/ds-identify process. It spawns sub-processes "blkid -c /dev/null -o export" which gets stuck in the "D" uninterruptible sleep state.

The processes cannot be killed, so the only solution is to reboot the affected server.

root 3839 0.0 0.0 4760 1840 ? S Dec05 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 3844 0.0 0.0 11212 2836 ? D Dec05 0:00 \_ blkid -c /dev/null -o export
root 6943 0.0 0.0 4760 1880 ? S Dec05 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 6948 0.0 0.0 11212 2844 ? D Dec05 0:00 \_ blkid -c /dev/null -o export
root 6111 0.0 0.0 4760 1916 ? S Dec12 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 6149 0.0 0.0 11212 2940 ? D Dec12 0:00 \_ blkid -c /dev/null -o export
root 8765 0.0 0.3 926528 24968 ? Ssl Dec12 0:12 /usr/lib/snapd/snapd
root 9179 0.0 0.0 4760 1892 ? S Dec12 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 9185 0.0 0.0 11980 3552 ? D Dec12 0:00 \_ blkid -c /dev/null -o export

Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic

5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Dan Watkins (oddbloke) wrote :

Hi Nicklas,

Thanks for the bug report, this is a bit of a strange one! `blkid` hanging suggests a deeper issue with your VM/hypervisor than just a cloud-init issue, but let's gather the info we need to work that out. Could you run `cloud-init collect-logs` on an affected machine and attach the tarball it creates to this bug, please?

What happens if you run `sudo blkid -c /dev/null -o export` manually on the instance?

(Once you've attached this requested info, please move the bug status back to New.)

Thanks!

Dan

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Nicklas (nicklaswallgren) wrote :

Hey,

Thanks for the response.

If we run the `blkid -c /dev/null -o export` manually, it gets stuck as well.

The strange thing is, this only happens once in a while. If we restart the instance, and run the command again it runs successfully.

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Nicklas (nicklaswallgren) wrote :

Here is the output of the /proc/<pid>/stack

cat /proc/3844/stack
[<0>] __blkdev_get+0x7a/0x580
[<0>] blkdev_get+0x131/0x340
[<0>] blkdev_open+0x92/0x100
[<0>] do_dentry_open+0x1f8/0x3a0
[<0>] vfs_open+0x2f/0x40
[<0>] path_openat+0x2e8/0x1700
[<0>] do_filp_open+0x9b/0x110
[<0>] do_sys_open+0x1bb/0x2d0
[<0>] __x64_sys_openat+0x20/0x30
[<0>] do_syscall_64+0x5a/0x120
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

Revision history for this message
Dan Watkins (oddbloke) wrote :

Hi Nicklas,

Thanks for the extra info! This sounds like an issue with blkid (and/or your system), rather than a cloud-init bug. I've added a util-linux bug task (which is the package that ships blkid), and I'm going to mark the cloud-init bug task Invalid as I don't believe that there's anything we can do to help you from our perspective.

If you think this is incorrect, please do let us know!

Thanks!

Dan

Changed in cloud-init:
status: New → Invalid
tags: added: rls-bb-incoming
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

ls -latr /proc/<pid>/fd

would be nice too, to see what it has open/holding/reading.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.0.0-32-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity

[ 0.000000] DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[ 0.000000] Hypervisor detected: VMware

Could you please expalin what are you doing?

Because everything about this is odd. You are booting of LVM volume? Running ubiquity desktop installer? And using cloud-init?

It would be interesting for you to describe your setup and suggest how to reproduce your instance type.

Also you appear to run without a cloud hence nocloud seed.

If you are modifying cloud.config anyway, can you specify that ds-identify should just use NoCloud?

journal.txt seems to be incomplete, it's not a full journal of boot with this issue.

Revision history for this message
Nicklas (nicklaswallgren) wrote :

Hey,

We have restarted the affect nodes/servers. I will retrieve the output of `ls -latr /proc/<pid>/fd` once it starts happening again. Often takes a few days.

I haven't setup our Ubuntu environments. Our IT/devops team (with no experience of linux) has installed the Ubuntu-nodes on our VMware environment.

I don't believe we have made any changes to "cloud.config". It's just a clean install of 18.04.3, with 5.0.0-36-generic.

Can I provide any other information?

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

cloud-init is used to perform first time configuration of cloud-images using a cloud metadata source. I.e. a generic image is launched by openstack and provisions per-user keys / packages / etc.

If someone manually installs Ubuntu Desktop / Server using an installer, one should not be installing or using cloud-init as clearly, it will never complete without a correct metadata source present. And when it fails to complete, it re-attempts to reprovision the machine on every boot.

If you don't know whether or not there is a cloud metadata source, and whether or not cloud-init should be provisioning the instances on first boot, or whether a desktop/server installers were used - you need to sort that out first. Please escalate to your IT/devops team to understand whether or not it is intentional that cloud-init is installed and attempted on every boot, and where the metadata for it is suppoed to come from. Then fix cloud-init/metadata such that it completes, or remove/disable/uninstall cloud-init.

This is not a bug, but effectively a support issue at this point. The logs are inconsistent with normal cloud-init usage, and there is no reproducer.

Changed in util-linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Scott Moser (smoser) wrote :

> If someone manually installs Ubuntu Desktop / Server using an installer,
> one should not be installing or using cloud-init as clearly, it will never
> complete without a correct metadata source present. And when it fails to
> complete, it re-attempts to reprovision the machine on every boot.

I'd just like to clarify something. The above statement is neither true
nor helpful.

cloud-init (since 18.04) does not run unless it is on a cloud or has cloud
platform metadata. It is perfectly fine to have cloud-init installed on a
desktop or a server that is not running "on a cloud".

In order to accomplish this, cloud-init is only enabled via a generator.
That generator ran 'blkid', and 'blkid' blocked indefinitely.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Moved back to New, because this is happening in normal cloud-init operation.

Changed in util-linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Nicklas (nicklaswallgren) wrote :

Affected process
```
root 1166 0.0 0.0 72292 5812 ? Ss 2019 0:00 /usr/sbin/sshd -D
```

Output of `ls -latr /proc/1166/fd`

total 0
dr-xr-xr-x 9 root root 0 Jan 14 12:00 ..
dr-x------ 2 root root 0 Jan 14 12:31 .
lrwx------ 1 root root 64 Jan 14 12:31 4 -> 'socket:[24485]'
lrwx------ 1 root root 64 Jan 14 12:31 3 -> 'socket:[24483]'
lrwx------ 1 root root 64 Jan 14 12:31 2 -> 'socket:[24455]'
lrwx------ 1 root root 64 Jan 14 12:31 1 -> 'socket:[24455]'
lr-x------ 1 root root 64 Jan 14 12:31 0 -> /dev/null

Should we disable `cloud-init` altogether?

Revision history for this message
Nicklas (nicklaswallgren) wrote :

Never mind my last message...

Revision history for this message
Nicklas (nicklaswallgren) wrote :
Download full text (3.3 KiB)

Here is the output of `ls -latr /proc/<pid>/fd`

root 3377 0.0 0.0 11980 3352 ? D Jan24 0:00 blkid -c /dev/null -o export
root 6293 0.0 0.0 11980 3316 ? D Jan24 0:00 blkid -c /dev/null -o export
root 10422 0.0 0.0 11980 3556 ? D Jan26 0:00 blkid -c /dev/null -o export
root 13020 0.0 0.0 11980 3492 ? D Jan24 0:00 blkid -c /dev/null -o export
root 13192 0.0 0.0 11980 3752 ? D Jan26 0:00 blkid -c /dev/null -o export
root 15494 0.0 0.0 11980 3704 ? D Jan27 0:00 blkid -c /dev/null -o export
root 15506 0.0 0.0 11212 2532 ? D Jan23 0:00 blkid -c /dev/null -o export
root 15630 0.0 0.0 11980 3456 ? D Jan24 0:00 blkid -c /dev/null -o export
root 15768 0.0 0.0 11980 3700 ? D Jan26 0:00 blkid -c /dev/null -o export
root 16282 0.0 0.0 11980 3704 ? D Jan27 0:00 blkid -c /dev/null -o export
root 18328 0.0 0.0 11980 3616 ? D Jan27 0:00 blkid -c /dev/null -o export
root 18364 0.0 0.0 11980 3432 ? D Jan23 0:00 blkid -c /dev/null -o export
root 18549 0.0 0.0 11980 3324 ? D Jan24 0:00 blkid -c /dev/null -o export
root 18755 0.0 0.0 11980 3620 ? D Jan26 0:00 blkid -c /dev/null -o export
root 19266 0.0 0.0 11980 3556 ? D Jan27 0:00 blkid -c /dev/null -o export
root 20480 0.0 0.0 11980 3616 ? D Jan26 0:00 blkid -c /dev/null -o export
root 21131 0.0 0.0 11980 3504 ? D Jan23 0:00 blkid -c /dev/null -o export
root 21149 0.0 0.0 11980 3636 ? D Jan27 0:00 blkid -c /dev/null -o export
root 21378 0.0 0.0 11980 3180 ? D Jan24 0:00 blkid -c /dev/null -o export
root 21927 0.0 0.0 11980 3624 ? D Jan27 0:00 blkid -c /dev/null -o export
root 23165 0.0 0.0 11980 3704 ? D Jan26 0:00 blkid -c /dev/null -o export
root 23795 0.0 0.0 11980 3616 ? D Jan27 0:00 blkid -c /dev/null -o export
root 23911 0.0 0.0 11980 3456 ? D Jan23 0:00 blkid -c /dev/null -o export
root 24554 0.0 0.0 11980 3592 ? D Jan27 0:00 blkid -c /dev/null -o export
root 25672 0.0 0.0 11980 3752 ? D Jan26 0:00 blkid -c /dev/null -o export
root 28344 0.0 0.0 11980 3704 ? D Jan26 0:00 blkid -c /dev/null -o export
root 30382 0.0 0.0 11980 3216 ? D Jan24 0:00 blkid -c /dev/null -o export

dr-xr-xr-x 9 root root 0 Jan 30 06:29 ..
lr-x------ 1 root root 64 Jan 30 15:22 4 -> /proc/partitions
l-wx------ 1 root root 64 Jan 30 15:22 3 -> /run/cloud-init/ds-identify.log
lrwx------ 1 root root 64 Jan 30 15:22 2 -> /dev/null
l-wx------ 1 root root 64 Jan 30 15:22 1 -> 'pipe:[399588299]'
lrwx------ 1 root root 64 Jan 30 15:22 0 -> /dev/null
dr-x------ 2 root root 0 Jan 30 15:22 .

root@ubuntuapp02:/# ls -latr /proc/23795/fd
total 0
dr-xr-xr-x 9 root root 0 Jan 30 06:29 ..
lr-x------ 1 root root 64 Jan 30 15:23 4 -> /proc/partitions
l-wx------ 1 root root 64 Jan 30 15:23 3 -> /run/cloud-init/ds...

Read more...

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.