Xenial (ESM) - 4.15.0-156 creates event storm when ejecting DVD-ROM media

Bug #1944642 reported by Kenneth Lakin (VMware)
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Xenial
In Progress
Undecided
Tim Gardner

Bug Description

Hello! I'm a programmer employed by VMWare. We're using Ubuntu Xenial, have an ESM contract, and have run into what we believe to be a nasty interaction between a kernel change and the `cdrom_id --eject-media` command. This problem started happening after upgrading from 4.15.0-154 to 4.15.0-156.

I am reporting a kernel "device change" event storm that appears to not stop on its own and is triggered by both ejecting media from a DVD-ROM drive AND having a udev rule which contains the following line:

ENV{DISK_EJECT_REQUEST}=="?*", RUN+="cdrom_id --eject-media $devnode", GOTO="cdrom_end"

If you eliminate this line from your udev rule file, the storm will never happen. This strongly suggests that the cdrom_id program is a key component in the problem. As mentioned before, this problem happened after we upgraded from kernel 4.15.0-154 to kernel 4.15.0-156 as part of a regular update to the Canonical-included packages that we use as a base for our system. Based on the evidence, we believe that this is a regression from kernel version 4.15.0-154.

This storm prevents new media inserted into the drive from being recognized and mounted. In order for any new media inserted into the drive to be recognized, one must terminate the storm by do one of two things:

Either:
Move `/lib/udev/cdrom_id` to some other location, wait for a moment for the storm to stop, then move it back. Media inserted into the drive will be immediately recognized.
or:
Insert the media several times until it stays inserted. In testing "several" has been "three or four", but I have no idea of the upper or lower bounds of this number. Each time you insert the media, it will be automatically removed from the VM. On the insertion immediately _before_ the one that actually works, the event storm will stop, and the media will be -again- automatically removed. On the next insertion, the media will stay in, and you will be able to actually mount the media inside the VM.

When you look at the output of `udevadm monitor -u -k` during the event storm the following event is generated at a rate roughly ten times per second:

KERNEL[3198.914510] change /devices/pci0000:00/0000:00:07.1/ata1/host0/target0:0:0/0:0:0:0/block/sr0 (block)
UDEV [3199.021261] change /devices/pci0000:00/0000:00:07.1/ata1/host0/target0:0:0/0:0:0:0/block/sr0 (block)

Relevant system information:
OS: Ubuntu Xenial ESM
IAAS: vSphere 7.0
uname -a output: Linux test-machine 4.15.0-156-generic #163~16.04.1-Ubuntu SMP Mon Aug 23 13:38:23 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Likely Reproduction Steps:
1) Boot a machine (virtual or otherwise) with an attached DVD-ROM drive with the 4.15.0-156-generic kernel.
2) Insert media into the drive.
3) Eject the media.
4) Notice that `udevadm monitor -u -k` is showing you an endless stream of "device change" events.

If you need any additional information or resources from us, please don't hesitate to ask. I expect that if you folks need a vSphere environment, we'll be able to provide one.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1944642

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Tim Gardner (timg-tpi)
tags: added: bot-stop-nagging
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Hi Kenneth - Since you already have the environment and reproducer, it would likely be faster if I build bisect kernels to narrow down the offending commit. It will probably take 6-8 attempts. Do you have the time for that ?

Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
Kenneth Lakin (VMware) (klakin-vmware) wrote :

Hello Tim!

I'm familiar with using 'git bisect' to track down bugs, but not so familiar with working with Debian packages. (My primary Linux is Gentoo, and don't _usually_ deal directly with Debian packages at work.) Will the process for this be:

You build some number of kernel package .debs and send them to me, along with instructions on how to correctly switch to each one and in which order I should switch. I then tell you between which two the bug appears?

If yes (or something pretty similar), yeah, I have time for that.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Hi Kenneth - yes, I'll build kernels and you tell me which ones are good or bad. Here is the first kernel:

wget https://kernel.ubuntu.com/~rtg/dvd-event-storm-lp1944642/4.15.0-155.162_1.da738569aabed1847bf462f9b75c3d952fd37cad/amd64/linux-image-unsigned-4.15.0-155-generic_4.15.0-155.162~1.da738569aabed1847bf462f9b75c3d952fd37cad_amd64.deb
wget https://kernel.ubuntu.com/~rtg/dvd-event-storm-lp1944642/4.15.0-155.162_1.da738569aabed1847bf462f9b75c3d952fd37cad/amd64/linux-modules-4.15.0-155-generic_4.15.0-155.162~1.da738569aabed1847bf462f9b75c3d952fd37cad_amd64.deb
sudo dpkg -i linux-image*.deb linux-modules*.deb
sudo reboot

Make sure you're running the right version by checking 'uname -a', e.g.,

Linux ip-172-31-3-147 4.15.0-155-generic #162~1.da738569aabed1847bf462f9b75c3d952fd37cad SMP Thu Sep 23 1 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Kenneth - I'm anticipating that you might have version issues with your installed kernels. Here is the same kernel but with a higher version number so that it should be the default boot kernel:

wget https://kernel.ubuntu.com/~rtg/dvd-event-storm-lp1944642/4.15.0-160.0_1.da738569aabed1847bf462f9b75c3d952fd37cad/amd64/linux-image-unsigned-4.15.0-160-generic_4.15.0-160.0~1.da738569aabed1847bf462f9b75c3d952fd37cad_amd64.deb
wget https://kernel.ubuntu.com/~rtg/dvd-event-storm-lp1944642/4.15.0-160.0_1.da738569aabed1847bf462f9b75c3d952fd37cad/amd64/linux-modules-4.15.0-160-generic_4.15.0-160.0~1.da738569aabed1847bf462f9b75c3d952fd37cad_amd64.deb
sudo dpkg -i linux-image*.deb linux-modules*.deb
sudo reboot

I will take care that subsequent test kernels have an increasing version number so that once installed they always become the default boot kernel.

Revision history for this message
Kenneth Lakin (VMware) (klakin-vmware) wrote :

Thanks very much for the info and assistance.

I won't have the time to get to this today, but absolutely will have the time tomorrow. Hopefully that works with your schedule.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Kenneth,

Thank you for taking the time to report the issue. In the future, please feel free to file a case through the support portal if you find a issue that needs to be fixed urgently in ESM.

In any case, I did some investigating, and found the below commit was backported to 4.15.0-155-generic:

commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3
Author: ManYi Li <email address hidden>
Date: Fri Jun 11 17:44:02 2021 +0800
Subject: scsi: sr: Return appropriate error code when disk is ejected
Link: https://github.com/torvalds/linux/commit/7dd753ca59d6c8cc09aa1ed24f7657524803c7f3

It landed in a few other kernels too, the full list is:

Bionic 4.15.0-155-generic
Focal 5.4.0-82-generic
Hirsute 5.11.0-32-generic
Impish 5.13.0-1-generic

I believe this is the root cause, and it lines up with your 4.15.0-154-generic (good) and 4.15.0-156-generic (bad) findings.

This has been fixed upstream, by the below commit:

commit 5c04243a56a7977185b00400e59ca7e108004faf
Author: Li Manyi <email address hidden>
Date: Mon Jul 26 19:49:13 2021 +0800
Subject: scsi: sr: Return correct event when media event code is 3
Link: https://github.com/torvalds/linux/commit/5c04243a56a7977185b00400e59ca7e108004faf

This commit has already been applied to all the Ubuntu kernels, which are currently sitting in -proposed:

Bionic 4.15.0-159-generic
Focal 5.4.0-87-generic
Hirsute 5.11.0-37-generic
Impish 5.13.0-16-generic

Looking at https://kernel.ubuntu.com/, the current SRU cycle is scheduled to come to an end early next week, so we should expect a release to -updates on ther 27th of September, give or take a few days if any CVEs turn up.

At the moment, it is not very straightforward to access the -proposed repo under ESM, but if you can also reproduce the issue under Bionic, feel free to enable -proposed and try 4.15.0-159-generic to confirm it fixes the issue.

https://wiki.ubuntu.com/Testing/EnableProposed

I set bug 1942299 to track this issue, so I am going to mark this one as a duplicate of it.

If you have any questions, feel free to write back, or open a case on the support portal.

Thanks,
Matthew

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Kenneth,

The 4.15.0-159-generic kernel has now landed in Xenial ESM, and it contains the fix "scsi: sr: Return correct event when media event code is 3".

If you could install that kernel and double check that it fixes the issue, that would be fantastic.

If you need anything else, let us know.

Thanks,
Matthew

Revision history for this message
Kenneth Lakin (VMware) (klakin-vmware) wrote :

Looks like that kernel fixes our issues. Thanks much for all the assistance.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.