Tight timeout for bcache removal causes spurious failures

Bug #1796292 reported by Peter Sabaini
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
curtin
Fix Released
High
Andrea Righi
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Cosmic
Confirmed
Undecided
Unassigned
Disco
Confirmed
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned

Bug Description

I've had a number of deployment faults where curtin would report Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-deployment of 30+ nodes. Upon retrying the node would usually deploy fine. Experimentally I've set the timeout ridiculously high, and it seems I'm getting no faults with this. I'm wondering if the timeout for removal is set too tight, or might need to be made configurable.

--- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000
+++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000
@@ -263,7 +263,7 @@
     return _subp(*args, **kwargs)

-def wait_for_removal(path, retries=[1, 3, 5, 7]):
+def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
     if not path:
         raise ValueError('wait_for_removal: missing path parameter')

Related branches

CVE References

Revision history for this message
Scott Moser (smoser) wrote :

Can we get a console output of a failure ? or a curtin log.

also, i think the fix here would better go into shutdown_bcache in its calling of 'wait_for_removal'.

Revision history for this message
Ryan Harper (raharper) wrote :

Yeah, I think it make sense for shutdown_bcache to adjust. Currently it does this:

# bcache device removal should be fast but in an extreme
# case, might require the cache device to flush large
# amounts of data to a backing device. The strategy here
# is to wait for approximately 30 seconds but to check
# frequently since curtin cannot proceed until devices
# cleared.
removal_retries = [0.2] * 150 # 30 seconds total

1200 seems extreme to me. Did you try anything lower than that?
I'd be more inclined to try 5 minutes...

remove_retries = [0.2] * 1500 # 300 seconds/5 min total

Or we could scale it up like:

remove_retries = ([0.2] * 150 + [20] * 15) # 30 seconds + 5 minutes

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

It looks like we have this reproduced:

2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Skipping logfile /tmp/curtin-logs.tar: file does not exist
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Wrote: /tmp/curtin-logs.tar
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Unexpected error while running command.
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Command: ['curtin', 'block-meta', 'custom']
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Exit code: 3
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Reason: -
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Stdout: [Errno Timeout exceeded for removal of %s] /sys/fs/bcache/aca640ad-b751-4bc0-bfbc-8181bde699a7
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]:
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Stderr: ''
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:final' at Fri, 14 Dec 2018 21:52:01 +0000. Up 37.78 seconds.
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: 2018-12-14 21:52:55,936 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [3]
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: 2018-12-14 21:52:55,948 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2018-12-14T21:52:55+00:00 ln-sv-ostk02 cloud-init[2467]: 2018-12-14 21:52:55,958 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed

tags: added: cpe-onsite
Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

Thank for the log. If it's reproducible, can you re-run the deployment
with curtin verbosity set ?

And can you attach the configuration that you're deploying? I'm especially
interested in the number of bcache backing devices and how many cache
devices.

% maas {profile} maas set-config name=curtin_verbose value=true

https://discourse.maas.io/t/getting-curtin-debug-logs/169

On Fri, Dec 14, 2018 at 5:01 PM Pedro Guimarães <email address hidden>
wrote:

> ** Tags added: cpe-onsite
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Changed in curtin:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Hi Ryan,

curtin config: https://pastebin.canonical.com/p/s5wpt8YJt3/

Curtin logs are empty:

ubuntu@ln-sv-infr01:~$ maas admin node-results read system_id=bfdbg4 name=/tmp/curtin.tar
Success.
Machine-readable output follows:
[]

However, they are available in MAAS UI, and I've exported them: https://pastebin.canonical.com/p/TnD65J9jS8/

This part looks like problematic one:

Error rescanning devices, possibly known issue LP: #1489521
cmd: ['blockdev', '--rereadpt', '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh']
stdout:''
stderr:blockdev: ioctl error on BLKRRPART: Device or resource busy

Thoughts?

Jeff Lane  (bladernr)
Changed in curtin:
status: Incomplete → New
Revision history for this message
Ryan Harper (raharper) wrote :

Thank you for the config and an install log.

You don't happen to have a verbose curtin install log where it times out (the failure path)?

The error on rescanning partitions is a side-effect from one or more of the bcache devices
from not getting shut down properly which keeps a hold on the underlying device which prevents
access to the device exclusively, hence the error; however the failure is before this.

Getting the verbose logging from curtin (like you did for this successful install) will help me understand where the bug occurred.

Changed in curtin:
status: New → Incomplete
Revision history for this message
John George (jog) wrote :

@Ryan we're hitting this issue during solutions QA test runs. The latest recreate is here:
https://solutions.qa.canonical.com/#/qa/testRun/4b4c04da-0178-4bcd-8a4b-7da81f3ed6bc

There is a link at the bottom of that page to all the test artefacts, which includes the maas logs at this link:
https://oil-jenkins.canonical.com/artifacts/4b4c04da-0178-4bcd-8a4b-7da81f3ed6bc/generated/maas/logs-2019-01-17-03.39.13.tar

Revision history for this message
Ryan Harper (raharper) wrote :

On Thu, Jan 17, 2019 at 12:15 PM John George <email address hidden>
wrote:

> @Ryan we're hitting this issue during solutions QA test runs. The latest
> recreate is here:
>
> https://solutions.qa.canonical.com/#/qa/testRun/4b4c04da-0178-4bcd-8a4b-7da81f3ed6bc
>
> There is a link at the bottom of that page to all the test artefacts,
> which includes the maas logs at this link:
>
> https://oil-jenkins.canonical.com/artifacts/4b4c04da-0178-4bcd-8a4b-7da81f3ed6bc/generated/maas/logs-2019-01-17-03.39.13.tar

Can you narrow down which systems I should be looking at?

>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here is another reproduction:

http://paste.ubuntu.com/p/zfXRsp8rbK/

Here is the curtin config:

http://paste.ubuntu.com/p/BTQRDVk7SB/

Changed in curtin:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

sub'd to field high as we've hit this 4 times in CI, and it seems related to bug 1808231 and bug 1811117, and we don't have a workaround.

Revision history for this message
Ryan Harper (raharper) wrote :

Do you turn on curtin_verbose during the QA runs? If not, can we ensure we get a reproduce with the curtin_verbose enabled?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Yes, we do turn on curtin_verbose

On Wed, Jan 23, 2019 at 6:26 PM Ryan Harper <email address hidden>
wrote:

> Do you turn on curtin_verbose during the QA runs? If not, can we ensure
> we get a reproduce with the curtin_verbose enabled?
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1808231).
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> Status in curtin:
> New
>
> Bug description:
> I've had a number of deployment faults where curtin would report
> Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
> deployment of 30+ nodes. Upon retrying the node would usually deploy
> fine. Experimentally I've set the timeout ridiculously high, and it
> seems I'm getting no faults with this. I'm wondering if the timeout
> for removal is set too tight, or might need to be made configurable.
>
> --- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000
> +++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000
> @@ -263,7 +263,7 @@
> return _subp(*args, **kwargs)
>
>
> -def wait_for_removal(path, retries=[1, 3, 5, 7]):
> +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
> if not path:
> raise ValueError('wait_for_removal: missing path parameter')
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

@jhobbs

I see that now; I missed the curtin section since the log had more than just the single curtin install boot.

I think I see something; but since I've never been able to reproduce this in our integration harness it's been hard to diagnose.

The Shutdown Plan for cleaning devices shows that it has detected bcache0p1 as a 'bcache' device, when it's really a 'partition' which should only trigger a wipe of the partition, the actual bcache device is the parent (bcache0).

Shutdown Plan:
{'level': 3, 'dev_type': 'bcache', 'device': '/sys/class/block/bcache0/bcache0p1'}
{'level': 2, 'dev_type': 'bcache', 'device': '/sys/class/block/bcache1'}
{'level': 2, 'dev_type': 'bcache', 'device': '/sys/class/block/bcache0'}
{'level': 1, 'dev_type': 'partition', 'device': '/sys/class/block/nvme0n1/nvme0n1p2'}
{'level': 1, 'dev_type': 'partition', 'device': '/sys/class/block/sda/sda1'}
{'level': 1, 'dev_type': 'partition', 'device': '/sys/class/block/sda/sda2'}
{'level': 1, 'dev_type': 'partition', 'device': '/sys/class/block/nvme0n1/nvme0n1p1'}
{'level': 1, 'dev_type': 'partition', 'device': '/sys/class/block/sda/sda3'}
{'level': 0, 'dev_type': 'disk', 'device': '/sys/class/block/sda'}
{'level': 0, 'dev_type': 'disk', 'device': '/sys/class/block/nvme0n1'}
{'level': 0, 'dev_type': 'disk', 'device': '/sys/class/block/sdb'}

Then this leads us to attempt to write to bcache0/bcache0p1/bcache, which does not exist (it's a partition, not a bcache device).

Error writing to bcache stop file /sys/class/block/bcache0/bcache0p1/bcache/stop, device removed: [Errno 1] Operation not permitted: '/sys/class/block/bcache0/bcache0p1/bcache'
waiting for /sys/class/block/bcache0/bcache0p1 to be removed
sleeping 0.2

We repeat and fail each time.

With that in mind, I've updated the fix for bcache-partitions with a change that prevents partitions on top of bcache devices from being classified as 'bcache' devices. This will fix the issue that you produced, but it's not the original issue (which didn't involve bcache devices with partitions).

So, with the updated curtin (18.2-0ubuntu10~clear-holders-bcache-partitions-lp1811117-ppa10) in the ppa:raharper/bugfixes, please attempt to reproduce.

Changed in curtin:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We've had a number of successful test runs with this now, and no more failures, so I think it's fixed.

Changed in curtin:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit another failure that could possibly be this. http://paste.ubuntu.com/p/pvDmP2B8Q2/

Revision history for this message
Ryan Harper (raharper) wrote :

Hi,

Can you provide the curtin-logs.tar from one of these failures?

maas {profile} node-results read system_id={system_id}
name=/tmp/curtin-logs.tar | jq -r .[0].data | base64 -d > curtin-logs.tar

https://discourse.maas.io/t/getting-curtin-debug-logs/169

And if possible, can you test with curtin trunk (daily ppa)?

https://launchpad.net/~curtin-dev/+archive/ubuntu/daily

On Tue, Feb 12, 2019 at 10:40 AM Jason Hobbs <email address hidden>
wrote:

> We hit another failure that could possibly be this.
> http://paste.ubuntu.com/p/pvDmP2B8Q2/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Changed in curtin:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I've attached curtin-logs from a failure. This was with curtin from proposed - 18.2-10-g7afd77fa-0ubuntu1~18.04.1

Changed in curtin:
status: Incomplete → New
Revision history for this message
Ryan Harper (raharper) wrote :

Thanks.

Do we have logs of the ceph charm which create those ceph logical volumes?

I'd like to recreate those LVs on top of the instance so I can have an instance with the same setup.

Changed in curtin:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

There are logs of that creation here:

https://oil-jenkins.canonical.com/artifacts/0cef245b-9ce0-4dec-8ee6-5fef94433267/generated/generated/openstack/juju-crashdump-openstack-2019-03-15-12.01.32.tar.gz

The logs for unit ceph-osd/0 are in ceph-osd_0/var/log/, for example:

ceph-osd_0/var/log/ceph/ceph-volume.log

Has stuff about lvm creation.

Changed in curtin:
status: Incomplete → New
Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.5 KiB)

I've seen this failure in another scenario, tracked in this private bug (LP: #1815018); so I'm copying in the portion that's relevant here.

--

The failure you saw in the 1 in 30 case I now believe is related to the time it takes to flush the cache device. Curtin currently finds a bcache's cacheset device and stops that first. In your ceph deployment scenario, each cache device has 3 backing devices being cached, which may contain a large amount of dirty data that needs to be flush; and this is where the longer timeout that was initially mentioned in LP: #1796292 helped.

Curtin will now do the following for stopping bcache devices.

1. wipe the bcache device contents
2. extract the cacheset uuid
3. extract the backing device
4. detached bcache from cacheset
5. stop the bcacheN device
6. wait for removal of sysfs path to bcacheN, bcacheN/bcache and
   backing/bcache to go away
7. Check how many other backing devices are attached to
   cset_uuid, if zero, stop cset

Notably at step 4, we will monitor the bcache device's state, and wait until the cacheset is no longer attached. And then Step 7, we only remove a cache device once all if the devices it was caching have been stopped.

[ 83.467348] cloud-init[1081]: shutdown running on holder type: 'bcache' syspath: '/sys/class/block/bcache0'
[ 83.470549] cloud-init[1081]: Wiping superblock on bcache device: /sys/class/block/bcache0
[ 83.472919] cloud-init[1081]: wiping superblock on /dev/bcache0
[ 83.474801] cloud-init[1081]: wiping /dev/bcache0 attempt 1/4
[ 83.477077] cloud-init[1081]: wiping 1M on /dev/bcache0 at offsets [0, -1048576]
[ 83.479757] cloud-init[1081]: successfully wiped device /dev/bcache0 on attempt 1/4
[ 83.481915] cloud-init[1081]: os.path.exists on blockdevs:
[ 83.484922] cloud-init[1081]: [('/sys/class/block/bcache0/bcache', True), ('/sys/class/block/vda/vda1/bcache', True)]
[ 83.489019] cloud-init[1081]: bcache: detaching /sys/class/block/bcache0 from cacheset 7d2a9905-bd60-4db1-a8c8-12ca4ac90d45
[ 83.492648] cloud-init[1081]: /sys/class/block/bcache0 waiting up to 300s for cacheset to detach
[ 83.496051] cloud-init[1081]: /sys/class/block/bcache0 cset detach check=0 state='dirty' dirty_data='1.9M'
[ 85.372738] cloud-init[1081]: /sys/class/block/bcache0 cset detach check=1 state='no cache' dirty_data='0.0k'
[ 85.373121] cloud-init[1081]: /sys/class/block/bcache0 successfully detached from cacheset 7d2a9905-bd60-4db1-a8c8-12ca4ac90d45
[ 85.377747] cloud-init[1081]: stopping bcache backing device at: /sys/class/block/bcache0/bcache
[ 85.379303] cloud-init[1081]: waiting for /sys/class/block/bcache0 to be removed
[ 85.380359] cloud-init[1081]: sleeping 0.2
[ 85.569818] cloud-init[1081]: /sys/class/block/bcache0 has been removed
[ 85.571160] cloud-init[1081]: waiting for /sys/class/block/bcache0/bcache to be removed
[ 85.571727] cloud-init[1081]: /sys/class/block/bcache0/bcache has been removed
[ 85.576889] cloud-init[1081]: waiting for /sys/class/block/vda/vda1/bcache to be removed
[ 85.580122] cloud-init[1081]: /sys/class/block/vda/vda1/bcache has been removed
[ 85.584239] cloud-init[1081]: Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
[ ...

Read more...

Changed in curtin:
status: New → In Progress
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 36351dea to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=36351dea

Changed in curtin:
status: In Progress → Fix Committed
Revision history for this message
Ryan Harper (raharper) wrote :

Jason,

Would you be able to give the current curtin-daily ppa build a test? I believe we've got this issue fixed but would like some feedback/results on this fix before we start an SRU.

ppa:curtin-dev/daily

Revision history for this message
Ryan Harper (raharper) wrote :

I've copied what's in disco into this ppa; so it won't change on you while testing.

https://launchpad.net/~raharper/+archive/ubuntu/curtin-dev/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

we hit this issue with the version from the curtin-dev ppa - so it is apparently not yet resolved:

18.2-19-g36351dea-0ubuntu1

Changed in curtin:
status: Fix Committed → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In comment 24, the failure was on spearow; unpack the logs for 10.244.40.32 and it's at:

./10.244.40.32/var/log/maas/rsyslog/spearow/2019-01-15/messages

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

err, ./10.244.40.32/var/log/maas/rsyslog/spearow/2019-04-25/messages

Revision history for this message
Ryan Harper (raharper) wrote :

After looking at the logs, I believe curtin is doing all that it can to shut down the bcache device.

Waiting 1200 seconds is more than reasonable and so I believe there is something else going on in the bcache device in the kernel at this point. I would like to either open a task against the kernel or start a new bug with these details.

Some conversation from IRC for context.

<rharper> that's a lot of seconds to wait
<rharper> I wonder if there's a kernel bug in there; is it reasonable to wait 2400 seconds? 3600 seconds?
<jhobbs> no
<rharper> jhobbs: do you know if we capture dmesg from the host?
<jhobbs> that's all ridiculous
<jhobbs> for nvme...
<rharper> right; so it smells like a bug, or deadlock in bcache itself; I found *many* of those while working on a more reliable way to stop them
<rharper> jhobbs: so I'm generally happy with the curtin code; I think we're doing a reliable job of shutting them down and waiting a more than reasonable amount of time for the device to stop at this point. We may need to open up a different/new issue against the kernel to see if we can get to the bottom of why it isn't shutting down;
<jhobbs> rharper: that log is from the node logging syslog to maas, we don't have any hooks in there to get dmesg during install
<rharper> jhobbs: ok

Revision history for this message
James Troup (elmo) wrote :

Ryan Harper <email address hidden> writes:

> After looking at the logs, I believe curtin is doing all that it can to
> shut down the bcache device.
>
> Waiting 1200 seconds is more than reasonable and so I believe there is
> something else going on in the bcache device in the kernel at this
> point. I would like to either open a task against the kernel or start a
> new bug with these details.

https://patchwork.kernel.org/patch/10909311/

May or may not be relevant.

--
James

Revision history for this message
Joshua Powers (powersj) wrote :

Adding affects linux package

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1796292

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This occurrs on a target machine during maas install. Apport is not collected in this case.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Trent Lloyd (lathiat) wrote :

I have been running into this (curtin 18.1-17-gae48e86f-0ubuntu1~16.04.1)

I think this commit basically agrees with my thoughts but I just wanted to share them explicitly in case they are interesting

 (1) If you *unregister* the cache device from the backing device, it first has to purge all the dirty data back to the backing device. This may obviously take a while.

 (2) When doing that, I managed to deadlock bcache at least once on xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I assume was trying to write to the bcache.. traceback: https://pastebin.canonical.com/117528/ - you can't get out of that without a reboot

 (3) However generally I had good luck simplying "stop"ing the cache devices (it seems perhaps that is what this bug is designed to do, switch to stop, instead of unregister?). Specifically though I was stopping the backing devices, and then later the cache device. It seems like the current commit is the other way around?

tags: added: sts
Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

On Wed, May 8, 2019 at 11:55 PM Trent Lloyd <email address hidden>
wrote:

> I have been running into this (curtin 18.1-17-gae48e86f-
> 0ubuntu1~16.04.1)
>
> I think this commit basically agrees with my thoughts but I just wanted
> to share them explicitly in case they are interesting
>
> (1) If you *unregister* the cache device from the backing device, it
> first has to purge all the dirty data back to the backing device. This
> may obviously take a while.
>
> (2) When doing that, I managed to deadlock bcache at least once on
> xenial-hwe 4.15 where it was trying to reclaim memory from XFS, which I
> assume was trying to write to the bcache.. traceback:
> https://pastebin.canonical.com/117528/ - you can't get out of that
> without a reboot
>

Thanks for capturing those; Ive quite a few of my own as an unregister path
which _should_ work; but doesn't for various bugs in bcache. I need to
attach those OOPS to this bug as well.

>
> (3) However generally I had good luck simplying "stop"ing the cache
> devices (it seems perhaps that is what this bug is designed to do,
> switch to stop, instead of unregister?). Specifically though I was
> stopping the backing devices, and then later the cache device. It seems
> like the current commit is the other way around?
>

Unregister is just not stable, so stopping is what is being done now.

I did attempt stopping bcache devices first and only once all bcache
devices were
stopped to then stop and remove a cacheset; this proved unreliable under our
integration testing of various bcache scenarios.

>
> ** Tags added: sts
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

Xenial GA kernel bcache unregister oops:

http://paste.ubuntu.com/p/BzfHFjzZ8y/

Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

@jhobbs

Here is the script that cleans up bcache devices on recommission:
https://pastebin.ubuntu.com/p/6WCGvM4Q32/

tags: added: cdo-qa foundations-engine
Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

I was looking into kernel commits and I came across this: https://github.com/torvalds/linux/commit/fadd94e05c02afec7b70b0b14915624f1782f578

So, as far as I understood, it actually deals with the issue of manual device detach during a writeback clean-up and causing deadlock. The timeline makes sense when we look for Bionic GA kernel, as well. Bionic GA should not include this fix, but HWE should.

Could we run the tests again, but focusing on Bionic hwe? Afair hwe runs 4.18, which should include this commit.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

This script *should* trigger the issue on Bionic GA: https://pastebin.ubuntu.com/p/WdKGbMWnM6/
Try it with both GA and HWE bionic, the commit on HWE should trigger up.

Revision history for this message
Dan Watkins (oddbloke) wrote : Fixed in curtin version 19.1.

This bug is believed to be fixed in curtin in version 19.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: New → Fix Released
Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

Is there an estimate on getting this package in bionic-updates please?

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1796292] Re: Tight timeout for bcache removal causes spurious failures

On Mon, Jun 3, 2019 at 2:05 PM Andrey Grebennikov <
<email address hidden>> wrote:

> Is there an estimate on getting this package in bionic-updates please?
>

We are starting an SRU of curtin this week. SRU's take at least 7 days
from when they hit -proposed
possibly longer depending on test results.

I should have something up in -proposed this week and we'll go from there
on testing.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Revision history for this message
Terry Rudd (terrykrudd) wrote :

Canonical kernel team has this item queued in the hotlist to work on. I am assigning to myself to accelerate work

Changed in curtin:
assignee: nobody → Terry Rudd (terrykrudd)
Revision history for this message
Andrea Righi (arighi) wrote :

From a kernel perspective this big slowness on shutting down a bcache volume might be caused by a locking / race condition issue. If I read correctly this problem has been reproduced in bionic (and in xenial we even got a kernel oops - it looks like caused by a NULL pointer dereference). I would try to address these issues separately.

About bionic it would be nice to test this commit (also mentioned by @elmo in comment #28):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60

Moreover, even if we didn't get an explicit NULL pointer dereference with bionic, I think it would be interesting to test also the following fixes:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566

I've already backported all of them and applied to the latest bionic kernel. A test kernel is available here:

https://kernel.ubuntu.com/~arighi/LP-1796292/

If it doesn't cost too much it would be great to do a test with it. In the meantime I'll try to reproduce the problem locally. Thanks in advance!

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (3.3 KiB)

This is difficult for us to test in our lab because we are using MAAS, and
we hit this during MAAS deployments of nodes, so we would need MAAS images
built with these kernels. Additionally, this doesn't reproduce every time,
it is maybe 1/4 test runs. It may be best to find a way to reproduce this
outside of MAAS.

On Wed, Jul 3, 2019 at 11:16 AM Andrea Righi <email address hidden>
wrote:

> >From a kernel perspective this big slowness on shutting down a bcache
> volume might be caused by a locking / race condition issue. If I read
> correctly this problem has been reproduced in bionic (and in xenial we
> even got a kernel oops - it looks like caused by a NULL pointer
> dereference). I would try to address these issues separately.
>
> About bionic it would be nice to test this commit (also mentioned by
> @elmo in comment #28):
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60
>
> Moreover, even if we didn't get an explicit NULL pointer dereference
> with bionic, I think it would be interesting to test also the following
> fixes:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566
>
> I've already backported all of them and applied to the latest bionic
> kernel. A test kernel is available here:
>
> https://kernel.ubuntu.com/~arighi/LP-1796292/
>
> If it doesn't cost too much it would be great to do a test with it. In
> the meantime I'll try to reproduce the problem locally. Thanks in
> advance!
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> Status in curtin:
> Fix Released
> Status in linux package in Ubuntu:
> Confirmed
> Status in linux source package in Bionic:
> New
> Status in linux source package in Cosmic:
> New
> Status in linux source package in Disco:
> New
> Status in linux source package in Eoan:
> Confirmed
>
> Bug description:
> I've had a number of deployment faults where curtin would report
> Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
> deployment of 30+ nodes. Upon retrying the node would usually deploy
> fine. Experimentally I've set the timeout ridiculously high, and it
> seems I'm getting no faults with this. I'm wondering if the timeout
> for removal is set too tight, or might need to be made configurable.
>
> --- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000
> +++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000
> @@ -263,7 +263,7 @@
> return _subp(*args, **kwargs)
>
>
> -def wait_for_removal(path, retries=[1, 3, 5, 7]):
> +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
> if not path:
> raise ValueError('wait_f...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.5 KiB)

I've setup our integration test that runs the the CDO-QA bcache/ceph setup.

On the updated kernel I got through 10 loops on the deployment before it stacktraced:

http://paste.ubuntu.com/p/zVrtvKBfCY/

[ 3939.846908] bcache: bch_cached_dev_attach() Caching vdd as bcache5 on set 275985b3-da58-41f8-9072-958bd960b490
[ 3939.878388] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event)
[ 3939.904984] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event)
[ 3939.972715] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event)
[ 3940.059415] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event)
[ 3940.129854] bcache: register_bcache() error /dev/vdd: device already registered (emitting change event)
[ 3949.257051] md: md0: resync done.
[ 4109.273558] INFO: task python3:19635 blocked for more than 120 seconds.
[ 4109.279331] Tainted: P O 4.15.0-55-generic #60+lp796292
[ 4109.284767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4109.288771] python3 D 0 19635 16361 0x00000000
[ 4109.288774] Call Trace:
[ 4109.288818] __schedule+0x291/0x8a0
[ 4109.288822] ? __switch_to_asm+0x34/0x70
[ 4109.288824] ? __switch_to_asm+0x40/0x70
[ 4109.288826] schedule+0x2c/0x80
[ 4109.288852] bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 4109.288866] ? wait_woken+0x80/0x80
[ 4109.288872] __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[ 4109.288876] bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 4109.288882] __uuid_write+0x59/0x150 [bcache]
[ 4109.288895] ? submit_bio+0x73/0x140
[ 4109.288900] ? __write_super+0x137/0x170 [bcache]
[ 4109.288905] bch_uuid_write+0x16/0x40 [bcache]
[ 4109.288911] __cached_dev_store+0x1a1/0x6d0 [bcache]
[ 4109.288916] bch_cached_dev_store+0x39/0xc0 [bcache]
[ 4109.288992] sysfs_kf_write+0x3c/0x50
[ 4109.288998] kernfs_fop_write+0x125/0x1a0
[ 4109.289001] __vfs_write+0x1b/0x40
[ 4109.289003] vfs_write+0xb1/0x1a0
[ 4109.289004] SyS_write+0x55/0xc0
[ 4109.289010] do_syscall_64+0x73/0x130
[ 4109.289014] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 4109.289016] RIP: 0033:0x7f8d2833e154
[ 4109.289018] RSP: 002b:00007ffcda55a4e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 4109.289020] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f8d2833e154
[ 4109.289021] RDX: 0000000000000008 RSI: 00000000022b7360 RDI: 0000000000000003
[ 4109.289022] RBP: 00007f8d288396c0 R08: 0000000000000000 R09: 0000000000000000
[ 4109.289022] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[ 4109.289026] R13: 0000000000000000 R14: 00000000022b7360 R15: 0000000001fe8db0
[ 4109.289033] INFO: task bcache_allocato:22317 blocked for more than 120 seconds.
[ 4109.292172] Tainted: P O 4.15.0-55-generic #60+lp796292
[ 4109.295345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4109.298208] bcache_allocato D 0 22317 2 0x80000000
[ 4109.298212] Call Trace:
[ 4109.298217] __schedule+0x291/0x8a0
[ 4109.298225] schedule+0x2c/0x80
[ 4109.298232] bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 4109.2982...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

Without the patch, I can reproduce the hang fairly frequently, in one or two loops, which fails in this way:

[ 1069.711956] bcache: cancel_writeback_rate_update_dwork() give up waiting for dc->writeback_write_update to quit
[ 1088.583986] INFO: task kworker/0:2:436 blocked for more than 120 seconds.
[ 1088.590330] Tainted: P O 4.15.0-54-generic #58-Ubuntu
[ 1088.595831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.598210] kworker/0:2 D 0 436 2 0x80000000
[ 1088.598244] Workqueue: events update_writeback_rate [bcache]
[ 1088.598246] Call Trace:
[ 1088.598255] __schedule+0x291/0x8a0
[ 1088.598258] ? __switch_to_asm+0x40/0x70
[ 1088.598260] schedule+0x2c/0x80
[ 1088.598262] schedule_preempt_disabled+0xe/0x10
[ 1088.598264] __mutex_lock.isra.2+0x18c/0x4d0
[ 1088.598266] ? __switch_to_asm+0x34/0x70
[ 1088.598268] ? __switch_to_asm+0x34/0x70
[ 1088.598270] ? __switch_to_asm+0x40/0x70
[ 1088.598272] __mutex_lock_slowpath+0x13/0x20
[ 1088.598274] ? __mutex_lock_slowpath+0x13/0x20
[ 1088.598276] mutex_lock+0x2f/0x40
[ 1088.598281] update_writeback_rate+0x98/0x2b0 [bcache]
[ 1088.598285] process_one_work+0x1de/0x410
[ 1088.598287] worker_thread+0x32/0x410
[ 1088.598289] kthread+0x121/0x140
[ 1088.598291] ? process_one_work+0x410/0x410
[ 1088.598293] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1088.598295] ret_from_fork+0x35/0x40

Andrea Righi (arighi)
Changed in curtin:
assignee: Terry Rudd (terrykrudd) → Andrea Righi (arighi)
Revision history for this message
Andrea Righi (arighi) wrote :

Thanks tons for the tests Ryan! Well, at least the hung task timeout trace is different, so we're making some progress.

With the new kernel it seems that we're stuck in bch_bucket_alloc(). I've identified other upstream fixes that could help to prevent this problem.

If you're willing to do few more tests, here's a new test kernel (based on 4.15.0-54-generic + set of bcache upstream fixes):

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/

And, just in case, I've also applied the same set of fixes also to the latest bionic's master-next:

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292/

Testing these two kernels should give us more information about the nature of the problem.

Revision history for this message
Andrea Righi (arighi) wrote :

Good news! I've been able to reproduce the hung task in bch_bucket_alloc() issue locally, using the test case from bug 1784665. I think we're hitting the same problem now. I'll do more tests and will keep you updated.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Cosmic):
status: New → Confirmed
Changed in linux (Ubuntu Disco):
status: New → Confirmed
Revision history for this message
Andrea Righi (arighi) wrote :

Hi Ryan, I've uploaded a new test kernel:
https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-54.58+lp1796292/

This one is based on 4.15.0-54.58 and it addresses specifically the bch_bucket_alloc() problem (with this patch applied: https://lore.kernel.org/lkml/20190710093117.GA2792@xps-13/T/#u).

With this kernel I wasn't able to reproduce the hung task timeout issue in bch_bucket_alloc() anymore.

It would be great if you could repeat your test also with this kernel. Thanks in advance!

Revision history for this message
Andrea Righi (arighi) wrote :

... and, just in case, I've uploaded also a test kernel based on the latest bionic's master-next + a bunch of extra bcache fixes:
https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-55.60+lp1796292+1/

If the previous kernel is still buggy it'd be nice to try also this one.

Revision history for this message
Ryan Harper (raharper) wrote :

Andrea, thanks for the updated kernels.

On the first one, I got 23 installs before I ran into an issue; I'll test the newer kernel next.

https://paste.ubuntu.com/p/2B4Kk3wbvQ/

[ 5436.870482] BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8
[ 5436.873374] IP: cache_set_flush+0xf6/0x190 [bcache]
[ 5436.875208] PGD 0 P4D 0
[ 5436.876488] Oops: 0000 [#1] SMP PTI
[ 5436.877842] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) nls_utf8 isofs ppdev nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds serio_raw parport_pc parport qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 bcache psmouse nvme nvme_core virtio_blk virtio_net virtio_scsi floppy i2c_piix4 pata_acpi
[ 5436.896104] CPU: 0 PID: 11216 Comm: kworker/0:1 Tainted: P O 4.15.0-54-generic #58+lp1796292+1
[ 5436.899985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 5436.902645] Workqueue: events cache_set_flush [bcache]
[ 5436.904374] RIP: 0010:cache_set_flush+0xf6/0x190 [bcache]
[ 5436.906183] RSP: 0018:ffffab52826bbe58 EFLAGS: 00010202
[ 5436.909050] RAX: 0000000000000000 RBX: ffff94d104aa0cc0 RCX: 0000000000000000
[ 5436.911939] RDX: 0000000000000000 RSI: 0000000020000001 RDI: 0000000000000000
[ 5436.914448] RBP: ffffab52826bbe78 R08: ffff94d13f61ac30 R09: ffff94d13f342b98
[ 5436.917113] R10: ffffab52803b3d10 R11: 00000000000002c6 R12: 0000000000000001
[ 5436.919210] R13: ffff94d13f622140 R14: ffff94d104aa0db8 R15: 0000000000000000
[ 5436.921401] FS: 0000000000000000(0000) GS:ffff94d13f600000(0000) knlGS:0000000000000000
[ 5436.923743] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5436.926299] CR2: 00000000000009b8 CR3: 0000000038252000 CR4: 00000000000006f0
[ 5436.929093] Call Trace:
[ 5436.930818] process_one_work+0x1de/0x410
[ 5436.932818] worker_thread+0x32/0x410
[ 5436.935332] kthread+0x121/0x140
[ 5436.937309] ? process_one_work+0x410/0x410
[ 5436.939393] ? kthread_create_worker_on_cpu+0x70/0x70
[ 5436.941263] ret_from_fork+0x35/0x40
[ 5436.943060] Code: b8 00 00 00 a8 02 74 c8 31 f6 4c 89 e7 e8 43 0e ff ff eb bc 66 83 bb 4c f7 ff ff 00 48 8b 83 58 ff ff ff 74 31 41 bc 01 00 00 00 <48> 8b b8 b8 09 00 00 48 85 ff 74 05 e8 f9 9d 0d d3 0f b7 8b 4c
[ 5436.950188] RIP: cache_set_flush+0xf6/0x190 [bcache] RSP: ffffab52826bbe58
[ 5436.952796] CR2: 00000000000009b8
[ 5436.954567] ---[ end trace b771415397e98c3d ]---

Revision history for this message
Ryan Harper (raharper) wrote :

The newer kernel went about 16 runs and then popped this:

[ 2137.810559] md: md0: resync done.
[ 2296.795633] INFO: task python3:11639 blocked for more than 120 seconds.
[ 2296.800320] Tainted: P O 4.15.0-55-generic #60+lp1796292+1
[ 2296.805097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2296.810071] python3 D 0 11639 8301 0x00000000
[ 2296.810075] Call Trace:
[ 2296.810100] __schedule+0x291/0x8a0
[ 2296.810102] ? __switch_to_asm+0x34/0x70
[ 2296.810103] ? __switch_to_asm+0x40/0x70
[ 2296.810105] schedule+0x2c/0x80
[ 2296.810118] bch_bucket_alloc+0xe5/0x370 [bcache]
[ 2296.810128] ? wait_woken+0x80/0x80
[ 2296.810132] __bch_bucket_alloc_set+0x10d/0x160 [bcache]
[ 2296.810137] bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 2296.810143] __uuid_write+0x59/0x170 [bcache]
[ 2296.810148] ? __write_super+0x137/0x170 [bcache]
[ 2296.810153] bch_uuid_write+0x16/0x40 [bcache]
[ 2296.810159] __cached_dev_store+0x1e5/0x8b0 [bcache]
[ 2296.810160] ? __switch_to_asm+0x34/0x70
[ 2296.810161] ? __switch_to_asm+0x40/0x70
[ 2296.810163] ? __switch_to_asm+0x34/0x70
[ 2296.810167] bch_cached_dev_store+0x46/0x110 [bcache]
[ 2296.810181] sysfs_kf_write+0x3c/0x50
[ 2296.810182] kernfs_fop_write+0x125/0x1a0
[ 2296.810185] __vfs_write+0x1b/0x40
[ 2296.810187] vfs_write+0xb1/0x1a0
[ 2296.810189] SyS_write+0x55/0xc0
[ 2296.810193] do_syscall_64+0x73/0x130
[ 2296.810194] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 2296.810196] RIP: 0033:0x7f8e80077154
[ 2296.810197] RSP: 002b:00007fffc23855e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 2296.810199] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f8e80077154
[ 2296.810200] RDX: 0000000000000008 RSI: 0000000000d734b0 RDI: 0000000000000003
[ 2296.810201] RBP: 00007f8e805726c0 R08: 0000000000000000 R09: 0000000000000000
[ 2296.810202] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[ 2296.810203] R13: 0000000000000000 R14: 0000000000d734b0 R15: 0000000000aa4db0
[ 2417.627259] INFO: task python3:11639 blocked for more than 120 seconds.
[ 2417.632687] Tainted: P O 4.15.0-55-generic #60+lp1796292+1
[ 2417.638276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Chris Gregan (cgregan) wrote :

Escalated to Field Critical as it now happens often enough to block our ability to test proposed product releases. We are unable to test openstack-next at the moment because our test runs fail behind this bug.

Revision history for this message
Andrea Righi (arighi) wrote :

I've uploaded a new test kernel based on the latest bionic kernel from master-next:
https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292/

In addition to that I've backported all the recent upstream bcache fixes and applied my proposed fix for the potential deadlock in bch_allocator_thread() (https://lkml.org/lkml/2019/7/10/241).

I've tested this kernel both on a VM and on a bare metal box, running the test case from bug
1784665
(https://launchpadlibrarian.net/381282009/bcache-basic-repro.sh - with some minor adjustments to match my devices).

The tests have been running for more than 1h without triggering any problem (and they're still going).

Ryan / Chris: it would be really nice if you could do one more test with this new kernel... and if you're still hitting issues we can try to work on a better reproducer. Thanks again!

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.3 KiB)

ubuntu@ubuntu:~$ uname -r
4.15.0-56-generic
ubuntu@ubuntu:~$ cat /proc/version
Linux version 4.15.0-56-generic (arighi@kathleen) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #62~lp1796292 SMP Thu Aug 1 07:45:21 UTC 2019

This failed on the second install while running bcache-super-show /dev/vdg
Hung task.

[ 259.150347] bcache: run_cache_set() invalidating existing data
[ 259.158038] bcache: register_cache() registered cache device nvme1n1p2
[ 259.251093] bcache: register_bdev() registered backing device vdg
[ 259.379809] bcache: bch_cached_dev_attach() Caching vdg as bcache3 on set 084505ad-5f6c-4666-9e3e-4f1650e8b015
[ 259.411486] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 259.537070] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 259.797830] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 259.900392] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 271.682662] md: md0: resync done.
[ 484.525529] INFO: task python3:11257 blocked for more than 120 seconds.
[ 484.528933] Tainted: P O 4.15.0-56-generic #62~lp1796292
[ 484.532221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.535936] python3 D 0 11257 7974 0x00000000
[ 484.535941] Call Trace:
[ 484.535952] __schedule+0x291/0x8a0
[ 484.535957] schedule+0x2c/0x80
[ 484.535977] bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 484.535984] ? wait_woken+0x80/0x80
[ 484.535993] __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[ 484.536002] bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 484.536014] __uuid_write+0x59/0x150 [bcache]
[ 484.536025] ? __write_super+0x137/0x170 [bcache]
[ 484.536035] bch_uuid_write+0x16/0x40 [bcache]
[ 484.536046] __cached_dev_store+0x1d8/0x8a0 [bcache]
[ 484.536057] bch_cached_dev_store+0x39/0xc0 [bcache]
[ 484.536061] sysfs_kf_write+0x3c/0x50
[ 484.536064] kernfs_fop_write+0x125/0x1a0
[ 484.536069] __vfs_write+0x1b/0x40
[ 484.536071] vfs_write+0xb1/0x1a0
[ 484.536075] SyS_write+0x5c/0xe0
[ 484.536081] do_syscall_64+0x73/0x130
[ 484.536085] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 484.536088] RIP: 0033:0x7f2ac7aa7154
[ 484.536090] RSP: 002b:00007ffff157b628 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 484.536093] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f2ac7aa7154
[ 484.536095] RDX: 0000000000000008 RSI: 0000000001c96340 RDI: 0000000000000003
[ 484.536096] RBP: 00007f2ac7fa26c0 R08: 0000000000000000 R09: 0000000000000000
[ 484.536098] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[ 484.536099] R13: 0000000000000000 R14: 0000000001c96340 R15: 00000000019c7db0

I'll note that, I can run bcache-super-show on the device while this was hung.

$ sudo bash
root@ubuntu:~# bcache-super-show /dev/vdg
sb.magic ok
sb.first_sector 8 [match]
sb.csum C71A896B52F1C486 [match]
sb.version 1 [backing device]

dev.label osddata5
dev.uuid 11df8370-fb64-4bb0-8171-dadabb47f6b1
dev.sec...

Read more...

Revision history for this message
Andrea Righi (arighi) wrote :

Thanks Ryan, this is very interesting:

[ 259.411486] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 259.537070] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 259.797830] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 259.900392] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)

It looks that we're trying to register /dev/vdg multiple times as a backing device (make-bcache -B). I'm not getting this message during my tests, so that might be required to reproduce that particular deadlock.

I'll modify my test case to trigger these errors and see if I can reproduce the hung task timeout issue.

Revision history for this message
Ryan Harper (raharper) wrote :

On Thu, Aug 1, 2019 at 10:15 AM Andrea Righi <email address hidden>
wrote:

> Thanks Ryan, this is very interesting:
>
> [ 259.411486] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
> [ 259.537070] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
> [ 259.797830] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
> [ 259.900392] bcache: register_bcache() error /dev/vdg: device already
> registered (emitting change event)
>
> It looks that we're trying to register /dev/vdg multiple times as a
> backing device (make-bcache -B). I'm not getting this message during my
> tests, so that might be required to reproduce that particular deadlock.
>

We carry a specific sauce patch to ensure that if the cacheset is already
online, and a backing device
shows up later, that the kernel emits the change event to trigger the udev
rules to generate the
symlink for /dev/bcache/by-uuid. I don't think the patch we carry is at
issue since we are just detecting
the re-register scenario and emitting a change uevent;

https://www.spinics.net/lists/linux-bcache/msg05833.html

We may want to resubmit that now to see if they'll take that or even want
to deal with the scenario
in a cleaner way;

>
> I'll modify my test case to trigger these errors and see if I can
> reproduce the hung task timeout issue.
>

I can provide you a setup to reproduce this. I'll put together a doc.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

Reproducer script

Revision history for this message
Andrea Righi (arighi) wrote :

Ryan, unfortunately the last reproducer script is giving me a lot of errors and I'm still trying to figure out how to make it run to the end (or at least to a point where it's start to run some bcache commands).

In the meantime (as anticipated on IRC) I've uploaded a test kernel reverting the patch "UBUNTU: SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent":

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+1/

As we know, this would re-introduce the problem discussed in bug 1729145, but it'd be interesting to test it anyway, just to see if this patch is somehow related to the bch_bucket_alloc() deadlock.

In addition to that I've spent some time looking at the last kernel trace and the code. It looks like bch_bucket_alloc() is always releasing the mutex &ca->set->bucket_lock when it goes to sleep (call to schedule()), but it doesn't release bch_register_lock, that might be also acquired. I was wondering if this could the reason of this deadlock, so I've prepared an additional test kernel that does *not* revert our "UBUNTU SAUCE" patch, but instead it releases the mutex bch_register_lock when bch_bucket_alloc() goes to sleep:

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+3/

Sorry for asking all these tests... if I can't find a way to reproduce the bug on my side, asking you to test is the only way that I have to debug this issue. :)

Revision history for this message
Ryan Harper (raharper) wrote :

I tried the +3 kernel first, and I got 3 installs and then this hang:

[ 549.828710] bcache: run_cache_set() invalidating existing data
[ 549.836485] bcache: register_cache() registered cache device nvme1n1p2
[ 549.937486] bcache: register_bdev() registered backing device vdg
[ 550.018855] bcache: bch_cached_dev_attach() Caching vdg as bcache3 on set c7abd3ea-f9c9-415a-b8a6-9efeddc3e030
[ 550.074760] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 550.316246] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 550.545840] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 550.565928] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 550.724285] bcache: register_bcache() error /dev/vdg: device already registered (emitting change event)
[ 562.352520] md: md0: resync done.
[ 725.980380] INFO: task python3:27303 blocked for more than 120 seconds.
[ 725.982364] Tainted: P O 4.15.0-56-generic #62~lp1796292+3
[ 725.984228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 725.986284] python3 D 0 27303 27293 0x00000000
[ 725.986287] Call Trace:
[ 725.986319] __schedule+0x291/0x8a0
[ 725.986322] schedule+0x2c/0x80
[ 725.986337] bch_bucket_alloc+0x320/0x3a0 [bcache]
[ 725.986359] ? wait_woken+0x80/0x80
[ 725.986363] __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[ 725.986367] bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 725.986372] __uuid_write+0x59/0x150 [bcache]
[ 725.986377] ? __write_super+0x137/0x170 [bcache]
[ 725.986382] bch_uuid_write+0x16/0x40 [bcache]
[ 725.986386] __cached_dev_store+0x1d8/0x8a0 [bcache]
[ 725.986391] bch_cached_dev_store+0x39/0xc0 [bcache]
[ 725.986399] sysfs_kf_write+0x3c/0x50
[ 725.986401] kernfs_fop_write+0x125/0x1a0
[ 725.986406] __vfs_write+0x1b/0x40
[ 725.986407] vfs_write+0xb1/0x1a0
[ 725.986409] SyS_write+0x5c/0xe0
[ 725.986416] do_syscall_64+0x73/0x130
[ 725.986419] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 725.986421] RIP: 0033:0x7f2d7c39a154
[ 725.986422] RSP: 002b:00007fff80c5e048 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 725.986426] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f2d7c39a154
[ 725.986426] RDX: 0000000000000008 RSI: 000000000247e7d0 RDI: 0000000000000003
[ 725.986427] RBP: 00007f2d7c8956c0 R08: 0000000000000000 R09: 0000000000000000
[ 725.986428] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[ 725.986428] R13: 0000000000000000 R14: 000000000247e7d0 R15: 00000000021b6e60

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (4.0 KiB)

Trying the first kernel without the change event sauce also fails:

[ 532.823594] bcache: run_cache_set() invalidating existing data
[ 532.828876] bcache: register_cache() registered cache device nvme0n1p2
[ 532.869716] bcache: register_bdev() registered backing device vda1
[ 532.994355] bcache: bch_cached_dev_attach() Caching vda1 as bcache0 on set 21d89237-231d-4af6-a4c8-4b1b8fa5eef5
[ 533.051588] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.094717] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.120063] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.142517] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.191069] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.249877] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.282653] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.301225] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.310505] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.318959] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.374121] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.536920] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.581468] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.589270] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.595986] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.602638] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.651848] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.677836] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.712074] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.717682] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.723354] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.728951] bcache: register_bcache() error /dev/vda1: device already registered
[ 533.777602] bcache: register_bcache() error /dev/vda1: device already registered
[ 553.784393] md: md0: resync done.
[ 725.983387] INFO: task python3:413 blocked for more than 120 seconds.
[ 725.985099] Tainted: P O 4.15.0-56-generic #62~lp1796292+1
[ 725.986820] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 725.988649] python3 D 0 413 405 0x00000000
[ 725.988652] Call Trace:
[ 725.988684] __schedule+0x291/0x8a0
[ 725.988687] schedule+0x2c/0x80
[ 725.988710] bch_bucket_alloc+0x1fa/0x350 [bcache]
[ 725.988722] ? wait_woken+0x80/0x80
[ 725.988726] __bch_bucket_alloc_set+0xfe/0x150 [bcache]
[ 725.988729] bch_bucket_alloc_set+0x4e/0x70 [bcache]
[ 725.988734] __uuid_write+0x59/0x150 [bcache]
[ 725.988738] ? __write_super+0x137/0x170 [bcache]
[ 725.988742] bch_uuid_write+0x16/0x40 [bcache]
[ 725.988746] __cached_dev_store+0x1d8/0x8a0 [bcache]
[ 725.988750] ...

Read more...

Revision history for this message
Andrea Righi (arighi) wrote :

After some help from Ryan (on IRC) I've been able to run the last reproducer script and trigger the same trace. Now I should be able to collect all the information that I need and hopefully post a new test kernel (fixed for real...) soon.

Revision history for this message
Andrea Righi (arighi) wrote :
Download full text (5.1 KiB)

Some additional info about the deadlock:

crash> bt 16588
PID: 16588 TASK: ffff9ffd7f332b00 CPU: 1 COMMAND: "bcache_allocato"
    [exception RIP: bch_crc64+57]
    RIP: ffffffffc093b2c9 RSP: ffffab9585767e28 RFLAGS: 00000286
    RAX: f1f51403756de2bd RBX: 0000000000000000 RCX: 0000000000000065
    RDX: 0000000000000065 RSI: ffff9ffd63980000 RDI: ffff9ffd63925346
    RBP: ffffab9585767e28 R8: ffffffffc093db60 R9: ffffab9585739000
    R10: 000000000000007f R11: 000000001ffef001 R12: 0000000000000000
    R13: 0000000000000008 R14: ffff9ffd63900000 R15: ffff9ffd683d0000
    CS: 0010 SS: 0018
 #0 [ffffab9585767e30] bch_prio_write at ffffffffc09325c0 [bcache]
 #1 [ffffab9585767eb0] bch_allocator_thread at ffffffffc091bdc5 [bcache]
 #2 [ffffab9585767f08] kthread at ffffffffa80b2481
 #3 [ffffab9585767f50] ret_from_fork at ffffffffa8a00205

crash> bt 14658
PID: 14658 TASK: ffff9ffd7a9f0000 CPU: 0 COMMAND: "python3"
 #0 [ffffab958380bb48] __schedule at ffffffffa89ae441
 #1 [ffffab958380bbe8] schedule at ffffffffa89aea7c
 #2 [ffffab958380bbf8] bch_bucket_alloc at ffffffffc091c370 [bcache]
 #3 [ffffab958380bc68] __bch_bucket_alloc_set at ffffffffc091c5ce [bcache]
 #4 [ffffab958380bcb8] bch_bucket_alloc_set at ffffffffc091c66e [bcache]
 #5 [ffffab958380bcf8] __uuid_write at ffffffffc0931b69 [bcache]
 #6 [ffffab958380bda0] bch_uuid_write at ffffffffc0931f76 [bcache]
 #7 [ffffab958380bdc0] __cached_dev_store at ffffffffc0937c08 [bcache]
 #8 [ffffab958380be20] bch_cached_dev_store at ffffffffc0938309 [bcache]
 #9 [ffffab958380be50] sysfs_kf_write at ffffffffa830c97c
#10 [ffffab958380be60] kernfs_fop_write at ffffffffa830c3e5
#11 [ffffab958380bea0] __vfs_write at ffffffffa827e5bb
#12 [ffffab958380beb0] vfs_write at ffffffffa827e781
#13 [ffffab958380bee8] sys_write at ffffffffa827e9fc
#14 [ffffab958380bf30] do_syscall_64 at ffffffffa8003b03
#15 [ffffab958380bf50] entry_SYSCALL_64_after_hwframe at ffffffffa8a00081
    RIP: 00007faffc7bd154 RSP: 00007ffe307cbc88 RFLAGS: 00000246
    RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007faffc7bd154
    RDX: 0000000000000008 RSI: 00000000011ce7f0 RDI: 0000000000000003
    RBP: 00007faffccb86c0 R8: 0000000000000000 R9: 0000000000000000
    R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
    R13: 0000000000000000 R14: 00000000011ce7f0 R15: 0000000000f33e60
    ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

In this case the task "python3" (pid 14658) gets stuck in a wait that never completes from bch_bucket_alloc(). The task should that should resume "python3" from this wait is "bcache_allocator" (pid 16588), but the resume never happens, because bcache_allocator is stuck in this "retry_invalidate" busy loop:

static int bch_allocator_thread(void *arg)
{
...
retry_invalidate:
                allocator_wait(ca, ca->set->gc_mark_valid &&
                               !ca->invalidate_needs_gc);
                invalidate_buckets(ca);

                /*
                 * Now, we write their new gens to disk so we can start writing
                 * new stuff to them:
                 */
                allocator_wait(ca, !atomic_read(&ca->set->prio_blocked));
  ...

Read more...

Revision history for this message
Andrea Righi (arighi) wrote :

Ryan, I've uploaded a new test kernel with the fix mentioned in the comment before:

https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/

I've performed over 100 installations using curtin-nvme.sh (install_count = 100), no hung task timeout. I'll run other stress tests to make sure we're not breaking anything else with this fix, but results look promising so far.

It'd be great if you could also do a test on your side. Thanks!

Revision history for this message
Ryan Harper (raharper) wrote :

On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi <email address hidden>
wrote:

> Ryan, I've uploaded a new test kernel with the fix mentioned in the
> comment before:
>
> https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/
>
> I've performed over 100 installations using curtin-nvme.sh
> (install_count = 100), no hung task timeout. I'll run other stress tests
> to make sure we're not breaking anything else with this fix, but results
> look promising so far.
>
> It'd be great if you could also do a test on your side. Thanks!
>

Thats excellent news. I'm starting my tests on this kernel now.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>

Revision history for this message
Ryan Harper (raharper) wrote :

On Mon, Aug 5, 2019 at 1:19 PM Ryan Harper <email address hidden>
wrote:

>
>
> On Mon, Aug 5, 2019 at 8:01 AM Andrea Righi <email address hidden>
> wrote:
>
>> Ryan, I've uploaded a new test kernel with the fix mentioned in the
>> comment before:
>>
>> https://kernel.ubuntu.com/~arighi/LP-1796292/4.15.0-56.62~lp1796292+4/
>>
>> I've performed over 100 installations using curtin-nvme.sh
>> (install_count = 100), no hung task timeout. I'll run other stress tests
>> to make sure we're not breaking anything else with this fix, but results
>> look promising so far.
>>
>> It'd be great if you could also do a test on your side. Thanks!
>>
>
> Thats excellent news. I'm starting my tests on this kernel now.
>

I've got 233 consecutive installs successful.

>
>
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1796292
>>
>> Title:
>> Tight timeout for bcache removal causes spurious failures
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>>
>

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (27.8 KiB)

This bug was fixed in the package linux - 5.2.0-13.14

---------------
linux (5.2.0-13.14) eoan; urgency=medium

  * eoan/linux: 5.2.0-13.14 -proposed tracker (LP: #1840261)

  * NULL pointer dereference when Inserting the VIMC module (LP: #1840028)
    - media: vimc: fix component match compare

  * Miscellaneous upstream changes
    - selftests/bpf: remove bpf_util.h from BPF C progs

linux (5.2.0-12.13) eoan; urgency=medium

  * eoan/linux: 5.2.0-12.13 -proposed tracker (LP: #1840184)

  * Eoan update: v5.2.8 upstream stable release (LP: #1840178)
    - scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
    - libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
    - libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
    - ALSA: usb-audio: Sanity checks for each pipe and EP types
    - ALSA: usb-audio: Fix gpf in snd_usb_pipe_sanity_check
    - HID: wacom: fix bit shift for Cintiq Companion 2
    - HID: Add quirk for HP X1200 PIXART OEM mouse
    - atm: iphase: Fix Spectre v1 vulnerability
    - bnx2x: Disable multi-cos feature.
    - drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
    - ife: error out when nla attributes are empty
    - ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6
    - ip6_tunnel: fix possible use-after-free on xmit
    - ipip: validate header length in ipip_tunnel_xmit
    - mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
    - mvpp2: fix panic on module removal
    - mvpp2: refactor MTU change code
    - net: bridge: delete local fdb on device init failure
    - net: bridge: mcast: don't delete permanent entries when fast leave is
      enabled
    - net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
    - net: fix ifindex collision during namespace removal
    - net/mlx5e: always initialize frag->last_in_page
    - net/mlx5: Use reversed order when unregister devices
    - net: phy: fixed_phy: print gpio error only if gpio node is present
    - net: phylink: don't start and stop SGMII PHYs in SFP modules twice
    - net: phylink: Fix flow control for fixed-link
    - net: phy: mscc: initialize stats array
    - net: qualcomm: rmnet: Fix incorrect UL checksum offload logic
    - net: sched: Fix a possible null-pointer dereference in dequeue_func()
    - net sched: update vlan action for batched events operations
    - net: sched: use temporary variable for actions indexes
    - net/smc: do not schedule tx_work in SMC_CLOSED state
    - net: stmmac: Use netif_tx_napi_add() for TX polling function
    - NFC: nfcmrvl: fix gpio-handling regression
    - ocelot: Cancel delayed work before wq destruction
    - tipc: compat: allow tipc commands without arguments
    - tipc: fix unitilized skb list crash
    - tun: mark small packets as owned by the tap sock
    - net/mlx5: Fix modify_cq_in alignment
    - net/mlx5e: Prevent encap flow counter update async to user query
    - r8169: don't use MSI before RTL8168d
    - bpf: fix XDP vlan selftests test_xdp_vlan.sh
    - selftests/bpf: add wrapper scripts for test_xdp_vlan.sh
    - selftests/bpf: reduce time to execute test_xdp_vlan.sh
    - net: fix bpf_xdp_adjust_head regression for generic-XDP
    - hv_sock: Fi...

Changed in linux (Ubuntu Eoan):
status: Confirmed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (235.3 KiB)

This bug was fixed in the package linux - 4.15.0-60.67

---------------
linux (4.15.0-60.67) bionic; urgency=medium

  * bionic/linux: 4.15.0-60.67 -proposed tracker (LP: #1841086)

  * [Regression] net test from ubuntu_kernel_selftests failed due to bpf test
    compilation issue (LP: #1840935)
    - SAUCE: Fix "bpf: relax verifier restriction on BPF_MOV | BPF_ALU"

  * [Regression] failed to compile seccomp test from ubuntu_kernel_selftests
    (LP: #1840932)
    - Revert "selftests: skip seccomp get_metadata test if not real root"

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

linux (4.15.0-59.66) bionic; urgency=medium

  * bionic/linux: 4.15.0-59.66 -proposed tracker (LP: #1840006)

  * zfs not completely removed from bionic tree (LP: #1840051)
    - SAUCE: (noup) remove completely the zfs code

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * [18.04 FEAT] Enhanced hardware support (LP: #1836857)
    - s390: report new CPU capabilities
    - s390: add alignment hints to vector load and store

  * [18.04 FEAT] Enhanced CPU-MF hardware counters - kernel part (LP: #1836860)
    - s390/cpum_cf: Add support for CPU-MF SVN 6
    - s390/cpumf: Add extended counter set definitions for model 8561 and 8562

  * ideapad_laptop disables WiFi/BT radios on Lenovo Y540 (LP: #1837136)
    - platform/x86: ideapad-laptop: Remove no_hw_rfkill_list

  * Stacked onexec transitions fail when under NO NEW PRIVS restrictions
    (LP: #1839037)
    - SAUCE: apparmor: fix nnp subset check failure when, stacking

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665) // Tight
    timeout for bcache removal causes spurious failures (LP: #1796292)
    - SAUCE: bcache: fix deadlock in bcache_allocator

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665)
    - bcache: never writeback a discard operation
    - bcache: improve bcache_reboot()
    - bcache: fix writeback target calc on large devices
    - bcache: add journal statistic
    - bcache: fix high CPU occupancy during journal
    - bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
    - bcache: fix incorrect sysfs output value of strip size
    - bcache: fix error return value in memory shrink
    - bcache: fix using of loop variable in memory shrink
    - bcache: Fix indentation
    - bcache: Add __printf annotation to __bch_check_keys()
    - bcache: Annotate switch fall-through
    - bcache: Fix kernel-doc warnings
    - bcache: Remove an unused variable
    - bcache: Suppress more warnings about set-but-not-used variables
    - bcache: Reduce the number of sparse complaints about lock imbalances
    - bcache: Fix a compiler warning in bcache_device_init()
    - bcache: Move couple of string arrays to sysfs.c
    - bcache: Move couple of functions to sysfs.c
    - bcache: Replace bch_read_string_list() by __sysfs_match_string()

  * linux hwe i386 kernel 5.0.0-21.22~18.04.1 crashes on Lenovo x220
    (LP: #1838115)
    - x86/mm: Check for pfn instead of page in vmalloc_sync_one()
    - x86/mm: Sync also unmappings in vmalloc_sync_all()
    - mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()...

Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.