This is difficult for us to test in our lab because we are using MAAS, and
we hit this during MAAS deployments of nodes, so we would need MAAS images
built with these kernels. Additionally, this doesn't reproduce every time,
it is maybe 1/4 test runs. It may be best to find a way to reproduce this
outside of MAAS.
On Wed, Jul 3, 2019 at 11:16 AM Andrea Righi <email address hidden>
wrote:
> >From a kernel perspective this big slowness on shutting down a bcache
> volume might be caused by a locking / race condition issue. If I read
> correctly this problem has been reproduced in bionic (and in xenial we
> even got a kernel oops - it looks like caused by a NULL pointer
> dereference). I would try to address these issues separately.
>
> About bionic it would be nice to test this commit (also mentioned by
> @elmo in comment #28):
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60
>
> Moreover, even if we didn't get an explicit NULL pointer dereference
> with bionic, I think it would be interesting to test also the following
> fixes:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b732a248d12cbdb46999daf0bf288c011335eb
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f0ffa67349c56ea54c03ccfd1e073c990e7411e
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9951379b0ca88c95876ad9778b9099e19a95d566
>
> I've already backported all of them and applied to the latest bionic
> kernel. A test kernel is available here:
>
> https://kernel.ubuntu.com/~arighi/LP-1796292/
>
> If it doesn't cost too much it would be great to do a test with it. In
> the meantime I'll try to reproduce the problem locally. Thanks in
> advance!
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1796292
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> Status in curtin:
> Fix Released
> Status in linux package in Ubuntu:
> Confirmed
> Status in linux source package in Bionic:
> New
> Status in linux source package in Cosmic:
> New
> Status in linux source package in Disco:
> New
> Status in linux source package in Eoan:
> Confirmed
>
> Bug description:
> I've had a number of deployment faults where curtin would report
> Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
> deployment of 30+ nodes. Upon retrying the node would usually deploy
> fine. Experimentally I've set the timeout ridiculously high, and it
> seems I'm getting no faults with this. I'm wondering if the timeout
> for removal is set too tight, or might need to be made configurable.
>
> --- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000
> +++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000
> @@ -263,7 +263,7 @@
> return _subp(*args, **kwargs)
>
>
> -def wait_for_removal(path, retries=[1, 3, 5, 7]):
> +def wait_for_removal(path, retries=[1, 3, 5, 7, 1200, 1200]):
> if not path:
> raise ValueError('wait_for_removal: missing path parameter')
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1796292/+subscriptions
>
This is difficult for us to test in our lab because we are using MAAS, and
we hit this during MAAS deployments of nodes, so we would need MAAS images
built with these kernels. Additionally, this doesn't reproduce every time,
it is maybe 1/4 test runs. It may be best to find a way to reproduce this
outside of MAAS.
On Wed, Jul 3, 2019 at 11:16 AM Andrea Righi <email address hidden>
wrote:
> >From a kernel perspective this big slowness on shutting down a bcache /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=eb8cbb6df38 f6e5124a3d5f1f8 a3dbf519537c60 /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=a4b732a248d 12cbdb46999daf0 bf288c011335eb /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=1f0ffa67349 c56ea54c03ccfd1 e073c990e7411e /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit/ ?id=9951379b0ca 88c95876ad9778b 9099e19a95d566 /kernel. ubuntu. com/~arighi/ LP-1796292/ /bugs.launchpad .net/bugs/ 1796292 removal( path, retries=[1, 3, 5, 7]): removal( path, retries=[1, 3, 5, 7, 1200, 1200]): 'wait_for_ removal: missing path parameter') /bugs.launchpad .net/curtin/ +bug/1796292/ +subscriptions
> volume might be caused by a locking / race condition issue. If I read
> correctly this problem has been reproduced in bionic (and in xenial we
> even got a kernel oops - it looks like caused by a NULL pointer
> dereference). I would try to address these issues separately.
>
> About bionic it would be nice to test this commit (also mentioned by
> @elmo in comment #28):
>
>
> https:/
>
> Moreover, even if we didn't get an explicit NULL pointer dereference
> with bionic, I think it would be interesting to test also the following
> fixes:
>
>
> https:/
>
> https:/
>
> https:/
>
> I've already backported all of them and applied to the latest bionic
> kernel. A test kernel is available here:
>
> https:/
>
> If it doesn't cost too much it would be great to do a test with it. In
> the meantime I'll try to reproduce the problem locally. Thanks in
> advance!
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https:/
>
> Title:
> Tight timeout for bcache removal causes spurious failures
>
> Status in curtin:
> Fix Released
> Status in linux package in Ubuntu:
> Confirmed
> Status in linux source package in Bionic:
> New
> Status in linux source package in Cosmic:
> New
> Status in linux source package in Disco:
> New
> Status in linux source package in Eoan:
> Confirmed
>
> Bug description:
> I've had a number of deployment faults where curtin would report
> Timeout exceeded for removal of /sys/fs/bcache/xxx when doing a mass-
> deployment of 30+ nodes. Upon retrying the node would usually deploy
> fine. Experimentally I've set the timeout ridiculously high, and it
> seems I'm getting no faults with this. I'm wondering if the timeout
> for removal is set too tight, or might need to be made configurable.
>
> --- curtin/util.py~ 2018-05-18 18:40:48.000000000 +0000
> +++ curtin/util.py 2018-10-05 09:40:06.807390367 +0000
> @@ -263,7 +263,7 @@
> return _subp(*args, **kwargs)
>
>
> -def wait_for_
> +def wait_for_
> if not path:
> raise ValueError(
>
> To manage notifications about this bug go to:
> https:/
>