Activity log for bug #1884766

Date Who What changed Old value New value Message
2020-06-23 13:00:39 Mauricio Faria de Oliveira bug added bug
2020-06-23 13:00:55 Mauricio Faria de Oliveira linux (Ubuntu): status New Confirmed
2020-06-23 13:00:58 Mauricio Faria de Oliveira linux (Ubuntu): importance Undecided Medium
2020-06-23 13:01:01 Mauricio Faria de Oliveira linux (Ubuntu): assignee Mauricio Faria de Oliveira (mfo)
2020-06-23 13:01:09 Mauricio Faria de Oliveira tags sts
2020-06-30 21:06:14 Mauricio Faria de Oliveira description This bug is for tracking and submitting this commit [1] once it lands in v5.8-rcN. [1] https://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git/commit/?id=34c86f4c4a7be3b3e35aa48bd18299d4c756064d [Impact] * Users of the Linux kernel's crypto userspace API reported BUG() / kernel NULL pointer dereference errors after kernel upgrades. * The stack trace signature is an accept() syscall going through af_alg_accept() and hitting errors usually in one of: - apparmor_sk_clone_security() - apparmor_sock_graft() - release_sock() [Fix] * This is a regression introduced by upstream commit 37f96694cf73 ("crypto: af_alg - Use bh_lock_sock in sk_destruct") which made its way through stable. * The offending patch allows the critical regions of af_alg_accept() and af_alg_release_parent() to run concurrently; now with the "right" events on 2 CPUs it might drop the non-atomic reference counter of the alg_sock then the sock, thus release a sock that is still in use. * The fix is upstream commit 34c86f4c4a7b ("crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()") [1]. It changes alg_sock's ref counter to atomic, which addresses the root cause. [Test Case] * There is a synthetic test case available, which uses a kprobes kernel module to synchronize the concurrent CPUs on the instructions responsible for the problem; and a userspace part to run it. * The organic reproducer is the Varnish Cache Plus software with the Crypto vmod (which uses kernel crypto userspace API) under long, very high load. * The patch has been verified on both reproducers with the 4.15 and 5.7 kernels. * More tests performed with 'stress-ng --af-alg' with 11 CPUs/hogs on Bionic/Disco/Eoan/Focal (all on same version of stress-ng, V0.11.14) No regressions observed from original kernel. (the af-alg stressor can exercise almost all kernel crypto modules shipped with the kernel; so it checks more paths/crypto alg interfaces.) [Regression Potential] * The fix patch does a fundamental change in how alg_sock reference counters work, plus another change to the 'nokey' counting. This of course *has* a risk of regression. * Regressions theoretically could manifest as use after free errors (in case of undercounting) in the af_alg functions or silent memory leaks (in case of overcounting), but also other behaviors since reference counting is key to many things. * FWIW, this patch has been written by the crypto subsystem maintainer, who certainly knows a lot of the normal and corner cases, thus giving the patch more credit. * Testing with the organic reproducer ran as long as 5 days, without issues, so it does look good. [Other Info] * [1] Patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34c86f4c4a7be3b3e35aa48bd18299d4c756064d [Stack Trace Examples] Examples: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 ... RIP: 0010:apparmor_sk_clone_security+0x26/0x70 ... Call Trace: security_sk_clone+0x33/0x50 af_alg_accept+0x81/0x1c0 [af_alg] alg_accept+0x15/0x20 [af_alg] SYSC_accept4+0xff/0x210 SyS_accept+0x10/0x20 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:__release_sock+0x54/0xe0 ... Call Trace: release_sock+0x30/0xa0 af_alg_accept+0x122/0x1c0 [af_alg] alg_accept+0x15/0x20 [af_alg] SYSC_accept4+0xff/0x210 SyS_accept+0x10/0x20 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
2020-06-30 21:07:59 Mauricio Faria de Oliveira nominated for series Ubuntu Groovy
2020-06-30 21:07:59 Mauricio Faria de Oliveira bug task added linux (Ubuntu Groovy)
2020-06-30 21:07:59 Mauricio Faria de Oliveira nominated for series Ubuntu Xenial
2020-06-30 21:07:59 Mauricio Faria de Oliveira bug task added linux (Ubuntu Xenial)
2020-06-30 21:07:59 Mauricio Faria de Oliveira nominated for series Ubuntu Bionic
2020-06-30 21:07:59 Mauricio Faria de Oliveira bug task added linux (Ubuntu Bionic)
2020-06-30 21:07:59 Mauricio Faria de Oliveira nominated for series Ubuntu Eoan
2020-06-30 21:07:59 Mauricio Faria de Oliveira bug task added linux (Ubuntu Eoan)
2020-06-30 21:07:59 Mauricio Faria de Oliveira nominated for series Ubuntu Focal
2020-06-30 21:07:59 Mauricio Faria de Oliveira bug task added linux (Ubuntu Focal)
2020-06-30 21:25:09 Mauricio Faria de Oliveira description [Impact] * Users of the Linux kernel's crypto userspace API reported BUG() / kernel NULL pointer dereference errors after kernel upgrades. * The stack trace signature is an accept() syscall going through af_alg_accept() and hitting errors usually in one of: - apparmor_sk_clone_security() - apparmor_sock_graft() - release_sock() [Fix] * This is a regression introduced by upstream commit 37f96694cf73 ("crypto: af_alg - Use bh_lock_sock in sk_destruct") which made its way through stable. * The offending patch allows the critical regions of af_alg_accept() and af_alg_release_parent() to run concurrently; now with the "right" events on 2 CPUs it might drop the non-atomic reference counter of the alg_sock then the sock, thus release a sock that is still in use. * The fix is upstream commit 34c86f4c4a7b ("crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()") [1]. It changes alg_sock's ref counter to atomic, which addresses the root cause. [Test Case] * There is a synthetic test case available, which uses a kprobes kernel module to synchronize the concurrent CPUs on the instructions responsible for the problem; and a userspace part to run it. * The organic reproducer is the Varnish Cache Plus software with the Crypto vmod (which uses kernel crypto userspace API) under long, very high load. * The patch has been verified on both reproducers with the 4.15 and 5.7 kernels. * More tests performed with 'stress-ng --af-alg' with 11 CPUs/hogs on Bionic/Disco/Eoan/Focal (all on same version of stress-ng, V0.11.14) No regressions observed from original kernel. (the af-alg stressor can exercise almost all kernel crypto modules shipped with the kernel; so it checks more paths/crypto alg interfaces.) [Regression Potential] * The fix patch does a fundamental change in how alg_sock reference counters work, plus another change to the 'nokey' counting. This of course *has* a risk of regression. * Regressions theoretically could manifest as use after free errors (in case of undercounting) in the af_alg functions or silent memory leaks (in case of overcounting), but also other behaviors since reference counting is key to many things. * FWIW, this patch has been written by the crypto subsystem maintainer, who certainly knows a lot of the normal and corner cases, thus giving the patch more credit. * Testing with the organic reproducer ran as long as 5 days, without issues, so it does look good. [Other Info] * [1] Patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34c86f4c4a7be3b3e35aa48bd18299d4c756064d [Stack Trace Examples] Examples: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 ... RIP: 0010:apparmor_sk_clone_security+0x26/0x70 ... Call Trace: security_sk_clone+0x33/0x50 af_alg_accept+0x81/0x1c0 [af_alg] alg_accept+0x15/0x20 [af_alg] SYSC_accept4+0xff/0x210 SyS_accept+0x10/0x20 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:__release_sock+0x54/0xe0 ... Call Trace: release_sock+0x30/0xa0 af_alg_accept+0x122/0x1c0 [af_alg] alg_accept+0x15/0x20 [af_alg] SYSC_accept4+0xff/0x210 SyS_accept+0x10/0x20 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Impact]  * Users of the Linux kernel's crypto userspace API    reported BUG() / kernel NULL pointer dereference    errors after kernel upgrades.  * The stack trace signature is an accept() syscall    going through af_alg_accept() and hitting errors    usually in one of:    - apparmor_sk_clone_security()    - apparmor_sock_graft()    - release_sock() [Fix]  * This is a regression introduced by upstream commit    37f96694cf73 ("crypto: af_alg - Use bh_lock_sock    in sk_destruct") which made its way through stable.  * The offending patch allows the critical regions    of af_alg_accept() and af_alg_release_parent() to    run concurrently; now with the "right" events on 2    CPUs it might drop the non-atomic reference counter    of the alg_sock then the sock, thus release a sock    that is still in use.  * The fix is upstream commit 34c86f4c4a7b ("crypto:    af_alg - fix use-after-free in af_alg_accept() due    to bh_lock_sock()") [1]. It changes alg_sock's ref    counter to atomic, which addresses the root cause. [Test Case]  * There is a synthetic test case available, which    uses a kprobes kernel module to synchronize the    concurrent CPUs on the instructions responsible    for the problem; and a userspace part to run it.  * The organic reproducer is the Varnish Cache Plus    software with the Crypto vmod (which uses kernel    crypto userspace API) under long, very high load.  * The patch has been verified on both reproducers    with the 4.15 and 5.7 kernels. * More tests performed with 'stress-ng --af-alg' with 11 CPUs on Xenial/Bionic/Disco/Eoan/Focal (all on same version of stress-ng, V0.11.14) No regressions observed from original kernel. (the af-alg stressor can exercise almost all kernel crypto modules shipped with the kernel; so it checks more paths/crypto alg interfaces.) [Regression Potential]  * The fix patch does a fundamental change in how    alg_sock reference counters work, plus another    change to the 'nokey' counting. This of course    *has* a risk of regression.  * Regressions theoretically could manifest as use    after free errors (in case of undercounting) in    the af_alg functions or silent memory leaks (in    case of overcounting), but also other behaviors    since reference counting is key to many things.  * FWIW, this patch has been written by the crypto    subsystem maintainer, who certainly knows a lot    of the normal and corner cases, thus giving the    patch more credit.  * Testing with the organic reproducer ran as long    as 5 days, without issues, so it does look good. [Other Info]  * [1] Patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34c86f4c4a7be3b3e35aa48bd18299d4c756064d [Stack Trace Examples] Examples:     BUG: unable to handle kernel NULL pointer dereference at 0000000000000000     ...     RIP: 0010:apparmor_sk_clone_security+0x26/0x70     ...     Call Trace:      security_sk_clone+0x33/0x50      af_alg_accept+0x81/0x1c0 [af_alg]      alg_accept+0x15/0x20 [af_alg]      SYSC_accept4+0xff/0x210      SyS_accept+0x10/0x20      do_syscall_64+0x73/0x130      entry_SYSCALL_64_after_hwframe+0x3d/0xa2     general protection fault: 0000 [#1] SMP PTI     ...     RIP: 0010:__release_sock+0x54/0xe0     ...     Call Trace:      release_sock+0x30/0xa0      af_alg_accept+0x122/0x1c0 [af_alg]      alg_accept+0x15/0x20 [af_alg]      SYSC_accept4+0xff/0x210      SyS_accept+0x10/0x20      do_syscall_64+0x73/0x130      entry_SYSCALL_64_after_hwframe+0x3d/0xa2
2020-06-30 21:31:50 Mauricio Faria de Oliveira linux (Ubuntu Xenial): status New In Progress
2020-06-30 21:31:53 Mauricio Faria de Oliveira linux (Ubuntu Xenial): importance Undecided Medium
2020-06-30 21:31:56 Mauricio Faria de Oliveira linux (Ubuntu Xenial): assignee Mauricio Faria de Oliveira (mfo)
2020-06-30 21:31:59 Mauricio Faria de Oliveira linux (Ubuntu Bionic): status New In Progress
2020-06-30 21:32:04 Mauricio Faria de Oliveira linux (Ubuntu Bionic): importance Undecided Medium
2020-06-30 21:32:09 Mauricio Faria de Oliveira linux (Ubuntu Bionic): assignee Mauricio Faria de Oliveira (mfo)
2020-06-30 21:32:13 Mauricio Faria de Oliveira linux (Ubuntu Eoan): status New In Progress
2020-06-30 21:32:16 Mauricio Faria de Oliveira linux (Ubuntu Eoan): importance Undecided Medium
2020-06-30 21:32:18 Mauricio Faria de Oliveira linux (Ubuntu Eoan): assignee Mauricio Faria de Oliveira (mfo)
2020-06-30 21:32:20 Mauricio Faria de Oliveira linux (Ubuntu Focal): status New In Progress
2020-06-30 21:32:22 Mauricio Faria de Oliveira linux (Ubuntu Focal): importance Undecided Medium
2020-06-30 21:32:25 Mauricio Faria de Oliveira linux (Ubuntu Focal): assignee Mauricio Faria de Oliveira (mfo)
2020-06-30 21:32:30 Mauricio Faria de Oliveira linux (Ubuntu Groovy): status Confirmed Won't Fix
2020-06-30 21:32:33 Mauricio Faria de Oliveira linux (Ubuntu Groovy): importance Medium Undecided
2020-06-30 21:32:36 Mauricio Faria de Oliveira linux (Ubuntu Groovy): assignee Mauricio Faria de Oliveira (mfo)
2020-06-30 21:32:49 Mauricio Faria de Oliveira linux (Ubuntu): status Confirmed In Progress
2020-06-30 21:33:16 Mauricio Faria de Oliveira description [Impact]  * Users of the Linux kernel's crypto userspace API    reported BUG() / kernel NULL pointer dereference    errors after kernel upgrades.  * The stack trace signature is an accept() syscall    going through af_alg_accept() and hitting errors    usually in one of:    - apparmor_sk_clone_security()    - apparmor_sock_graft()    - release_sock() [Fix]  * This is a regression introduced by upstream commit    37f96694cf73 ("crypto: af_alg - Use bh_lock_sock    in sk_destruct") which made its way through stable.  * The offending patch allows the critical regions    of af_alg_accept() and af_alg_release_parent() to    run concurrently; now with the "right" events on 2    CPUs it might drop the non-atomic reference counter    of the alg_sock then the sock, thus release a sock    that is still in use.  * The fix is upstream commit 34c86f4c4a7b ("crypto:    af_alg - fix use-after-free in af_alg_accept() due    to bh_lock_sock()") [1]. It changes alg_sock's ref    counter to atomic, which addresses the root cause. [Test Case]  * There is a synthetic test case available, which    uses a kprobes kernel module to synchronize the    concurrent CPUs on the instructions responsible    for the problem; and a userspace part to run it.  * The organic reproducer is the Varnish Cache Plus    software with the Crypto vmod (which uses kernel    crypto userspace API) under long, very high load.  * The patch has been verified on both reproducers    with the 4.15 and 5.7 kernels. * More tests performed with 'stress-ng --af-alg' with 11 CPUs on Xenial/Bionic/Disco/Eoan/Focal (all on same version of stress-ng, V0.11.14) No regressions observed from original kernel. (the af-alg stressor can exercise almost all kernel crypto modules shipped with the kernel; so it checks more paths/crypto alg interfaces.) [Regression Potential]  * The fix patch does a fundamental change in how    alg_sock reference counters work, plus another    change to the 'nokey' counting. This of course    *has* a risk of regression.  * Regressions theoretically could manifest as use    after free errors (in case of undercounting) in    the af_alg functions or silent memory leaks (in    case of overcounting), but also other behaviors    since reference counting is key to many things.  * FWIW, this patch has been written by the crypto    subsystem maintainer, who certainly knows a lot    of the normal and corner cases, thus giving the    patch more credit.  * Testing with the organic reproducer ran as long    as 5 days, without issues, so it does look good. [Other Info]  * [1] Patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34c86f4c4a7be3b3e35aa48bd18299d4c756064d [Stack Trace Examples] Examples:     BUG: unable to handle kernel NULL pointer dereference at 0000000000000000     ...     RIP: 0010:apparmor_sk_clone_security+0x26/0x70     ...     Call Trace:      security_sk_clone+0x33/0x50      af_alg_accept+0x81/0x1c0 [af_alg]      alg_accept+0x15/0x20 [af_alg]      SYSC_accept4+0xff/0x210      SyS_accept+0x10/0x20      do_syscall_64+0x73/0x130      entry_SYSCALL_64_after_hwframe+0x3d/0xa2     general protection fault: 0000 [#1] SMP PTI     ...     RIP: 0010:__release_sock+0x54/0xe0     ...     Call Trace:      release_sock+0x30/0xa0      af_alg_accept+0x122/0x1c0 [af_alg]      alg_accept+0x15/0x20 [af_alg]      SYSC_accept4+0xff/0x210      SyS_accept+0x10/0x20      do_syscall_64+0x73/0x130      entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Impact]  * Users of the Linux kernel's crypto userspace API    reported BUG() / kernel NULL pointer dereference    errors after kernel upgrades.  * The stack trace signature is an accept() syscall    going through af_alg_accept() and hitting errors    usually in one of:    - apparmor_sk_clone_security()    - apparmor_sock_graft()    - release_sock() [Fix]  * This is a regression introduced by upstream commit    37f96694cf73 ("crypto: af_alg - Use bh_lock_sock    in sk_destruct") which made its way through stable.  * The offending patch allows the critical regions    of af_alg_accept() and af_alg_release_parent() to    run concurrently; now with the "right" events on 2    CPUs it might drop the non-atomic reference counter    of the alg_sock then the sock, thus release a sock    that is still in use.  * The fix is upstream commit 34c86f4c4a7b ("crypto:    af_alg - fix use-after-free in af_alg_accept() due    to bh_lock_sock()") [1]. It changes alg_sock's ref    counter to atomic, which addresses the root cause. [Test Case]  * There is a synthetic test case available, which    uses a kprobes kernel module to synchronize the    concurrent CPUs on the instructions responsible    for the problem; and a userspace part to run it.  * The organic reproducer is the Varnish Cache Plus    software with the Crypto vmod (which uses kernel    crypto userspace API) under long, very high load.  * The patch has been verified on both reproducers    with the 4.15 and 5.7 kernels.  * More tests performed with 'stress-ng --af-alg'    with 11 CPUs on Xenial/Bionic/Disco/Eoan/Focal    (all on same version of stress-ng, V0.11.14)    No regressions observed from original kernel.    (the af-alg stressor can exercise almost all    kernel crypto modules shipped with the kernel;    so it checks more paths/crypto alg interfaces.) [Regression Potential]  * The fix patch does a fundamental change in how    alg_sock reference counters work, plus another    change to the 'nokey' counting. This of course    *has* a risk of regression.  * Regressions theoretically could manifest as use    after free errors (in case of undercounting) in    the af_alg functions or silent memory leaks (in    case of overcounting), but also other behaviors    since reference counting is key to many things.  * FWIW, this patch has been written by the crypto    subsystem maintainer, who certainly knows a lot    of the normal and corner cases, thus giving the    patch more credit.  * Testing with the organic reproducer ran as long    as 5 days, without issues, so it does look good. [Other Info] * Not sending for Groovy (should get via Unstable).  * [1] Patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34c86f4c4a7be3b3e35aa48bd18299d4c756064d [Stack Trace Examples] Examples:     BUG: unable to handle kernel NULL pointer dereference at 0000000000000000     ...     RIP: 0010:apparmor_sk_clone_security+0x26/0x70     ...     Call Trace:      security_sk_clone+0x33/0x50      af_alg_accept+0x81/0x1c0 [af_alg]      alg_accept+0x15/0x20 [af_alg]      SYSC_accept4+0xff/0x210      SyS_accept+0x10/0x20      do_syscall_64+0x73/0x130      entry_SYSCALL_64_after_hwframe+0x3d/0xa2     general protection fault: 0000 [#1] SMP PTI     ...     RIP: 0010:__release_sock+0x54/0xe0     ...     Call Trace:      release_sock+0x30/0xa0      af_alg_accept+0x122/0x1c0 [af_alg]      alg_accept+0x15/0x20 [af_alg]      SYSC_accept4+0xff/0x210      SyS_accept+0x10/0x20      do_syscall_64+0x73/0x130      entry_SYSCALL_64_after_hwframe+0x3d/0xa2
2020-07-01 16:41:31 Brian Moyles bug added subscriber Netflix Engineering
2020-07-20 16:55:53 Kelsey Steele linux (Ubuntu Xenial): status In Progress Fix Committed
2020-07-22 07:19:53 Kelsey Steele linux (Ubuntu Eoan): status In Progress Fix Committed
2020-07-30 21:15:22 Kelsey Steele linux (Ubuntu Bionic): status In Progress Fix Committed
2020-08-04 01:19:15 Kelsey Steele linux (Ubuntu Focal): status In Progress Fix Committed
2020-08-10 18:52:23 Ubuntu Kernel Bot tags sts sts verification-needed-bionic
2020-08-17 18:59:49 Mauricio Faria de Oliveira tags sts verification-needed-bionic sts verification-done-bionic
2020-08-18 16:59:43 Brian Murray linux (Ubuntu Eoan): status Fix Committed Won't Fix
2020-09-01 10:15:06 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2020-09-01 12:48:39 Mauricio Faria de Oliveira linux (Ubuntu Xenial): status Fix Committed Fix Released
2020-09-01 12:48:44 Mauricio Faria de Oliveira linux (Ubuntu Focal): status Fix Committed Fix Released
2020-09-01 12:48:47 Mauricio Faria de Oliveira linux (Ubuntu Groovy): status Won't Fix Fix Released
2022-09-14 13:44:52 Mauricio Faria de Oliveira linux (Ubuntu): status In Progress Fix Released