rt kernel BUG: using smp_processor_id() in preemptible code modprobe

Bug #1884262 reported by Jim Somerville
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
zhao.shuai

Bug Description

Brief Description
-----------------

Boot an rt load on hardware with a c6xx (qat) device present. As soon as the device goes enabled, bug reports start happening:

[ 20.065605] c6xx 0000:3f:00.0: enabling device (0140 -> 0142)
[ 20.071492] BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/1026
[ 20.071515] caller is qat_rsa_init_tfm+0x16/0x50 [intel_qat]
[ 20.071516] BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/1032
[ 20.071518] CPU: 16 PID: 1026 Comm: systemd-udevd Tainted: G O --------- - - 4.18.0-147.3.1.rt24.96.el8_1.tis.6.x86_64 #1
[ 20.071526] caller is qat_rsa_init_tfm+0x16/0x50 [intel_qat]
[ 20.071526] Hardware name: Intel Corporation S2600WFQ/S2600WFQ, BIOS SE5C620.86B.02.01.0009.092820190230 09/28/2019
[ 20.071527] Call Trace:
[ 20.071532] dump_stack+0x5a/0x73
[ 20.071537] check_preemption_disabled+0xd9/0xf0
[ 20.071542] qat_rsa_init_tfm+0x16/0x50 [intel_qat]
[ 20.071548] crypto_create_tfm+0x48/0xd0
[ 20.071549] crypto_spawn_tfm2+0x2e/0x50
[ 20.071553] pkcs1pad_init_tfm+0x19/0x30
[ 20.071555] crypto_create_tfm+0x48/0xd0
[ 20.071557] crypto_alloc_tfm+0x4d/0xb0
[ 20.071560] public_key_verify_signature+0x7e/0x2c0
[ 20.071566] ? keyring_search+0x9c/0xd0
[ 20.071567] ? key_default_cmp+0x20/0x20
[ 20.071569] ? find_asymmetric_key+0xc4/0x230
[ 20.071572] pkcs7_validate_trust+0x97/0x1e0
[ 20.071578] verify_pkcs7_signature+0xa7/0x130
[ 20.071584] mod_verify_sig+0x97/0xe0
[ 20.071587] load_module+0xc5/0x1be0
[ 20.071591] ? map_vm_area+0x31/0x40
[ 20.071593] ? __vmalloc_node_range+0x14b/0x220
[ 20.071594] ? __do_sys_init_module+0x9b/0x170
[ 20.071596] __do_sys_init_module+0x113/0x170
[ 20.071601] do_syscall_64+0x5b/0x1c0
[ 20.071604] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 20.071605] RIP: 0033:0x7fe1d70264ea
[ 20.071607] Code: 48 8b 0d a9 79 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 79 2c 00 f7 d8 64 89 01 48
[ 20.071608] RSP: 002b:00007ffdb11222f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[ 20.071609] RAX: ffffffffffffffda RBX: 0000562d0ecb3b80 RCX: 00007fe1d70264ea
[ 20.071611] BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/1035
[ 20.071612] RDX: 00007fe1d7941039 RSI: 0000000000038f4c RDI: 0000562d0f4e4ac0
[ 20.071618] caller is qat_rsa_init_tfm+0x16/0x50 [intel_qat]
[ 20.071619] RBP: 00007fe1d7941039 R08: 0000000000000000 R09: 0000562d0ecb85a0
[ 20.071619] R10: 0000562d0ecb8500 R11: 0000000000000246 R12: 0000562d0f4e4ac0
[ 20.071620] R13: 0000562d0ecb7e30 R14: 0000000000020000 R15: 0000000000000000

The kernel logs are flooded with such entries which happen every time a modprobe or anything else does a load_module. The module loading works but the code is evidently violating the use of smp_processor_id() with respect to preemptibility.

Severity
--------

Major: usable but *kernel* log flooding with BUG reports is visibly bad, not minor imo

Workaround
----------

Not needed. Just hold your nose.

Not bothering with the rest of the template, everything you need is above.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - issue w/ 4.18 kernel (rt)
Assigning to distro.other PL to assign to the appropriate dev prime

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Austin Sun (sunausti)
tags: added: stx.4.0 stx.distro.other
Austin Sun (sunausti)
Changed in starlingx:
assignee: Austin Sun (sunausti) → zhao.shuai (zhao.shuai.neusoft)
Revision history for this message
zhao.shuai (zhao.shuai.neusoft) wrote :

According to the latest Driver information officially provided by QAT:
Currently only the following systems are supported:
Red Hat Enterprise Linux 7.6*, Red Hat Enterprise Linux 7.5*

website path:
https://www.intel.com/content/www/us/en/search.html?ws=text#q=QAT&t=Downloads&layout=table

Since we have used the latest version of the QAT driver, this kind of problem is caused by the QAT driver cannot be perfectly adapted. In response to the internal problems of the QAT driver, we can notify the QAT team to upgrade the corresponding Driver, and then we update StarlingX according to the latest QAT Driver.

Revision history for this message
zhao.shuai (zhao.shuai.neusoft) wrote :

By analyzing the Log && latest QAT Driver code (qat1.7.l.4.5.0-00034.tar.gz),
the basic sequence of the problem is as follows:

qat_rsa_init_tfm()
     |-->qat_crypto_get_instance_node(get_current_node());
            |-->topology_physical_package_id(smp_processor_id());
-----------------------
Preemptive risk point:
static inline int get_current_node(void)
{
 return topology_physical_package_id(smp_processor_id());
}
----------------------
Solutions waiting to be discussed:

diff --git a/quickassist/qat/drivers/crypto/qat/qat_common/adf_common_drv.h b/quickassist/qat/drivers/crypto/qat/qat_common/adf_common_drv.h
index 35f0f44..bbe95d9 100644
--- a/quickassist/qat/drivers/crypto/qat/qat_common/adf_common_drv.h
+++ b/quickassist/qat/drivers/crypto/qat/qat_common/adf_common_drv.h
@@ -100,7 +100,11 @@ struct service_hndl {

 static inline int get_current_node(void)
 {
- return topology_physical_package_id(smp_processor_id());
+ unsigned int cpu;
+ preempt_disable();
+ cpu = smp_processor_id();
+ preempt_enable();
+ return topology_physical_package_id(cpu);
 }

 int adf_service_register(struct service_hndl *service);
--
2.7.4
----------------------
Reference materials:
https://www.kernel.org/doc/Documentation/preempt-locking.txt

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737444

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
zhao.shuai (zhao.shuai.neusoft) wrote :

Source Code Change for Review:
https://review.opendev.org/#/c/737444

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/737444
Committed: https://git.openstack.org/cgit/starlingx/kernel/commit/?id=e2c681a443cc4bb3f039a31cbce5eadc1125339d
Submitter: Zuul
Branch: master

commit e2c681a443cc4bb3f039a31cbce5eadc1125339d
Author: zhao.shuai <email address hidden>
Date: Mon Jun 22 20:24:29 2020 -0700

    qat: fix smp_processor_id preemption complaints

    The module loading works but the code is evidently violating
    the use of smp_processor_id() with respect to preemptibility.

    It seems that smp_processor_id() is only used for a best-effort
    load-balancing, refer to qat_crypto_get_instance_node(). It's not feasible
    to disable preemption for the duration of the crypto requests. Therefore,
    just silence the warning.

    Reference materials:
    https://github.com/torvalds/linux/commit/1b82feb6c5e1996513d0fb0bbb475417088b4954
    https://www.kernel.org/doc/Documentation/preempt-locking.txt

    Change-Id: I0f4d88d934aa29d30cde9a20212e758e15ad01ad
    Closes-Bug: 1884262
    Signed-off-by: zhao.shuai <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.