5.15.0-1041-realtime kernel failed to pass hwlatdetect on 4th Generation Intel® Xeon® Scalable Processor

Bug #2028541 reported by Alex Wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-realtime
Fix Released
Medium
Joseph Salisbury

Bug Description

[Impact]

On our customer server with Intel 4th Generation Intel® Xeon® Scalable Processor, realtime test hwlatdetect failed with ubuntu 22.04 5.15.0-1041-realtime kernel.

But when we change back to 5.15.0-1039-realtime kernel, and keep all the other settings(BIOS, cmdline, core/uncore freq...), this test can pass.

[Test Plan]

# echo 7ffffffe,7ffffffe > /sys/kernel/tracing/tracing_cpumask
# echo per-cpu > /sys/kernel/tracing/hwlat_detector/mode
# hwlatdetect --threshold 10 --hardlimit 10 --duration 12h --window 1000000us --width 950000us --report /home/hwlat_result.log --watch

kernel cmdline:
BOOT_IMAGE=/boot/vmlinuz-5.15.0-1041-realtime ... ro default_hugepagesz=1G hugepages=30 hugepagesz=1G intel_iommu=on iommu=pt skew_tick=1 isolcpus=managed_irq,domain,1-30,33-62 nohz_full=1-30,33-62 nohz=on rcu_nocbs=1-30,33-62 rcu_nocb_poll nosoftlockup nmi_watchdog=0 crashkernel=auto selinux=0 mce=off audit=0 vfio_pci.enable_sriov=1 vfio_pci.disable_idle_d3=1 usbcore.autosuspend=-1 tsc=reliable

tuned profile: realtime

For 5.15.0-1039-realtime kernel, all latencies are below 10 us for 12 hours.
For 5.15.0-1041-realtime kernel, latencies above 10 us can be found in an hour.

[Where problems could occur]

On the same server with the same configuration, only change the rt-kernel, we can meet this issue, so maybe we can find something in the changelog of kernel.

----------------------------------

System Configuration
    OS: Ubuntu 22.04 LTS
    Kernel: 5.15.0-1041-realtime
    CPUs: Intel 4th Generation Intel® Xeon® Scalable Processor, single socket system

Errors
    hwlatdetect latencies above 10 us can be found in an hour.

So for supporting the customer to enable realtime workload on ubuntu rt kernel, any change in 5.15.0-1041-realtime may cause this failure?

Changed in ubuntu-realtime:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote (last edit ):

Do you know if this issue also happens with the Ubuntu-realtime-5.15.0-1040.45 kernel? There were a lot of upstream stable updates added with that release.

We can perform a kernel bisect to identify the commit that introduced this regression. I can build the kernels and you could test to see if a specific kernel build exhibits the bug or not. Would you be able to test some kernels if I build them?

Changed in ubuntu-realtime:
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I will also see if I can reproduce this regression internally. To test, I have a dual-core system with two: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, can you see if this bug still exists in the Ubuntu-realtime-5.15.0-1043.48 kernel? It is the version currently in -proposed.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote (last edit ):

How are you interpeting your results, so I can do the same? I ran the hwlatdetect you provided in the description against the 1039 kernel. This is a snipit of my results:

test finished
Max Latency: 44us
Samples recorded: 1009
Samples exceeding threshold: 1009
SMIs during run: 0
report saved to /home/hwlat_result.log (1009 samples)

ts: 1690309385.147289040, inner:44, outer:41
ts: 1690309385.147286665, inner:41, outer:43
ts: 1690309385.146428556, inner:41, outer:41
ts: 1690309396.883916958, inner:38, outer:16
ts: 1690309397.000467627, inner:39, outer:39
ts: 1690309407.146921902, inner:36, outer:39
ts: 1690309434.018117360, inner:11, outer:5
ts: 1690309443.036113494, inner:11, outer:5
ts: 1690309521.192119186, inner:6, outer:11
ts: 1690309574.298196459, inner:5, outer:11
ts: 1690309577.304183263, inner:11, outer:5
ts: 1690309635.420204197, inner:5, outer:11
ts: 1690309652.454203214, inner:5, outer:11
ts: 1690309751.652092860, inner:11, outer:5
ts: 1690309821.792218155, inner:11, outer:6
ts: 1690309976.100194996, inner:5, outer:11
ts: 1690310017.182106712, inner:9, outer:11
ts: 1690310113.374107838, inner:11, outer:10
ts: 1690310151.450100976, inner:5, outer:11
ts: 1690310159.466188548, inner:11, outer:5
ts: 1690310247.642112078, inner:6, outer:11
ts: 1690310258.664093244, inner:11, outer:5
ts: 1690310290.728091227, inner:5, outer:11
ts: 1690310296.740196885, inner:11, outer:5
ts: 1690310331.810186921, inner:5, outer:11
ts: 1690310343.834183837, inner:11, outer:6
ts: 1690310410.968088896, inner:11, outer:5
...
...
...

Are you looking at the inner and outer numbers to determine your latency that needs to be below 10 us? If so, I probably need to tune my system a bit on 1039 to establish a baseline.

Revision history for this message
Alex Wang (alexwang-bkc) wrote :

Hi Salisbury,

I tried with my server and this issue also happens with the Ubuntu-realtime-5.15.0-1040.45 kernel:
hwlatdetect: test duration 43200 seconds
   detector: tracer
   parameters:
        Latency threshold: 10us
        Sample window: 1000000us
        Sample width: 950000us
     Non-sampling period: 50000us
        Output File: /home/hwlat_result.log

Starting test
ts: 1690449580.354462983, inner:1, outer:11
ts: 1690450571.218462019, inner:1, outer:11
ts: 1690450604.482456718, inner:1, outer:11
ts: 1690450610.530459305, inner:1, outer:13
ts: 1690450685.122460067, inner:0, outer:12
ts: 1690450747.618463617, inner:2, outer:11
ts: 1690450995.586461613, inner:1, outer:12
ts: 1690451171.986462773, inner:1, outer:11
^Cinterrupted
test finished
Max Latency: 13us
Samples recorded: 8
Samples exceeding threshold: 8
SMIs during run: 0
report saved to /home/hwlat_result.log (8 samples)

When we switch back to 5.15.0-1039-realtime, the latency is OK:
hwlatdetect: test duration 43200 seconds
   detector: tracer
   parameters:
        Latency threshold: 10us
        Sample window: 1000000us
        Sample width: 950000us
     Non-sampling period: 50000us
        Output File: /home/hwlat_result.log

Starting test
^Cinterrupted
test finished
Max Latency: Below threshold
Samples recorded: 0
Samples exceeding threshold: 0
SMIs during run: 0
report saved to /home/hwlat_result.log (0 samples)

Revision history for this message
Alex Wang (alexwang-bkc) wrote :

And dual socket system latency is much higher than single socket, so we could only debug this issue on single socket system.

And I am glad to try with specific kernel build, where can I get it?

Thanks.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Were you able to see if this issue still exists in the Ubuntu-realtime-5.15.0-1043.48 kernel?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I will see if I can reproduce this issue as well. I was not able to using a two socket system. I will try disabling all the cpus on one of the sockets to simulate your environment.

Revision history for this message
Alex Wang (alexwang-bkc) wrote :

This issue is gone in 5.15.0-1044-realtime kernel and 5.15.0-1043-realtime kernel.

5.15.0-1042-realtime kernel is not tested yet.

5.15.0-1041-realtime kernel and 5.15.0-1040-realtime kernel introduce this issue.

5.15.0-1039-realtime kernel is OK.

All these tests are on the same server, with the same BIOS settings/tuned config.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you test the v5.15.0-1045 kernel, which is now in -updates?

Revision history for this message
Alex Wang (alexwang-bkc) wrote :

5.15.0-1045-realtime kernel can not reproduce this issue.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That is great news! This issue should be fixed via a stable update from version 1045 forward.

The 5.15.0-1046 is now in -updates and will soon be updated with 5.15.0-1047. I will leave this bug open to confirm it is still fixed in 1046 and 1047.

Thanks for all the help testing!

Changed in ubuntu-realtime:
status: In Progress → Fix Released
Revision history for this message
Alex Wang (alexwang-bkc) wrote :

Hi Joseph,

This issue is back in 5.15.0-1048-realtime kernel and 5.15.0-1049-realtime kernel.
5.15.0-1046-realtime kernel is OK.

So is it possible to check the different between 1046 and 1048? Thanks.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote (last edit ):

Thanks for the report that this issue returned. There were no changes in the real-time patch set between the 5.15.0-1046-realtime kernel and 5.15.0-1049-realtime kernel.

I'll see what stable updates came in that could have caused this.

Changed in ubuntu-realtime:
status: Fix Released → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The 1048 kernel only had a few security updates added to it. Can you see if 5.15.0-1047 has the bug as well?

Revision history for this message
Alex Wang (alexwang-bkc) wrote :

Sorry, I can't see "5.15.0-1047-realtime" through apt list:

linux-image-5.15.0-1046-realtime/jammy 5.15.0-1046.52 amd64
linux-image-5.15.0-1048-realtime/jammy,now 5.15.0-1048.54 amd64
linux-image-5.15.0-1049-realtime/jammy 5.15.0-1049.55 amd64
linux-image-5.15.0-1050-realtime/jammy 5.15.0-1050.56 amd64

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry for the delay on this. There is a new kernel being released this week: 5.15.0-1052.58. Would it be possible for you to see if this kernel fixes the issue?

Revision history for this message
Alex Wang (alexwang-bkc) wrote :

I have tested 5.15.0-1052.58 and 5.15.0-1051.57
5.15.0-1052.58 hwlatdetect 12 hours result is <=10us.
5.15.0-1051.57 hwlatdetect 12 hours max latency is 15 ~ 20us.
So is it possible to find out what is update in 5.15.0-1052.58 and narrow down which part cause this?
I am curious.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It's interesting that you see these results. The upstream patchset has not changed in Ubuntu since 5.15.0-1045 with patchset rt66. However, this issue has seem to come an gone away in a few releases since then.

This issue was probably caused by an upstream stable update that came in with the 5.15.y updates and then fixed by a later update.

There are many updates that could be the fix. All of the updates can be seen via the PPA:
https://launchpad.net/~ubuntu-advantage/+archive/ubuntu/realtime-updates/+packages

Drop down the package linux-realtime - 5.15.0-1052.58 and you should see the changelog.

Since no commit specifically sticks out as a fix, we could perform a "Reverse" bisect, but that would required you to perform a lot of testing.

Changed in ubuntu-realtime:
assignee: Joseph Salisbury (jsalisbury) → nobody
status: In Progress → Triaged
Changed in ubuntu-realtime:
status: Triaged → Incomplete
Revision history for this message
Alex Wang (alexwang-bkc) wrote :

5.15.0-1052.58 works fine with our platforms.
And I learned that upstream update may cause some performance issues.
It's OK for me to close this ticket.
Great thanks.

Changed in ubuntu-realtime:
status: Incomplete → Fix Released
Changed in ubuntu-realtime:
assignee: nobody → Joseph Salisbury (jsalisbury)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.