cores isolation not working properly

Bug #1992164 reported by Antonio
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ubuntu-realtime
Fix Released
Medium
Unassigned

Bug Description

I think I configured the kernel of Ubuntu real time with isolated cores, irq affinity, nohz for isolated cores but I still get a lot of interruptions (up to 10us) when launching a DPDK thread on an isolated core. I tried the same on Rocky linux getting much better results (less than 1us). For sure I'm doing something wrong but I don't understand why.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hello Antonio,

Could you post the details of how you configured isolated cores? Are you using boot options or a userspace tool like tuned?

Changed in ubuntu-realtime:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Antonio (adibaccoks) wrote :

Using following boot options:

"audit=0 default_hugepagesz=1G hugepagesz=1G hugepages=32 hugepagesz=2M hugepages=1024 hpet=disable intel_idle.max_cstate=1 intel_iommu=on intel_pstate=disable iommu=pt isolcpus=nohz,domain,1-27,29-55 mce=off nmi_watchdog=0 nohz_full=1-27,29-55 nosoftlockup numa_balancing=disable processor.max_cstate=1 rcu_nocbs=1-27,29-55 rcu_nocb_poll irqaffinity=0,28 tsc=reliable quiet"

And then I used the attached utility to measure hiccups. The tool pins a thread to an isolated core and executes a "computation" several times and calculates min and max exec time.

The result of the tool gives jitter up to 20us over a computation that takes 150 us. See the code for details.

    300034 300082 48 ( 0.0 us) 300056 300016 300090 1912602624 6001690792 4
    300016 300326 310 ( 0.1 us) 300060 300016 300326 2986344448 6001689432 5
    300036 300160 124 ( 0.1 us) 300058 300016 300326 4060086272 6001730288 6
    300032 300766 734 ( 0.3 us) 300072 300016 300766 838860800 6001715376 7
    300016 329984 29968 (14.3 us) 300058 300016 329984 1912602624 6001649968 8
    300014 300082 68 ( 0.0 us) 300056 300014 329984 2986344448 6001694352 9
    300016 300190 174 ( 0.1 us) 300056 300014 329984 4060086272 6001688028 10
    300030 301642 1612 ( 0.8 us) 300060 300014 329984 838860800 6001747244 11
    300016 329438 29422 (14.0 us) 300068 300014 329984 1912602624 6001835468 12

We would need a maximum jitter of up to 1us.

Best regards,
Antonio.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Antonio,

Thanks for the feedback. I usually use version 2.18 or newer of the tuned package for setting IRQ affinity. I define specific cores for real-time processes and specific cores for interrupts. The current version of tuned in Ubuntu 22.04 is 2.15, which does not work properly. Version 2.18 has to be installed manually.

However, I will investigate isolation the way you are doing it, and post some feedback shortly.

Changed in ubuntu-realtime:
status: Triaged → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote (last edit ):

I am using a 96 core machine. To experiment with core isolation, I isolate 95 cores, which causes all interrupts to go to a single core.

Here is what tuned sets my cmdline to:
BOOT_IMAGE=/boot/vmlinuz-5.15.0-1022-realtime root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0 console=ttyS0,115200 skew_tick=1 isolcpus=managed_irq,domain,0-94 intel_pstate=disable nosoftlockup tsc=nowatchdog

The key parameter here is isolcpus=managed_irq,domain,0-94

I will see what else tuned does behind the scenes besides setting these parameters. I'll post my findings for you to test.

Revision history for this message
Antonio (adibaccoks) wrote :

Thank you for your interest.
I used your same kernel pars (adding just the hugepages stuff):

realtime 5.15.30)
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-1009-realtime root=/dev/mapper/ubuntu--vg-ubuntu--lv ro audit=0 default_hugepagesz=1G hugepagesz=1G hugepages=32 hugepagesz=2M hugepages=1024 intel_iommu=on iommu=pt skew_tick=1 isolcpus=managed_irq,domain,0-53 intel_pstate=disable nosoftlockup tsc=nowatchdog

and I get these values with the tool I sent you last time.

    300012 309756 9744 ( 4.6 us) 300058 300012 319294 1023410176 6013485990 50
    300012 309432 9420 ( 4.5 us) 300034 300012 319294 3707764736 6013426518 51
    300012 309684 9672 ( 4.6 us) 300080 300012 319294 2097152000 6013690050 52
    300012 316186 16174 ( 7.7 us) 300056 300012 319294 486539264 6013424742 53
    300012 309666 9654 ( 4.6 us) 300032 300012 319294 3170893824 6013433298 54
    300012 315130 15118 ( 7.2 us) 300074 300012 319294 1560281088 6013429568 55
    300012 317324 17312 ( 8.2 us) 300058 300012 319294 4244635648 6013426366 56
    300012 309608 9596 ( 4.6 us) 300056 300012 319294 2634022912 6013438128 57

Furthermore if I cat /proc/interrupts I see a lot of interrupts (see attachment).

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Antonio,

Can you try adding managed_irq to your boot options?

I'm attempting to reproduce your issue on a simple VM. Here are the boot options from a VM:

BOOT_IMAGE=/boot/vmlinuz-5.15.0-1022-realtime root=UUID=9b6f96a9-f97e-4dda-a0b9-8be7f8e799fb ro quiet splash skew_tick=1 isolcpus=managed_irq,domain,1-3 intel_pstate=disable nosoftlockup tsc=nowatchdog vt.handoff=7

As you can see, I'm isolating cpus 1,2 and 3 for real-time processes and core 0 will be used for IRQs. I found a simple way to cause a bunch of repeating IRQs is to perform a flood ping. Something like:

sudo ping -f 192.168.1.1 -> replace that IP with the local IP address.

You can then monitor interrupts by watching /proc/interrupts with this command:
watch -n1 -d cat /proc/interrupts

While the flood ping is running, you should see the interrupt count to dramatically increase increase on your network device. The interrupts should all go to the core(s) that is not isolated.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, the LOC and RES interrupts are most telling if managed irqs is working. The local interrupts should only increase on the core(s) not isolated. If there is a single cpu handling all interrupts, the 'Rescheduling interrupts' should not increase at all. Here is an example from my VM:

This is without isolcpus isolation:
LOC: 14876 13351 12125 14904 Local timer interrupts -> should only climb on 1 cpu.
RES: 12800 14857 15097 15249 Rescheduling interrupts -> should not climb att all.

This is with isolcpus isolation (The first column is the cpu handling all the irqs the RES row is not increasing):
LOC: 35545 555 319 298 Local timer interrupts
RES: 1155 40 43 63 Rescheduling interrupts

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry, I posted my two comments before I saw your comment #5. I'll review your data.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Antonio,

Can you monitor /proc/inturrupts when your workload is running? It's difficult to read /proc/interrupts when there are so many cores. However, if you monitor the LOC: row, which is 'Local timer interrupts' you should only see the columns for CPUS 54 and 55 climbing.

It should also be the case for the other interrupts. You will see some initial interrupts when the system first boots, but they should settle down and only increase on the CPUs that are not isolated.

Can you see if that is the case when your test is running?

Also, I tried running your tool, but I am missing some shared libraries:
./dpdk-jitter: error while loading shared libraries: librte_power.so.22: cannot open shared object file: No such file or directory

Do you know off the top of your head if I'm missing a specific package? If not, I can do some digging.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

One more comment, a command to see which of your cpus are isolated is:
cat /sys/devices/system/cpu/isolated

If you look in the file /etc/systemd/system.conf, the value of CPUAffinity is the cpus that will handle all inturrupts.

Revision history for this message
Antonio (adibaccoks) wrote :

Hi,

the isolated cores are fine:

user@metis:~$ cat /sys/devices/system/cpu/isolated
0-53

To use my tool, simply download https://fast.dpdk.org/rel/dpdk-21.11.tar.xz , extract, enter dir and give command
meson -Dplatform=native build
now enter build dir and issue:
ninja
sudo ninja install

Anyway, you could use also the non-dpdk version that you can find here:

https://github.com/FDio/archived-pma_tools

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hello Antonio,

So it sounds like isolated cores and IRQ CPU afinity is working now, is that correct?

I performed some tests with your jitter tool. I found that assigning one CPU to IRQs and the remaining 95 for isolation obtained the best latency results. This is the invocation I used:
sudo ./jitter -c 10 -i 30

I created a tar file of all my test results and attached it to this report.

I see you are using 2 cores for IRQs: 0 and 28. Could you try testing with just one core for interrupts? Ensure you make changes to the /etc/systemd/system.conf in addition to the boot parameters.

Another option would be for you to try experimenting with tuned and use the realtime profile in tuned. That is the easiest way I found to experiment with these parameters.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

hiccup test results.

Revision history for this message
Antonio (adibaccoks) wrote :

Hi Joseph,

I saw your results and, for example, this one jitter-with-5IRQ-cpus-isolate_managed_irq-YCPUAffinityChangednetdev_queue_count-4 looks very promising.

But, in your opinion, what is the reason why we are observing still a variation in execution time around 2-4 usec?

Can you do something to lower this "variation" with your real time kernel ?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I will do some more testing and investigation. I'll post my findings shortly.

Revision history for this message
Antonio (adibaccoks) wrote :

Thank you, I think that, in theory, apart from frequency throttling, if there are no interrupts on a core the jitter of execution time should be close to 0 us.

If your kernel is capable of limiting the maximum interruption time to less than 1us, then it opens up to many real time applications including telecom for example.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Antonio. I am away for this week, but will continue my investigation next week.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Antonio,

I think I found the magic combination of tunings. Here is my /proc/cmdline:

BOOT_IMAGE=/boot/vmlinuz-5.15.0-1025-realtime root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0 console=ttyS0,115200 skew_tick=1 rcu_nocb_poll rcu_nocbs=1-95 nohz=on nohz_full=1-95 kthread_cpus=0 irqaffinity=0 isolcpus=managed_irq,domain,1-95 intel_pstate=disable nosoftlockup tsc=nowatchdog

This is a 96 core machine and I assigned all IRQs to core 0 and isolated cores 1-95.

I'll post my results in the next comment:

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Download full text (4.3 KiB)

$ ./jitter -c 1 -i 30
Linux Jitter testing program version 1.9
Iterations=30
The pragram will execute a dummy function 80000 times
Display is updated every 20000 displayUpdate intervals
Thread affinity will be set to core_id:1
Timings are in CPU Core cycles
Inst_Min: Minimum Excution time during the display update interval(default is ~1 second)
Inst_Max: Maximum Excution time during the display update interval(default is ~1 second)
Inst_jitter: Jitter in the Excution time during rhe display update interval. This is the value of interest
last_Exec: The Excution time of last iteration just before the display update
Abs_Min: Absolute Minimum Excution time since the program started or statistics were reset
Abs_Max: Absolute Maximum Excution time since the program started or statistics were reset
tmp: Cumulative value calcualted by the dummy function
Interval: Time interval between the display updates in Core Cycles
Sample No: Sample number

cycles per usec 2394
   Inst_Min Inst_Max Inst_jitter last_Exec Abs_min Abs_max tmp Interval Sample No
    128020 128026 6 (0.0 us) 128022 128020 128026 304676864 2562159712 1
    128020 128024 4 (0.0 us) 128022 128020 128026 2689138688 2562185220 2
    128020 128026 6 (0.0 us) 128022 128020 128026 778633216 2562165460 3
    128020 128026 6 (0.0 us) 128024 128020 128026 3163095040 2562164762 4
    128020 128026 6 (0.0 us) 128022 128020 128026 1252589568 2562163104 5
    128020 128026 6 (0.0 us) 128024 128020 128026 3637051392 2562164176 6
    128020 128026 6 (0.0 us) 128024 128020 128026 1726545920 2562163752 7
    128020 128026 6 (0.0 us) 128022 128020 128026 4111007744 2562166540 8
    128020 128026 6 (0.0 us) 128022 128020 128026 2200502272 2562164038 9
    128020 128026 6 (0.0 us) 128020 128020 128026 289996800 2562164134 10
    128020 128026 6 (0.0 us) 128020 128020 128026 2674458624 2562166348 11
    128020 128026 6 (0.0 us) 128024 128020 128026 763953152 2562164290 12
    128020 128026 6 (0.0 us) 128022 128020 128026 3148414976 2562165930 13
    128020 128026 6 (0.0 us) 128022 128020 128026 1237909504 2562163676 14
    128020 128026 6 (0.0 us) 128020 128020 128026 3622371328 2562164092 15
    128020 128026 6 (0.0 us) 128022 128020 128026 1711865856 2562166342 16
    128020 128026 6 (0.0 us) 128022 128020 128026 4096327680 2562166296 17
    128020 128026 6 (0.0 us) 128022 128020 128026 2185822208 2562168882 18
    128020 128026 6 (0.0 us) 128022 128020 128026 275316736 2...

Read more...

Revision history for this message
Antonio (adibaccoks) wrote :

Hi Joseph,

thank you very much!! Could you tell me which network cards have you installed on your system?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Antonio,

This machine has multiple Broadcom NetXtreme BCM5720 devices:

sudo lshw -class network -short
H/W path Device Class Description
=========================================================
/0/100/1c.5/0 eno8303 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/100/1c.5/0.1 eno8403 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/102/4/0 eno12399 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/102/4/0.1 eno12409 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/102/5/0 eno12419 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/102/5/0.1 eno12429 network NetXtreme BCM5720 Gigabit Ethernet PCIe

Revision history for this message
Antonio (adibaccoks) wrote : RE: [Bug 1992164] Re: cores isolation not working properly
Download full text (6.2 KiB)

Hi Joseph,

thank you very much for your support.
The kthread_cpus key seems to work very well.
Now I'm using this command line:

user@metis:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-1009-realtime root=/dev/mapper/ubuntu--vg-ubuntu--lv ro audit=0 default_hugepagesz=1G hugepagesz=1G hugepages=32 hugepagesz=2M hugepages=1024 intel_iommu=on iommu=pt skew_tick=1 rcu_nocb_poll rcu_nocbs=1-55 nohz=on nohz_full=1-55 kthread_cpus=0 irqaffinity=0 isolcpus=managed_irq,domain,1-55 intel_pstate=disable nosoftlockup tsc=nowatchdog

on this kernel:
Linux metis 5.15.0-1009-realtime #9-Ubuntu SMP PREEMPT_RT Thu Apr 21 20:33:36

I launched the flood ping and the execution jitter is very stable, I tried to flood ping from and to the machine under test. I also tried to access the disk with: truncate -s 5000M test.txt; sync and it was always stable.

There is just one simple action that somewhat affects the jitter and I don't understand why. But I suppose you can figure out what is happening.

1) I launched the jitter program and left it go (you can also launch multiple of these on different isolated cores, in my case I launched on core 2 and 50):
    jitter -c 2 -i 3000

2) from another machine just try to ssh into the machine under test. I noticed that there is a strong correlation between the ssh attempts and spikes in the jitter. The strange thing is that the spike is observed on multiple isolated cores.

Could you try the same on your machine to confirm this behaviour?

In this traces you can see what I mean:

    360020 360024 4 360022 360016 438922 3184656384 7201919860 351
    360018 360024 6 360022 360016 438922 437387264 7201919660 352
    360018 360026 8 360020 360016 438922 1985085440 7201922250 353
    360020 360024 4 360022 360016 438922 3532783616 7201919662 354
    360020 360024 4 360020 360016 438922 785514496 7201918522 355
    360018 360026 8 360022 360016 438922 2333212672 7201919396 356
    360020 373132 13112 360020 360016 438922 3880910848 7201945238 357 (SSH into test machine)
    360020 360024 4 360022 360016 438922 1133641728 7201921792 358
    360018 360026 8 360022 360016 438922 2681339904 7201919438 359
    360020 360024 4 360022 360016 438922 4229038080 7201923808 360
    360020 360024 4 360022 360016 438922 1481768960 7201919854 361
    360018 360026 8 360022 360016 438922 3029467136 7201919770 362
    360020 376424 16404 360022 360016 438922 282198016 7201941410 363 (SSH into test machine)
    360018 360024 6 360022 360016 438922 1829896192 7201925264 364
    360020 360024 4 360022 360016 438922 3377594368 7201920538 ...

Read more...

Revision history for this message
Antonio (adibaccoks) wrote :

Hi Joseph,

I noticed a big improvement with the new options you provided on command line, I also updated the real time kernel to
Linux metis 5.15.0-1024-realtime #25-Ubuntu SMP PREEMPT_RT Fri Oct 14 20:32:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Anyway, while starting up, I see a message that make me think that kthread_cpus is not being processed:

user@metis:~$ sudo dmesg | grep "passed to user"
[ 0.352403] Unknown kernel command line parameters "BOOT_IMAGE=/vmlinuz-5.15.0-1024-realtime kthread_cpus=0", will be passed to user space.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes, you can ignore kthread_cpus. It is not in the kernel at this point.

Revision history for this message
Antonio (adibaccoks) wrote :

ok, thank you.

There is just one simple action that somewhat affects the jitter and I don't understand why. But I suppose you can figure out what is happening.

1) I launched the jitter program and left it go (you can also launch multiple of these on different isolated cores, in my case I launched on core 2 and 50):
    jitter -c 2 -i 3000

2) from another machine just try to ssh into the machine under test. I noticed that there is a strong correlation between the ssh attempts and spikes in the jitter. The strange thing is that the spike is observed on multiple isolated cores.

Could you try the same on your machine to confirm this behaviour?

Revision history for this message
Tim Martin (nv-tmartin) wrote :
Download full text (5.6 KiB)

Hi Joseph and Antonio,

I am also interested in reducing task jitter for my application. I've been following along this thread and still see very high jitter: >100us with Antonio's non-dpdk jitter.c program and nearly 1 second with a different jitter measurement tool I wrote.

Details of my setup:

$ cat /sys/devices/system/cpu/isolated
4-15

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-1025-realtime root=UUID=296f0418-3049-4cf8-b6dc-8f6b08d5ba5f ro pci=realloc=off default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc isolcpus=nohz,domain,managed_irq,4-15 intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 intel_pstate=disable audit=0 idle=poll rcu_nocb_poll nosoftlockup iommu=off irqaffinity=0 nohz_full=4-15 rcu_nocbs=4-15

$ cat /proc/version
Linux version 5.15.0-1025-realtime (buildd@lcy02-amd64-072) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #28-Ubuntu SMP PREEMPT_RT Fri Oct 28 23:19:16 UTC 2022

$ cat /etc/systemd/system.conf | grep CPUAffinity
#CPUAffinity=

(I just noticed the comment about setting CPUAffinity in /etc/systemd/system.conf, will do that in follow-up post).

With the non-dpdk jitter.c tool running I see a bunch of things running on CPU core 4:

$ ps -eo psr,tid,pid,s,class,comm:50,%cpu,priority,nice -T | sort -g | grep "^ 4"
  4 247 247 I TS kworker/4:1-mm_percpu_wq 0.0 20 0
  4 2738 2738 R TS jitter 83.5 20 0
  4 415 415 S FF irq/60-megasas1 0.0 -51 -
  4 455 455 S FF irq/121-nvme0q1 0.0 -51 -
  4 52 52 S TS cpuhp/4 0.0 20 0
  4 53 53 S FF idle_inject/4 0.0 -51 -
  4 54 54 S FF irq_work/4 0.0 -2 -
  4 55 55 S FF migration/4 0.0 -100 -
  4 56 56 S FF rcuc/4 0.0 -2 -
  4 57 57 S TS ksoftirqd/4 0.0 20 0
  4 58 58 I TS kworker/4:0-events 0.0 20 0
  4 59 59 I TS kworker/4:0H-events_highpri 0.0 0 -20
  4 955 955 S FF irq/70-eno1-TxR 0.0 -51 -
  4 976 976 I TS kworker/4:1H-events_highpri 0.0 0 -20

dpdk-jitter/non-dpdk-jitter$ ./jitter -c 4 -i 300
...
cycles per usec 2100
...
    134422 371682 237260 (113.0 us) 134424 134418 462864 (156.4 us) 1940455424 2692118730 293
    134422 162130 27708 ( 13.2 us) 134422 134418 462864 (156.4 us) 2714304512 2690190402 294
    134422 181470 47048 ( 22.4 us) 134426 134418 462864 (156.4 us) 3488153600 2690214636 295
    134422 134564 142 ( 0.1 us) 134424 134418 462864 (156.4 us) 4262002688 2690160144 296
    1344...

Read more...

Revision history for this message
Tim Martin (nv-tmartin) wrote :

For the non-dpdk jitter.c, I made the following changes to display the maximum time:

--- jitter.c 2022-09-30 12:24:31.000000000 +0000
+++ /nfs/scratch.timothym_sw/dpdk-jitter/non-dpdk-jitter/jitter.c 2022-11-21 19:36:56.443403000 +0000
@@ -40,7 +40,7 @@
 unsigned int iterations = 200;
 int parseable = 0;

-char *title = " Inst_Min Inst_Max Inst_jitter last_Exec Abs_min Abs_max tmp Interval Sample No\n";
+char *title = " Inst_Min Inst_Max Inst_jitter last_Exec Abs_min Abs_max tmp Interval Sample No\n";

 int dummyop(unsigned int loops, unsigned startval)
 {
@@ -272,13 +272,13 @@
                        if(SampleNo > 0)
                        {
                                if(parseable==0){
- printf("%10lu %10lu %10lu (%.1lf us) %10lu %10lu %10lu %13u %10lu %10u\n",
+ printf("%10lu %10lu %10lu (%5.1lf us) %10lu %10lu %10lu (%5.1lf us) %13u %10lu %10u\n",
                                                TransientMin,
                                                TransientMax,
                                                TransientMax-TransientMin,
                                                (double)(TransientMax-TransientMin)/cyclesPerMicroSec,
                                                currentExectime, absoluteMin,
- absoluteMax,tmp,
+ absoluteMax,(double)(absoluteMax-absoluteMin)/cyclesPerMicroSec,tmp,
                                                (endTime-lasttime),
                                                SampleNo);
                                fflush(stdout);

Revision history for this message
Tim Martin (nv-tmartin) wrote :

My jitter measuring program that uses back to back std::chrono::steady_clock::now() calls

Revision history for this message
Tim Martin (nv-tmartin) wrote :

My python script that parses /proc/interrupts and /proc/softirqs

Revision history for this message
Tim Martin (nv-tmartin) wrote :

Here is my comparison of the non-dpdk jitter.c program and my test.cpp program, to try to explain the difference in measured task latency between the two programs:

a) both programs essentially call a timer function repeatedly and look for the max time between the timer calls. jitter.c has a simple workload in between the calls, test.cpp does not. I added the dummy workload to test.cpp, and test.cpp still shows multiple millisecond latency

b) jitter.c uses the TSC for its timer, test.cpp uses std::chrono::steady_clock::now() (which is a front for clock_gettime()). I had previously tried using the TSC and don't see any major difference in behavior. (The reason I avoided the TSC is the need to accurately calibrate it... DPDK has a lot of tricks to do this, or its possible to casually calibrate it with a usleep).

c) jitter.c runs with SCHED_OTHER default priority (shows as 20 in 'ps'), test.cpp is run with SCHED_FIFO rt priority 95 (shows as -96 in `ps`). This seemed to be the major difference between the two. When I modified test.cpp to run at SCHED_OTHER, I no longer saw the multi-millisecond stalls and had performance similar to jitter.c.

I'd like to understand better why SCHED_FIFO is producing worse performance -- this seems strange.

Revision history for this message
Tim Martin (nv-tmartin) wrote :

I think I may have solved the latency problem.

Looking at "cat /proc/$pid_of_jitter_process/sched" I see "nr_involuntary_switches" is incrementing frequently.

I noticed on my server, I was getting frequent ethernet IRQs and I saw one movable IRQ that wasn't obeying the cmdline irqaffinity setting

$ ps -eo psr,tid,pid,s,class,comm:50,%cpu,priority,nice -T | sort -g | grep "^ 4"
...
  4 951 951 S FF irq/70-eno1-TxR 0.0 -51 -
...

Borrowing somewhat from the `run` script in git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git:

# Migrate irqs to CPU 0
for D in $(ls /proc/irq)
do
        if [[ -x "/proc/irq/$D" && $D != "0" ]]
        then
                echo $D
                echo 1 > /proc/irq/$D/smp_affinity
        fi
done

successfully moved it.

After that, I see the latency improved from my previous runs, e.g.:
dpdk-jitter/non-dpdk-jitter$ ./jitter -c 4 -i 300
...
    134422 134564 142 ( 0.1 us) 134426 134420 165790 ( 14.9 us) 1062862848 2690160090 300

Worst-case improved from ~156us to ~15us...

Note that I do also see the latency spike correlated with ssh-ing into the server that Antonio observes above. When this happens, my parse_proc_interrupts.py script, parsing /proc/interrupts, observes the "CAL" interrupt running on the isolated CPU core 4 that the jitter program is running on.

Delta /proc/interrupts:
irq_name irq labels irq_count irq_count-baseline irq_count/sec
CAL: Function call interrupts 8506 3 3

Revision history for this message
Tim Martin (nv-tmartin) wrote :

> I'd like to understand better why SCHED_FIFO is producing worse performance -- this seems strange.

I figured this one out, by default the real-time scheduling policies will only let a task run for ~950ms then the "RES" rescheduling interrupt fires.

As root, do this to disable that behavior:

echo -1 > /proc/sys/kernel/sched_rt_runtime_us

NB: This can make the system non-responsive if the application never yields...If the software design intends only one thread per CPU core anyways, it's probably ok to run with SCHED_OTHER. YMMV.

> observes the "CAL" interrupt running on the isolated CPU core

From various other googling, the CAL function call interrupt is apparently related to smp_call_function() to run a kernel function on every online CPU, such as TLB flushing.

Revision history for this message
Antonio (adibaccoks) wrote :

Hi Tim,

your analysis is really excellent.

You discovered that ssh-ing into the server causes the CAL interrupt to be triggered and this is good info for Joseph to understand why it is happening and if this behaviour can be changed.

Unfortunately, this CAL interrupt is stealing the CPU for a very long time (from 5 to 15 usecs) and our application cannot work with such interruptions.

Tim, if you like, you can contact me at <email address hidden> and we can share our findings.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Tim and Antonio,

Sorry for the delay, I may have delayed responses due to holiday.

Thank you both for the detailed information. I will review you analysis, Tim.

For RT throttling, when running a Task on an isolcpus set:
'sysctl kernel.sched_rt_runtime_us=-1', turns off RT throttling.

RT Throttling limits the execution time of real-time tasks per period. Turning off RT throttling allows a RT task to dominate a CPU.

I will see if I can reproduce the ssh induced jitter an latencies, Antonio.

Once I do that, I'll investigate further to see if there is a way to tune it out and understand the CAL interrupt better. I know the boot CPU cannot run in nohz_full or the isolation modes — at least one CPU needs to continue to receive interrupts and do the necessary housekeeping. This may or may not be related. There may be some bug or specific configuration that needs to be done.

Revision history for this message
Antonio (adibaccoks) wrote :

I can confirm that, in terms of interrupts, all isolated cores are not receiving any interrupt but the CAL interrupt whenever the box is SSH-ed from another machine. From time to time I see also IWI and LOC on isolated core.

Here below I check /proc/interrupts every 5 secs:

1 -----------------
CAL cpu 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, .
2 -----------------
3 -----------------
....
7 -----------------
8 -----------------
CAL cpu 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, .
9 -----------------
10 -----------------
......
39 -----------------
40 -----------------
LOC cpu 46, .
IWI cpu 46, .
41 -----------------
42 -----------------
43 -----------------

Revision history for this message
Antonio (adibaccoks) wrote :

LOC and IWI happen every 600 secs (more or less) on consecutive CPUs.

Revision history for this message
Antonio (adibaccoks) wrote :

Any news about why the CAL interrupt is delivered to every core even for a simple operation like an SSH connection?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The CAL interrupts are remote function call interrupts. I believe these cause a high number of Inter-Processor interrupts. I'll do some research over the next day or two to see if there are some tuning options to bring those down.

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

As Joe says, CAL interrupts are interprocessor function calls from within the kernel. There are a significant number of places that make these calls (roughly 80 to 90). I suspect the source is most likely a TLB invalidation, but the best way to find out is to enable some instrumentation.

I'm attaching a shell script (smp_call_func.sh) that will trace calls that create such function call interrupts; it will print a cpumask and call trace for each such invocation. There will likely be a fair amount of noise here related to calls that aren't for "all cpus."

Please run the script (as root), it takes one argument, the filename to store the trace output. With the script running, reproduce the "CAL to all CPUs" event; we can then examine the trace output to hopefully determine what's going on.

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :
Revision history for this message
Antonio (adibaccoks) wrote :

Please find attached the output of the script you suggested to use.

I see the cpumask=0xffffffffffffff every time I try to ssh into the machine running Ubuntu realtime.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

As a test, can you try disabling timer_migration, if you don't already have it disabled? It can be done with:

echo 0 > /proc/sys/kernel/timer_migration

This mostly helps with multi-socket machines, but it's worth a try here as well.

Revision history for this message
Antonio (adibaccoks) wrote :

I tried but the SSH connection always causes a peak in the interruptions.

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :
Download full text (3.2 KiB)

Summary:

 Looking through the stack traces, I see 59 examples of the
all-ones cpumask. Ultimately, all of these events are related to
insertion or removal of eBPF objects to or from the kernel, which cause
all-CPU TLB flushes for various reasons. These TLB flushes are sent to
all CPUs via the CAL interrupt.

 For sshd, it is adding an eBPF filter for seccomp. A brief
inspection of the sshd source suggests that the seccomp filter is a
compile time setting, and cannot be disabled at runtime.

 I'm unsure precisely what snap-confine is using eBPF for, but
suspect it's something similar.

Details:

 At the lowest level, these are all performing TLB flushes for all
CPUs. There are two functions being used:

func=0xffffffffa30966a0

This is do_flush_tlb_all, which flushes the entire TLB for a CPU

func=0xffffffffa3097c10

This is __cpa_flush_tlb, which flushes the CPU TLB for a specific set of
pages.

 The choice between the two depends on the number of pages being
flushed; above a certain threshold, the kernel flushes the entire TLB
rather than performing a series of single page TLB invalidations.

 The call patterns are:

(a) sshd adding an eBPF program for seccomp, via prctl(PR_SET_SECCOMP).
In this case, the TLB flush arises when the page in question previously
had an alias mapping (that other CPUs may still have cached in their TLB):

            sshd-62862 [000] ....2.. 516588.693693: smp_call_function_many_cond: (smp_call_function_many_cond+0x0/0x2e0) cpumask=0xffffffffffffff func=0xffffffffa30966a0
            sshd-62862 [000] ....2.. 516588.693699: <stack trace>
 => smp_call_function_many_cond
 => flush_tlb_kernel_range
 => __purge_vmap_area_lazy
 => _vm_unmap_aliases.part.0
 => vm_unmap_aliases
 => change_page_attr_set_clr
 => set_memory_ro
 => bpf_int_jit_compile
 => bpf_prog_select_runtime
 => bpf_prepare_filter
 => bpf_prog_create_from_user
 => seccomp_set_mode_filter
 => do_seccomp
 => prctl_set_seccomp
 => __do_sys_prctl
 => __x64_sys_prctl
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

(b) This is similar to (a), but here involves setting pages to read-only
or executable, which in turn requires a TLB flush. Note that (a) and (b)
occurred sequentially (about 50 usec apart), most likely in the same
actual call to change_page_attr_set_clr:

            sshd-62862 [000] ....2.. 516588.693745: smp_call_function_many_cond: (smp_call_function_many_cond+0x0/0x2e0) cpumask=0xffffffffffffff func=0xffffffffa30966a0
            sshd-62862 [000] ....2.. 516588.693746: <stack trace>
 => smp_call_function_many_cond
 => flush_tlb_all
 => cpa_flush
 => change_page_attr_set_clr
 => set_memory_ro
 => bpf_int_jit_compile
 => bpf_prog_select_runtime
 => bpf_prepare_filter
 => bpf_prog_create_from_user
 => seccomp_set_mode_filter
 => do_seccomp
 => prctl_set_seccomp
 => __do_sys_prctl
 => __x64_sys_prctl
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

(c) snap-confine adding an eBPF program directly via a bpf() system call.
This has two variants, largely similar to those for (a) and (b) insofar as
the reason for the TLB flush.

(d) a kworker thread performing a deferred free of an eBPF object. When
the memory is unmapped this ...

Read more...

Revision history for this message
Antonio (adibaccoks) wrote :

Hi Jay,

I'm amazed by your analysis. Thank you very much. Do you think that this "interruption" can be at least mitigated?

Thank you ,
Antonio.

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

Antonio,

Well, getting rid of these interrupts is tricky, and depends on how far you're willing to go.

First, the TLB flush operations are necessary for the eBPF insert / removal steps, to insure that all CPUs have a consistent view of memory. So, changing eBPF to not do the TLB flush isn't really a viable option.

That leaves changing the system to not use the eBPF objects.

I don't see a global "disable BPF" switch anywhere. There is a sysctl to disallow unprivileged users from using the bpf() system call, kernel.unprivileged_bpf_disabled, but sshd is using prctl(), which is not subject to the sysctl's limitation. The sysctl, by the way, defaults to disabled.

The Ubuntu realtime kernel is built with CONFIG_BPF_JIT_ALWAYS_ON, meaning that it's not possible to disable the eBPF JIT. This suggests one hack to disable eBPF: setting the JIT memory limit to something absurdly low, e.g.,

sysctl -w net.core.bpf_jit_limit=1

This causes attempts to load eBPF objects to fail, unless the requester is privileged. It appears that sshd goes out of its way to become unprivileged, and ignores the failure to load the seccomp filter. I see no "all CPUs" TLB flushes from sshd after making this change. However, it does break other things; e.g., the "man" command fails, apparently because it does the same thing (or is restricted by apparmor), except that it does not ignore the seccomp filter failure, and fails.

This suggests that it should be possible to disallow prctl for sshd via a properly crafted apparmor profile. It appears to have the ability to do so, although I'm not sufficiently familiar with apparmor profiles to write one without some research.

An alternative for just sshd, that's relatively straightforward: compile and install an sshd that doesn't use the seccomp sandbox, e.g., configure it with --with-sandbox=no (or possibly =rlimit). Note that this would be disabling security functionality in sshd (and taking on the burden of updates for sshd, etc), so is perhaps not viable, depending on your usage needs. This isn't something I would recommend undertaking without due consideration.

Another alternative would be to recompile the kernel without CONFIG_BPF, although I'm not sure if there are dependencies in the operating system on having BPF. This also has the same concerns with updates and so forth as building your own sshd.

For the events from snap.confine, I do not see those at all, so it may have something to do with the install. I'm testing on an ordinary, if a bit old, server with the standard Ubuntu 22.04 server image.

Revision history for this message
Antonio (adibaccoks) wrote (last edit ):

Hi Joy,

I found another case where CAL, RES, LOC, IWI interrupt kicks in regularly

LOC cpu 3, 4, 29, .
IWI cpu 3, 4, 29, .
RES cpu 3, 4, .
CAL cpu 3, 4, .

I'm attaching the output of your script.
Is it possible to understand what is going wrong?

An additional question: I know it is not advisable to disable interrupts, but if I would like to do it on a specific core, would be possible to isolate the core completely from any interrupt and specifically from the CAL interrupt? I don't need TLB flush on that core, virt to phys mapping are setup once for ever.

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

Antonio,

The script output you attached didn't capture any actual events (the events in there are false positives generated when setting up the tracepoints themselves, i.e., related to initializing the instrumentation).

If you're experiencing TLB invalidations, the TLB pseudo-interrupt in /proc/interrupts should increment with the CAL interrupt (as CAL is the mechanism to tell other CPUs to perform TLB invalidations).

LOC are the APIC timer interrupts, this suggests that something is scheduling a timer to fire on that CPU at a future time; the interrupt initiates the callback for the timer.

IWI are IRQ work interrupts, basically, a mechanism to allow interrupt handlers to schedule activity to run later, outside of interrupt context. Also used for some other not-really-interrupt things, e.g., RCU, eBPF and the scheduler all use IWI for various things. On RT, these should be handled in the irq_work/[CPU #] kernel threads, although there's still an interrupt to trigger them.

RES are scheduler interrupts, telling a CPU to run the process scheduler logic immediately. I'll hazard a guess that these aren't happening to schedule regular user processes, but rather than some kernel task needs to run (perhaps rcuc, a per-CPU task for RCU housekeeping).

It's possible to instrument these specifically. It's pretty straightforward to instrument when they happen, but harder to figure out the why.

As to your other question, no, there is no mechanism to entirely disable interrupts on specific CPUs, or to exempt CPUs from TLB invalidations. To my knowledge, complete task isolation is not a feature of the Linux RT implementation today, so creating a configuration in which a CPU is literally never interrupted will be difficult.

That said, some of the interrupts you list do seem a bit odd, e.g., getting a RES interrupt on CPUs in the isolcpus= list.

Revision history for this message
Antonio (adibaccoks) wrote :

Thank you.

The RES interrupt is totally unexpected because it is kicking in on isolated cpus.
I was able to remove those RES interrupts giving this command (from Joseph):

sudo bash -c 'echo 0 > /proc/sys/kernel/timer_migration'

Revision history for this message
Antonio (adibaccoks) wrote :

Hi all,

I think you can close this bug.
Even if we avoid TLB flushing on isolated cores, we are still struggling with LOC interrupts being delivered to isolated cores and this is not compatible with our application real time requirements.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the feedback, Antonio! At a minimum, this is a good bug to reference for tuning :-) Thanks for all your input as well, Jay!

The LOC interrupts being delivered to isolated cores might be a great topic for a research paper. I plan on doing some research on this over time.

Thanks again!

Changed in ubuntu-realtime:
status: In Progress → Fix Released
Revision history for this message
Antonio (adibaccoks) wrote :

I don't know if you are interested to understand while the LOC interrupts are being issued on isolated CPUs but if you are interested I will be happy to help. I have the server where this is happening

Revision history for this message
Paweł Żak (pawelzak) wrote (last edit ):

Hello Joseph and Jay, your comments in this bug helped me progressing in my issue regarding hunting interrupts (for tickless ubuntu-lowlatency kernel) but now I'm stuck.

Can you take a look https://bugs.launchpad.net/ubuntu/+source/linux-lowlatency/+bug/2023391 ?
Thank you in advance!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.