Does it seem correct to say that the general intention of irqbalance wrt to
system performance is to improve throughput (translating in some cases to a
more responsive system) at a cost of increased processing latency?

If so, then it should be considered and tuned generally with regards to
usage scenarios that consider latency vs throughput. Eg digital audio
workstations and gaming machines might disable it.

But as Steve says, we don't have any data on the tradeoffs. How much more
throughput/responsiveness for what cost in latency under what
configurations?

All I can find is a recommendation not to use it on CPUs with 2 or fewer
cores as the overhead is said to be too high (which acc to above would
translate to "unreasonable amount of latency for relatively little or even
no throughput gains"), but even then, are we talking physical or
logical/virtual cores?

It seems like the more cores a system has, the more trivial the overhead
from running irqbalance per performance/responsiveness gain. Is there a
threshold number of cores beyond which something like IRQ balance becomes
strongly recommended for general computing applications? But even then just
like power scaling I can imagine it might still add undesirable or even
critical latency in applications that are highly latency sensitive (eg when
milliseconds or fractions of milliseconds matter)

This website gave me some clarity on the theory and purpose:
https://www.baeldung.com/linux/irqbalance-modern-hardware

There is another dimension, one related to one of the reasons why Apple
became known as the "AV professional's workstation" for so long, is that
(apparently for fascinating reasons of a historical accident) the
multimedia system engineers gained enough influence in the company to allow
them to tune the default system configuration to prioritize latency and
then system responsiveness over throughput (and even some compromises in
system security) to allow for minimal system config in applications
requiring both low and consistent (eg low jitter) latency. As it turned
out, they got away with doing this for so many years in part because the
growing "AV professional wannabe" crowd who just used the system mostly for
general (rather than low latency sensitive) applications didn't really
notice or care about the hit to throughput or security vulnerability.
Noticeable in benchmarks, but not in real life.

My first point in saying this is that benchmarks don't necessarily tell us
what will give the greatest benefit to the greatest number of users with
minimal or no reconfiguration. Eg, who cares if it takes even 10% more
milliseconds to transcode an AV file or compile code (on same hardware
configred differently) if it means you could also run latency sensitive
apps at a consistent (low jitter) amd low latency without having to
reconfigure anything and maintaining a generally responsive system? People
often just walk away from that anyway (either physically, eg smoke or
coffee break, or figuratively, eg task switching, in which case a
responsive system would be a higher priority than crunching the numbers
slightly faster).

My second point is I think obsession over benchmarking risks losing the
forest through the trees and really often doesn't account for anything
close to real world performance optimizations. But even then it could be
argued this is only because we fail to consider important parameters in
common performance benchmarks, such as "responsiveness" and "jitter" and
"latency" alongside obsessing over throughput and to a lesser extent power
management.

For me, the core of this question means finally coming to clarity about
what "optimal balance" means for the widest variety of desktop and server
applications,  just as Apple did accidentally a few decades ago with its
client systems
 I think this requires considering a variety of factors instead of an
unrealistically narrow idea of "performance" that does not factor in real
world user experience. Eg the idea that most users often will appreciate
improvements in latency and responsiveness without really much noticing the
cost of throughput until someone starts obsessing over throughput
benchmarks with relatively minute differences as far as our intuitive or
subconscious experience is concerned. Less user frustration from fewer to
no buffer overruns or perceived interface hiccups again draw on concepts
such as "reliability" and "default breadth of utility"

FWIW I think a lot of throughput obsession is about internalized and
institutionalized planned obsolescence. It's the primary benchmark of OEM
system performance, and a fairly lazy way to measure performance at that.
4min 30sec for a transcoded file would be considered hugely different than
5min 30sec on the same hardware, but for the average user who would just
take a smoco break or switch tasks it doesn't matter as long as the system
remains responsive and functional. And you'll never get to the transcoding
in the first place if the system keeps recording buffer xruns that ruin a
file being processed or recorded in real time due to system latency that is
too high or variable for the sort of performance and responsiveness needed.

Another way to put this is there's the social-economic dimension of
performance vs the psychological dimension of performance. While the
psychological one is arguably more important and humane it is often at odds
with the social-economic dimension which seeks to sell more new systems
because they are "faster" (and only very marginally and narrowly defined).
I cpuld easily be out of the loop bur I just don't see this stuff
considered often enough in discussions of performance optimization.

Ethan

On Sat, Jan 6, 2024, 20:20 Steve Langasek <email address hidden>
wrote:

> Hi Christian,
>
> I see a lot of strong opinions being given, but aside from the "don't
> use it in KVM" guidance which appears to be based on GCE's engineering
> expertise, very little evidence that irqbalance is actually a problem.
>
> I think it's true that in the default config, irqbalance can interfere
> with putting CPUs into higher C states to conserve power.  However, I
> don't see any indication of quantitative analysis showing the impact.
>
> Recent versions of irqbalance have a '--powerthresh' argument that can
> be used to tell irqbalance to rebalance across fewer cores when CPU load
> is low, to allow some of the cores to be put into a sleep state and
> conserve power.  My own initial testing on my desktop shows that this
> gets used for all of about 10 seconds at a time every few hours, before
> the load increases and irqbalance wakes the core back up...
>
> I would want any decision to remove irqbalance from the desktop to be
> based on evidence, not conjecture.  At a minimum, I think what I would
> like to see is output from powertop showing both power consumption and
> CPU idle stats over a reasonable amount of time (10 minutes?), on a
> representative client machine, for a 2x3 matrix of configurations:
>
>  - idle vs normal desktop load
>  - irqbalance disabled vs irqbalance enabled with defaults vs irqbalance
> enabled with IRQBALANCE_ARGS=--powerthresh=1
>
> System should be rebooted between each of the irqbalance configurations,
> as I'm not sure what does or doesn't persist in the CPU config after
> irqbalance exits.
>
> I am specifically not going to try to rebut the various webpages
> referenced here, beyond saying that there's an awful lot of these pages
> pointing to one other as authoritative sources on irqbalance without
> there actually being evidence to back them up (and a heaping spoonful of
> misinformation / outdated information along the way).  So if we're going
> to make a change, there should be due diligence to demonstrate a
> benefit, it should not be based on Internet hype.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1833322
>
> Title:
>   Consider removing irqbalance from default install on desktop images
>
> Status in irqbalance package in Ubuntu:
>   New
> Status in ubuntu-meta package in Ubuntu:
>   Confirmed
>
> Bug description:
>   as per https://github.com/pop-os/default-settings/issues/60
>
>   Distribution (run cat /etc/os-release):
>
>   $ cat /etc/os-release
>   NAME="Pop!_OS"
>   VERSION="19.04"
>   ID=ubuntu
>   ID_LIKE=debian
>   PRETTY_NAME="Pop!_OS 19.04"
>   VERSION_ID="19.04"
>   HOME_URL="https://system76.com/pop"
>   SUPPORT_URL="http://support.system76.com"
>   BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
>   PRIVACY_POLICY_URL="https://system76.com/privacy"
>   VERSION_CODENAME=disco
>   UBUNTU_CODENAME=disco
>
>   Related Application and/or Package Version (run apt policy $PACKAGE
>   NAME):
>
>   $ apt policy irqbalance
>   irqbalance:
>   Installed: 1.5.0-3ubuntu1
>   Candidate: 1.5.0-3ubuntu1
>   Version table:
>   *** 1.5.0-3ubuntu1 500
>   500 http://us.archive.ubuntu.com/ubuntu disco/main amd64 Packages
>   100 /var/lib/dpkg/status
>
>   $ apt rdepends irqbalance
>   irqbalance
>   Reverse Depends:
>   Recommends: ubuntu-standard
>   gce-compute-image-packages
>
>   Issue/Bug Description:
>
>   as per konkor/cpufreq#48 and
>   http://konkor.github.io/cpufreq/faq/#irqbalance-detected
>
>   irqbalance is technically not needed on desktop systems (supposedly it
>   is mainly for servers), and may actually reduce performance and power
>   savings. It appears to provide benefits only to server environments
>   that have relatively-constant loading. If it is truly a server-
>   oriented package, then it shouldn't be installed by default on a
>   desktop/laptop system and shouldn't be included in desktop OS images.
>
>   Steps to reproduce (if you know):
>
>   This is potentially an issue with all default installs.
>
>   Expected behavior:
>
>   n/a
>
>   Other Notes:
>
>   I can safely remove it via "sudo apt purge irqbalance" without any
>   apparent adverse side-effects. If someone is running a situation where
>   they need it, then they always have the option of installing it from
>   the repositories.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/irqbalance/+bug/1833322/+subscriptions
>
>