lowlatency kernel causes sidechain thread starvation

Bug #1194543 reported by RJ Skerry-Ryan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mixxx
1.11
Won't Fix
High
RJ Skerry-Ryan
2.0
Fix Released
High
RJ Skerry-Ryan

Bug Description

mibofra on IRC reported an issue with recording/streaming that only happens when using the lowlatency kernel.

Regular bits of silence can be heard on both recordings and shoutcast.

<mibofra> audio api : jack audio connection kit
<mibofra> rate : 48000Hz
<mibofra> latency: set on jackd (I've set 21,3)
<mibofra> Linux spf-laptop 3.8.0-20-lowlatency #14-Ubuntu SMP PREEMPT Mon May 13 20:17:10 UTC 2013 i686 i686 i686 GNU/Linux
<mibofra> I've the 3.8.0-25 and lower but they have problems with realtime jackd and usb audiocards
<mibofra> I'm using an aureon 7.1 usb now (terratec)

I suspect this is a thread starvation issue where the sidechain FIFO is overflowing before the sidechain can process it. It is most likely caused by our LowPriority priority on the sidechain thread. I'm asking mibofra to test without giving the thread a priority (or maybe giving it HighPriority).

Tags: shoutcast
Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

I've seen multiple people on IRC complaining about broadcasting issues using lowlatency kernel so I'm marking confirmed.

Changed in mixxx:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

How do people feel about removing our LowPriority setting (which probably makes it pre-empted by the Qt main thread often)? At normal priority it's still lower than the engine thread.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

I would go a little further. Strictly spoken, the broadcast thread is a realtime thread. It has a fixed dead line that must not be missed in any case. For Radio DJs this is more important than the sound card output.

I have not looked to the source, but can't we tie it to the engine thread? This will save us from unnecessary thread switches.

For me this issue has the same root cause as https://bugs.launchpad.net/mixxx/+bug/1203249.

Not sure how is the case for the recording use case. But these arguments are valid as well if the recording file it the main product and we see dropouts in the recorded file..
Dropouts are a no go here as well.

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

Recording is really not time critical. I see this as an issue with the sidechain particularly on kernels that are stricter about thread scheduling. Our current thread priorities are haphazard. I think LowPriority is the wrong priority for the sidechain -- I think it should be HighPriority or HigherPriority (maybe ranking above control input).

RE: moving the EngineShoutcast code into the engine.... have you looked at that code? :) It sleep()s!

Revision history for this message
Daniel Schürmann (daschuer) wrote :

I think it is all about the definition of time critical and real time? Real time for me is alway if we have a hard deadline that must not be broken. Even if we have a super big buffer, it must not overflow so that we loose a single sample. All processes in the audio chain are of this kind.

I just head a brief look at the engine side-chain. An yes there is much room for optimization.

The engine thread fills the side chain buffer with real time priority and it is processed by the side chain with low priority.
Such conditions are in general allays a candidate for over/underflows. Once we have removed all non time critical tasks from the side chain process function, there might be no reason for a lower priority than the engine tread itself.

You may argue that we can use the buffer to bridge over cpu intensive phases, but what is the nature of the task we wait for? And who grantees that we will ever have the time to process the buffer to zero?

But of cause we need the buffer to bridge other reasons than low CPU resources like waiting for network access or hd. But is it at the right level? Is there already a chache buffer inside the OS? So there is a need to watch out for the right moment to use the CPU. The only room for saving CPU is to size the input memory chunks to the output memory chunks and skip side chain scheduling until enough samples are ready to process. But this will make all scheduling calculation much harder

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Without further optimizations HigherPriority should be fine.

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

MP3 encoding isn't something we should be doing in the main callback since it is not a result of the callback -- PortAudio does not need the results of the sidechain. We fill the sidechain FIFO and then when it reaches a threshold we signal the sidechain thread to wake up and process the samples.

The sidechain still has a realtime nature for shoutcast (recording is not realtime -- we can take as long as we want given enough buffering -- i.e. we could just dump the samples to disk in the sidechain thread and then encode them later) but it can process much less frequently since we deliver encoded packets to the shoutcast server much less frequently than the callback occurs. The shoutcast streaming setups include many layers of buffering so there is much less of a hard deadline to hit.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Ah, I see, We have a trigger at 1/5 of buffer #define SIDECHAIN_BUFFER_SIZE 65536
This is a buffer size of about 150 ms @ 44100 Hz
The recording feature calls write() which is potentially a blocking function.

So you are right, a priority below realtime is fine for now and a suitable solution for 1.12
Maybe we should start with normal priority to not just shift the problem.

---

But I am stressing this theme that much, because I am thinking about some changes in scheduling:

> PortAudio does not need the results of the sidechain.

That is true, but I like to have one pace maker task within Mixxx, that is a time source for all audio stream related things.
This means all audio thread should be finished, before the next PA callback is issued. All additional buffering only consumes CPU, but does not save time. It has to be done anyway and if it delayed we have a bigger amount of work next time.
This way the user can tune whole Mixxx scheduling by setting up the Buffer size.

Some related bullets:
* it dos not make a difference which task does the MP3 encoding is done if we assume that encoding the same data in two steps will take the same time than processing it at one step. We need the CPU time anyway. True? Lame call overhead vs, thread switch overhead.
* If we assume we have a 10 ms Buffer size, the side chain is processed every 15 cycles. This will produce a CPU peak and maybe it consumes CPU when it is needed elsewhere at this time.
* if we enable both recording and broadcasting. recording may lock broadcasting because of possible blocking write.
* Is there any blocking function in the shoutcast path? Can we call it from the PA callback?
* Is Mamory maped IO unblocking? http://my.safaribooksonline.com/book/operating-systems-and-server-administration/linux/0596009585/advanced-file-i-o/mapping_files_into_memory. Maybe we should use it anyway.
* What is the right way to optimize for multicore processing?

How do you think about this?

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

> That is true, but I like to have one pace maker task within Mixxx, that is a time source for all audio stream related things.

It sounds like your main motivation to merge the sidechain with the main callback is to get deterministic processing of the sidechain with the audio callback. My argument is that this does not need to be deterministic:

1) Recording does not care whether the samples are processed on time (as long as the thread wakes up before overflow).
2) Shoutcast does not care whether the samples are shipped to the server on the timescales we are talking about. There is buffering at the client and at the server so cutting ~100ms of latency out will have very little user-visible effect.

If the sidechain wakes up on time before overflow, then everything is fine. If there is a buffer overflow then either our scheduling is off or the user's machine is too weak to operate at the given latency.

In this case (the bug reported by mibofra), our scheduling is off. Low latency on a realtime kernel means once the sidechain thread hits the run queue it is always in line behind: the main thread, input threads, vsync thread -- a whole lot of computation that fires frequently. So we should change the scheduling so that when the scheduler is stricter about priorities that the sidechain

The scheduling of the sidechain is driven by the callback because the signal to wake it up is directly dependent on the number of samples processed by the callback.

Whether or not the scheduling of the sidechain processing is regular depends on the scheduling latency of the thread waking up once we signal it from the callback.

I added a timer to measure this:

Debug [Main]: Stat("EngineSideChain scheduler latency","count=397,sum=5.83277e+08ns,average=1.46921e+06ns,min=4083ns,max=5.71935e+08ns,variance=8.23872e+14ns^2,stddev=2.87032e+07ns")

So there is an average time to schedule of 1.5ms with standard deviation of 28ms. (I wish we measured 90th / 99th percentile variance -- that would give another useful description).

Due to the buffering involved in: 1) the encoder 2) the shoutcast server 3) the shoutcast clients I really don't think this variable latency addition matters much.

Moreover, the callback is really meant to produce audio for the soundcard. The moment we can satisfy that deadline I think we should. There are also timing constraints involved with shoutcast servers that I don't understand and we need someone to take a close look at. I'm not even sure pushing new samples every latency period would work. Maybe someone who knows more about these protocols could comment.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Thank you for your detailed reply!

> It sounds like your main motivation to merge the sidechain with the main callback is to get deterministic processing of the sidechain with the audio callback.

Yes, I want to make things more deterministic to finally have more benefit of multi core processing. But of cause after a detailed analysis that we are sure we do things right.

> My argument is that this does not need to be deterministic:
> 1) Recording does not care whether the samples are processed on time (as long as the thread wakes up before overflow).

I fully agree

> 2) Shoutcast does not care whether the samples are shipped to the server on the timescales we are talking about. There is buffering at the client and at the server so cutting ~100ms of latency out will have very little user-visible effect.

Yes

> If the sidechain wakes up on time before overflow, then everything is fine. If there is a buffer overflow then either our scheduling is off or the user's machine is too weak to operate at the given latency.

Yes.

You analysis it very useful for this case. An I think we have a common sense in this. If not, please correct my following thoughts:

* Currently the side chain takes about 1.5 ms at your setup processing a 150 ms buffer -> Factor 0.01
I wander what happens with the Factor if we run with half of the buffer size. This give us a cue if there is room for optimization.
If the Factor does not increase much, we can think about to scale the buffer size to the PA buffer size or merge some work to the engine thread. This avoids finally to scheduled a high priority thread at random moments. As you said correctly before, this is no benefit for the side chain itself, but IMHO for all other threads that have to be scheduled between the PA callbacks.

* I think we should separate the potentially blocking write() at recording from the probably non blocking shoutcast. One possible solution is to process shoutcast from the PA thread itself and do recording on its own thread.

* I like to use the benefit of Multicore CPUs Currently Mixxx uses all 4 cores of my CPU, but this is a result of a random scheduling.
I wander if we can gain lower latency if we paralyze the PA callback by the number of cores. The for loops in engine master.cpp are ideal candidates. Or we could process right channel and left channel in different threads. For this approach a deterministic scheduling is very helpful to tweak sings up.

I have a two step idea for this:
1. move all PA buffer size depended tasks to the PA thread
2. parallelize the PA thread

If it turns out that these are bad Idea in this particularly case lets try to find a common scene for such problems. Do you think my
deterministic processing ideas are valid in general?

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :
Download full text (5.5 KiB)

Ah, I did some more research. I think my variance numbers above (28ms) are misleading. The sidechain is much more reliable than that in when it is scheduled.

I built a global timeline tracker for Mixxx and instrumented all of our threads for various events we care about. I've attached a timeline dump.

Columns are:

- time since bootup in nanoseconds
- duration since previous line
- duration since last START of the same event string
- duration since last END of the same event string
- event identifier string

I'm working on a nice graphing tool for this and auto-extracting some details of various stats (the distribution of the frequency of events, etc.).

But for now, if you slice this to just EngineSidechain start events:

$ cat timeline.csv | egrep "START,EngineSideChain$"

You'll see the scheduling is very deterministic on my machine:

6401813,+2.55468ms,+0ns,+0ns,START,EngineSideChain
2267856463,+8.862us,+2.26145s,+2.26143s,START,EngineSideChain
2871676046,+37.298us,+603.82ms,+603.534ms,START,EngineSideChain
3475332555,+31.977us,+603.657ms,+603.569ms,START,EngineSideChain
4078993778,+9.072us,+603.661ms,+603.589ms,START,EngineSideChain
4682823651,+51.679us,+603.83ms,+603.768ms,START,EngineSideChain
5286533296,+27.559us,+603.71ms,+603.639ms,START,EngineSideChain
5890151625,+1.944us,+603.618ms,+603.545ms,START,EngineSideChain
6494206651,+48.318us,+604.055ms,+603.983ms,START,EngineSideChain
7097950042,+37.408us,+603.743ms,+603.684ms,START,EngineSideChain
7701548968,+7.581us,+603.599ms,+603.527ms,START,EngineSideChain
8305264076,+24.455us,+603.715ms,+603.652ms,START,EngineSideChain
8908953748,+6.893us,+603.69ms,+603.629ms,START,EngineSideChain
9512699813,+4.376us,+603.746ms,+603.675ms,START,EngineSideChain
10116375339,+3.79us,+603.676ms,+603.597ms,START,EngineSideChain
10720137282,+2.313us,+603.762ms,+603.683ms,START,EngineSideChain
11323834895,+6.158us,+603.698ms,+603.635ms,START,EngineSideChain
11927510878,+6.229us,+603.676ms,+603.61ms,START,EngineSideChain
12531278907,+9.061us,+603.768ms,+603.692ms,START,EngineSideChain
13135007774,+21.234us,+603.729ms,+603.666ms,START,EngineSideChain
13738704524,+5.195us,+603.697ms,+603.631ms,START,EngineSideChain
14342497032,+40.417us,+603.793ms,+603.721ms,START,EngineSideChain
14946184124,+354ns,+603.687ms,+603.615ms,START,EngineSideChain
15549856644,+8.811us,+603.673ms,+603.571ms,START,EngineSideChain
16153767368,+20.116us,+603.911ms,+603.849ms,START,EngineSideChain
16757384342,+24.97us,+603.617ms,+603.551ms,START,EngineSideChain
17361117902,+26.223us,+603.734ms,+603.676ms,START,EngineSideChain
17964835399,+13.324us,+603.717ms,+603.659ms,START,EngineSideChain
19041821439,+370.964ms,+1.07699s,+1.07692s,START,EngineSideChain

Column 2 is the important one here -- it's time since last START event. Ignoring the first and last events (because they are starting up the sidechain worker and shutting it down) you can see that it occurs pretty solidly every 603 or 604ms.

If you add in callback events, you can see how the sidechain is driven:

4697256496,+5.4529ms,+11.5716ms,+11.449ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in O
utput
4697318510,+1.402us,+11.5401ms,+11.5369ms,START,EngineSideChain:...

Read more...

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

> 2. parallelize the PA thread

This is similar to the timer powered engine idea:
https://blueprints.launchpad.net/mixxx/+spec/timer-powered-engine

I wrote a very naive email to the portaudio mailing list a few years back -- Ross Bencina (from portaudio) provided some great feedback that eventually caused us to put the project on the shelf:
http://music.columbia.edu/pipermail/portaudio/2011-June/012497.html

I know what you're proposing is slightly different. The timer-powered engine idea would write to a FIFO so the callback would be nothing more than a series of non-blocking reads on various FIFOs.

Priority inversion is one risk of moving logic outside the callback. One of Ross's opinions is that even relying on the scheduler of the OS to behave as you expect is a form of priority inversion (he talks about it in this post: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing -- see the priority inversion section).

Some commercial software seems to advertise doing their audio processing on the GPU so it's certainly possible to build a system this way. Maybe it would look like:

- A series of FIFOs for every worker stream of samples.
- A series of worker threads for each division of labor (left/right channel, etc.) driven by semaphores.

Every callback would:

1) wake up the workers via semaphores to work on the next callback
2) read from the FIFOs -- play silence if samples aren't ready yet

I think this would add one latency period since on the first callback we would play silence and then signal the workers to produce the next callback (which would have been the output for the first callback in our current system).

We could block the callback to wait for the workers directly. This would not add an additional latency period but it would be very risky because if for some reason our workers do not finish in time then we would xrun. If you xrun e.g. using JACK, then the JACK watchdog can kill the process so we should avoid this.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

timeline.csv:

Thank you. This is really a very useful tool. without it we are really fishing in troubled water.

What are the test conditions?
Was recording ans sh outcast activated?
Was a track playing?
Wat was the priority of the EngineSideChain thread?
How many Cores do you have?

Do you have a clue why it is scheduled every 604 ms and not every 150 ms?

While these analysis is very worth full, I am still in doubt if we have a common sense in general:

Currently we stack up the side chain work of ~50 PA callback and process it at once. This makes scene, if we save CPU because we must do some checks and calls only one time. An improvement by 50 if we only look at this call.
But this dos not comes without cost:
* OS primitives for syncing
* Reading and writing the sample buffer
* an addition thread to schedule and time for thread changing
* Non equal CPU load over time, because of additional real time work every 604 ms (see ASCII Art)

No Problem for a high end CPU
PA Callback - - - - - -
Sidechain --

Low end CPU
PA Callback - - - - - -
Sidechain - -
                                ^ probably 100 % CPU load (no time left to process e.g. waveforms)

Do you agree to this view?

To get a clue how much time CPU we save by stacking up the work, we could compare the needed CPU time for the current buffer size and smaller sizes. Could you check this with mp3 encoding enabled in recorder and broadcast?
Thank you.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

>> 2. parallelize the PA thread

> This is similar to the timer powered engine idea:
> https://blueprints.launchpad.net/mixxx/+spec/timer-powered-engine

No, sorry. If I got this right, this is a proposal to decouple the Mixxx threads from the PA callback.
My idea is somehow the opposite, I want to couple all realtime work (sample processing with a deadline) to one
dedicated audio thread.

Once I have time I will start it by ensuring that we have only one PA callback in any case.
Se my analysis here: https://bugs.launchpad.net/mixxx/+bug/1203249/comments/7
(Comments to this bug are very welcome)

> http://music.columbia.edu/pipermail/portaudio/2011-June/012497.html
I agree to the outlined problems in the post. But I think we have currently a much better solution inside Mixxx then outlined in:
> We have a GSoC student working on converting Mixxx to be driven by an internal timer instead of the callback.

To me it looks like this would make the scheduling more complex, will use More CPU, and introduces additional latency due to the additional buffer.

There is currently such a buffer inside ALSA/PA and it is sufficient. If we are on a low end CPU we can increase it and if we are on a High end CPU we can tweak it for optimum latency. All what we need to do for now, is to fix bugs like bug #1203249
And remove some bottle necks like this bugs (from the result of our pending analysis)

> Priority inversion is one risk of moving logic outside the callback.
>We could block the callback to wait for the workers directly. This would not add an additional latency period but it would be very risky because if for some reason our workers do not finish in time then we would xrun.

Yes, this is my idea. But for now i can't see a risk, not if we are careful enough. Lets look on the approach like process a deck engine on a different CPU. In this case the engine Master thread has to wait for the deck engine in any case. If it is processed by itself or in different thread. Of cause there is an overhead for syncing an task switching. At this point where we have to lock if there is a second core available (probably idling). I had the experience that we can 1.5 times faster due to the second Core (from theoretically 2 times).

And by the way: three is a lot of work involved I am not about to spend it now. It is only important for me to have a common sense how to fix bug #1203249.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

http://www.mixxx.org/forums/viewtopic.php?f=3&t=5930
It looks like this is an overflow bugreport of the recording feature.

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

The sidechain is now HighPriority: https://github.com/mixxxdj/mixxx/commit/a9b1e584fea3f020b63e48dff186d80c6a9c087c

I'll try to reach out to mibofra to test.

Changed in mixxx:
assignee: nobody → RJ Ryan (rryan)
Revision history for this message
Francesco Bonanno (mibo-fra) wrote :
Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

mibofra mentioned on IRC that this only affects ogg and not mp3. mibofra -- is that after updating to master?

Revision history for this message
Daniel Schürmann (daschuer) wrote :

An other change is proposed here:
https://github.com/mixxxdj/mixxx/pull/665

tags: added: shoutcast
Revision history for this message
Daniel Schürmann (daschuer) wrote :

Since the we have reworked the broadcasting path lately, this bug is probably solved.
But at least the findings reported here are outdated.
So I will mark this released.

if you still suffer these issues, please file a new Bug with data from a recent Mixxx build.

no longer affects: mixxx
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.