Bug #1194543 “lowlatency kernel causes sidechain thread starvati...” : Series 1.11 : Bugs : Mixxx

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2013-06-25:

#1

I've seen multiple people on IRC complaining about broadcasting issues using lowlatency kernel so I'm marking confirmed.

Changed in mixxx:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-01-19:

#2

How do people feel about removing our LowPriority setting (which probably makes it pre-empted by the Qt main thread often)? At normal priority it's still lower than the engine thread.

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-19:

#3

I would go a little further. Strictly spoken, the broadcast thread is a realtime thread. It has a fixed dead line that must not be missed in any case. For Radio DJs this is more important than the sound card output.

I have not looked to the source, but can't we tie it to the engine thread? This will save us from unnecessary thread switches.

For me this issue has the same root cause as https://bugs.launchpad.net/mixxx/+bug/1203249.

Not sure how is the case for the recording use case. But these arguments are valid as well if the recording file it the main product and we see dropouts in the recorded file..
Dropouts are a no go here as well.

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-01-19:

#4

Recording is really not time critical. I see this as an issue with the sidechain particularly on kernels that are stricter about thread scheduling. Our current thread priorities are haphazard. I think LowPriority is the wrong priority for the sidechain -- I think it should be HighPriority or HigherPriority (maybe ranking above control input).

RE: moving the EngineShoutcast code into the engine.... have you looked at that code? :) It sleep()s!

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-20:

#5

I think it is all about the definition of time critical and real time? Real time for me is alway if we have a hard deadline that must not be broken. Even if we have a super big buffer, it must not overflow so that we loose a single sample. All processes in the audio chain are of this kind.

I just head a brief look at the engine side-chain. An yes there is much room for optimization.

The engine thread fills the side chain buffer with real time priority and it is processed by the side chain with low priority.
Such conditions are in general allays a candidate for over/underflows. Once we have removed all non time critical tasks from the side chain process function, there might be no reason for a lower priority than the engine tread itself.

You may argue that we can use the buffer to bridge over cpu intensive phases, but what is the nature of the task we wait for? And who grantees that we will ever have the time to process the buffer to zero?

But of cause we need the buffer to bridge other reasons than low CPU resources like waiting for network access or hd. But is it at the right level? Is there already a chache buffer inside the OS? So there is a need to watch out for the right moment to use the CPU. The only room for saving CPU is to size the input memory chunks to the output memory chunks and skip side chain scheduling until enough samples are ready to process. But this will make all scheduling calculation much harder

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-20:

#6

Without further optimizations HigherPriority should be fine.

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-01-20:

#7

MP3 encoding isn't something we should be doing in the main callback since it is not a result of the callback -- PortAudio does not need the results of the sidechain. We fill the sidechain FIFO and then when it reaches a threshold we signal the sidechain thread to wake up and process the samples.

The sidechain still has a realtime nature for shoutcast (recording is not realtime -- we can take as long as we want given enough buffering -- i.e. we could just dump the samples to disk in the sidechain thread and then encode them later) but it can process much less frequently since we deliver encoded packets to the shoutcast server much less frequently than the callback occurs. The shoutcast streaming setups include many layers of buffering so there is much less of a hard deadline to hit.

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-20:

#8

Ah, I see, We have a trigger at 1/5 of buffer #define SIDECHAIN_BUFFER_SIZE 65536
This is a buffer size of about 150 ms @ 44100 Hz
The recording feature calls write() which is potentially a blocking function.

So you are right, a priority below realtime is fine for now and a suitable solution for 1.12
Maybe we should start with normal priority to not just shift the problem.

---

But I am stressing this theme that much, because I am thinking about some changes in scheduling:

> PortAudio does not need the results of the sidechain.

That is true, but I like to have one pace maker task within Mixxx, that is a time source for all audio stream related things.
This means all audio thread should be finished, before the next PA callback is issued. All additional buffering only consumes CPU, but does not save time. It has to be done anyway and if it delayed we have a bigger amount of work next time.
This way the user can tune whole Mixxx scheduling by setting up the Buffer size.

Some related bullets:
* it dos not make a difference which task does the MP3 encoding is done if we assume that encoding the same data in two steps will take the same time than processing it at one step. We need the CPU time anyway. True? Lame call overhead vs, thread switch overhead.
* If we assume we have a 10 ms Buffer size, the side chain is processed every 15 cycles. This will produce a CPU peak and maybe it consumes CPU when it is needed elsewhere at this time.
* if we enable both recording and broadcasting. recording may lock broadcasting because of possible blocking write.
* Is there any blocking function in the shoutcast path? Can we call it from the PA callback?
* Is Mamory maped IO unblocking? http://my.safaribooksonline.com/book/operating-systems-and-server-administration/linux/0596009585/advanced-file-i-o/mapping_files_into_memory. Maybe we should use it anyway.
* What is the right way to optimize for multicore processing?

How do you think about this?

Ah, I see, We have a trigger at 1/5 of buffer #define SIDECHAIN_BUFFER_SIZE 65536
This is a buffer size of about 150 ms @ 44100 Hz 
The recording feature calls write() which is potentially a blocking function.

So you are right, a priority below realtime is fine for now and a suitable solution for 1.12
Maybe we should start with normal priority to not just shift the problem.

---

But I am stressing this theme that much, because I am thinking about some changes in scheduling:

> PortAudio does not need the results of the sidechain.

That is true, but I like to have one pace maker task within Mixxx, that is a time source for all audio stream related things. 
This means all audio thread should be finished, before the next PA callback is issued. All additional buffering only consumes CPU, but does not save time. It has to be done anyway and if it delayed we have a bigger amount of work next time. 
This way the user can tune whole Mixxx scheduling by setting up the Buffer size.

Some related bullets: 
* it dos not make a difference which task does the MP3 encoding is done if we assume that  encoding the same data in two steps will take the same time than processing it at one step. We need the CPU time anyway. True? Lame call overhead vs, thread switch overhead. 
* If we assume we have a 10 ms Buffer size, the side chain is processed every 15 cycles. This will produce a CPU peak and maybe it consumes CPU when it is needed elsewhere at this time. 
* if we enable both recording and broadcasting. recording may lock broadcasting because of possible blocking write. 
* Is there any blocking function in the shoutcast path? Can we call it from the PA callback? 
* Is Mamory maped IO unblocking?   http://my.safaribooksonline.com/book/operating-systems-and-server-administration/linux/0596009585/advanced-file-i-o/mapping_files_into_memory. Maybe we should use it anyway. 
* What is the right way to optimize for multicore processing?

How do you think about this?

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-01-20:

#9

> That is true, but I like to have one pace maker task within Mixxx, that is a time source for all audio stream related things.

It sounds like your main motivation to merge the sidechain with the main callback is to get deterministic processing of the sidechain with the audio callback. My argument is that this does not need to be deterministic:

1) Recording does not care whether the samples are processed on time (as long as the thread wakes up before overflow).
2) Shoutcast does not care whether the samples are shipped to the server on the timescales we are talking about. There is buffering at the client and at the server so cutting ~100ms of latency out will have very little user-visible effect.

If the sidechain wakes up on time before overflow, then everything is fine. If there is a buffer overflow then either our scheduling is off or the user's machine is too weak to operate at the given latency.

In this case (the bug reported by mibofra), our scheduling is off. Low latency on a realtime kernel means once the sidechain thread hits the run queue it is always in line behind: the main thread, input threads, vsync thread -- a whole lot of computation that fires frequently. So we should change the scheduling so that when the scheduler is stricter about priorities that the sidechain

The scheduling of the sidechain is driven by the callback because the signal to wake it up is directly dependent on the number of samples processed by the callback.

Whether or not the scheduling of the sidechain processing is regular depends on the scheduling latency of the thread waking up once we signal it from the callback.

I added a timer to measure this:

Debug [Main]: Stat("EngineSideChain scheduler latency","count=397,sum=5.83277e+08ns,average=1.46921e+06ns,min=4083ns,max=5.71935e+08ns,variance=8.23872e+14ns^2,stddev=2.87032e+07ns")

So there is an average time to schedule of 1.5ms with standard deviation of 28ms. (I wish we measured 90th / 99th percentile variance -- that would give another useful description).

Due to the buffering involved in: 1) the encoder 2) the shoutcast server 3) the shoutcast clients I really don't think this variable latency addition matters much.

Moreover, the callback is really meant to produce audio for the soundcard. The moment we can satisfy that deadline I think we should. There are also timing constraints involved with shoutcast servers that I don't understand and we need someone to take a close look at. I'm not even sure pushing new samples every latency period would work. Maybe someone who knows more about these protocols could comment.

> That is true, but I like to have one pace maker task within Mixxx, that is a time source for all audio stream related things.

It sounds like your main motivation to merge the sidechain with the main callback is to get deterministic processing of the sidechain with the audio callback. My argument is that this does not need to be deterministic:

1) Recording does not care whether the samples are processed on time (as long as the thread wakes up before overflow).
2) Shoutcast does not care whether the samples are shipped to the server on the timescales we are talking about. There is buffering at the client and at the server so cutting ~100ms of latency out will have very little user-visible effect.

If the sidechain wakes up on time before overflow, then everything is fine. If there is a buffer overflow then either our scheduling is off or the user's machine is too weak to operate at the given latency.

In this case (the bug reported by mibofra), our scheduling is off. Low latency on a realtime kernel means once the sidechain thread hits the run queue it is always in line behind: the main thread, input threads, vsync thread -- a whole lot of computation that fires frequently. So we should change the scheduling so that when the scheduler is stricter about priorities that the sidechain

The scheduling of the sidechain is driven by the callback because the signal to wake it up is directly dependent on the number of samples processed by the callback.

Whether or not the scheduling of the sidechain processing is regular depends on the scheduling latency of the thread waking up once we signal it from the callback.

I added a timer to measure this:

Debug [Main]: Stat("EngineSideChain scheduler latency","count=397,sum=5.83277e+08ns,average=1.46921e+06ns,min=4083ns,max=5.71935e+08ns,variance=8.23872e+14ns^2,stddev=2.87032e+07ns")

So there is an average time to schedule of 1.5ms with standard deviation of 28ms. (I wish we measured 90th / 99th percentile variance -- that would give another useful description).

Due to the buffering involved in: 1) the encoder 2) the shoutcast server 3) the shoutcast clients  I really don't think this variable latency addition matters much.

Moreover, the callback is really meant to produce audio for the soundcard. The moment we can satisfy that deadline I think we should. There are also timing constraints involved with shoutcast servers that I don't understand and we need someone to take a close look at. I'm not even sure pushing new samples every latency period would work. Maybe someone who knows more about these protocols could comment.

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-20:

#10

Thank you for your detailed reply!

> It sounds like your main motivation to merge the sidechain with the main callback is to get deterministic processing of the sidechain with the audio callback.

Yes, I want to make things more deterministic to finally have more benefit of multi core processing. But of cause after a detailed analysis that we are sure we do things right.

> My argument is that this does not need to be deterministic:
> 1) Recording does not care whether the samples are processed on time (as long as the thread wakes up before overflow).

I fully agree

> 2) Shoutcast does not care whether the samples are shipped to the server on the timescales we are talking about. There is buffering at the client and at the server so cutting ~100ms of latency out will have very little user-visible effect.

Yes

> If the sidechain wakes up on time before overflow, then everything is fine. If there is a buffer overflow then either our scheduling is off or the user's machine is too weak to operate at the given latency.

Yes.

You analysis it very useful for this case. An I think we have a common sense in this. If not, please correct my following thoughts:

* Currently the side chain takes about 1.5 ms at your setup processing a 150 ms buffer -> Factor 0.01
I wander what happens with the Factor if we run with half of the buffer size. This give us a cue if there is room for optimization.
If the Factor does not increase much, we can think about to scale the buffer size to the PA buffer size or merge some work to the engine thread. This avoids finally to scheduled a high priority thread at random moments. As you said correctly before, this is no benefit for the side chain itself, but IMHO for all other threads that have to be scheduled between the PA callbacks.

* I think we should separate the potentially blocking write() at recording from the probably non blocking shoutcast. One possible solution is to process shoutcast from the PA thread itself and do recording on its own thread.

* I like to use the benefit of Multicore CPUs Currently Mixxx uses all 4 cores of my CPU, but this is a result of a random scheduling.
I wander if we can gain lower latency if we paralyze the PA callback by the number of cores. The for loops in engine master.cpp are ideal candidates. Or we could process right channel and left channel in different threads. For this approach a deterministic scheduling is very helpful to tweak sings up.

I have a two step idea for this:
1. move all PA buffer size depended tasks to the PA thread
2. parallelize the PA thread

If it turns out that these are bad Idea in this particularly case lets try to find a common scene for such problems. Do you think my
deterministic processing ideas are valid in general?

Thank you for your detailed reply!

> It sounds like your main motivation to merge the sidechain with the main callback is to get deterministic processing of the sidechain with the audio callback.

Yes, I want to make things more deterministic to finally have more benefit of multi core processing. But of cause after a detailed analysis that we are sure we do things right.

> My argument is that this does not need to be deterministic:
> 1) Recording does not care whether the samples are processed on time (as long as the thread wakes up before overflow).

I fully agree

> 2) Shoutcast does not care whether the samples are shipped to the server on the timescales we are talking about. There is buffering at the client and at the server so cutting ~100ms of latency out will have very little user-visible effect.

Yes

> If the sidechain wakes up on time before overflow, then everything is fine. If there is a buffer overflow then either our scheduling is off or the user's machine is too weak to operate at the given latency.

Yes.

You analysis it very useful for this case. An I think we have a common sense in this. If not, please correct my following thoughts:

* Currently the side chain takes about 1.5 ms at your setup processing a 150 ms buffer -> Factor 0.01 
I wander what happens with the Factor if we run with half of the buffer size. This give us a cue if there is room for optimization.    
If the Factor does not increase much, we can think about to scale the buffer size to the PA buffer size or merge some work to the engine thread. This avoids finally to scheduled a high priority thread at  random moments.  As you said correctly before, this is no benefit for the side chain itself, but IMHO for all other threads that have to be scheduled between the PA callbacks.

* I think we should separate the potentially blocking write() at recording from the probably non blocking shoutcast. One possible solution is to process shoutcast from the PA thread itself and do recording on its own thread.

* I like to use the benefit of Multicore CPUs  Currently Mixxx uses all 4 cores of my CPU, but this is a result of a random scheduling.
I wander if we can gain lower latency if we paralyze the PA callback by the number of cores.  The for loops in engine master.cpp are ideal candidates. Or we could process right channel and left channel in different threads. For this approach a deterministic scheduling is very helpful to tweak sings up.

I have a two step idea for this: 
1. move all PA buffer size depended tasks to the PA thread 
2. parallelize the PA thread

If it turns out that these are bad Idea in this particularly case lets try to find a common scene for such problems. Do you think my 
deterministic processing ideas are valid in general?

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-01-21:

#11

timeline.csv Edit (393.0 KiB, text/csv)

Download full text (5.5 KiB)

Ah, I did some more research. I think my variance numbers above (28ms) are misleading. The sidechain is much more reliable than that in when it is scheduled.

I built a global timeline tracker for Mixxx and instrumented all of our threads for various events we care about. I've attached a timeline dump.

Columns are:

- time since bootup in nanoseconds
- duration since previous line
- duration since last START of the same event string
- duration since last END of the same event string
- event identifier string

I'm working on a nice graphing tool for this and auto-extracting some details of various stats (the distribution of the frequency of events, etc.).

But for now, if you slice this to just EngineSidechain start events:

$ cat timeline.csv | egrep "START,EngineSideChain$"

You'll see the scheduling is very deterministic on my machine:

6401813,+2.55468ms,+0ns,+0ns,START,EngineSideChain
2267856463,+8.862us,+2.26145s,+2.26143s,START,EngineSideChain
2871676046,+37.298us,+603.82ms,+603.534ms,START,EngineSideChain
3475332555,+31.977us,+603.657ms,+603.569ms,START,EngineSideChain
4078993778,+9.072us,+603.661ms,+603.589ms,START,EngineSideChain
4682823651,+51.679us,+603.83ms,+603.768ms,START,EngineSideChain
5286533296,+27.559us,+603.71ms,+603.639ms,START,EngineSideChain
5890151625,+1.944us,+603.618ms,+603.545ms,START,EngineSideChain
6494206651,+48.318us,+604.055ms,+603.983ms,START,EngineSideChain
7097950042,+37.408us,+603.743ms,+603.684ms,START,EngineSideChain
7701548968,+7.581us,+603.599ms,+603.527ms,START,EngineSideChain
8305264076,+24.455us,+603.715ms,+603.652ms,START,EngineSideChain
8908953748,+6.893us,+603.69ms,+603.629ms,START,EngineSideChain
9512699813,+4.376us,+603.746ms,+603.675ms,START,EngineSideChain
10116375339,+3.79us,+603.676ms,+603.597ms,START,EngineSideChain
10720137282,+2.313us,+603.762ms,+603.683ms,START,EngineSideChain
11323834895,+6.158us,+603.698ms,+603.635ms,START,EngineSideChain
11927510878,+6.229us,+603.676ms,+603.61ms,START,EngineSideChain
12531278907,+9.061us,+603.768ms,+603.692ms,START,EngineSideChain
13135007774,+21.234us,+603.729ms,+603.666ms,START,EngineSideChain
13738704524,+5.195us,+603.697ms,+603.631ms,START,EngineSideChain
14342497032,+40.417us,+603.793ms,+603.721ms,START,EngineSideChain
14946184124,+354ns,+603.687ms,+603.615ms,START,EngineSideChain
15549856644,+8.811us,+603.673ms,+603.571ms,START,EngineSideChain
16153767368,+20.116us,+603.911ms,+603.849ms,START,EngineSideChain
16757384342,+24.97us,+603.617ms,+603.551ms,START,EngineSideChain
17361117902,+26.223us,+603.734ms,+603.676ms,START,EngineSideChain
17964835399,+13.324us,+603.717ms,+603.659ms,START,EngineSideChain
19041821439,+370.964ms,+1.07699s,+1.07692s,START,EngineSideChain

Column 2 is the important one here -- it's time since last START event. Ignoring the first and last events (because they are starting up the sidechain worker and shutting it down) you can see that it occurs pretty solidly every 603 or 604ms.

If you add in callback events, you can see how the sidechain is driven:

4697256496,+5.4529ms,+11.5716ms,+11.449ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in O
utput
4697318510,+1.402us,+11.5401ms,+11.5369ms,START,EngineSideChain:...

Ah, I did some more research. I think my variance numbers above (28ms) are misleading. The sidechain is much more reliable than that in when it is scheduled.

I built a global timeline tracker for Mixxx and instrumented all of our threads for various events we care about. I've attached a timeline dump.

Columns are:

- time since bootup in nanoseconds
- duration since previous line
- duration since last START of the same event string
- duration since last END of the same event string
- event identifier string

I'm working on a nice graphing tool for this and auto-extracting some details of various stats (the distribution of the frequency of events, etc.).

But for now, if you slice this to just EngineSidechain start events:

$ cat timeline.csv | egrep "START,EngineSideChain$"

You'll see the scheduling is very deterministic on my machine:

6401813,+2.55468ms,+0ns,+0ns,START,EngineSideChain
2267856463,+8.862us,+2.26145s,+2.26143s,START,EngineSideChain
2871676046,+37.298us,+603.82ms,+603.534ms,START,EngineSideChain
3475332555,+31.977us,+603.657ms,+603.569ms,START,EngineSideChain
4078993778,+9.072us,+603.661ms,+603.589ms,START,EngineSideChain
4682823651,+51.679us,+603.83ms,+603.768ms,START,EngineSideChain
5286533296,+27.559us,+603.71ms,+603.639ms,START,EngineSideChain
5890151625,+1.944us,+603.618ms,+603.545ms,START,EngineSideChain
6494206651,+48.318us,+604.055ms,+603.983ms,START,EngineSideChain
7097950042,+37.408us,+603.743ms,+603.684ms,START,EngineSideChain
7701548968,+7.581us,+603.599ms,+603.527ms,START,EngineSideChain
8305264076,+24.455us,+603.715ms,+603.652ms,START,EngineSideChain
8908953748,+6.893us,+603.69ms,+603.629ms,START,EngineSideChain
9512699813,+4.376us,+603.746ms,+603.675ms,START,EngineSideChain
10116375339,+3.79us,+603.676ms,+603.597ms,START,EngineSideChain
10720137282,+2.313us,+603.762ms,+603.683ms,START,EngineSideChain
11323834895,+6.158us,+603.698ms,+603.635ms,START,EngineSideChain
11927510878,+6.229us,+603.676ms,+603.61ms,START,EngineSideChain
12531278907,+9.061us,+603.768ms,+603.692ms,START,EngineSideChain
13135007774,+21.234us,+603.729ms,+603.666ms,START,EngineSideChain
13738704524,+5.195us,+603.697ms,+603.631ms,START,EngineSideChain
14342497032,+40.417us,+603.793ms,+603.721ms,START,EngineSideChain
14946184124,+354ns,+603.687ms,+603.615ms,START,EngineSideChain
15549856644,+8.811us,+603.673ms,+603.571ms,START,EngineSideChain
16153767368,+20.116us,+603.911ms,+603.849ms,START,EngineSideChain
16757384342,+24.97us,+603.617ms,+603.551ms,START,EngineSideChain
17361117902,+26.223us,+603.734ms,+603.676ms,START,EngineSideChain
17964835399,+13.324us,+603.717ms,+603.659ms,START,EngineSideChain
19041821439,+370.964ms,+1.07699s,+1.07692s,START,EngineSideChain

Column 2 is the important one here -- it's time since last START event. Ignoring the first and last events (because they are starting up the sidechain worker and shutting it down) you can see that it occurs pretty solidly every 603 or 604ms.

If you add in callback events, you can see how the sidechain is driven:

4697256496,+5.4529ms,+11.5716ms,+11.449ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in O
utput
4697318510,+1.402us,+11.5401ms,+11.5369ms,START,EngineSideChain::writeSamples
4708880743,+8.66986ms,+11.6242ms,+11.5379ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4708929779,+954ns,+11.6113ms,+11.6094ms,START,EngineSideChain::writeSamples
4720451130,+11.5049ms,+11.5704ms,+11.5049ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4720494210,+760ns,+11.5644ms,+11.5626ms,START,EngineSideChain::writeSamples
4732074693,+11.5643ms,+11.6236ms,+11.5643ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4732134119,+1.618us,+11.6399ms,+11.6387ms,START,EngineSideChain::writeSamples
4743707696,+11.5413ms,+11.633ms,+11.5413ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
Output
4743752829,+957ns,+11.6187ms,+11.6166ms,START,EngineSideChain::writeSamples
4755296740,+11.5273ms,+11.589ms,+11.5273ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
Output
4755357269,+1.498us,+11.6044ms,+11.6024ms,START,EngineSideChain::writeSamples
4766900337,+11.515ms,+11.6036ms,+11.515ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in O
utput
4766946225,+959ns,+11.589ms,+11.5867ms,START,EngineSideChain::writeSamples
4778499032,+1.65356ms,+11.5987ms,+11.5361ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4778543622,+1.349us,+11.5974ms,+11.5959ms,START,EngineSideChain::writeSamples
4790126518,+3.48877ms,+11.6275ms,+11.5575ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4790191793,+1.889us,+11.6482ms,+11.6459ms,START,EngineSideChain::writeSamples
4801719883,+261.257us,+11.5934ms,+11.5014ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4801763626,+982ns,+11.5718ms,+11.5696ms,START,EngineSideChain::writeSamples
4813346992,+11.564ms,+11.6271ms,+11.564ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in O
utput
4813405432,+914ns,+11.6418ms,+11.6403ms,START,EngineSideChain::writeSamples
4824951658,+2.85787ms,+11.6047ms,+11.5296ms,START,SoundDevicePortAudio::callbackProcess 1, Built-in
 Output
4824992376,+888ns,+11.5869ms,+11.5854ms,START,EngineSideChain::writeSamples
4824994010,+1.634us,+603.686ms,+603.677ms,START,EngineSideChain::writeSamples wake up
4825012429,+4.325us,+603.687ms,+603.458ms,START,EngineSideChain
4825053647,+37.562us,+603.569ms,+603.504ms,START,EngineSideChain::process

So it's not surprising that the timing of the sidechain waking up is quite stable given the callback driving it is quite stably recurring every ~11.5ms.

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-01-22:

#12

> 2. parallelize the PA thread

This is similar to the timer powered engine idea:
https://blueprints.launchpad.net/mixxx/+spec/timer-powered-engine

I wrote a very naive email to the portaudio mailing list a few years back -- Ross Bencina (from portaudio) provided some great feedback that eventually caused us to put the project on the shelf:
http://music.columbia.edu/pipermail/portaudio/2011-June/012497.html

I know what you're proposing is slightly different. The timer-powered engine idea would write to a FIFO so the callback would be nothing more than a series of non-blocking reads on various FIFOs.

Priority inversion is one risk of moving logic outside the callback. One of Ross's opinions is that even relying on the scheduler of the OS to behave as you expect is a form of priority inversion (he talks about it in this post: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing -- see the priority inversion section).

Some commercial software seems to advertise doing their audio processing on the GPU so it's certainly possible to build a system this way. Maybe it would look like:

- A series of FIFOs for every worker stream of samples.
- A series of worker threads for each division of labor (left/right channel, etc.) driven by semaphores.

Every callback would:

1) wake up the workers via semaphores to work on the next callback
2) read from the FIFOs -- play silence if samples aren't ready yet

I think this would add one latency period since on the first callback we would play silence and then signal the workers to produce the next callback (which would have been the output for the first callback in our current system).

We could block the callback to wait for the workers directly. This would not add an additional latency period but it would be very risky because if for some reason our workers do not finish in time then we would xrun. If you xrun e.g. using JACK, then the JACK watchdog can kill the process so we should avoid this.

> 2. parallelize the PA thread

This is similar to the timer powered engine idea:
https://blueprints.launchpad.net/mixxx/+spec/timer-powered-engine

I wrote a very naive email to the portaudio mailing list a few years back -- Ross Bencina (from portaudio) provided some great feedback that eventually caused us to put the project on the shelf:
http://music.columbia.edu/pipermail/portaudio/2011-June/012497.html

I know what you're proposing is slightly different. The timer-powered engine idea would write to a FIFO so the callback would be nothing more than a series of non-blocking reads on various FIFOs.

Priority inversion is one risk of moving logic outside the callback. One of Ross's opinions is that even relying on the scheduler of the OS to behave as you expect is a form of priority inversion (he talks about it in this post: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing -- see the priority inversion section).

Some commercial software seems to advertise doing their audio processing on the GPU so it's certainly possible to build a system this way. Maybe it would look like:

- A series of FIFOs for every worker stream of samples.
- A series of worker threads for each division of labor (left/right channel, etc.) driven by semaphores.

Every callback would:

1) wake up the workers via semaphores to work on the next callback
2) read from the FIFOs -- play silence if samples aren't ready yet

I think this would add one latency period since on the first callback we would play silence and then signal the workers to produce the next callback (which would have been the output for the first callback in our current system).

We could block the callback to wait for the workers directly. This would not add an additional latency period but it would be very risky because if for some reason our workers do not finish in time then we would xrun. If you xrun e.g. using JACK, then the JACK watchdog can kill the process so we should avoid this.

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-22:

#13

timeline.csv:

Thank you. This is really a very useful tool. without it we are really fishing in troubled water.

What are the test conditions?
Was recording ans sh outcast activated?
Was a track playing?
Wat was the priority of the EngineSideChain thread?
How many Cores do you have?

Do you have a clue why it is scheduled every 604 ms and not every 150 ms?

While these analysis is very worth full, I am still in doubt if we have a common sense in general:

Currently we stack up the side chain work of ~50 PA callback and process it at once. This makes scene, if we save CPU because we must do some checks and calls only one time. An improvement by 50 if we only look at this call.
But this dos not comes without cost:
* OS primitives for syncing
* Reading and writing the sample buffer
* an addition thread to schedule and time for thread changing
* Non equal CPU load over time, because of additional real time work every 604 ms (see ASCII Art)

No Problem for a high end CPU
PA Callback - - - - - -
Sidechain --

Low end CPU
PA Callback - - - - - -
Sidechain - -
^ probably 100 % CPU load (no time left to process e.g. waveforms)

Do you agree to this view?

To get a clue how much time CPU we save by stacking up the work, we could compare the needed CPU time for the current buffer size and smaller sizes. Could you check this with mp3 encoding enabled in recorder and broadcast?
Thank you.

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-22:

#14

>> 2. parallelize the PA thread

> This is similar to the timer powered engine idea:
> https://blueprints.launchpad.net/mixxx/+spec/timer-powered-engine

No, sorry. If I got this right, this is a proposal to decouple the Mixxx threads from the PA callback.
My idea is somehow the opposite, I want to couple all realtime work (sample processing with a deadline) to one
dedicated audio thread.

Once I have time I will start it by ensuring that we have only one PA callback in any case.
Se my analysis here: https://bugs.launchpad.net/mixxx/+bug/1203249/comments/7
(Comments to this bug are very welcome)

> http://music.columbia.edu/pipermail/portaudio/2011-June/012497.html
I agree to the outlined problems in the post. But I think we have currently a much better solution inside Mixxx then outlined in:
> We have a GSoC student working on converting Mixxx to be driven by an internal timer instead of the callback.

To me it looks like this would make the scheduling more complex, will use More CPU, and introduces additional latency due to the additional buffer.

There is currently such a buffer inside ALSA/PA and it is sufficient. If we are on a low end CPU we can increase it and if we are on a High end CPU we can tweak it for optimum latency. All what we need to do for now, is to fix bugs like bug #1203249
And remove some bottle necks like this bugs (from the result of our pending analysis)

> Priority inversion is one risk of moving logic outside the callback.
>We could block the callback to wait for the workers directly. This would not add an additional latency period but it would be very risky because if for some reason our workers do not finish in time then we would xrun.

Yes, this is my idea. But for now i can't see a risk, not if we are careful enough. Lets look on the approach like process a deck engine on a different CPU. In this case the engine Master thread has to wait for the deck engine in any case. If it is processed by itself or in different thread. Of cause there is an overhead for syncing an task switching. At this point where we have to lock if there is a second core available (probably idling). I had the experience that we can 1.5 times faster due to the second Core (from theoretically 2 times).

And by the way: three is a lot of work involved I am not about to spend it now. It is only important for me to have a common sense how to fix bug #1203249.

>> 2. parallelize the PA thread

> This is similar to the timer powered engine idea:
> https://blueprints.launchpad.net/mixxx/+spec/timer-powered-engine

No, sorry. If I got this right, this is a proposal to decouple the Mixxx threads from the PA callback. 
My idea is somehow the opposite, I want to couple all realtime work (sample processing with a deadline) to one 
dedicated audio thread.

Once I have time I will start it by ensuring that we have only one PA callback in any case. 
Se my analysis here: https://bugs.launchpad.net/mixxx/+bug/1203249/comments/7
(Comments to this bug are very welcome)

> http://music.columbia.edu/pipermail/portaudio/2011-June/012497.html
I agree to the outlined problems in the post. But I think we have currently a much better solution inside Mixxx then outlined in: 
> We have a GSoC student working on converting Mixxx to be driven by an internal timer instead of the callback.

To me it looks like this would make the scheduling more complex, will use More CPU, and introduces additional latency due to the additional buffer.

There is currently such a buffer inside ALSA/PA and it is sufficient. If we are on a low end CPU we can increase it and if we are on a High end CPU we can tweak it for optimum latency. All what we need to do for now, is to fix bugs like bug #1203249
And remove some bottle necks like this bugs (from the result of our pending analysis) 
  
> Priority inversion is one risk of moving logic outside the callback. 
>We could block the callback to wait for the workers directly. This would not add an additional latency period but it would be very risky because if for some reason our workers do not finish in time then we would xrun.

Yes, this is my idea. But for now i can't see a risk, not if we are careful enough. Lets look on the approach like process a deck engine on a different CPU. In this case the engine Master thread has to wait for the deck engine in any case. If it is processed by itself or in different thread. Of cause there is an overhead for syncing an task switching. At this point where we have to lock if there is a second core available (probably idling). I had the experience that we can 1.5 times faster due to the second Core (from theoretically 2 times).

And by the way: three is a lot of work involved I am not about to spend it now. It is only important for me to have a common sense how to fix bug #1203249.

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2014-01-31:

#15

http://www.mixxx.org/forums/viewtopic.php?f=3&t=5930
It looks like this is an overflow bugreport of the recording feature.

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-03-15:

#16

The sidechain is now HighPriority: https://github.com/mixxxdj/mixxx/commit/a9b1e584fea3f020b63e48dff186d80c6a9c087c

I'll try to reach out to mibofra to test.

Changed in mixxx:
assignee:	nobody → RJ Ryan (rryan)

Revision history for this message

Francesco Bonanno (mibo-fra) wrote on 2014-03-15:

#17

timeline.csv Edit (8.2 MiB, text/csv)

Revision history for this message

RJ Skerry-Ryan (rryan) wrote on 2014-03-18:

#18

mibofra mentioned on IRC that this only affects ogg and not mp3. mibofra -- is that after updating to master?

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2015-10-02:

#19

An other change is proposed here:
https://github.com/mixxxdj/mixxx/pull/665

tags:

added: shoutcast

Revision history for this message

Daniel Schürmann (daschuer) wrote on 2016-03-02:

#20

Since the we have reworked the broadcasting path lately, this bug is probably solved.
But at least the findings reported here are outdated.
So I will mark this released.

if you still suffer these issues, please file a new Bug with data from a recent Mixxx build.

no longer affects:

mixxx

	Status	Importance	Assigned to	Milestone
Mixxx
1.11	Won't Fix	High	RJ Skerry-Ryan	Mixxx 1.11.1
2.0	Fix Released	High	RJ Skerry-Ryan	Mixxx 2.0.0 "1.12.0"

Mixxx

lowlatency kernel causes sidechain thread starvation

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches