audit syscalls made from the PortAudio callback
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mixxx |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
I've been examining the syscalls Mixxx makes from the PortAudio thread with dtrace to try and measure their cost. We should minimize the syscalls we do in a realtime context to avoid blocking the engine thread.
Changed in mixxx: | |
status: | New → Confirmed |
RJ Skerry-Ryan (rryan) wrote : | #1 |
RJ Skerry-Ryan (rryan) wrote : | #2 |
EngineSideChain
0 756 psynch_
0 757 psynch_
RJ Skerry-Ryan (rryan) wrote : | #3 |
I need to update my dtrace scripts to measure the distributions of times rather than eyeballing it. For now, consider every timestamp I give to be a sample size of 1 :).
RJ Skerry-Ryan (rryan) wrote : | #4 |
Setting controls for queued connections -- 12 microseconds
0 752 psynch_
0 753 psynch_
tags: | added: engine performance |
RJ Skerry-Ryan (rryan) wrote : | #5 |
To verify the dtrace timings make sense I recorded some playing a single MP3 with --developer and compared them with this dtrace script:
HALC_IO*
/execname == "mixxx"/
{
self-
self-
self-
}
HALC_IO*
/execname == "mixxx" && self->callback_
{
@[ustack(), "callback duration ns"] = quantize(timestamp - self->callback_
@[ustack(), "callback on-cpu duration ns"] = quantize(vtimestamp - self->callback_
self-
self-
}
Debug [Main]: Stat("SoundDevi
I used Mixxx's experiment mode to measure 30 seconds of playing, not including Mixxx startup or shutdown (since that tends to trigger outliers).
callback on-cpu duration ns
value ------------- Distribution ------------- count
8192 | 0
16384 | 164
32768 |@@@@@@@@@@@@@ 10218
65536 |@@@@@@
131072 |@@@@@ 3755
262144 | 71
524288 | 0
1048576 | 1
2097152 | 0
callback duration ns
value ------------- Distribution ------------- count
8192 | 0
16384 | 43
32768 |@@@@@@@@@ 6981
65536 |@@@@@@@@@@@@@@@@@@ 13896
131072 |@@@@@@@@@@@@ 9293
262144 |@ 475
524288 | 3
1048576 | 1
2097152 | 1
4194304 | 0
The min and maxes line up pretty well, so I think I can trust these measurements. I think the difference in max values from the on-cpu vs. on-cpu+off-cpu duration indicates the thread is spending up to a millisecond blocked.
RJ Skerry-Ryan (rryan) wrote : | #6 |
Here's a syscall from Stat::track:
syscall duration ns
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@
16384 | 0
RJ Skerry-Ryan (rryan) wrote : | #7 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@
8192 |@@@@@@
16384 | 0
RJ Skerry-Ryan (rryan) wrote : | #8 |
Ouch, malloc. That's about as long as my average callback duration.
syscall duration ns
value ------------- Distribution ------------- count
32768 | 0
65536 |@@@@@@
131072 | 0
RJ Skerry-Ryan (rryan) wrote : | #9 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@
8192 |@@@@ 2
16384 | 0
RJ Skerry-Ryan (rryan) wrote : | #10 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@ 4
8192 |@@@@@@
16384 |@@@ 1
32768 | 0
RJ Skerry-Ryan (rryan) wrote : | #11 |
syscall duration ns
value ------------- Distribution ------------- count
4096 | 0
8192 |@@@@@@@@@@@@@@@ 3
16384 |@@@@@@
32768 | 0
RJ Skerry-Ryan (rryan) wrote : | #12 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 | 1
8192 |@@@@@@
16384 |@@@@@@@@@@@ 23
32768 | 0
RJ Skerry-Ryan (rryan) wrote : | #13 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@
8192 |@@@@@@@@@@@@ 51
16384 |@@ 7
32768 | 1
65536 | 0
RJ Skerry-Ryan (rryan) wrote : | #14 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 |@@ 30
8192 |@@@@@@@@ 111
16384 |@@@@@@
32768 |@@@@@@ 78
65536 | 3
131072 | 0
RJ Skerry-Ryan (rryan) wrote : | #15 |
syscall duration ns
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@
8192 |@@@@@@@@@@@@@@@@ 112
16384 | 1
32768 | 0
RJ Skerry-Ryan (rryan) wrote : | #16 |
I didn't see anything in here that was particularly surprising -- these were all things we knew we were doing, but having ballpark numbers is nice.
Some initial thoughts:
* We may see some tail latency improvements from using tcmalloc given the cases where we ended up in malloc seemed quite expensive (up to 65us).
* ControlObject::set can cost 16us locking mutexes alone. Queued connections allocate memory too (which can also block). We really need to get COs out of the engine.
* SideChain writing costs ~8-16us just signaling the sidechain thread.
* Why is SoundDeviceNetwork open when no shoutcast stream is enabled? Do we need to be calling gettimeofday from the engine thread?
RJ Skerry-Ryan (rryan) wrote : | #17 |
Here's my dtrace script:
HALC_IO*
/execname == "mixxx"/
{
self-
self-
self-
}
HALC_IO*
/execname == "mixxx" && self->callback_
{
@[ustack(), "callback duration ns"] = quantize(timestamp - self->callback_
@[ustack(), "callback on-cpu duration ns"] = quantize(vtimestamp - self->callback_
self-
self-
}
syscall:::entry
/execname == "mixxx" && tid == self->callback_tid/
{
self->syscall_ts = timestamp;
}
syscall:::return
/execname == "mixxx" && tid == self->callback_tid && self->syscall_ts/
{
@[ustack(), "syscall duration ns"] = quantize(timestamp - self->syscall_ts);
self->syscall_ts = 0;
}
Daniel Schürmann (daschuer) wrote : | #18 |
> * We may see some tail latency improvements from using tcmalloc given the cases where we ended up in malloc seemed quite expensive (up to 65us).
Does it mean we need to override the new() operator for some classes? How does it work together with precompiled qt libraries?
> * ControlObject::set can cost 16us locking mutexes alone. Queued connections allocate memory too (which can also block). We really need to get COs out of the engine.
Yes! I have already thought about it what will be a good approach, that is performant, but not requires a change in our business logic.
On Idea was to implement a new CO Proxy type that has value buffer for pending changes from the engine and a chain pointer that allows to chain all touched COs.
During the callback set() writes the value buffer and puts the CO on top of the linked list.
After the engine call, a new threads works though the fifo list and updates the underlying CO accordingly.
Unfortunately I am not sure if this really improves the performance, because the work is still there and the fifo should be empty before we can continue with the next engine call.
If we assume that we protect these threads from each other by a single mutex, we use actually more CPU, but we benefit from multi core cpus.
Maybe we find a way to make the chain of Engine COs lock free
> * Why is SoundDeviceNetwork open when no shoutcast stream is enabled?
There is no strong reason for it. It requires some work to enable it on demand though.
> * Do we need to be calling gettimeofday from the engine thread?
Yes, that is the main reason for having SoundDeviceNetwork, it runs at the network synced gettimeofday clock.
By the way, there is a Pending PR, that allows to run the whole engine at this clock to remove sync artifacts from the broadcast stream: https:/
RJ Skerry-Ryan (rryan) wrote : | #19 |
Taking the max of on-cpu vs on+off-cpu callback duration shows something of an upper bound (ok, sloppy math) for the amount of time the callback can sleep.
callback on-cpu duration max ns 2447591
callback duration max ns 4713387
Here is a histogram of the number of times that the callback thread blocks per callback (I put a dtrace probe on sched:::off-cpu and incremented a counter between HALC_IO*
This is playing 2 tracks at system reported latency of 13.5828ms for about 30 seconds.
offcpu count
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@
1 | 11
2 | 7
4 | 1
8 | 0
Note that these power-of-2 histograms dtrace produces account an event to the smallest power of 2 that is less than or equal to the true number. So the true maximum here is somewhere between 4 and 7.
Swiftb0y (swiftb0y) wrote : | #20 |
Mixxx now uses GitHub for bug tracking. This bug has been migrated to:
https:/
lock status: | Metadata changes locked and limited to project staff |
gettimeofday appears to take about 8 microseconds on average. The "ts" here is a walltime nanosecond counter (though I think the precision is microseconds, not nanos since the last 3 numbers are always zero).
Notably, I'm not broadcasting at all and EngineNetworkStream is still doing this work.
0 383 gettimeofday:return (tid 1488971) ts:148116165640 5619000 gettimeofday
libsystem_ kernel. dylib`_ _gettimeofday+ 0xa
mixxx`EngineNet workStream: :getWriteExpect ed()+0x17
mixxx`SoundDevi ceNetwork: :writeProcess( )+0xd4
mixxx`SoundMana ger::writeProce ss()+0xb9
mixxx`SoundDevi cePortAudio: :callbackProces sClkRef( unsigned int, float*, float const*, PaStreamCallbac kTimeInfo const*, unsigned long)+0x751
mixxx`( anonymous namespace) ::paV19Callback ClkRef( void const*, void*, unsigned long, PaStreamCallbac kTimeInfo const*, unsigned long, void*)+0x23
libportaudio. 2.dylib` AdaptingOutputO nlyProcess+ 0xe7
libportaudio. 2.dylib` PaUtil_ EndBufferProces sing+0x1c4
libportaudio. 2.dylib` AudioIOProc+ 0x392
CoreAudio` AUConverterBase ::RenderBus( unsigned int&, AudioTimeStamp const&, unsigned int, unsigned int)+0x2c5
CoreAudio` AUBase: :DoRenderBus( unsigned int&, AudioTimeStamp const&, unsigned int, AUOutputElement*, unsigned int, AudioBufferList &)+0x9a
CoreAudio` AUBase: :DoRender( unsigned int&, AudioTimeStamp const&, unsigned int, unsigned int, AudioBufferList &)+0x1f8
CoreAudio` AUHAL:: AUIOProc( unsigned int, AudioTimeStamp const*, AudioBufferList const*, AudioTimeStamp const*, AudioBufferList*, AudioTimeStamp const*, void*)+0x6f2
CoreAudio` HALC_ProxyIOCon text::IOWorkLoo p()+0x9e7
CoreAudio` HALC_ProxyIOCon text::IOThreadE ntry(void* )+0x58
CoreAudio` HALB_IOThread: :Entry( void*)+ 0x4b
libsystem_ pthread. dylib`_ pthread_ body+0x83
libsystem_ pthread. dylib`_ pthread_ body
libsystem_ pthread. dylib`thread_ start+0xd
0 382 gettimeofday:entry (tid 1488971) ts:148116165640 7042000 gettimeofday
libsystem_ kernel. dylib`_ _gettimeofday+ 0xa
mixxx`EngineNet workStream: :getWriteExpect ed()+0x17
mixxx`SoundDevi ceNetwork: :writeProcess( )+0xd4
mixxx`SoundMana ger::writeProce ss()+0xb9
mixxx`SoundDevi cePortAudio: :callbackProces sClkRef( unsigned int, float*, float const*, PaStreamCallbac kTimeInfo const*, unsigned long)+0x751
mixxx`( anonymous namespace) ::paV19Callback ClkRef( void const*, void*, unsigned long, PaStreamCallbac kTimeInfo const*, unsigned long, void*)+0x23
libportaudio. 2.dylib` AdaptingOutputO nlyProcess+ 0xe7
libportaudio. 2.dylib` PaUtil_ EndBufferProces sing+0x1c4
libportaudio. 2.dylib` AudioIOProc+ 0x392
CoreAudio` AUConverterBase ::RenderBus( unsigned int&, AudioTimeStamp const&, unsigned int, unsigned int)+0x2c5
CoreAudio` AUBase: :DoRenderBus( unsigned int&, AudioTimeStamp const&, unsigned int, AUOutputElement*, unsigned int, AudioBufferList &)+0x9a
CoreAudio` AUBase: ...