Smarter processing of reports queue

Bug #2072790 reported by Galen Charlton
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
New
Wishlist
Unassigned

Bug Description

The reports queue processed by Clark Kent can get bottle-necked by reports that take a long time to run, leading to reports that ordinarily would take just a couple seconds to run getting delayed.

While increasing clark-kent.pl's concurrency can help, in a large consortium, there will always be a risk of fast reports getting backlogged by slower queries.

Tweaking how Clark manages its queue can allow for fast reports to continue to flow even while slower reports are getting processed.

Tags: reports
Galen Charlton (gmc)
Changed in evergreen:
importance: Undecided → Wishlist
tags: added: reports
Revision history for this message
Galen Charlton (gmc) wrote (last edit ):

One way to deal with this is to arrange for fast reports to be handled by dedicated Clark worker processes. In particular, two new configuration parameters could be added to Clark:

- maximum amount of time that a report's SQL query can be allowed to run to be considered "fast". This would default to a low value such as 5 seconds.
- number of workers to reserve for handling fast reports

A given report run would be initially assumed to be fast per a new Boolean column on reporter.schedule, is_slow, that defaults to false.

Each time that Clark polls to see if it should spawn a worker to run a report, it would compare the number of workers already running a report to its maximum concurrency and the number of workers to reserve to run fast reports, i.e., whether $concurrency - $current_running <= $num_reserved. If that condition is met, the first scheduled fast report would be selected to run (i.e., where is_slow IS FALSE). Otherwise, the first pending scheduled report of any speed would be selected.

When a worker runs a "fast" report, a statement_timeout would be set to the maximum amount of time allowed for fast reports. If the report's query times out, the reporter.schedule row would be reset by setting start_time to NULL and is_slow to TRUE.

With this approach, setting the concurrency to 4 and the number reserved to 1 would allow up to 3 slow reports to be run in parallel while reserving a worker slot for the fast reports to continue to flow.

If the concurrency is left at just 1 and the number reserved to 1, the effect would be that any available fast reports are run first, but once it gets to a slow report, the slow report will block further reports until it is completed. If the concurrency is set to 1 and the number reserved to 0, there would be no change in behavior as compared to the status quo: reports would just get processed based on their scheduled time.

Mike Rylander contributed to this idea.

Revision history for this message
Galen Charlton (gmc) wrote :

One wrinkle to the suggested approach: Clark would need to distinguish between the timeout used to determine whether a report run will be "fast" or not versus the report timing out because it hit the maximum permitted time set by the --statement-timeout switch. Hitting the latter should continue to mean that the report run has simply failed and will not be automatically attempted again.

Revision history for this message
Galen Charlton (gmc) wrote (last edit ):

A couple data points for deciding on a good default value for the time limit for fast reports:

Big Consortium 1
----------------
The 50% percentile time to complete reports is 3.5 seconds
The 75% percentile time is 9.0 seconds
The 90% percentile time is 15.1 seconds

Big Consortium 2
----------------
The 50% percentile time to complete reports is 2.5 seconds
The 75% percentile time is 6.0 seconds
The 90% percentile time is 16.5 seconds

That suggests to me that a default of as much as 20-30 seconds could make sense, since it's the reports that take multiple minutes to run that would be the ones blocking the faster reports.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.