Activity log for bug #1600014

Date Who What changed Old value New value Message
2016-07-07 21:24:41 Eric K bug added bug
2016-07-07 21:38:12 Eric K description Implement changes to mitigate missed actions during PE failover Please see spec and blueprint for more information. https://github.com/openstack/congress-specs/blob/master/specs/newton/high-availability-design.rst https://blueprints.launchpad.net/congress/+spec/high-availability-design Implement changes to mitigate missed actions during PE failover Please see discussion, spec and blueprint for more information. https://github.com/openstack/congress-specs/blob/master/specs/newton/high-availability-design.rst https://blueprints.launchpad.net/congress/+spec/high-availability-design https://review.openstack.org/#/c/318383/ Relevant discussion: Let's say there are two PE instances P1 and P2. A sequence of data updates takes place which is expected to trigger a sequence of actions a1, a2. Both PE instances are expected to see the same data updates and requests the same sequence of actions. Assume we have a single data source driver D. Let's consider the following sequence of events: P1 requests a1. D receives P1:a1, selects P1 as primary, executes a1. P2 requests a1. D receives P2:a1, logs request but does not execute because P2 is not primary. P1 crashes. P2 requests a2. D receives P2:a2, logs request but does not execute because P2 is not primary. D detects that P1 has crashed. At this point, D selects P2 as the new primary, and will executes future requests from P2. But the action a2 was never executed. To make sure the action is not missed, D can look at the recent requests it received from P2 (a1, a2) and compare it with the recently executed actions (a1) and determine that it should execute a2. Everything worked perfectly. From the information given so far, D cannot distinguish the above sequence of events from the following sequence of events, where the correct sequence of actions is actually a1, a1, a2: P1 requests a1. D receives P1:a1, selects P1 as primary, executes a1. P2 crashes. [P2 should request a1, but doesn't because it is down.] P2 recovers. P1 crashes. [P1 should request a1 again, but doesn't because it is down.] P2 requests a1. D receives P2:a1, logs request but does not execute because P2 is not primary. P2 requests a2. D receives P2:a2, logs request but does not execute because P2 is not primary. D detects that P1 has crashed. In order to determine whether a1 should be executed once or twice, D must determine whether the a1 request from P2 "matches" the a1 request from P1. In order to do this matching well, we can keep track of the following meta-info around the action request: time-stamp the latest data update that triggered the action request on the PE, expressed logically as <datasource, table, seqnum> The this meta-info still isn't enough to guarantee perfect matching. Time-stamp can be off by an arbitrarily large amount because two PE instances may receive the updates at different times and therefore trigger the action request at different times. The seqnum of the data update that triggered the action request is much better, but it's not infallible because logically the same action event may be triggered by different data updates on different PE instances. Say we have two rules: execute[a1] :- ds1:p execute[a1] :- ds2:q Two data updates, ds1:p and ds2:q go out at "the same time", but PE1 receives p first, and PE2 receives q first. Now they both send the execute[a1] request, but with different data update triggers. The receiving datasource driver does not know it should be executed only once, and in order to guarantee at-least-once execution, must execute twice. There is an even tougher deduplication scenario: execute[a1] :- ds1:p, NOT ds2:q execute[a2] :- ds2:q, NOT ds1:p If PE1 receives p before q, PE1 requests execute[a1] If PE2 receives q before p, PE2 requests execute[a2] The receiving datasource driver has no choice but to execute both a1 and a2, but in fact it is technically incorrect because neither ordering of the two data updates should trigger both a1 and a2. All that being said, the duplicate executions should be quite rare because unless the primary PE goes down, the receiving DSD simply continues executing requests from the primary PE, without having to do matching. All other approaches suffer from a similar problem because there is simply no way to guarantee exactly-once execution without having Congress and all its effectors inside a transactional system.
2016-07-13 21:40:25 Eric K congress: assignee Eric K (ekcs)
2017-06-29 18:07:06 Eric K congress: importance Low Wishlist
2017-06-29 18:07:13 Eric K congress: status New Triaged