ovn recovery has a race

Bug #1835830 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

Currently there is a race with the high-availability when resetting a controller. Namely, the VIP that OVN uses (namely the internal_api VIP by default) only has a colocation constraint with the master role of the ovn-dbs resource. This leaves the following race open:
1) We reboot ctrl-0 hosting the master role of ovn-dbs
2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
   promotion operation running in the background is not completed)
3) OVN VIP moves to ctrl-1 even though it is still in slave mode
  (there is only a colocation constraint between vip and master role for ovn)
4) OVN controllers on the overcloud connect to the VIP but it is in
  read-only mode because it was a slave
5) OVN controllers that connected at 4) stay in read-only forever
   until they get restarted manually.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/669803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/669610
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5c10f331977152d4e69e91ed67cd14304cef6a18
Submitter: Zuul
Branch: master

commit 5c10f331977152d4e69e91ed67cd14304cef6a18
Author: Michele Baldessari <email address hidden>
Date: Mon Jul 8 08:53:27 2019 +0200

    Close OVN VIP race by adding an ordering constraint

    Currently there is a race with the high-availability of ovn when resetting a
    controller. Namely, the VIP that OVN uses (namely the internal_api VIP
    by default) only has a colocation constraint with the master role of the
    ovn-dbs resource. This leaves the following race open:
    1) We reboot ctrl-0 hosting the master role of ovn-dbs
    2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
       promotion operation running in the background is not completed)
    3) OVN VIP moves to ctrl-1 even though it is still in slave mode
      (there is only a colocation constraint between vip and master role for
    ovn)
    4) OVN controllers on the overcloud connect to the VIP but it is in
      read-only mode because it was a slave
    5) OVN controllers that connected at 4) stay in read-only forever
       until they get restarted manually.

    With the addition of this constraint we force the VIP move only after
    the master role has been promoted. This makes it much more unlikely
    for a client to connect to the VIP and get a read-only db in the
    background. With only this patch applied I did not manage to reproduce
    the issue (even after 7 reboots of controllers).
    Note that there is still a small race window possible because the
    current OVN resource agent has a bug: it promotes a resource to master
    after issuing the promotion command to the DB but without waiting for
    this promotion to complete. A patch for OVN-ra will also be submitted
    but from initial testing this change seems to be largely sufficient.

    Also note that this change introduces a small less desirable
    side-effect:
    A failover of the internal VIP will now take a bit longer because it
    will happen only after ovn-dbs gets promoted to master.
    We plan to take care of this fully by decoupling the OVN VIP from the
    internal_api one. This change addresses the immediate issue related
    to ovn_controllers being stuck in read-only due to premature promotion.
    (OVN upstream is discussing how to make connections to read-only VIP
    trigger a reconnection eventually)

    Closes-Bug: #1835830

    Change-Id: I3fa07e28c4e37197890664d12a265f1673c780f2

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/stein)

Reviewed: https://review.opendev.org/669803
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=dc4bb7e7cb652e41f55b5b7ba8b93628afab10cf
Submitter: Zuul
Branch: stable/stein

commit dc4bb7e7cb652e41f55b5b7ba8b93628afab10cf
Author: Michele Baldessari <email address hidden>
Date: Mon Jul 8 08:53:27 2019 +0200

    Close OVN VIP race by adding an ordering constraint

    Currently there is a race with the high-availability of ovn when resetting a
    controller. Namely, the VIP that OVN uses (namely the internal_api VIP
    by default) only has a colocation constraint with the master role of the
    ovn-dbs resource. This leaves the following race open:
    1) We reboot ctrl-0 hosting the master role of ovn-dbs
    2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
       promotion operation running in the background is not completed)
    3) OVN VIP moves to ctrl-1 even though it is still in slave mode
      (there is only a colocation constraint between vip and master role for
    ovn)
    4) OVN controllers on the overcloud connect to the VIP but it is in
      read-only mode because it was a slave
    5) OVN controllers that connected at 4) stay in read-only forever
       until they get restarted manually.

    With the addition of this constraint we force the VIP move only after
    the master role has been promoted. This makes it much more unlikely
    for a client to connect to the VIP and get a read-only db in the
    background. With only this patch applied I did not manage to reproduce
    the issue (even after 7 reboots of controllers).
    Note that there is still a small race window possible because the
    current OVN resource agent has a bug: it promotes a resource to master
    after issuing the promotion command to the DB but without waiting for
    this promotion to complete. A patch for OVN-ra will also be submitted
    but from initial testing this change seems to be largely sufficient.

    Also note that this change introduces a small less desirable
    side-effect:
    A failover of the internal VIP will now take a bit longer because it
    will happen only after ovn-dbs gets promoted to master.
    We plan to take care of this fully by decoupling the OVN VIP from the
    internal_api one. This change addresses the immediate issue related
    to ovn_controllers being stuck in read-only due to premature promotion.
    (OVN upstream is discussing how to make connections to read-only VIP
    trigger a reconnection eventually)

    Closes-Bug: #1835830

    Change-Id: I3fa07e28c4e37197890664d12a265f1673c780f2
    (cherry picked from commit 5c10f331977152d4e69e91ed67cd14304cef6a18)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/669996

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.opendev.org/669996
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=bcea8ea1aac2a4b892007ced0b28d34b7651fecc
Submitter: Zuul
Branch: stable/rocky

commit bcea8ea1aac2a4b892007ced0b28d34b7651fecc
Author: Michele Baldessari <email address hidden>
Date: Mon Jul 8 08:53:27 2019 +0200

    Close OVN VIP race by adding an ordering constraint

    Currently there is a race with the high-availability of ovn when resetting a
    controller. Namely, the VIP that OVN uses (namely the internal_api VIP
    by default) only has a colocation constraint with the master role of the
    ovn-dbs resource. This leaves the following race open:
    1) We reboot ctrl-0 hosting the master role of ovn-dbs
    2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
       promotion operation running in the background is not completed)
    3) OVN VIP moves to ctrl-1 even though it is still in slave mode
      (there is only a colocation constraint between vip and master role for
    ovn)
    4) OVN controllers on the overcloud connect to the VIP but it is in
      read-only mode because it was a slave
    5) OVN controllers that connected at 4) stay in read-only forever
       until they get restarted manually.

    With the addition of this constraint we force the VIP move only after
    the master role has been promoted. This makes it much more unlikely
    for a client to connect to the VIP and get a read-only db in the
    background. With only this patch applied I did not manage to reproduce
    the issue (even after 7 reboots of controllers).
    Note that there is still a small race window possible because the
    current OVN resource agent has a bug: it promotes a resource to master
    after issuing the promotion command to the DB but without waiting for
    this promotion to complete. A patch for OVN-ra will also be submitted
    but from initial testing this change seems to be largely sufficient.

    Also note that this change introduces a small less desirable
    side-effect:
    A failover of the internal VIP will now take a bit longer because it
    will happen only after ovn-dbs gets promoted to master.
    We plan to take care of this fully by decoupling the OVN VIP from the
    internal_api one. This change addresses the immediate issue related
    to ovn_controllers being stuck in read-only due to premature promotion.
    (OVN upstream is discussing how to make connections to read-only VIP
    trigger a reconnection eventually)

    Closes-Bug: #1835830

    Change-Id: I3fa07e28c4e37197890664d12a265f1673c780f2
    (cherry picked from commit 5c10f331977152d4e69e91ed67cd14304cef6a18)
    (cherry picked from commit dc4bb7e7cb652e41f55b5b7ba8b93628afab10cf)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.1.0

This issue was fixed in the openstack/puppet-tripleo 11.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/675861

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.opendev.org/675861
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=a0cfe0afddc24a17b9ef5ac9d24f22086a45055f
Submitter: Zuul
Branch: stable/queens

commit a0cfe0afddc24a17b9ef5ac9d24f22086a45055f
Author: Michele Baldessari <email address hidden>
Date: Mon Jul 8 08:53:27 2019 +0200

    Close OVN VIP race by adding an ordering constraint

    Currently there is a race with the high-availability of ovn when resetting a
    controller. Namely, the VIP that OVN uses (namely the internal_api VIP
    by default) only has a colocation constraint with the master role of the
    ovn-dbs resource. This leaves the following race open:
    1) We reboot ctrl-0 hosting the master role of ovn-dbs
    2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
       promotion operation running in the background is not completed)
    3) OVN VIP moves to ctrl-1 even though it is still in slave mode
      (there is only a colocation constraint between vip and master role for
    ovn)
    4) OVN controllers on the overcloud connect to the VIP but it is in
      read-only mode because it was a slave
    5) OVN controllers that connected at 4) stay in read-only forever
       until they get restarted manually.

    With the addition of this constraint we force the VIP move only after
    the master role has been promoted. This makes it much more unlikely
    for a client to connect to the VIP and get a read-only db in the
    background. With only this patch applied I did not manage to reproduce
    the issue (even after 7 reboots of controllers).
    Note that there is still a small race window possible because the
    current OVN resource agent has a bug: it promotes a resource to master
    after issuing the promotion command to the DB but without waiting for
    this promotion to complete. A patch for OVN-ra will also be submitted
    but from initial testing this change seems to be largely sufficient.

    Also note that this change introduces a small less desirable
    side-effect:
    A failover of the internal VIP will now take a bit longer because it
    will happen only after ovn-dbs gets promoted to master.
    We plan to take care of this fully by decoupling the OVN VIP from the
    internal_api one. This change addresses the immediate issue related
    to ovn_controllers being stuck in read-only due to premature promotion.
    (OVN upstream is discussing how to make connections to read-only VIP
    trigger a reconnection eventually)

    Closes-Bug: #1835830

    Change-Id: I3fa07e28c4e37197890664d12a265f1673c780f2
    (cherry picked from commit 5c10f331977152d4e69e91ed67cd14304cef6a18)
    (cherry picked from commit dc4bb7e7cb652e41f55b5b7ba8b93628afab10cf)
    (cherry picked from commit bcea8ea1aac2a4b892007ced0b28d34b7651fecc)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.5.1

This issue was fixed in the openstack/puppet-tripleo 10.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.5.1

This issue was fixed in the openstack/puppet-tripleo 9.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.5.1

This issue was fixed in the openstack/puppet-tripleo 8.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.