Bug #1847780 “ovn-dbs cluster ends the update process with 2 nod...” : Bugs : tripleo

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-11: Fix proposed to tripleo-heat-templates (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/688212

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-16: Fix merged to tripleo-heat-templates (master)

#2

Reviewed: https://review.opendev.org/688212
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=751b3fc09632cf184bf947be8f74d771554a23d0
Submitter: Zuul
Branch: master

commit 751b3fc09632cf184bf947be8f74d771554a23d0
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Oct 11 16:10:00 2019 +0200

Workaround ovn cluster failure during update when schema change.

    During update the ovndb server can have a schema change. The problem
    is that an updated slave ovndb wouldn't connect to a master which
    still has the old db schema. At some point (200000ms) pacemaker put
    the resource in error Time Out. Then it will wait for the operator to
    cleanup the resource. Meaning that the update can goes like this:

     - Original state: (Master, Slave, Failed): nothing updated
       - ctl0-M-old
       - ctl1-S-old
       - ctl2-S-old
     - First state: after update of ctl0
       - ctl0-F-new
       - ctl1-M-old
       - ctl2-S-old
     - Second state: after update of ctl1
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-old
     - Third and final state: after update of ctl2
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-new

    During the third state we have a cut in the control plane as ctl2 is
    the master and there is no slave to fall back to. Then we end up
    loosing HA as only one node is active. The error persists after
    reboot. Only a pcs resource cleanup will bring the cluster online.

    The real solution will come from ovndb and the associated ocf agent,
    but in the meantime, we workaround it by:
     - cleanup
     - ban the resource;
    in step 1 and:
     - cleanup
     - unban the resource
    in step 5.

    This has the net effect of preventing the cut in the control plane for
    the last node as we move master to the updated controller which will
    form a cluster of one master and one slave (as two are updated). The
    last one will happily join then when it will be updated.

    That means:
     - we always have either 1 or 2 nodes working;
     - we end the update with the cluster converged back to a stable
     state.

    The problems are :
     - we could hide a real ovndb cluster issue;
    - if the update break in-between we could have a leftover ban on one
     of the node;

But, all things considered, this looks like the best compromise for
the time being.

Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
Closes-Bug: #1847780

Reviewed:  https://review.opendev.org/688212
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=751b3fc09632cf184bf947be8f74d771554a23d0
Submitter: Zuul
Branch:    master

commit 751b3fc09632cf184bf947be8f74d771554a23d0
Author: Sofer Athlan-Guyot <sathlang@redhat.com>
Date:   Fri Oct 11 16:10:00 2019 +0200

Workaround ovn cluster failure during update when schema change.
    
    During update the ovndb server can have a schema change. The problem
    is that an updated slave ovndb wouldn't connect to a master which
    still has the old db schema.  At some point (200000ms) pacemaker put
    the resource in error Time Out.  Then it will wait for the operator to
    cleanup the resource.  Meaning that the update can goes like this:
    
     - Original state: (Master, Slave, Failed): nothing updated
       - ctl0-M-old
       - ctl1-S-old
       - ctl2-S-old
     - First state: after update of ctl0
       - ctl0-F-new
       - ctl1-M-old
       - ctl2-S-old
     - Second state: after update of ctl1
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-old
     - Third and final state: after update of ctl2
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-new
    
    During the third state we have a cut in the control plane as ctl2 is
    the master and there is no slave to fall back to. Then we end up
    loosing HA as only one node is active.  The error persists after
    reboot.  Only a pcs resource cleanup will bring the cluster online.
    
    The real solution will come from ovndb and the associated ocf agent,
    but in the meantime, we workaround it by:
     - cleanup
     - ban the resource;
    in step 1 and:
     - cleanup
     - unban the resource
    in step 5.
    
    This has the net effect of preventing the cut in the control plane for
    the last node as we move master to the updated controller which will
    form a cluster of one master and one slave (as two are updated).  The
    last one will happily join then when it will be updated.
    
    That means:
     - we always have either 1 or 2 nodes working;
     - we end the update with the cluster converged back to a stable
     state.
    
    The problems are :
     - we could hide a real ovndb cluster issue;
    - if the update break in-between we could have a leftover ban on one
     of the node;
    
    But, all things considered, this looks like the best compromise for
    the time being.
    
    Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
    Closes-Bug: #1847780

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-16: Fix proposed to tripleo-heat-templates (stable/stein)

#3

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/688846

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-17: Fix merged to tripleo-heat-templates (stable/stein)

#4

Reviewed: https://review.opendev.org/688846
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d9c60ab05e9d67c35602d3fdf5f2658c49028874
Submitter: Zuul
Branch: stable/stein

commit d9c60ab05e9d67c35602d3fdf5f2658c49028874
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Oct 11 16:10:00 2019 +0200