ovn-dbs cluster ends the update process with 2 node stopped.

Bug #1847780 reported by Sofer Athlan-Guyot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sofer Athlan-Guyot

Bug Description

Hi,

after an stein update the ovndb cluster is "broken" Meaning that only one node is alive (the master).

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@con
troller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

     podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
       galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
       galera-bundle-1 (ocf::heartbeat:galera): Master controller-1
       galera-bundle-2 (ocf::heartbeat:galera): Master controller-2
     podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest]
       rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0
       rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1
       rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2
     podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]
       redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
       redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1
       redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2
     ip-192.168.24.15 (ocf::heartbeat:IPaddr2): Started controller-0
     ip-10.0.0.110 (ocf::heartbeat:IPaddr2): Started controller-1
     ip-172.17.1.72 (ocf::heartbeat:IPaddr2): Started controller-0
     ip-172.17.1.108 (ocf::heartbeat:IPaddr2): Started controller-2
     ip-172.17.3.110 (ocf::heartbeat:IPaddr2): Started controller-0
     ip-172.17.4.102 (ocf::heartbeat:IPaddr2): Started controller-1
     podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest]
       haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0
       haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-1
       haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2
     podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
       ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped controller-0
       ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped controller-1
       ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Master controller-2
     podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest]
       openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-1

    Failed Resource Actions:
    * ovndb_servers_start_0 on ovn-dbs-bundle-0 'unknown error' (1): call=8, status=Timed Out, exitreason='',
        last-rc-change='Wed Oct 9 09:12:56 2019', queued=0ms, exec=200002ms
    * ovndb_servers_start_0 on ovn-dbs-bundle-1 'unknown error' (1): call=8, status=Timed Out, exitreason='',
        last-rc-change='Wed Oct 9 09:42:35 2019', queued=0ms, exec=200002ms

The state persist even after reboot. So we have a small cut in ctl plane, but it's still working. We lose HA though.

A simple pcs resource cleanup solve it.

Originally reported there : https://bugzilla.redhat.com/show_bug.cgi?id=1760405

Full explanation of the issue there: https://bugzilla.redhat.com/show_bug.cgi?id=1759974#c4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/688212

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/688212
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=751b3fc09632cf184bf947be8f74d771554a23d0
Submitter: Zuul
Branch: master

commit 751b3fc09632cf184bf947be8f74d771554a23d0
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Oct 11 16:10:00 2019 +0200

    Workaround ovn cluster failure during update when schema change.

    During update the ovndb server can have a schema change. The problem
    is that an updated slave ovndb wouldn't connect to a master which
    still has the old db schema. At some point (200000ms) pacemaker put
    the resource in error Time Out. Then it will wait for the operator to
    cleanup the resource. Meaning that the update can goes like this:

     - Original state: (Master, Slave, Failed): nothing updated
       - ctl0-M-old
       - ctl1-S-old
       - ctl2-S-old
     - First state: after update of ctl0
       - ctl0-F-new
       - ctl1-M-old
       - ctl2-S-old
     - Second state: after update of ctl1
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-old
     - Third and final state: after update of ctl2
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-new

    During the third state we have a cut in the control plane as ctl2 is
    the master and there is no slave to fall back to. Then we end up
    loosing HA as only one node is active. The error persists after
    reboot. Only a pcs resource cleanup will bring the cluster online.

    The real solution will come from ovndb and the associated ocf agent,
    but in the meantime, we workaround it by:
     - cleanup
     - ban the resource;
    in step 1 and:
     - cleanup
     - unban the resource
    in step 5.

    This has the net effect of preventing the cut in the control plane for
    the last node as we move master to the updated controller which will
    form a cluster of one master and one slave (as two are updated). The
    last one will happily join then when it will be updated.

    That means:
     - we always have either 1 or 2 nodes working;
     - we end the update with the cluster converged back to a stable
     state.

    The problems are :
     - we could hide a real ovndb cluster issue;
    - if the update break in-between we could have a leftover ban on one
     of the node;

    But, all things considered, this looks like the best compromise for
    the time being.

    Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
    Closes-Bug: #1847780

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/688846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/stein)

Reviewed: https://review.opendev.org/688846
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d9c60ab05e9d67c35602d3fdf5f2658c49028874
Submitter: Zuul
Branch: stable/stein

commit d9c60ab05e9d67c35602d3fdf5f2658c49028874
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Oct 11 16:10:00 2019 +0200

    Workaround ovn cluster failure during update when schema change.

    During update the ovndb server can have a schema change. The problem
    is that an updated slave ovndb wouldn't connect to a master which
    still has the old db schema. At some point (200000ms) pacemaker put
    the resource in error Time Out. Then it will wait for the operator to
    cleanup the resource. Meaning that the update can goes like this:

     - Original state: (Master, Slave, Failed): nothing updated
       - ctl0-M-old
       - ctl1-S-old
       - ctl2-S-old
     - First state: after update of ctl0
       - ctl0-F-new
       - ctl1-M-old
       - ctl2-S-old
     - Second state: after update of ctl1
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-old
     - Third and final state: after update of ctl2
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-new

    During the third state we have a cut in the control plane as ctl2 is
    the master and there is no slave to fall back to. Then we end up
    loosing HA as only one node is active. The error persists after
    reboot. Only a pcs resource cleanup will bring the cluster online.

    The real solution will come from ovndb and the associated ocf agent,
    but in the meantime, we workaround it by:
     - cleanup
     - ban the resource;
    in step 1 and:
     - cleanup
     - unban the resource
    in step 5.

    This has the net effect of preventing the cut in the control plane for
    the last node as we move master to the updated controller which will
    form a cluster of one master and one slave (as two are updated). The
    last one will happily join then when it will be updated.

    That means:
     - we always have either 1 or 2 nodes working;
     - we end the update with the cluster converged back to a stable
     state.

    The problems are :
     - we could hide a real ovndb cluster issue;
    - if the update break in-between we could have a leftover ban on one
     of the node;

    But, all things considered, this looks like the best compromise for
    the time being.

    Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
    Closes-Bug: #1847780
    (cherry picked from commit 751b3fc09632cf184bf947be8f74d771554a23d0)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.3.0

This issue was fixed in the openstack/tripleo-heat-templates 11.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/691009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/691273

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.opendev.org/691009
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9ad87580baec0794e8dafbc11b4c2053fcb146de
Submitter: Zuul
Branch: stable/rocky

commit 9ad87580baec0794e8dafbc11b4c2053fcb146de
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Oct 11 16:10:00 2019 +0200

    Workaround ovn cluster failure during update when schema change.

    During update the ovndb server can have a schema change. The problem
    is that an updated slave ovndb wouldn't connect to a master which
    still has the old db schema. At some point (200000ms) pacemaker put
    the resource in error Time Out. Then it will wait for the operator to
    cleanup the resource. Meaning that the update can goes like this:

     - Original state: (Master, Slave, Failed): nothing updated
       - ctl0-M-old
       - ctl1-S-old
       - ctl2-S-old
     - First state: after update of ctl0
       - ctl0-F-new
       - ctl1-M-old
       - ctl2-S-old
     - Second state: after update of ctl1
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-old
     - Third and final state: after update of ctl2
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-new

    During the third state we have a cut in the control plane as ctl2 is
    the master and there is no slave to fall back to. Then we end up
    loosing HA as only one node is active. The error persists after
    reboot. Only a pcs resource cleanup will bring the cluster online.

    The real solution will come from ovndb and the associated ocf agent,
    but in the meantime, we workaround it by:
     - cleanup
     - ban the resource;
    in step 1 and:
     - cleanup
     - unban the resource
    in step 5.

    This has the net effect of preventing the cut in the control plane for
    the last node as we move master to the updated controller which will
    form a cluster of one master and one slave (as two are updated). The
    last one will happily join then when it will be updated.

    That means:
     - we always have either 1 or 2 nodes working;
     - we end the update with the cluster converged back to a stable
     state.

    The problems are :
     - we could hide a real ovndb cluster issue;
    - if the update break in-between we could have a leftover ban on one
     of the node;

    But, all things considered, this looks like the best compromise for
    the time being.

    Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
    Closes-Bug: #1847780
    (cherry picked from commit 751b3fc09632cf184bf947be8f74d771554a23d0)
    (cherry picked from commit d9c60ab05e9d67c35602d3fdf5f2658c49028874)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.opendev.org/691273
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2be2edb05fd47d227cb97b614b04eddfd005da93
Submitter: Zuul
Branch: stable/queens

commit 2be2edb05fd47d227cb97b614b04eddfd005da93
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Oct 11 16:10:00 2019 +0200

    Workaround ovn cluster failure during update when schema change.

    During update the ovndb server can have a schema change. The problem
    is that an updated slave ovndb wouldn't connect to a master which
    still has the old db schema. At some point (200000ms) pacemaker put
    the resource in error Time Out. Then it will wait for the operator to
    cleanup the resource. Meaning that the update can goes like this:

     - Original state: (Master, Slave, Failed): nothing updated
       - ctl0-M-old
       - ctl1-S-old
       - ctl2-S-old
     - First state: after update of ctl0
       - ctl0-F-new
       - ctl1-M-old
       - ctl2-S-old
     - Second state: after update of ctl1
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-old
     - Third and final state: after update of ctl2
       - ctl0-F-new
       - ctl1-F-new
       - ctl2-M-new

    During the third state we have a cut in the control plane as ctl2 is
    the master and there is no slave to fall back to. Then we end up
    loosing HA as only one node is active. The error persists after
    reboot. Only a pcs resource cleanup will bring the cluster online.

    The real solution will come from ovndb and the associated ocf agent,
    but in the meantime, we workaround it by:
     - cleanup
     - ban the resource;
    in step 1 and:
     - cleanup
     - unban the resource
    in step 5.

    This has the net effect of preventing the cut in the control plane for
    the last node as we move master to the updated controller which will
    form a cluster of one master and one slave (as two are updated). The
    last one will happily join then when it will be updated.

    That means:
     - we always have either 1 or 2 nodes working;
     - we end the update with the cluster converged back to a stable
     state.

    The problems are :
     - we could hide a real ovndb cluster issue;
    - if the update break in-between we could have a leftover ban on one
     of the node;

    But, all things considered, this looks like the best compromise for
    the time being.

    Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80
    Closes-Bug: #1847780
    (cherry picked from commit 751b3fc09632cf184bf947be8f74d771554a23d0)
    (cherry picked from commit d9c60ab05e9d67c35602d3fdf5f2658c49028874)
    (cherry picked from commit 9ad87580baec0794e8dafbc11b4c2053fcb146de)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.6.2

This issue was fixed in the openstack/tripleo-heat-templates 10.6.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates rocky-eol

This issue was fixed in the openstack/tripleo-heat-templates rocky-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates queens-eol

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.