OVN relies on one host address (SPoF)

Bug #1875223 reported by Radosław Piliszek
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Michal Nasiadka
Antelope
Fix Released
High
Michal Nasiadka
Xena
Won't Fix
Undecided
Unassigned
Yoga
Fix Released
High
Unassigned
Zed
Fix Released
High
Unassigned

Bug Description

OVN is wired such that it relies on the magic 0th host API address - it should presumably use the VIP address instead. To check @Michał.

Tags: ovn
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Side note: Found during IPv6 debugging.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Extra issues:
ovn db ports do not get set on ovn daemons, only neutron side

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

@Michał:
I've eaten up the "DBs" part. It is OVN DBs that are affected. Both NB and SB.
Another thing is that their cmdlines are awfully long - could use templating out in Ansible. :/
Broken up NB (SB is analogical):

/usr/share/ovn/scripts/ovn-ctl run_nb_ovsdb
  --db-nb-create-insecure-remote=yes
  --db-nb-addr={{ api_interface_address | put_address_in_context('url') }}
  --db-nb-cluster-local-addr={{ api_interface_address | put_address_in_context('url') }}
  {% if groups['ovn-nb-db'] | length > 1 and inventory_hostname != groups['ovn-nb-db'][0] %}
    --db-nb-cluster-remote-addr={{ 'api' | kolla_address(groups['ovn-nb-db'][0]) | put_address_in_context('url') }}
  {% endif %}
  --db-sock=/run/ovn/ovnnb_db.sock
  --db-nb-pid=/run/ovn/ovnnb_db.pid
  --db-nb-file=/var/lib/openvswitch/ovn-nb/ovnnb.db
  --ovn-nb-logfile=/var/log/kolla/openvswitch/ovn-nb-db.log

The primary issue is with those: groups['ovn-nb-db'][0]. If 0th host is gone / unable to start for some reason, then the integration is doomed.
And the extra issue was about not telling DB to listen on the selected port.

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

--db-nb-create-insecure-remote=yes means it will listen on nb/sb port - the default port - or at least it should, let me double check :)
I agree with the [0] thingie - we could use some filter like | first, but then if that host changes each reconfigure run - new json files will be generated and ovn nb/sb will get restarted... so what's the best approach here?

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Regarding ports - they have variables to "configure" them but they only change them on neutron side.

You seem to be thinking about the addressing only in terms of one-off deployment but do remember these containers are to run longer than that and might undergo a series of restarts. Then there is no mechanism to 'fix' the addressing for the new situation. I don't know OVN best practices here but using haproxy VIP address does not sound that bad if the address must be single.

Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

I suggest SBDB and NBDB listen on the api_interface, as defined in group_vars/all.yml for ovn_nb_connection and ovn_sb_connection using api_interface_address.

Also, it is very important as to which value is used for db-nb-cluster-remote-addr db-sb-cluster-remote-addr, i.e. kolla_address(groups['ovn-nb-db'][0] _definitely_ points to the first bootstrapped and initialized OVN SB and NB DB node.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
milestone: 11.0.0 → none
Changed in kolla-ansible:
status: Triaged → In Progress
no longer affects: kolla-ansible/victoria
no longer affects: kolla-ansible/ussuri
no longer affects: kolla-ansible/wallaby
Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Just as a summary to understand what is really the problem:
Currently we:
- bootstrap the ,,master'' (first DB node) on the first controller
- start the rest to join the master controller

This works fine up to the moment where you need to reinstall the first controller
If you do that and run deploy - it will create a new database on the first controller and you end up with two clusters (the reinstalled first controller in first cluster, the rest in the other cluster).

The proper way should be:
NEW CLUSTER:
- check if cluster exists (it won't on a fresh install)
- bootstrap the first node with bootstrap CLI arguments
- bootstrap the other nodes with --join CLI arguments
- restart all of the nodes removing bootstrap/join CLI arguments

ADDING/RE-ADDING NODE:
- check if cluster exists and get nodes (or one which can be a ,,donor'' for a new node)
- bootstrap the new node with --join CLI arguments
- restart the new node without --join arguments so it will join the cluster configured in the DB

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/868929
Committed: https://opendev.org/openstack/kolla-ansible/commit/7cc4bf62031d293487b1714448ba7bb65ed324d6
Submitter: "Zuul (22348)"
Branch: master

commit 7cc4bf62031d293487b1714448ba7bb65ed324d6
Author: Michal Nasiadka <email address hidden>
Date: Fri Dec 30 15:19:27 2022 +0000

    ovn: Improve clustering

    Currently clustering steps are very static, if for a reason first
    node in the inventory fails and gets re-introduced - K-A will create
    a second empty cluster on that node.

    This patch changes the approach and checks if cluster exists, if it
    does - chooses a donor for the new node from currently running
    node set.

    Also it fixes node replacement - it removes old node from cluster
    (that has the same ip address as newly provisioned node).

    Closes-Bug: #1875223

    Change-Id: Ia025283e38ea7c3bd37c7a70d03f6b46c68f4456

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/893770

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/893771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/893772

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/893770
Committed: https://opendev.org/openstack/kolla-ansible/commit/fdcb72b38c79094dde7398ed80fcee9f4bf2960d
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit fdcb72b38c79094dde7398ed80fcee9f4bf2960d
Author: Michal Nasiadka <email address hidden>
Date: Fri Dec 30 15:19:27 2022 +0000

    ovn: Improve clustering

    Currently clustering steps are very static, if for a reason first
    node in the inventory fails and gets re-introduced - K-A will create
    a second empty cluster on that node.

    This patch changes the approach and checks if cluster exists, if it
    does - chooses a donor for the new node from currently running
    node set.

    Also it fixes node replacement - it removes old node from cluster
    (that has the same ip address as newly provisioned node).

    Closes-Bug: #1875223

    Change-Id: Ia025283e38ea7c3bd37c7a70d03f6b46c68f4456
    (cherry picked from commit 7cc4bf62031d293487b1714448ba7bb65ed324d6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/893771
Committed: https://opendev.org/openstack/kolla-ansible/commit/03db998f65fcb5cd1a2a9897b5cf7cb3476754cf
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 03db998f65fcb5cd1a2a9897b5cf7cb3476754cf
Author: Michal Nasiadka <email address hidden>
Date: Fri Dec 30 15:19:27 2022 +0000

    ovn: Improve clustering

    Currently clustering steps are very static, if for a reason first
    node in the inventory fails and gets re-introduced - K-A will create
    a second empty cluster on that node.

    This patch changes the approach and checks if cluster exists, if it
    does - chooses a donor for the new node from currently running
    node set.

    Also it fixes node replacement - it removes old node from cluster
    (that has the same ip address as newly provisioned node).

    Closes-Bug: #1875223

    Change-Id: Ia025283e38ea7c3bd37c7a70d03f6b46c68f4456
    (cherry picked from commit 7cc4bf62031d293487b1714448ba7bb65ed324d6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/893772
Committed: https://opendev.org/openstack/kolla-ansible/commit/d626c02fb31a5b4d10f9f5c4541dda36dc103e95
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit d626c02fb31a5b4d10f9f5c4541dda36dc103e95
Author: Michal Nasiadka <email address hidden>
Date: Fri Dec 30 15:19:27 2022 +0000

    ovn: Improve clustering

    Currently clustering steps are very static, if for a reason first
    node in the inventory fails and gets re-introduced - K-A will create
    a second empty cluster on that node.

    This patch changes the approach and checks if cluster exists, if it
    does - chooses a donor for the new node from currently running
    node set.

    Also it fixes node replacement - it removes old node from cluster
    (that has the same ip address as newly provisioned node).

    Closes-Bug: #1875223

    Change-Id: Ia025283e38ea7c3bd37c7a70d03f6b46c68f4456
    (cherry picked from commit 7cc4bf62031d293487b1714448ba7bb65ed324d6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 16.2.0

This issue was fixed in the openstack/kolla-ansible 16.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 14.10.0

This issue was fixed in the openstack/kolla-ansible 14.10.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 15.3.0

This issue was fixed in the openstack/kolla-ansible 15.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 17.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 17.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.