Instance failover fails at stein due to inconsistent hypervisor naming

Bug #1839300 reported by Liam Young
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Invalid
Medium
Unassigned
OpenStack Masakari Charm
Invalid
Medium
Unassigned
OpenStack Neutron Open vSwitch Charm
Fix Released
High
Frode Nordahl
OpenStack Nova Cloud Controller Charm
Invalid
Medium
Unassigned
OpenStack Nova Compute Charm
Fix Released
High
Frode Nordahl
OpenStack Pacemaker Remote Charm
Invalid
Medium
Unassigned

Bug Description

Masakari relies on the following to all match:

1) Name of hypervisor as shown in "openstack hypervisor list"
2) Name of hypervisor as shown in "nova service-list"
3) Name of hypervisor as listed in "crm status | grep Remote"

Currently at stein 1 is a fqdn but 2 & 3 are not. This cause error like:

2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver Traceback (most recent call last):
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver File "/usr/lib/python3/dist-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver result = task.execute(**arguments)
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver File "/usr/lib/python3/dist-packages/masakari/engine/drivers/taskflow/host_failure.py", line 51, in execute
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver self.novaclient.enable_disable_service(self.context, host_name)
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver File "/usr/lib/python3/dist-packages/masakari/compute/nova.py", line 58, in wrapper
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver res = method(self, ctx, *args, **kwargs)
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver File "/usr/lib/python3/dist-packages/masakari/compute/nova.py", line 154, in enable_disable_service
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver service = nova.services.list(host=host_name, binary='nova-compute')[0]
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.driver IndexError: list index out of range
2019-08-07 09:23:28.425 21468 ERROR masakari.engine.drivers.taskflow.drive

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Changed in charm-nova-compute:
status: New → In Progress
importance: Undecided → High
milestone: none → 19.10
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
Frode Nordahl (fnordahl) wrote :
Changed in charm-neutron-openvswitch:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Frode Nordahl (fnordahl)
milestone: none → 19.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-neutron-openvswitch (master)

Reviewed: https://review.opendev.org/682608
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=31e2aabb03a2fb20e0b5bff4cde09e7b82107367
Submitter: Zuul
Branch: master

commit 31e2aabb03a2fb20e0b5bff4cde09e7b82107367
Author: Frode Nordahl <email address hidden>
Date: Sun Sep 15 21:42:23 2019 +0200

    Use FQDN when registering agents with Neutron

    The change of behaviour will only affect newly installed
    deployments on OpenStack Train and onwards.

    Also set upper constraint for ``python-cinderclient`` in the
    functional test requirements as it relies on the v1 client
    which has been removed. We will not fix this in Amulet, charm
    pending migration to the Zaza framework.

    Related-Bug: #1839300
    Needed-By: Ia73ed6b76fc7f18014d4fa913397cc069e51ff07
    Change-Id: Iee73164358745628a4b8658614608bc872771fd1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/682607
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=c6455cc9559c4075e17024b1088ef6f2e9bb5ccf
Submitter: Zuul
Branch: master

commit c6455cc9559c4075e17024b1088ef6f2e9bb5ccf
Author: Frode Nordahl <email address hidden>
Date: Mon Sep 16 09:27:28 2019 +0200

    Use FQDN when registering agents with Nova

    The change of behaviour will only have affect on newly installed
    deployments on OpenStack Train and onwards.

    Also set upper constraint for ``python-cinderclient`` in the
    functional test requirements as it relies on the v1 client
    which has been removed. We will not fix this in Amulet, charm
    pending migration to the Zaza framework.

    Change-Id: Ia73ed6b76fc7f18014d4fa913397cc069e51ff07
    Depends-On: Iee73164358745628a4b8658614608bc872771fd1
    Closes-Bug: #1839300

Changed in charm-nova-compute:
status: In Progress → Fix Committed
Frode Nordahl (fnordahl)
Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-nova-compute:
status: Fix Committed → Fix Released
Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
Changed in charm-masakari:
importance: Undecided → Medium
status: New → Triaged
Changed in charm-pacemaker-remote:
importance: Undecided → Medium
status: New → Triaged
Changed in charm-hacluster:
importance: Undecided → Medium
status: New → Triaged
Changed in charm-nova-cloud-controller:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Unfortunately, I need to re-open this bug.

The nova-compute and neutron-openvswitch charms are not looking up the same interface ip's FQDN hostname causing placement issues in neutron-api/ovs such as:

2019-12-30 22:22:13.095 503444 DEBUG neutron.plugins.ml2.drivers.mech_agent [req-f928aab6-4685-4bea-af59-ad115b952cbf 5580199fdfc343fcb3c41fb8636ea6c8 c80a5b62cfe6435ab44315de3d670b2f - 3ef3cedadd5a4331a11118211060834e 3ef3cedadd5a4331a11118211060834e] Port d54c0eaf-0f35-4e8f-abe3-bb9ca646107f on network a7afdf47-4eaf-4fe3-8a7d-97c4e2a8e4e1 not bound, no agent of type HyperV agent registered on host bond1.713.myhost1.maas bind_port /usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/mech_agent.py:104
2019-12-30 22:22:13.096 503444 ERROR neutron.plugins.ml2.managers [req-f928aab6-4685-4bea-af59-ad115b952cbf 5580199fdfc343fcb3c41fb8636ea6c8 c80a5b62cfe6435ab44315de3d670b2f - 3ef3cedadd5a4331a11118211060834e 3ef3cedadd5a4331a11118211060834e] Failed to bind port d54c0eaf-0f35-4e8f-abe3-bb9ca646107f on host bond1.713.myhost1.maas for vnic_type normal using segments [{'id':
22:26 <drewn3ss> '92f10021-2981-4d37-a027-4b0a9ffbfc37', 'network_type': 'vxlan', 'physical_network': None, 'segmentation_id': 1064, 'network_id': 'a7afdf47-4eaf-4fe3-8a7d-97c4e2a8e4e1'}]

When I investigated the code changes, nova-compute uses:
         host_ip = get_relation_ip('cloud-compute',
                                   cidr_network=config('os-internal-network'))

but neutron-openvswitch uses:
        host_ip = get_relation_ip('neutron-plugin')

When I dump my bindings for my model, here is a subsection of the nova-compute bindings:

nova-compute:
    bindings:
      "": internal-space
      cloud-compute: internal-space

Here are all of the neutron-openvswitch bindings:
    bindings:
      data: internal-space

Oddly, I don't see bindings called out for "" or neutron-plugin. This may be an export-bundle shortcoming for subordinate charms.

I do see in the metadata.yaml for charm-neutron-openvswitch that the neutron-plugin binding is the local container relation. Since that's local container relation, what would that default binding IP be? localhost? Is this triggering some sort of fall-back methodology?

Ultimately, when I look in openstack services, I see that openstack compute service list has "bond1.713.myhost1.maas" as the nova-compute host, and openstack network agent list has the ovs agent on the node "myhost1.maas".

I'd prefer both to be myhost1.maas and not rely upon the IP address for the interface on the binding, but I can understand use cases where a specific FQDN may be needed vs the hosts' primary hostname/fqdn.

Ultimately, the fix needs to result in the same FQDN between both nova-compute and neutron-openvswitch, which means nova-compute probably has to provide the fqdn to ovs over it's relation as there may be many reasons they'd not be able to share a binding that would result in the same hostname.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

subscribing field-high as this patch has created a different regression that is affecting a live customer who is mid-redeployment of a node after Stein upgrade.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

The hostname configured for services must indeed be exactly the same between the principle charm, in this case nova-compute, and the neutron-openvswitch subordinate for interactions with other parts of the deployment to work.

It is unfortunate that the change appears to have slipped on locking that down for all use cases.

While we start work to fix that I would like to hear more about the different regression you refer to in #6.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

A related issue for Octavia was handled before the 19.10 release in bug 1845303, there we chose to have Octavia configured based on whichever hostname charm-neutron-openvswitch chose [0][1][2].

In retrospect it should probably be the other way around, i.e. the principle charm deciding which FQDN the subordinate uses.

I do understand your wish for "just using the primary FQDN" of a host, but from the charms perspective there is unfortunately no such thing. There is no system call that accurately provides that information, the charm has to actively select a interface and IP address to build the FQDN from.

0: https://review.opendev.org/#/c/685940/
1: https://review.opendev.org/#/c/685941/
2: https://review.opendev.org/#/c/685942/

Revision history for this message
James Troup (elmo) wrote : Re: [Bug 1839300] Re: Instance failover fails at stein due to inconsistent hypervisor naming

Frode Nordahl <email address hidden> writes:

> A related issue for Octavia was handled before the 19.10 release in bug
> 1845303, there we chose to have Octavia configured based on whichever
> hostname charm-neutron-openvswitch chose [0][1][2].
>
> In retrospect it should probably be the other way around, i.e. the
> principle charm deciding which FQDN the subordinate uses.
>
> I do understand your wish for "just using the primary FQDN" of a host,
> but from the charms perspective there is unfortunately no such thing.
> There is no system call that accurately provides that information, the
> charm has to actively select a interface and IP address to build the
> FQDN from.

Couldn't we go the other way and just use the short form everywhere?

--
James

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The short form is, again, dependent on the interface. If we look at the attached image from maas, the hostnames available (and configured in MAAS) are (with the rest of the fqdn in []):

- bond0[.proud-cub.maas]
- bond0.5[.proud-cub.maas]
- bond0.8[.proud-cub.maas]
- eno1[.proud-cub.maas]
- eno2[.proud-cub.maas]

While we can resolve `proud-cub.maas`, which hostname that gives is slightly more opaque. I believe that the IP address matched to the shorter (non-interface) short hostname is derived from the MAAS configured default_gateway (https://askubuntu.com/questions/1069896/maas-2-4-setting-a-nodes-default-route) but I'm not positive of that. In the example given, asking MAAS to resolve the short hostname "proud-cub" is resolved to the address on bond0. This would be a slightly catastrophic failure if bond0 isn't actually the interface used for live-migration, as the network may not actually be available to all of the machines bound to the configured space.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

To add some color on the background for our wish to use FQDNs it goes beyond fixing Masakari, we are also increasingly met with other subsystems with FQDN as a default like OVS/OVN, so continuing with just the shortname will be a uphill battle. We have also faced brittleness in meeting other subsystems configuration issues, like missing search domain configuration.

I do see your concerns about using something bound to a interface name or other mutable/likely to change over a clouds lifetime values.

The concerns Chris raises in #10 are also 100% valid and valuable, and I think we can address that concern with the existing migration network binding/configs.

To amend my comment in #8 there actually is a libc call we can use to get the "official" name of a host (this is also what `hostname -f` uses), with rigorous fallback that might do it.

In any case we must ensure that the principle and subordinate charm on a single host agree on their name.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

This should take care of retrieving a hosts primary FQDN: https://github.com/juju/charm-helpers/pull/415

I'll pair that with changes to the affected charms to make use of the amended context retaining the interim behavior of what was released with 19.10.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/701928
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=1869bfbc9711eac157821f8a4702409822c0842e
Submitter: Zuul
Branch: master

commit 1869bfbc9711eac157821f8a4702409822c0842e
Author: Frode Nordahl <email address hidden>
Date: Fri Jan 10 09:13:34 2020 +0100

    Use hosts official name for FQDN

    The current implementations use of a specific interface to build
    FQDN from has the undesired side effect of the ``nova-compute`` and
    ``neutron-openvswitch`` charms ending up with using different
    hostnames in some situations. It may also lead to use of a
    identifier that is mutable throughout the lifetime of a deployment.

    Use of a specific interface was chosen due to ``socket.getfqdn()``
    not giving reliable results (https://bugs.python.org/issue5004).

    This patch gets the FQDN by mimickingthe behaviour of a call to
    ``hostname -f`` with fallback to shortname on failure.

    Add relevant update from c-h.

    Needed-By: Ic8f8742261b773484687985aa0a366391cd2737a
    Change-Id: I82db81937e5a46dc6bd222b7160ca1fa5b190c10
    Closes-Bug: #1839300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-openvswitch (master)

Reviewed: https://review.opendev.org/701929
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=ee709a5ab30f285ecc1dd3ddb998af970f22e17e
Submitter: Zuul
Branch: master

commit ee709a5ab30f285ecc1dd3ddb998af970f22e17e
Author: Frode Nordahl <email address hidden>
Date: Fri Jan 10 10:57:44 2020 +0100

    Use hosts official name for FQDN

    The current implementations use of a specific interface to build
    FQDN from has the undesired side effect of the ``nova-compute`` and
    ``neutron-openvswitch`` charms ending up with using different
    hostnames in some situations. It may also lead to use of a
    identifier that is mutable throughout the lifetime of a deployment.

    Use of a specific interface was chosen due to ``socket.getfqdn()``
    not giving reliable results (https://bugs.python.org/issue5004).

    This patch gets the FQDN by mimicking the behaviour of a call to
    ``hostname -f`` with fallback to shortname on failure.

    Add relevant update from c-h.

    Depends-On: I82db81937e5a46dc6bd222b7160ca1fa5b190c10
    Change-Id: Ic8f8742261b773484687985aa0a366391cd2737a
    Closes-Bug: #1839300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (stable/19.10)

Fix proposed to branch: stable/19.10
Review: https://review.opendev.org/702172

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-openvswitch (stable/19.10)

Fix proposed to branch: stable/19.10
Review: https://review.opendev.org/702173

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (stable/19.10)

Reviewed: https://review.opendev.org/702172
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=66405e2092d8739a66e40ea054827d842fa97924
Submitter: Zuul
Branch: stable/19.10

commit 66405e2092d8739a66e40ea054827d842fa97924
Author: Frode Nordahl <email address hidden>
Date: Fri Jan 10 09:13:34 2020 +0100

    Use hosts official name for FQDN

    The current implementations use of a specific interface to build
    FQDN from has the undesired side effect of the ``nova-compute`` and
    ``neutron-openvswitch`` charms ending up with using different
    hostnames in some situations. It may also lead to use of a
    identifier that is mutable throughout the lifetime of a deployment.

    Use of a specific interface was chosen due to ``socket.getfqdn()``
    not giving reliable results (https://bugs.python.org/issue5004).

    This patch gets the FQDN by mimicking the behaviour of a call to
    ``hostname -f`` with fallback to shortname on failure.

    Add relevant update from c-h.

    Change-Id: I82db81937e5a46dc6bd222b7160ca1fa5b190c10
    Closes-Bug: #1839300
    (cherry-picked from 1869bfbc9711eac157821f8a4702409822c0842e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-openvswitch (stable/19.10)

Reviewed: https://review.opendev.org/702173
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=71e38aae6401b975683c2b013deb483eadc5c117
Submitter: Zuul
Branch: stable/19.10

commit 71e38aae6401b975683c2b013deb483eadc5c117
Author: Frode Nordahl <email address hidden>
Date: Fri Jan 10 10:57:44 2020 +0100

    Use hosts official name for FQDN

    The current implementations use of a specific interface to build
    FQDN from has the undesired side effect of the ``nova-compute`` and
    ``neutron-openvswitch`` charms ending up with using different
    hostnames in some situations. It may also lead to use of a
    identifier that is mutable throughout the lifetime of a deployment.

    Use of a specific interface was chosen due to ``socket.getfqdn()``
    not giving reliable results (https://bugs.python.org/issue5004).

    This patch gets the FQDN by mimicking the behaviour of a call to
    ``hostname -f`` with fallback to shortname on failure.

    Add relevant update from c-h.

    Change-Id: Ic8f8742261b773484687985aa0a366391cd2737a
    Closes-Bug: #1839300
    (cherry picked from commit ee709a5ab30f285ecc1dd3ddb998af970f22e17e)

James Page (james-page)
Changed in charm-masakari:
status: Triaged → Invalid
Changed in charm-pacemaker-remote:
status: Triaged → Invalid
Changed in charm-hacluster:
status: Triaged → Invalid
Changed in charm-nova-cloud-controller:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.