migration_interface breaks cold migrations

Bug #1918734 reported by Alexander Diana
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Undecided
Gaël THEROND

Bug Description

**Bug Report**

What happened:

configuring a migration_interface swaps nova_ssh to listen on the new interface, but cold migrations (resize, etc) will still use the api interface to move xml, which breaks cold migrations.

live migrations will use the right interface, and work as intended still, though, so this issue was hard to notice at first, and got to our production.

What you expected to happen:

as nova has no option for this it seems, nova_ssh should listen on both interfaces when an alternative migration_interface is configured.
This is a simplified fix of what I saw when checking TripleO for the same issue (as it happened there too)

How to reproduce it (minimal and precise):

configure migration_interface to a different network
openstack server resize
nova will log a connection refused on <api_interface_address>:8022, and rollback the migration

**Environment**:
* Kolla-Ansible version: stable/ussuri (looks to affect all versions)

Gaël THEROND (gtherond)
Changed in kolla-ansible:
assignee: nobody → Gaël THEROND (fl1nt)
status: New → Confirmed
Revision history for this message
Gaël THEROND (gtherond) wrote :

Hi Alexander, just to give you more insight on this one.

This is due to nova-compute (manager.py) calling for this function on cold migration:
https://opendev.org/openstack/nova/src/commit/f5f7c2540150c7ee7640c834d5caec31b3f5a7ab/nova/utils.py#L109

Because you probably don't get any DNS resolution within your underlying infrastructure, nova is actually using the /etc/hosts file to resolve your host node name.

Which in turn wrongly redirect it using your internal api subnet as the /etc/hosts file being propulated in here using the api_interface value:
https://opendev.org/openstack/kolla-ansible/src/commit/e744b9d510ba183d5a80b3e467d0e764eb5c9e02/ansible/roles/baremetal/tasks/pre-install.yml#L25

I'm having the same issue, however, before starting to fix it, I'll need to do some tests and have a few discussion with the team in order to validate that nova is the only service relying on these IPs.

For instance, changing the task to use migration_interface variable instead of the current api_interface MIGHT have an impact on RabbitMQ.

Revision history for this message
Mark Goddard (mgoddard) wrote :

fl1nt, can you share where in manager.py the ssh_execute function is called? I don't see it.

Revision history for this message
Gaël THEROND (gtherond) wrote :
Revision history for this message
Mark Goddard (mgoddard) wrote :

ssh_execute isn't directly included in any of those methods. I expect it's somewhere behind an RPC call. The stack trace includes an IP:

ssh -o BatchMode=yes 172.16.22.106 mkdir -p /var/lib/nova/instances/6c78f418-f6a4-41f0-8414-2cd3db16568f

Revision history for this message
Gaël THEROND (gtherond) wrote :

Yes, exact, there is an IP because it call for host as a context for the ssh_exec function, this host is then translated by linux using the /etc/hosts file, which use an incorrect IP as we have set the migration_interface on the globals.yml file, var that we use on libvirt/nova config template files, but as in here the resolve doesn't come from a direct IP usage but from the python calling linux gethostbyip(host) function then it end up using the /etc/host that we actually do provision within the baremetal role to use api_internal subnet.

Revision history for this message
Gaël THEROND (gtherond) wrote :

the function called is gethostbyaddr() sorry.

Revision history for this message
Gaël THEROND (gtherond) wrote :

no in fact it's the gethostbyname() sorry, there are two many of them ^^

Revision history for this message
Gaël THEROND (gtherond) wrote :

Ok, so after diging a bit more, it appears I have found out what's going on.

So, because nova-compute register with the nova DB using the my_ip directive from its configuration file, the nova service host_ip is using this internal_api_interface.

When doing a live-migration, it doesn't crash because the migration is actually a direct migration between hosts.

When doing a cold-migration (as in resize instances), the flow is slightly different. In order for the ssh_execute() function to correctly create the ssh subprocess call, it use a dest argument, this argument come from the compute_host table in nova db. Unlike my first assumption, this argument isn't OS resolved, so the /etc/hosts isn't used in this flow.

So in here, I don't really know if that's something that we can actually fix from our side of things.
For now I myself used the same subnet/interface for both migration_interface and api_internal_interface in order to get my clusters working correctly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by "Gaël THEROND <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/788058
Reason: Already fixed by another patch.

Tom Fifield (fifieldt)
Changed in kolla-ansible:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.