stx-openstack: Live-migration traffic going through wrong networks

Bug #2037330 reported by Thales Elero Cervi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Thales Elero Cervi

Bug Description

Brief Description
-----------------
Since Day0 stx-openstack is not correctly configuring the network IP to be used for live-migrations.
It is currently relying on default gateway resolution, but this is problematic since it will differ between AIO (solve to the oam-net IP) and worker dedicated (solve to the mgmt-net IP) nodes.
Platform firewall will block OAM traffic using ports not explicitly allowed.
Actually, this traffic should be going through cluster-host-net.

Severity
--------
Major: Live-migration not working for AIO-DX systems

Steps to Reproduce
------------------
* Install stx and apply stx-openstack
* Launch a VM
* Try to live-migrate the VM

Expected Behavior
------------------
VM live-migrate successfully (through the cluster-host-net)

Actual Behavior
----------------
Live-migration fails

Reproducibility
---------------
100% Reproducible

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
master and f/antelope branches

Last Pass
---------
stx.8.0

Timestamp/Logs
--------------
2023-08-31T04:22:22.895730116Z stdout F 2023-08-31 04:22:22.895 1043148 ERROR nova.virt.libvirt.driver [-] [instance: 3b763765-13a9-4503-aeb0-3e1036215896] Live Migration failure: operation failed: Failed to connect to remote libvirt URI qemu+tcp://1xy.xyz.abc.100/system: unable to connect to server at '1xy.xyz.abc.100:16509': Connection timed out: libvirt.libvirtError: operation failed: Failed to connect to remote libvirt URI qemu+tcp://128.224.151.197/system: unable to connect to server at '1xy.xyz.abc.100:16509': Connection timed out

Test Activity
-------------
Regression Testing

Changed in starlingx:
assignee: nobody → Thales Elero Cervi (tcervi)
tags: added: stx.9.0 stx.distro.openstack
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/895730
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/310f677d295abff792168ee860beba8c52b1c2ab
Submitter: "Zuul (22348)"
Branch: master

commit 310f677d295abff792168ee860beba8c52b1c2ab
Author: Thales Elero Cervi <email address hidden>
Date: Mon Sep 18 16:00:28 2023 -0300

    Move live-migration traffic to cluster-host-net

    This change updates the application plugins in order to ensure that all
    libvirt/live-migration related traffic is happening through the
    cluster-host-network. Currently most of the libvirt/live-migration
    addresses are being solved through INADDR_ANY (0.0.0.0), and this route
    resolution will vary between AIO, routes to oam-network, and Worker,
    routes to mgmt-network. Both resolutions are not correct since the
    correct network for such traffic should be the cluster-host-network.
    Actually, current platform firewall will block any traffic through not
    allowed oam-network ports.

    The goal will be achieved by setting to the node's cluster-host IP:
    * libvirt listen_addr
    * nova.conf "live_migration_inbound_addr"

    It is important to notice that in the current version of the
    openstack-helm nova helm chart, there is a problem with
    nova-compute-init.sh for this use case of ours, so an openstack-helm
    patch was required to fix it.

    Code that was previously implemented only for the Nova plugin and is now
    required by the Libvirt plugin, was moved to the parent OpenStack class.

    [1] https://github.com/openstack/openstack-helm/commit/31be86079d711c698b2560b4bed654e23373a596

    TEST PLAN:
    PASS - Build stx-openstack application
    PASS - Apply the application to an AIO-DX system
    PASS - "$ sudo netstat -ltnp | grep <libvirtd pid>" to ensure that
           libvirtd is listening on the correct cluster-host-net IP
    PASS - Verify that the nova-compute.sh script was populated correctly
    PASS - Test a VM live-migration on the controller+worker node
    PASS - Verify that live_migration data in LibvirtLiveMigrateData has the
           correct cluster-host-net IP address in its "target_connect_addr"
    PASS - Apply the application to a Standard system
    PASS - "$ sudo netstat -ltnp | grep <libvirtd pid>" to ensure that
           libvirtd is listening on the correct cluster-host-net IP
    PASS - Verify that the nova-compute.sh script was populated correctly
    PASS - Test a VM live-migration on the worker node
    PASS - Verify that live_migration data in LibvirtLiveMigrateData has the
           correct cluster-host-net IP address in its "target_connect_addr"

    Closes-Bug: 2037330

    Signed-off-by: Thales Elero Cervi <email address hidden>
    Change-Id: I37db601e4b1b0e397a1b8dbdad1a293ff25c2e55

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (f/antelope)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (f/antelope)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/896539
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/190502fb6bd3361526a2500d2d83eaf48158954b
Submitter: "Zuul (22348)"
Branch: f/antelope

commit 190502fb6bd3361526a2500d2d83eaf48158954b
Author: Thales Elero Cervi <email address hidden>
Date: Mon Sep 18 16:00:28 2023 -0300

    Move live-migration traffic to cluster-host-net

    This change updates the application plugins in order to ensure that all
    libvirt/live-migration related traffic is happening through the
    cluster-host-network. Currently most of the libvirt/live-migration
    addresses are being solved through INADDR_ANY (0.0.0.0), and this route
    resolution will vary between AIO, routes to oam-network, and Worker,
    routes to mgmt-network. Both resolutions are not correct since the
    correct network for such traffic should be the cluster-host-network.
    Actually, current platform firewall will block any traffic through not
    allowed oam-network ports.

    The goal will be achieved by setting to the node's cluster-host IP:
    * libvirt listen_addr
    * nova.conf "live_migration_inbound_addr"

    It is important to notice that in the current version of the
    openstack-helm nova helm chart, there is a problem with
    nova-compute-init.sh for this use case of ours, so an openstack-helm
    patch was required to fix it.

    Code that was previously implemented only for the Nova plugin and is now
    required by the Libvirt plugin, was moved to the parent OpenStack class.

    [1] https://github.com/openstack/openstack-helm/commit/31be86079d711c698b2560b4bed654e23373a596

    TEST PLAN:
    PASS - Build stx-openstack application
    PASS - Apply the application to an AIO-DX system
    PASS - "$ sudo netstat -ltnp | grep <libvirtd pid>" to ensure that
           libvirtd is listening on the correct cluster-host-net IP
    PASS - Verify that the nova-compute.sh script was populated correctly
    PASS - Test a VM live-migration on the controller+worker node
    PASS - Verify that live_migration data in LibvirtLiveMigrateData has the
           correct cluster-host-net IP address in its "target_connect_addr"
    PASS - Apply the application to a Standard system
    PASS - "$ sudo netstat -ltnp | grep <libvirtd pid>" to ensure that
           libvirtd is listening on the correct cluster-host-net IP
    PASS - Verify that the nova-compute.sh script was populated correctly
    PASS - Test a VM live-migration on the worker node
    PASS - Verify that live_migration data in LibvirtLiveMigrateData has the
           correct cluster-host-net IP address in its "target_connect_addr"

    Closes-Bug: 2037330

    Signed-off-by: Thales Elero Cervi <email address hidden>
    Change-Id: I37db601e4b1b0e397a1b8dbdad1a293ff25c2e55
    (cherry picked from commit 310f677d295abff792168ee860beba8c52b1c2ab)

tags: added: in-f-antelope
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.