sunbeam launch failed with port binding errors

Bug #2023931 reported by Hemanth Nakkina
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Snap
New
Medium
Hemanth Nakkina

Bug Description

sunbeam launch failed with port binding error

nova-compute log:
Jun 13 21:55:37 dev openstack-hypervisor.nova-compute[538724]: nova.exception.PortBindingFailed: Binding failed for port 56e88a88-6832-4783-b91d-ad8b5e0a7103, please check neutron logs for more information.

neutron server log:
2023-06-14 16:10:42.899 56 ERROR neutron.plugins.ml2.managers [req-8c0a8453-0749-4bc4-a0da-246022c66180 req-a9bf51ab-535a-4c24-ae17-f23ef8bb9304 63a4f9b6c11c42a8b627cf71d279bfc8 79e7828473a74ad28fc28040aa2e0338 - - 0795e7024e5e49f1a957a669cc4552f6 0795e7024e5e49f1a957a669cc4552f6] Failed to bind port 0452926d-f609-4211-a608-ffc36403499b on host dev.internal.cloudapp.net for vnic_type normal using segments [{'id': '5f6ba483-42f1-4104-9b97-d610cdbfe78f', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 1184, 'network_id': '3c87e071-e1af-4ba9-b446-675e4eff92ae'}]

Saw the following warnings:
2023-06-13 21:55:36.443 54 WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-9174f292-1e12-4e6c-a26f-38f3a3d9d83d req-00b772fb-a896-4d55-b0e7-5c18539d99d5 63a4f9b6c11c42a8b627cf71d279bfc8 79e7828473a74ad28fc28040aa2e0338 - - 0795e7024e5e49f1a957a669cc4552f6 0795e7024e5e49f1a957a669cc4552f6] Refusing to bind port 56e88a88-6832-4783-b91d-ad8b5e0a7103 due to no OVN chassis for host: dev.internal.cloudapp.net

The above warning clearly shows the OVN chassis does not exist. Seems there is difference in hostname perceived by nova-compute and ovn-controller.

More information:

python3 -c "import socket; print(socket.getfqdn())"
dev.internal.cloudapp.net

azureuser@dev:~$ openstack hypervisor list
+--------------------------------------+---------------------------------------------------------+-----------------+--------------+-------+

| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |

+--------------------------------------+---------------------------------------------------------+-----------------+--------------+-------+

| 2c41448c-ab3d-454c-a4b8-a822ad522ab5 | dev.3pmbhi1rcrau3nnvk2nd1bwztb.ax.internal.cloudapp.net | QEMU | x.x.x.x | up |

+--------------------------------------+---------------------------------------------------------+-----------------+--------------+-------+

azureuser@dev:~$ hostname -f
dev.3pmbhi1rcrau3nnvk2nd1bwztb.ax.internal.cloudapp.net

azureuser@dev:~$ sunbeam cluster list

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓

┃ Node ┃ Status ┃ Control ┃ Compute ┃ Storage ┃

┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩

│ dev.internal.cloudapp.net │ up │ x │ x │ │

└───────────────────────────┴────────┴─────────┴─────────┴─────────┘

azureuser@dev:~$ sudo snap get openstack-hypervisor node
Key Value
node.fqdn dev.internal.cloudapp.net

node.ip-address x.x.x.x

getfqdn() --> returns dev.internal.cloudapp.net
hostname -f --> returns dev.3pmbhi1rcrau3nnvk2nd1bwztb.ax.internal.cloudapp.net

And I see following message with <HOSTNAME>:<NODENAME> different
Jun 13 21:47:47 dev nova-compute[538724]: 2023-06-13 21:47:47.330 538724 INFO nova.compute.resource_tracker [None req-5eff8636-991c-401f-9d96-e2c8b29144f0 - - - - - -] Compute node record created for dev.internal.cloudapp.net:dev.3pmbhi1rcrau3nnvk2nd1bwztb.ax.internal.cloudapp.net with uuid: 2c41448c-ab3d-454c-a4b8-a822ad522ab5

FQDN used in sunbeam should do some more checks to avoid these kind of situations.

Revision history for this message
James Page (james-page) wrote :

The snap explicitly sets the internal host in nova, but not in neutron.

So nova gets socket.fqdn whereas neutron defaults to socket.get_hostname

Revision history for this message
James Page (james-page) wrote :

Actually that's foobar - OVN is set to socket.getfqdn as is the host configuration in nova.conf.

So they should match

Revision history for this message
James Page (james-page) wrote :

Can you check what the host key is set to in:

/var/snap/openstack-hypervisor/common/etc/nova/nova.conf

please

Changed in snap-openstack:
status: New → Incomplete
importance: Undecided → Medium
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

nova.conf has the following configuration for host key
host = dev.internal.cloudapp.net

From the below log, <host>:<nodename> --> host name is populated from host key in conf (via sunbeam) and nodename (hypervisor_hostname) is populated from libvirt hostinfo (libvirt uses gethostname i guess). And there comes the discrepancy.

Jun 13 21:47:47 dev nova-compute[538724]: 2023-06-13 21:47:47.330 538724 INFO nova.compute.resource_tracker [None req-5eff8636-991c-401f-9d96-e2c8b29144f0 - - - - - -] Compute node record created for dev.internal.cloudapp.net:dev.3pmbhi1rcrau3nnvk2nd1bwztb.ax.internal.cloudapp.net with uuid: 2c41448c-ab3d-454c-a4b8-a822ad522ab5

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Changed in snap-openstack:
assignee: nobody → Hemanth Nakkina (hemanth-n)
Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :
Download full text (5.7 KiB)

I'm running into a very similar issue.

ubuntu@crustle:~$ source demo-openrc
ubuntu@crustle:~$ python3 -c "import socket; print(socket.getfqdn())"
crustle.segmaas.1ss
ubuntu@crustle:~$ openstack hypervisor list
HttpException: 403: Client Error for url: http://10.20.21.11/openstack-nova/v2.1/os-hypervisors/detail, Policy doesn't allow os_compute_api:os-hypervisors:list-detail to be performed.
ubuntu@crustle:~$ hostname -f
crustle.segmaas.1ss
ubuntu@crustle:~$ sunbeam cluster list
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Node ┃ Status ┃ Control ┃ Compute ┃ Storage ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ crustle.segmaas.1ss │ up │ x │ x │ │
└─────────────────────┴────────┴─────────┴─────────┴─────────┘
ubuntu@crustle:~$ sudo snap get openstack-hypervisor node
Key Value
node.fqdn crustle.segmaas.1ss
node.ip-address 10.230.57.128
ubuntu@crustle:~$

Attaching full nova compute logs from journalctl. nova logs at https://pastebin.canonical.com/p/3WvYC3dX9T/ , some logs from the charm container at https://pastebin.canonical.com/p/wBtCTxFJ5k/ and neutron-server logs at https://pastebin.canonical.com/p/6PnfTKPG5P/

https://pastebin.canonical.com/p/xdk4knBGj2/ has the relevant logs in journalctl when the vm creation was tried using a command like,

openstack server create --flavor m1.small --image "ubuntu" testvm-nikhil

Or even this approach.

ubuntu@crustle:~$ sunbeam cluster list
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Node ┃ Status ┃ Control ┃ Compute ┃ Storage ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ crustle.segmaas.1ss │ up │ x │ x │ │
└─────────────────────┴────────┴─────────┴─────────┴─────────┘
ubuntu@crustle:~$ sunbeam launch ubuntu --name test
Launching an OpenStack instance ...
⠦ Creating the OpenStack instance ... Instance creation request failed: Server:4dee2c40-13a6-4b21-8ffc-867d4ed02a77 transitioned to failure state ERROR
Error: Unable to request new instance. Please run `sunbeam configure` first.
ubuntu@crustle:~$ sunbeam openrc > admin-openrc
ubuntu@crustle:~$ sunbeam configure --accept-defaults --openrc admin-openrc
Writing openrc to admin-openrc ... done
ubuntu@crustle:~$ sunbeam launch ubuntu --name test
Launching an OpenStack instance ...
Found sunbeam key in OpenStack!
⠸ Creating the OpenStack instance ... Instance creation request failed: Server:c0f5a495-1af3-40b4-b583-dbdc27fc5393 transitioned to failure state ERROR
Error: Unable to request new instance. Please run `sunbeam configure` first.
ubuntu@crustle:~$ sunbeam configure
Local or remote access to VMs [local/remote] (local):
CIDR of OpenStack external network - arbitrary but must not be in use (10.20.20.0/24):
Populate OpenStack cloud with demo user, default images, flavors etc [y/n] (y):
Username to use for access to OpenStack (demo):
Password to use for access to OpenStack (T3********):
Network range to use for project network (192.168.122.0/24):
List of nameservers guests should use for DNS resolution (10.230.56.2):
Enable ping and SSH access to instances? [y/n]...

Read more...

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :
Revision history for this message
Jake Nabasny (slapcat) wrote :
Download full text (3.3 KiB)

I ran into this bug with sunbeam deployed in a VM. Parts of the configuration during bootstrap used the reverse DNS name for the host machine instead of the FQDN set on the VM:

""""""""""""
# cat /var/snap/openstack-hypervisor/common/etc/nova/nova.conf | grep -i host
host = syn-172-100-xx-xx.res.spectrum.com

# python3 -c "import socket; print(socket.getfqdn())"
syn-172-100-xx-xx.res.spectrum.com

# sudo snap get openstack-hypervisor node
Key Value
node.fqdn syn-172-100-xx-xx.res.spectrum.com
node.ip-address 10.162.57.152
""""""""""""

alanbach and I found the following workaround to correct the misconfiguration in nova/neutron/ovn, but it is quite involved. If someone runs into this on a new cluster, it would be easier to fix the hostname resolution issues and then redeploy.

=== Workaround ===

1. Fix hostname resolution so forward and reverse lookup return expected values:

# echo "10.162.57.152 sunbeam.nabasny.com" >> /etc/hosts
# snap set openstack-hypervisor node.fqdn=sunbeam.nabasny.com

2. Update the hostname on openstack-hypervisor (run on the sunbeam machine):
# openstack-hypervisor.ovs-vsctl get open_vswitch . external_ids:hostname=$(hostname)

3. Update the OVN Gateway Chassis. First find the right LRP UUID for the gateway chassis:
# kubectl exec -it -n openstack neutron-0 -c neutron-server -- /bin/bash

root@neutron-0:/# neutron-ovn-db-sync-util --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --ovn-neutron_sync_mode repair
...
2024-05-30 18:21:51.718 46 WARNING neutron.scheduler.l3_ovn_scheduler [-] Gateway lrp-9ada90ed-1d9f-4946-8823-ba54ce007464 was not scheduled on any chassis, no candidates are available
...

Then set it from ovn-central:
# kubectl exec -it -n openstack ovn-central-0 -c ovn-nb-db-server -- /bin/bash

root@ovn-central-0:/# ovn-nbctl lrp-set-gateway-chassis lrp-ed3c9158-e33c-4910-b8dd-b6ec667a33be <FQDN> 2

You can verify it is set correctly with:
root@ovn-central-0:/# ovn-nbctl find Gateway_Chassis

4. Run the neutron-ovn-db-sync-util in neutron-0 again. This time it should complete successfully:
# kubectl exec -it -n openstack neutron-0 -c neutron-server -- /bin/bash

root@neutron-0:/# neutron-ovn-db-sync-util --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --ovn-neutron_sync_mode repair

5. Update all mentions of the incorrect hostname(s) in nova-mysql-0. First get the password:

juju run nova-mysql/0 get-password

Then login and update the relevant entries (pay attention to the specific id on the row you are updating):

# kubectl exec -it -n openstack nova-mysql-0 -c mysql -- mysql -uroot -p<password>

mysql> use nova;
mysql> select * from compute_nodes;
mysql> update compute_nodes set host='<hostname>' where id='<id>';
mysql> select * from services;
mysql> update services set host='<hostname>' where id='<id>';
mysql> use nova_api;
mysql> select * from host_mappings;
mysql> update host_mappings set host='<hostname>' where id='<id>';
mysql> quit;

6. Restart nova-0 and the nova-compute service:

# kubectl delete pod -n openstack nova-0
# systemctl restart snap.openstack-hypervisor.nova-compute

7. ...

Read more...

Frode Nordahl (fnordahl)
Changed in snap-openstack:
status: Incomplete → New
Revision history for this message
Frode Nordahl (fnordahl) wrote :

I'm hitting this issue too, to me it appears to be due to the `external_ids:hostname` and `external_ids:system-id` keys not matching up to the expected value.

Failure mode:
$ sunbeam launch ubuntu --name test
Launching an OpenStack instance ...
Found sunbeam key in OpenStack!
⠴ Creating the OpenStack instance ... Instance creation request failed: Server:b28747fe-5577-48b0-840b-e535a7277d5f transitioned to failure state ERROR
Error: Unable to request new instance. Please run `sunbeam configure` first.

nova.exception.RescheduledException: Build of instance b28747fe-5577-48b0-840b-e535a7277d5f was re-scheduled: Binding failed for port bc12efbc-3707-49b2-b222-abea4e73d41a, please check neutron logs for more information.

WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-03463dc2-74cd-4373-9914-04328ccad25e req-12e667e8-105f-47d9-ae9b-8a7c1119c6e5 6005b4f606c844b8ad5d00da7e2f35f4 1b6f18264dfc4a2a81b10a90e07f9421 - - 6b9540eac286435393ebcb2c88f2d6b9 6b9540eac286435393ebcb2c88f2d6b9] Refusing to bind port bc12efbc-3707-49b2-b222-abea4e73d41a due to no OVN chassis for host: sunbeam

# ovn-sbctl show
Chassis sunbeam
    hostname: sunbeam.lxd
    Encap geneve
        ip: "10.5.0.2"
        options: {csum="true"}
    Port_Binding cr-lrp-bec471e6-ec59-4144-acfe-76feae4c3acc

$ sudo openstack-hypervisor.ovs-vsctl list open-vswitch
...
external_ids : {hostname=sunbeam.lxd, system-id=sunbeam}

After aligning those values, i.e.:
$ sudo openstack-hypervisor.ovs-vsctl set open-vswitch . external_ids:hostname=sunbeam

The chassis record will automatically be corrected:
# ovn-sbctl show
Chassis sunbeam
    hostname: sunbeam
    Encap geneve
        ip: "10.5.0.2"
        options: {csum="true"}
    Port_Binding cr-lrp-bec471e6-ec59-4144-acfe-76feae4c3acc

And I can successfully launch an instance.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

The plot thickens, when doing a multi node deployment the `openstack network agent list` looks like this:
$ openstack network agent list
+----------------+----------------+---------------+-------------------+-------+-------+--------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+----------------+----------------+---------------+-------------------+-------+-------+--------------------+
| 98d95d5f-2b37- | OVN Metadata | sunbeam-1 | | :-) | UP | neutron-ovn- |
| 5f72-ab0e- | agent | | | | | metadata-agent |
| 1226e238dff9 | | | | | | |
| sunbeam-1 | OVN Controller | sunbeam-1 | | :-) | UP | ovn-controller |
| | agent | | | | | |
| 158be30a-4e15- | OVN Metadata | sunbeam-3.lxd | | :-) | UP | neutron-ovn- |
| 5f66-bdad- | agent | | | | | metadata-agent |
| 8908319563b5 | | | | | | |
| sunbeam-3 | OVN Controller | sunbeam-3.lxd | | :-) | UP | ovn-controller |
| | agent | | | | | |
| f387d34b-4133- | OVN Metadata | sunbeam-2.lxd | | :-) | UP | neutron-ovn- |
| 5f81-b010- | agent | | | | | metadata-agent |
| e3d1804e069c | | | | | | |
| sunbeam-2 | OVN Controller | sunbeam-2.lxd | | :-) | UP | ovn-controller |
| | agent | | | | | |
+----------------+----------------+---------------+-------------------+-------+-------+--------------------+

So perhaps this is a case of inconsistent handling of the first and subsequent nodes?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.