"nova.exception.PortBindingFailed: Binding failed" for OpenStack Zed in Juju deployment

Bug #2017494 reported by Thomas Dreibholz
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-central
Invalid
Undecided
Unassigned
neutron (Ubuntu)
Invalid
Undecided
Unassigned
ovn (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I made an OpenStack deployment with Juju, as documented in the Charm deployment guide (https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/install-juju.html). The setup consists of 8 nodes. The deployment itself is successful, Dashboard, Glance, etc. are running. But when trying to instantiate a VM, the deployment fails with "nova.exception.PortBindingFailed: Binding failed <UUUID>, please check neutron logs for more information." (in Nova log, i.e. /var/log/nova/nova-compute.log).

There is the hint to check "neutron logs", but there is actually no useful information there.

So, I checked the configuration first:

neutron.yaml for deployment of "neutron-api" and "ovn-chassis":
ovn-chassis:
  debug: true
  bridge-interface-mappings: >-
    br-simulamet:<MAC_NODE_1>
    br-simulamet:<MAC_NODE_2>
    br-simulamet:...
    ...
  ovn-bridge-mappings: physnet2:br-simulamet

neutron-api:
  verbose: true
  enable-ml2-port-security: true
  neutron-security-groups: true
  enable-vlan-trunking: false

  vlan-ranges: physnet2
  flat-network-providers:

This looks okay. The network interfaces are mapped into the bridge "br-simulamet", it it is actually existing on all nodes, e.g.:
root@P52S11:/var/log# ovs-vsctl get open . external_ids:ovn-bridge-mappings
"physnet2:br-simulamet"

The network/subnet configuration in OpenStack should also be okay, e.g.:
network create smil-network4 --external --provider-network-type vlan --provider-physical-network physnet2 --provider-segment 0204 --share
subnet create smil-network4-ipv4 --network smil-network4 --ip-version 4 --description "VLAN0204-SMIL-Network4" --subnet-range 10.193.4.0/24 --no-dhcp --allocation-pool start=10.193.4.200,end=10.193.4.254

So, the network should correctly map to "physnet2", with a VLAN tag (here: 204).

(For debugging, I also tried to use the network interface as "flat network" without VLANs. This does not change anything.)

The deployment (from "juju status") also looks okay for Neutron and OVN:
...
neutron-api 21.0.0 active 1 neutron-api zed/stable 546 no Unit is ready
neutron-api-mysql-router 8.0.32 active 1 mysql-router 8.0/stable 35 no Unit is ready
neutron-api-plugin-ovn 21.0.0 active 1 neutron-api-plugin-ovn zed/stable 45 no Unit is ready
nova-cloud-controller 26.1.0 active 1 nova-cloud-controller zed/stable 633 no Unit is ready
...
ovn-central 22.09.0 active 3 ovn-central 22.09/stable 75 no Unit is ready (leader: ovnsb_db)
ovn-chassis 22.09.1 active 8 ovn-chassis 22.09/stable 109 no Unit is ready
...

/var/log/ovn/ovn-controller.log does not provide useful information about the port binding failure, even after enabling "debug = true" in /etc/neutron/ovn.ini and restarting the services. Also, increasing the OVN log level did not reveal more information here, i.e.:
ovn-appctl vlog/set dbg
ovn-appctl vlog/disable-rate-limit

Increasing the Open vSwitch log level also did not reveal more insight, i.e.:
ovn-appctl vlog/set dbg
ovn-appctl vlog/disable-rate-limit

So, maybe the issue is related to some component around OVN? One strange thing I noticed: There are two processes "ovsdb-server" running, each with a "--log-file" parameter, referring to /var/log/ovn/ovn-northd.log, /var/log/ovn/ovsdb-server-sb.log:

root@P52S11:/var/log/openvswitch# ps ax | grep ovn
 129278 ? Ssl 4:58 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=ssl:172.31.255.116:6641,ssl:172.31.255.115:6641,ssl:172.31.255.114:6641 --ovnsb-db=ssl:172.31.255.116:16642,ssl:172.31.255.115:16642,ssl:172.31.255.114:16642 -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --no-chdir --log-file=/var/log/ovn/ovn-northd.log --pidfile=/var/run/ovn/ovn-northd.pid --detach
 130048 ? Ssl 34:42 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-nb.log --remote=punix:/var/run/ovn/ovnnb_db.sock --pidfile=/var/run/ovn/ovnnb_db.pid --unixctl=/var/run/ovn/ovnnb_db.ctl --remote=db:OVN_Northbound,NB_Global,connections --private-key=/etc/ovn/key_host --certificate=/etc/ovn/cert_host --ca-cert=/etc/ovn/ovn-central.crt --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers /var/lib/ovn/ovnnb_db.db
 130251 ? Ssl 47:13 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-sb.log --remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=/var/run/ovn/ovnsb_db.pid --unixctl=/var/run/ovn/ovnsb_db.ctl --remote=db:OVN_Southbound,SB_Global,connections --private-key=/etc/ovn/key_host --certificate=/etc/ovn/cert_host --ca-cert=/etc/ovn/ovn-central.crt --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /var/lib/ovn/ovnsb_db.db

The logs are in containers, checking them:

/var/snap/lxd/common/lxd/storage-pools/default/containers/juju-3083dc-2-lxd-1/rootfs/var/log/ovn/ovsdb-server-sb.log:
...
2023-04-24T08:30:50.962Z|21781|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:30:50.962Z|21782|jsonrpc|WARN|ssl:127.0.0.1:35594: receive error: Protocol error
2023-04-24T08:30:50.962Z|21783|reconnect|WARN|ssl:127.0.0.1:35594: connection dropped (Protocol error)
2023-04-24T08:35:00.857Z|21784|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:35:00.857Z|21785|jsonrpc|WARN|ssl:127.0.0.1:34324: receive error: Protocol error
2023-04-24T08:35:00.857Z|21786|reconnect|WARN|ssl:127.0.0.1:34324: connection dropped (Protocol error)

/var/snap/lxd/common/lxd/storage-pools/default/containers/juju-3083dc-2-lxd-1/rootfs/var/log/ovn/ovsdb-server-nb.log:
...
2023-04-24T08:30:50.960Z|22445|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:30:50.960Z|22446|jsonrpc|WARN|ssl:127.0.0.1:48028: receive error: Protocol error
2023-04-24T08:30:50.960Z|22447|reconnect|WARN|ssl:127.0.0.1:48028: connection dropped (Protocol error)
2023-04-24T08:35:00.855Z|22448|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:35:00.855Z|22449|jsonrpc|WARN|ssl:127.0.0.1:52444: receive error: Protocol error
2023-04-24T08:35:00.855Z|22450|reconnect|WARN|ssl:127.0.0.1:52444: connection dropped (Protocol error)

The containers belong to the deployment of "ovn-central", so I assume something is wrong here.

The issue appears on all 8 nodes I have set up. So, it is reproducible. I can provide log files, etc. on request.

Could this issue be a bug of an OpenStack package (may be ovn-central?), or a problem with the Juju Charms for deployment for OpenStack Zed, or some issue with the setup?

Tags: openstack
Revision history for this message
Thomas Dreibholz (dreibh) wrote :
tags: added: openstack
Revision history for this message
Thomas Dreibholz (dreibh) wrote :
Revision history for this message
Thomas Dreibholz (dreibh) wrote :
Revision history for this message
Thomas Dreibholz (dreibh) wrote :
Revision history for this message
Thomas Dreibholz (dreibh) wrote :
Revision history for this message
Thomas Dreibholz (dreibh) wrote :
Revision history for this message
Thomas Dreibholz (dreibh) wrote :

Some further debugging: I entered the instance container for ovn-central/0, i.e.:
juju ssh ovn-central/0

In /etc/ovn/ovn-northd-db-params.conf, I found the TCP and SSL parameters for the OVN NB and SB databases for running check (based on https://numans.blog/2018/01/05/debugging-ovn-external-connectivity-part-1/), in my case:
sudo ovn-nbctl -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --db=ssl:172.31.255.114:6641,ssl:172.31.255.115:6641,ssl:172.31.255.116:6641 show
sudo ovn-sbctl -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --db=ssl:172.31.255.114:16642,ssl:172.31.255.115:16642,ssl:172.31.255.116:16642 show

Connecting to the DBs works.

SB lists all my 8 nodes, i.e.:
...
Chassis P52S11.maas
    hostname: P52S11.maas
    Encap geneve
        ip: "172.31.255.100"
        options: {csum="true"}
...
This seems to look okay.

NB seems to even list my test VM's port, i.e.:
switch b2e5de93-e19a-458d-af63-e80d44629614 (neutron-97c5c0a1-5c29-4fc4-852b-ce818c972a6d) (aka smil-network4)
    port ec87abb0-b261-49fd-8e95-9cefdf32798e (aka Port-warrnambool.fire.smil)
        addresses: ["unknown"]
...
But "addresses" only contain "unknown".

Destroying the failed instance leads to removing the port, a new trial leads to creating a new one. So, I assume that at least some communication with the OVN system is working.

It seems that something goes wrong somewhere after creating this port, but without any information in one of the log files. Is there any hint for where to look for further debugging?

Revision history for this message
Linux (linux3276fan) wrote :

Hi Thomas,

ANy luck to relsove this issue.I have also similar setup with 8 nodes, when I try to create instance on openstack I get same error message.I did all the tests which you mentioned, the issue I am having exactly same.Any advice will be appricated.Thank you

Revision history for this message
Thomas Dreibholz (dreibh) wrote :

I finally gave up debugging the issue, and tried a full new deployment. Then it succeeded. So, I assume something went wrong silently in the deployment scripts during the first deployment.

Revision history for this message
Linux (linux3276fan) wrote :

I tried re-deploy twice, still stuck with same issue.I will try redeploy again.Thank you for your reply

Revision history for this message
Linux (linux3276fan) wrote :
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

> /var/snap/lxd/common/lxd/storage-pools/default/containers/juju-3083dc-2-lxd-1/rootfs/var/log/ovn/ovsdb-server-sb.log:
...
2023-04-24T08:30:50.962Z|21781|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:30:50.962Z|21782|jsonrpc|WARN|ssl:127.0.0.1:35594: receive error: Protocol error
2023-04-24T08:30:50.962Z|21783|reconnect|WARN|ssl:127.0.0.1:35594: connection dropped (Protocol error)
2023-04-24T08:35:00.857Z|21784|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:35:00.857Z|21785|jsonrpc|WARN|ssl:127.0.0.1:34324: receive error: Protocol error
2023-04-24T08:35:00.857Z|21786|reconnect|WARN|ssl:127.0.0.1:34324: connection dropped (Protocol error)

This bit looks a bit like an SSL connection attempt to a non SSL socket. i.e. the protocols aren't matching. Is the 'server' side running TLS and have valid certificates? As it's a juju model, could you provide a juju-status.txt (it shows the versions of everything), and maybe a crashdump? (https://github.com/juju/juju-crashdump).

Revision history for this message
Linux (linux3276fan) wrote :

Hi Alex, I see exactly same ssl errors on logs.I attached juju-status.txt file.I will try to get juju-crashdump

Revision history for this message
Linux (linux3276fan) wrote :
  • tmp Edit (36.0 MiB, application/octet-stream)

I attached ovn-central crashdump.Thank you

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The only thing that catches my eye is that there is software mismatch between ovn chassis and ovn central:

(from juju status):

App Version Status Scale Charm Channel Rev Exposed Message
ovn-central 22.09.1 active 3 ovn-central 22.09/stable 75 no Unit is ready (leader: ovnsb_db northd: active)
ovn-chassis 22.03.0 active 8 ovn-chassis 22.09/stable 109 no Unit is ready

i.e. ovn-cental is running 22.09.1 and ovn-chassis is running 22.03.0.

I don't actually know if this is a problem (off the top of my head), but it would probably be good to align the versions to 22.09; note this may be a red-herring. Have you got a copy of the model (bundle) file that you used to deploy this please?

Revision history for this message
Linux (linux3276fan) wrote :

I am not using bundle.I a deploying individual packages following below documentation.

https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/install-openstack.html

juju deploy -n 3 --to lxd:0,lxd:1,lxd:2 --channel 22.09/stable ovn-central
The neutron-api application will be containerised on machine 1:

juju deploy --to lxd:1 --channel zed/stable --config neutron.yaml neutron-api
Deploy the subordinate charm applications:

juju deploy --channel zed/stable neutron-api-plugin-ovn
juju deploy --channel 22.09/stable --config neutron.yaml ovn-chassis

Revision history for this message
Linux (linux3276fan) wrote :

@Thomas

JAre you still SSL error after you redeployed ?

> /var/snap/lxd/common/lxd/storage-pools/default/containers/juju-3083dc-2-lxd-1/rootfs/var/log/ovn/ovsdb-server-sb.log:
...
2023-04-24T08:30:50.962Z|21781|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:30:50.962Z|21782|jsonrpc|WARN|ssl:127.0.0.1:35594: receive error: Protocol error
2023-04-24T08:30:50.962Z|21783|reconnect|WARN|ssl:127.0.0.1:35594: connection dropped (Protocol error)
2023-04-24T08:35:00.857Z|21784|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
2023-04-24T08:35:00.857Z|21785|jsonrpc|WARN|ssl:127.0.0.1:34324: receive error: Protocol error
2023-04-24T08:35:00.857Z|21786|reconnect|WARN|ssl:127.0.0.1:34324: connection dropped (Protocol error)

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

@linux4376fan At the risk of doing "have you tried turning it off and on again" ... please could you try restarting the ovn-central services (ovn-central) on each of the 3 ovn-central units.

Revision history for this message
Linux (linux3276fan) wrote :

@Alex I restarted and rebooted all 3 ovn-central units, I still see same error message.I need to check how to upgrade ovn-chassis 22.03.0 version to match ovn-central 22.09

...skipping...
● ovn-ovsdb-server-sb.service - Open vSwitch database server for OVN Southbound database
     Loaded: loaded (/lib/systemd/system/ovn-ovsdb-server-sb.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-06-11 22:29:04 UTC; 2 days ago
   Main PID: 35001 (ovsdb-server)
      Tasks: 3 (limit: 314572)
     Memory: 7.2M
        CPU: 5min 32.259s
     CGroup: /system.slice/ovn-ovsdb-server-sb.service
             └─35001 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-sb.log --remote=punix:/var/run/ovn/ovnsb_db.sock --pidfile=>

Jun 14 14:56:53 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02444|reconnect|INFO|ssl:10.69.212.15:6644: connecting...
Jun 14 14:56:54 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02445|reconnect|INFO|ssl:10.69.212.15:6644: connection attempt timed out
Jun 14 14:56:54 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02446|reconnect|INFO|ssl:10.69.212.15:6644: waiting 2 seconds before reconnect
Jun 14 14:56:55 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02447|raft|INFO|ssl:10.69.212.15:39008: learned server ID 6e42
Jun 14 14:56:55 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02448|raft|INFO|ssl:10.69.212.15:39008: learned remote address ssl:10.69.212.15:6644
Jun 14 14:56:56 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02449|reconnect|INFO|ssl:10.69.212.15:6644: connecting...
Jun 14 14:56:56 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02450|reconnect|INFO|ssl:10.69.212.15:6644: connected
Jun 14 14:57:43 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02451|stream_ssl|WARN|SSL_accept: error:0A000126:SSL routines::unexpected eof while reading
Jun 14 14:57:43 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02452|jsonrpc|WARN|ssl:127.0.0.1:46464: receive error: Protocol error
Jun 14 14:57:43 juju-24ab8f-0-lxd-1 ovsdb-server[35001]: ovs|02453|reconnect|WARN|ssl:127.0.0.1:46464: connection dropped (Protocol error)

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

@linux3276fan - The ovn-chassis charm is a subordinate (i.e. doesn't have it's 'own' machine) and thus the packages for ovn on the same unit as a subordinate charm come from the principle charm which should be running jammy-zed, but jammy-zed should have 22.09 for the principle charm.

e.g. rmadison for zed shows.

 ovn | 22.09.1-0ubuntu0.22.10.1~cloud0 | zed | jammy-updates | source
 ovn | 22.09.1-0ubuntu0.22.10.1~cloud0 | zed-proposed | jammy-proposed | source

So looking at the juju status for nova-compute (ovn-chassis' parent) and nova-cloud-controller:

nova-cloud-controller 26.1.1 active 1 nova-cloud-controller zed/stable 655 no Unit is ready
nova-compute 25.1.1 active 8 nova-compute zed/stable 661 no Unit is ready

i.e. nova-compute is running 25.1.1 which is yoga (https://releases.openstack.org/yoga/) whereas the controller is running zed. Somehow, nova-compute is misconfigured or has a bug. Please could you check the output of the config for nova-compute:

juju config nova-compute | less

and see what the openstack-origin is set to?

Revision history for this message
Linux (linux3276fan) wrote :

@Alex, Thanks for your reply, Openstack Nova-compute is set to zed on juju config nova-compute output.I attached output file.Thank you

openstack-origin:
    default: zed

Below is the the deployment playbook used for nova-compute

cat nova-compute.yaml
nova-compute:
  config-flags: default_ephemeral_format=ext4
  enable-live-migration: true
  enable-resize: true
  migration-auth-type: ssh
  virt-type: qemu
  openstack-origin: distro

Revision history for this message
Linux (linux3276fan) wrote :

I will delete the model and re-try full deployment see if any luck.Thank you

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Okay, the issue with the software versions is in the nova-computer.yaml file; specifically, the "openstack-origin: distro" is overriding the default of "zed".

If you do a:

$ juju config nova-compute openstack-origin=zed action-managed-upgrade=false

then the nova-compute charm will upgrade the nova application to 'zed'. That will at least rule out software version incompatibility. Note the default for action-managed-upgrade is false, but this just makes sure (in case it's been changed).

Revision history for this message
Linux (linux3276fan) wrote :

@Alex Kavanagh (ajkavanagh)

I redeployed openstack succesfully and can launch VMs, I can see now both ovn-central and ovn-chassis are same versions.The ssl error messages are still there but not causing any issues.I think what had happend I had nova-computer.yaml file from last 3 months the new version zed do not have openstack-origin: distro" entry this caused issue as you pointed out.Thank you for your support and very much appricated

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

@linux3276fan - Good news that everything is working.

Having done some more digging, I think that the SSL errors are almost certainly related to (or due to): https://github.com/openssl/openssl/issues/18866 - as such it's annoying, but not actually causing any issues other than filling the logs with nonsense. Hopefully, this will get sorted out eventually.

As for this bug, I'm going to set to incomplete, as both you and @dreibh have resolutions - perhaps in different ways. As "incomplete", if there are no further reports the bug will expire, but still be searchable, etc. If you get the issue again, please feel free to re-open the bug.

Thanks.

Changed in charm-ovn-central:
status: New → Invalid
Changed in neutron (Ubuntu):
status: New → Invalid
Changed in ovn (Ubuntu):
status: New → Invalid
Revision history for this message
Linux (linux3276fan) wrote :

@Alex Thank you

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.