Instance created with a flat network spawns in error state

Bug #1835965 reported by Paulina Flores
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
YaoLe

Bug Description

Title
-----
Instance created with a flat network spawns in error state

Brief Description
-----------------
Flat network is only created once the stx-openstack application is reapplied to include this datanetwork. After creating it and attempting to spawn an instance, the instance is in error state

Severity
--------
Minor

Steps to Reproduce
------------------
1. Create a flat datanetwork
2. Reapply stx-openstack application
3. Create a flat network connected to the flat datanetwork
4. Assign the flat datanetwork to data interface in the target compute (compute-1)
5. Spawn two instances connected to this flat network hosted on the target compute
6. Ping between both instances

Expected Behavior
------------------
Flat network is created without problems but both VMs are active and pingable.

Actual Behavior
----------------
VMs spawn in error state. Fault message:

{u'message': u'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 74a10574-be51-4995-ab64-ec9a20d6605f.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 610, in build_instances\n raise exception.MaxRetriesExceeded(reason=msg)\n', u'created': u'2019-07-09T14:47:20Z'}

When attempting to create a VM with the same flavour and image but on a vlan network, the VM is spawned as active and running.

System Configuration
--------------------
System mode: Standard 2+2 on Baremetal

Reproducibility
---------------
100%

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190705T013000Z"
BUILD_DATE="2019-07-05 01:30:00 +0000"

Timestamp/Logs
--------------
Attaching sysinv log

Last time install passed
------------------------
n/a

Revision history for this message
Paulina Flores (paulina-flores) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Requesting input from Chenjie on this. The VM launch error is similar to another issue he's looking at.

tags: added: stx.networking
Changed in starlingx:
assignee: nobody → ChenjieXu (midone)
Revision history for this message
ChenjieXu (midone) wrote :

Hi Ghada,

This bug is a different bug from the following bug:
https://bugs.launchpad.net/starlingx/+bug/1835965

Nova will try every available hosts to create the instance. The following message just indicates that nova can't create the instance after trying every available hosts.
   Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 74a10574-be51-4995-ab64-ec9a20d6605f.'

Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

There are some problems with the steps to create flat network. Could you please try the following commands?
   system host-lock compute-0
   system host-lock compute-1
   system datanetwork-add phy-flat flat
   system host-if-modify -m 1500 -n data0 -c data compute-0 ${DATA0IFUUID}
   system host-if-modify -m 1500 -n data0 -c data compute-1 ${DATA1IFUUID}
   system interface-datanetwork-assign compute-0 ${DATA0IFUUID} phy-flat
   system interface-datanetwork-assign compute-1 ${DATA1IFUUID} phy-flat
   system host-unlock compute-0
   system host-unlock compute-1
   The interfaces ${DATA0IFUUID} of compute-0 and ${DATA1IFUUID} of compute-1 should be connected physically (such as wire, switch).
   After the host compute-0 and compute-1 has been restarted, make sure the application has been re-applied successfully.
   system application-list
   create a flat network connected to the flat datanetwork
   create the subnet
   Spawn two instances connected to this flat network hosted on the target compute

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Hi Chenjie,

I tried the steps you provided, yet both of my instances are still in error state:

controller-0:~$ openstack server list
+--------------------------------------+-----------+--------+----------+----------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------+--------+----------+----------+-----------+
| 61fc1ed7-3312-464d-a09f-a03f9df9dba0 | flat_vm_2 | ERROR | | vm_image | vm_flavor |
| 480475a0-9b57-4189-b422-e56597755a24 | flat_vm | ERROR | | vm_image | vm_flavor |
+--------------------------------------+-----------+--------+----------+----------+-----------+

Both of them are showing the message "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance" as fault. Tried also spawning an instance connected to a vlan network, but it's also in error state with the same fault message.

Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

It seems that your StarlingX is not running correctly because you can't create an instance connected to a vlan network. Could you please collect logs for debugging? Or you can simply redeploy StarlingX and then retest this bug.

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Hi Chenjie,

I collected the sysinv.log here, though please let me know if you need something else instead. We'll also be reinstalling the baremetal system soon and I'll let you know when I can try again as soon as possible.

YaoLe (yaole)
Changed in starlingx:
assignee: ChenjieXu (midone) → YaoLe (yaole)
Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

Could you please collect following log?
   kubectl get pod -n openstack > containers

Could you please collect the logs from the following Pods?
   kubectl get pod -n openstack | grep Running | grep nova
   kubectl get pod -n openstack | grep Running | grep neutron
You can get the Pod log by following command:
   kubectl log -n openstack $container_name

Revision history for this message
Paulina Flores (paulina-flores) wrote :
Download full text (16.7 KiB)

Hi Chenjie,

I'll be attaching the pod logs on a .rar file, meanwhile here's the containers log:

controller-0:~$ cat containers
NAME READY STATUS RESTARTS AGE
aodh-api-7d745fb854-8q4l9 1/1 Running 0 26m
aodh-api-7d745fb854-d4c7z 1/1 Running 0 26m
aodh-db-init-lbcks 0/1 Completed 0 26m
aodh-db-sync-nrhf8 0/1 Completed 0 26m
aodh-evaluator-66c6746944-8wn82 1/1 Running 0 26m
aodh-evaluator-66c6746944-xbpcn 1/1 Running 0 26m
aodh-ks-endpoints-hv6fd 0/3 Completed 0 26m
aodh-ks-service-hshtc 0/1 Completed 0 26m
aodh-ks-user-7fq25 0/1 Completed 0 26m
aodh-listener-6f8dcf79f9-hwwhj 1/1 Running 0 26m
aodh-listener-6f8dcf79f9-rcb5q 1/1 Running 0 26m
aodh-notifier-c987ffd6f-nnr54 1/1 Running 0 26m
aodh-notifier-c987ffd6f-wwpb9 1/1 Running 0 26m
aodh-rabbit-init-vhtbb 0/1 Completed 0 26m
barbican-api-696994d6-5dkfs 1/1 Running 0 43m
barbican-api-696994d6-xl8wp 1/1 Running 0 43m
barbican-db-init-ndm6b 0/1 Completed 0 43m
barbican-db-sync-gbtch 0/1 Completed 0 43m
barbican-ks-endpoints-gsh7x 0/3 Completed 0 43m
barbican-ks-service-6g96j 0/1 Completed 0 43m
barbican-ks-user-zn8ph 0/1 Completed 0 43m
barbican-rabbit-init-c9xl9 0/1 Completed 0 43m
ceilometer-central-86b4665889-gn2sn 1/1 Running 0 22m
ceilometer-central-86b4665889-j8fgd 1/1 Running 0 22m
ceilometer-compute-nfgb4 1/1 Running 0 22m
ceilometer-compute-nglp5 1/1 Running 0 22m
ceilometer-db-sync-dtdfx 0/1 Completed 0 22m
ceilometer-ks-service-vc7kl 0/1 Completed 0 22m
ceilometer-ks-user-nthxk 0/1 Completed 0 22m
ceilometer-notification-899494475-cdznc 1/1 Running 0 22m
ceilometer-rabbit-init-kjwtl 0/1 Completed 0 22m
ceph-ks-endpoints-m64gj ...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding the stx.2.0 tag given that VMs should launch successfully with a flat network. If this turns out to be a procedural issue (wrong steps) or a config issue (lab setup/L2 problem), the bug can be marked as Invalid.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2.0
Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

Your neutron ovs agent is down as following. Please reinstall the StarlingX or remove, delete, apply stx-openstack to make sure your ovs agent is running correctlly.
neutron-ovs-agent-compute-0-75ea0372-p5dwg 0/1 CrashLoopBackOff 11 38m
neutron-ovs-agent-compute-1-eae26dba-kkrw9 0/1 CrashLoopBackOff 11 38m

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Hi Chenjie,

I've been trying to reapply the application, even removing and deleting it, and both computes' ovs agents continue appearing as CrashLoopBackOff. Interestingly, if I revert the flat datanetwork assignment to the data0 interfaces, the pods boot up correctly.

Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

It seems that flat datanetwork is not configured correctly. Could you please post the commands you create flat datanetwork, assign the flat datanetwork to the data0 interface?

Could you please collect following logs?
system helm-override-show stx-openstack neutron openstack > helm-override
sudo ovs-vsctl show > bridges
kubectl describe pod -n openstack $OVS_AGENT > describe_ovs_agent
kubectl log -n openstack $OVS_AGENT > ovs_agent_log

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Raising the priority to high as this is a basic configuration which should be working. As already noted, please mark as Invalid once it is confirmed that this is an issue with the steps.

Changed in starlingx:
importance: Medium → High
Revision history for this message
Paulina Flores (paulina-flores) wrote :

Hi Chenjie,

I've been following the steps you provided on July 11th with no variation each time. When the flat datanetwork is created it doesn't ask for anything else, either--maybe there's something to consider there?

I'm attaching the logs you asked for in this message. The command "ovs-vsctl" returned nothing, though. I think we're no longer supporting it since I checked with other configurations and they were also unable to get anything with that command. Checking the helm-override under the bridge_mappings ovs section shows this however:

bridge_mappings: physnet0:br-phy0,phy-flat:br-phy0,physnet1:br-phy2,

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Adding onto my last comment, I managed to get the bridges to show up in each compute. I'm attaching both here. Please excuse me.

Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

The below helm-override is wrong and may cause ovs agent crash:
   bridge_mappings: physnet0:br-phy0,phy-flat:br-phy0,physnet1:br-phy2,
It should be like following:
   bridge_mappings: physnet0:br-phy0,phy-flat:br-phy1,physnet1:br-phy2,

It seems you bind physnet0 and phy-flat to the same interface. Is that correct? Before you assign the interface to phy-flat, have you remove the old assignment to the same interface like following?
   system interface-datanetwork-remove $UUID

Could you please execute below commands and post the results here?
   system interface-datanetwork-list compute-0
   system interface-datanetwork-list compute-1
   system host-if-list -a compute-0
   system host-if-list -a compute-1

Revision history for this message
YaoLe (yaole) wrote :

This bug cannot be reproduced on STX-aio-simplex 20190705T013000Z on virtual environment. The bridge_mappings in the ovs agent /etc/neutron/plugins/ml2/openvswitch_agent.ini is following:
bridge_mappings = physnet0:br-phy0,physnet1:br-phy1,

Revision history for this message
Paulina Flores (paulina-flores) wrote :

Hi Chenjie,

Looks like after removing the physnet0 datanetwork from both computes and reassigning them to the new flat datanetwork, the neutron-ovs-agents initialized correctly and I was finally able to create the VMs attached to the flat network without problem.

Here's the full list of steps taken this time around:

   system host-lock compute-0
   system host-lock compute-1
   system datanetwork-add phy-flat flat
   system host-if-modify -m 1500 -n data0 -c data compute-0 ${DATA0IFUUID}
   system host-if-modify -m 1500 -n data0 -c data compute-1 ${DATA1IFUUID}
   system interface-datanetwork-remove ${PHYSNETDATA0UUID} (both computes)
   system interface-datanetwork-assign compute-0 ${DATA0IFUUID} phy-flat
   system interface-datanetwork-assign compute-1 ${DATA1IFUUID} phy-flat
   system host-unlock compute-0
   system host-unlock compute-1
   The interfaces ${DATA0IFUUID} of compute-0 and ${DATA1IFUUID} of compute-1 should be connected physically (such as wire, switch).
   Once the hosts are back online, the application has been re-applied successfully.
   system application-list
   Check the neutron OVS agents are initialized correctly
   kubectl -n openstack get pod | grep neutron
   Create a flat network connected to the flat datanetwork
   Create the subnet
   Spawn two instances connected to this flat network hosted on the target compute
   Ping between them.

Seeing as the network is set up correctly, the application reapplies successfully, the bridges are updated, the VMs boot up without a problem, and I'm capable of pinging between them both, I'll be marking this test as a pass. Thanks a lot for your help and patience.

Revision history for this message
ChenjieXu (midone) wrote :

Hi Paulina,

You are welcome!

Revision history for this message
ChenjieXu (midone) wrote :

Confirmed by Paulina, this bug is an issue with steps. So this bug can be marked as invalid.

YaoLe (yaole)
Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@YaoLe, can you please document the correct steps on the networking wiki:
https://wiki.openstack.org/wiki/StarlingX/Networking#Useful_Networking_Commands

Revision history for this message
YaoLe (yaole) wrote :

I have documented the correct steps on the networking wiki:
https://wiki.openstack.org/wiki/StarlingX/Networking#Useful_Networking_Commands

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.