Concurrent create VM failed because of vif plug timeout

Bug #2020328 reported by Dongcan Ye
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Undecided
Unassigned

Bug Description

Environment:
OpenStack Zed
OpenvSwitch 2.17
OVN 22.06
ML2/OVN driver

Using following command create instances:
# openstack server create test_vm --flavor XXXXXX --network XXXXXX --image XXXXXX --min 10 --max 10 --availability-zone nova:XXXXXX

There maybe at least 2 or 3 instances created failed because of vif plugging timeout error.

VM port UUID: 6d5eccfa-069e-4058-a1c8-87bec9c1c280

Here is some logs info:

1. ovsdb server log:
record 554: 2023-05-19 10:48:42.399
  table Interface insert row "tap6d5eccfa-06" (8d7bdf97):
  table Open_vSwitch row 1a4db534 (1a4db534) diff:
  table Bridge row "br-int" (ee9ab5d4) diff:
  table Port insert row "tap6d5eccfa-06" (a4be79ec):

record 555: 2023-05-19 10:48:42.411
  table Interface row "tap6d5eccfa-06" (8d7bdf97) diff:
  table Open_vSwitch row 1a4db534 (1a4db534) diff:

record 556: 2023-05-19 10:48:42.909
  table Interface row "tap6d5eccfa-06" (8d7bdf97) diff:

record 557: 2023-05-19 10:48:42.941 "ovs-vsctl (invoked by init (pid 0)): ovs-vsctl del-port tap6d5eccfa-06"
  table Open_vSwitch row 1a4db534 (1a4db534) diff:
  table Interface row "tap6d5eccfa-06" (8d7bdf97) diff:
    delete row
  table Bridge row "br-int" (ee9ab5d4) diff:
  table Port row "tap6d5eccfa-06" (a4be79ec) diff:
    delete row

2. ovs-vswitchd log:
2023-05-19T10:48:42.400Z|13247|jsonrpc|DBG|unix:/run/openvswitch/db.sock: received notification, method="update3", params=[["monid","Open_vSwitch"],"00000000-0000-0000-0000-000000000000",{"Open_vSwitch":{"1a4db534-3133-43e0-922e-dd7d5d76a802":{"modify":{"next_cfg":135}}},"Interface":{"8d7bdf97-1f91-4ef3-9b27-b2a9e7c362e2":{"insert":{"name":"tap6d5eccfa-06"}}},"Port":{"a4be79ec-df95-495b-87e1-fc373582f647":{"insert":{"name":"tap6d5eccfa-06","interfaces":["uuid","8d7bdf97-1f91-4ef3-9b27-b2a9e7c362e2"]}}},"Bridge":{"ee9ab5d4-b72b-479b-87cd-29a19d5540a8":{"modify":{"ports":["uuid","a4be79ec-df95-495b-87e1-fc373582f647"]}}}}]
2023-05-19T10:48:42.410Z|13248|bridge|WARN|could not open network device tap6d5eccfa-06 (No such device)
2023-05-19T10:48:42.411Z|13249|jsonrpc|DBG|unix:/run/openvswitch/db.sock: send request, method="transact", params=["Open_vSwitch",{"row":{"cur_cfg":135},"where":[["_uuid","==",["uuid","1a4db534-3133-43e0-922e-dd7d5d76a802"]]],"table":"Open_vSwitch","op":"update"},{"row":{"error":"could not open network device tap6d5eccfa-06 (No such device)","ofport":-1},"where":[["_uuid","==",["uuid","8d7bdf97-1f91-4ef3-9b27-b2a9e7c362e2"]]],"table":"Interface","op":"update"},{"op":"assert","lock":"ovs_vswitchd"}], id=4787
2023-05-19T10:48:42.412Z|13250|poll_loop|DBG|wakeup due to [POLLIN] on fd 13 (<->/run/openvswitch/db.sock) at ../lib/stream-fd.c:157 (4% CPU usage)
2023-05-19T10:48:42.412Z|13251|jsonrpc|DBG|unix:/run/openvswitch/db.sock: received notification, method="update3", params=[["monid","Open_vSwitch"],"00000000-0000-0000-0000-000000000000",{"Interface":{"8d7bdf97-1f91-4ef3-9b27-b2a9e7c362e2":{"modify":{"error":"could not open network device tap6d5eccfa-06 (No such device)","ofport":-1}}},"Open_vSwitch":{"1a4db534-3133-43e0-922e-dd7d5d76a802":{"modify":{"cur_cfg":135}}}}]

2023-05-19T10:48:42.907Z|13260|vconn|DBG|unix#50: sent (Success): OFPT_PORT_STATUS (OF1.5) (xid=0x0): ADD: 57(tap6d5eccfa-06): addr:fe:16:3e:dc:bc:b8
     config: 0
     state: 0
     current: 10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
2023-05-19T10:48:42.907Z|13261|bridge|INFO|bridge br-int: added interface tap6d5eccfa-06 on port 57
2023-05-19T10:48:42.908Z|13262|hmap|DBG|Dropped 74 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2023-05-19T10:48:42.908Z|13263|hmap|DBG|../ofproto/ofproto-dpif-xlate.c:884: 1 bucket with 6+ nodes, including 1 bucket with 7 nodes (16 nodes total across 16 buckets)
2023-05-19T10:48:42.908Z|13264|dpif_netlink|DBG|port_changed: dpif:system@ovs-system vport:tap6d5eccfa-06 cmd:1
2023-05-19T10:48:42.908Z|13265|vconn|DBG|unix#50: sent (Success): OFPT_PORT_STATUS (OF1.5) (xid=0x0): MOD: 57(tap6d5eccfa-06): addr:fe:16:3e:dc:bc:b8
     config: 0
     state: LIVE
     current: 10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max

2023-05-19T10:48:42.971Z|13289|bridge|INFO|bridge br-int: deleted interface tap6d5eccfa-06 on port 57
2023-05-19T10:48:42.972Z|04671|poll_loop(revalidator62)|DBG|wakeup due to [POLLIN] on fd 43 (FIFO pipe:[154647]) at ../lib/ovs-thread.c:378 (0% CPU usage)
2023-05-19T10:48:42.972Z|04651|poll_loop(revalidator63)|DBG|wakeup due to [POLLIN] on fd 45 (FIFO pipe:[154648]) at ../lib/ovs-thread.c:378 (0% CPU usage)
2023-05-19T10:48:42.975Z|04672|poll_loop(revalidator62)|DBG|wakeup due to [POLLIN] on fd 43 (FIFO pipe:[154647]) at ../lib/ovs-thread.c:378 (0% CPU usage)
2023-05-19T10:48:42.975Z|13290|dpif|DBG|system@ovs-system: failed to query port tap6d5eccfa-06: No such device
2023-05-19T10:48:42.982Z|07648|poll_loop(revalidator61)|DBG|wakeup due to [POLLIN] on fd 41 (FIFO pipe:[155716]) at ../lib/ovs-thread.c:378 (10% CPU usage)
2023-05-19T10:48:42.984Z|04719|poll_loop(revalidator65)|DBG|wakeup due to [POLLIN] on fd 49 (FIFO pipe:[149813]) at ../lib/ovs-thread.c:378 (0% CPU usage)
2023-05-19T10:48:42.984Z|04589|poll_loop(revalidator64)|DBG|wakeup due to [POLLIN] on fd 47 (FIFO pipe:[154649]) at ../lib/ovs-thread.c:378 (1% CPU usage)
2023-05-19T10:48:42.985Z|13291|poll_loop|DBG|wakeup due to [POLLIN] on fd 56 (FIFO pipe:[136293]) at ../lib/ovs-rcu.c:236 (2% CPU usage)
2023-05-19T10:48:42.999Z|13292|poll_loop|DBG|wakeup due to [POLLIN] on fd 56 (FIFO pipe:[136293]) at ../lib/ovs-rcu.c:236 (2% CPU usage)
2023-05-19T10:48:42.999Z|13293|dpif_netlink|DBG|port_changed: dpif:system@ovs-system vport:tap6d5eccfa-06 cmd:2
2023-05-19T10:48:42.999Z|13294|dpif|DBG|system@ovs-system: failed to query port tap6d5eccfa-06: No such device

3. nova-compute log:
2023-05-19 18:48:42.416 7 INFO os_vif [None req-6f1ad586-b66d-4fb6-8603-ba96af2af311 bb05576c972b49b4ac598cb33cf151f2 af89a2f86b204b54af86df7e5bb6a841 - - default default] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:dc:bc:b8,bridge_name='br-int',has_traffic_filtering=True,id=6d5eccfa-069e-4058-a1c8-87bec9c1c280,network=Network(19771f61-1003-46a3-8850-001b32947f66),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap6d5eccfa-06')

2023-05-19 18:58:45.827 7 WARNING nova.compute.manager [None req-6f1ad586-b66d-4fb6-8603-ba96af2af311 bb05576c972b49b4ac598cb33cf151f2 af89a2f86b204b54af86df7e5bb6a841 - - default default] [instance: 1cfbb245-c807-48e9-90eb-2c4f77e7a7b3] Timeout waiting for ['network-vif-plugged-6d5eccfa-069e-4058-a1c8-87bec9c1c280'] for instance with vm_state building and task_state spawning. Event states are: network-vif-plugged-6d5eccfa-069e-4058-a1c8-87bec9c1c280: timed out after 600.00 seconds: eventlet.timeout.Timeout: 600 seconds

It seems that ovs successfully add tap device(tap6d5eccfa-06) in br-int, then quickly delete the device.

In Nova, it wait for network-vif-plugged event, but tap device not present in br-int bridge, so create instance failed after 600 secs(we modify the default timeout 300s).

Another Launchpad Bug: https://bugs.launchpad.net/neutron/+bug/2005111 have the same problem.

After change ML2/OVN driver to ML2/OVS driver, concurrent create VM also fails with vif plug timeout.

Revision history for this message
yatin (yatinkarel) wrote :

Hi Dongcan Ye,

Let's focus on ML2/OVN case, can you add more details for analysis:-
- Environment details:-
  - Number of controller, computes
  - Deployment tool used
  - Neutron version
  - Was it a fresh install or updated/upgraded from other openstack version. When the issue started to see from beginning of the deployment?
  - Issue happens on specific compute node or same compute node have both success/fail from the batch instances.
- Enable neutron debug logs
- Reproduce the issue
- Share neutron server debug logs from all controller nodes
- Share nova compute and ovs-vswitchd and ovn-controller logs from the impacted compute node
- Share OVN Northbound and Southbound DBs from the controller node
- Output of ovs-vsctl list open . from the compute node
- Output of openstack network agent list
- Output of openstack compute service list

Changed in neutron:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.