ML2 : hard reboot a VM after a compute crash

Bug #1282956 reported by Mathieu Rohon
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Unassigned
openstack-manuals
Fix Released
Undecided
Andreas Scheuring

Bug Description

I run in multi node setup with ML2, L2-population and Linuxbridge MD, and vxlan TypeDriver.

I start two compute-nodes, I launch a VM, and I shutdown the compute-node which host the VM.

I use this process to relaunch the VM on the other compute-node :

http://docs.openstack.org/trunk/openstack-ops/content/maintenance.html#totle_compute_node_failure

Once the VM is launched on the other compute node, fdb entries and neighbouring entries are no more populated on the network-node nor on the compute node

Revision history for this message
Aaron Rosen (arosen) wrote :

Can you check the log files if there are any errors or provide them? What's 'fdb' entries and neighbouring entries?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Mathieu Rohon (mathieu-rohon) wrote :

after the reboot --hard, the port status is Build, and the port is still bound to the previous Host.
It looks like nova doesn't send update_port with the new host.

Revision history for this message
Mathieu Rohon (mathieu-rohon) wrote :

aaron,

fdb entries are forwarding information on the bridge of the host, and the ip neighbouring entries stands for the ARP responder entries.
I have one network node and two compute nodes

I first create a VM with IP 10.0.0.104 and MAC 00:00:00:44:44:44 on node1, then on network node I have :

# ip neigh show
10.0.0.104 dev vxlan-1001 lladdr 00:00:00:44:44:44 PERMANENT
# bridge fdb show dev vxlan-1001
00:00:00:00:00:00 dst 192.168.254.74 self permanent
00:00:00:44:44:44 dst 192.168.254.74 self permanent

then, I change the binding of the VM in nova database :
mysql> update instances set host = 'node2' where host = 'node1' and deleted = 0;

then i do the reboot:
# nova reboot --hard uuid

Actually, this doesn't seem to be an l2-population only bug, since nova doesn't send the update_port with the new host.
the result is that the agent send "get_device_details" which move the port to the status Build. Then the agent set update_device_up which is not forwarded to the MD since the port is not bound to the agent which sends update_device_up.
here is the log of the plugin :
2014-02-25 18:01:32.424 28952 DEBUG neutron.plugins.ml2.rpc [req-3fd91aee-67b2-4fd7-b0e1-d3afb91ef7f6 None] De
vice tap147edc0d-44 up at agent lb00163ef452ac update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:
186
2014-02-25 18:01:32.427 28952 DEBUG neutron.plugins.ml2.rpc [req-3fd91aee-67b2-4fd7-b0e1-d3afb91ef7f6 None] Device tap147edc0d-44 not bound to the agent host devstack2 update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:192

in the ML2 db, the port is still bound to node1, and the port status is still BUILD.

summary: - l2-population : hard reboot a VM after a compute crash
+ ML2 : hard reboot a VM after a compute crash
Changed in neutron:
status: Incomplete → New
Aaron Rosen (arosen)
tags: added: network
Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
liuweicai (liuuweicai) wrote :

are you using shared storage for /var/lib/nova/instances ?

Sandhya Dasu (sadasu)
Changed in neutron:
assignee: nobody → Sandhya Dasu (sadasu)
Brent Eagles (beagles)
tags: added: neutron
Sean Dague (sdague)
no longer affects: nova
Revision history for this message
Sandhya Dasu (sadasu) wrote :

Trying to reproduce this problem in my multi-node setup.

Alan Pevec (apevec)
tags: removed: havana-backport-potential
Revision history for this message
Andreas Scheuring (andreas-scheuring) wrote :

As manipulating the database is not a valid operation, the only way to fix this is updating the documentation.
What probably needs to be done is to update neutrons port binding data as well, in the case ml2 is used.

We could do:
 update ml2_port_bindings set host = 'new-host' where host = 'old-host';

But on my test system the port is still not bound correctly. I see the following message in the server.

    Device tap4b0e8410-ac requested by agent lb5254002a14ad on network eaf89438-2df6-4d1d-a2b0-bc181606f46b not bound, vif_type: bridge

Due to that the agent doesn't set the device up. I see the following message:
    Device tap4b0e8410-ac not defined on plugin

Revision history for this message
Andreas Scheuring (andreas-scheuring) wrote :

just realized there's another entry in the ml2_port_binding_levels table... let me try this

Sandhya Dasu (sadasu)
Changed in neutron:
assignee: Sandhya Dasu (sadasu) → nobody
Revision history for this message
Andreas Scheuring (andreas-scheuring) wrote :

that worked! Additional queries to do:
use neutron
update ml2_port_bindings set host = 'new-host' where host = 'old-host';

update ml2_port_binding_levels set host = 'new-host' where host = 'old-host';

 Setting Neutron to won't fix and adding bug against docs + providing a fix.

Changed in neutron:
status: Confirmed → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to operations-guide (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289458

Changed in openstack-manuals:
assignee: nobody → Andreas Scheuring (andreas-scheuring)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to operations-guide (master)

Reviewed: https://review.openstack.org/289458
Committed: https://git.openstack.org/cgit/openstack/operations-guide/commit/?id=610019c45496e17a8094d535c856d9de9a16f123
Submitter: Jenkins
Branch: master

commit 610019c45496e17a8094d535c856d9de9a16f123
Author: Andreas Scheuring <email address hidden>
Date: Mon Mar 7 18:13:18 2016 +0100

    Ops: Update Compute Node Failure section with Neutron content

    Adding content describing how to fix the Neutron ML2 database in the
    case of a total compute node failure.

    Change-Id: I4712f604f798c6c9ebb6f9d8d82b63d8ac2ed599
    Closes-Bug: #1282956

Changed in openstack-manuals:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.