neutron

ML2 : hard reboot a VM after a compute crash

Bug #1282956 reported by Mathieu Rohon on 2014-02-21

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Won't Fix	Medium	Unassigned
	openstack-manuals	Fix Released	Undecided	Andreas Scheuring

Bug Description

I run in multi node setup with ML2, L2-population and Linuxbridge MD, and vxlan TypeDriver.

I start two compute-nodes, I launch a VM, and I shutdown the compute-node which host the VM.

I use this process to relaunch the VM on the other compute-node :

http://docs.openstack.org/trunk/openstack-ops/content/maintenance.html#totle_compute_node_failure

Once the VM is launched on the other compute node, fdb entries and neighbouring entries are no more populated on the network-node nor on the compute node

Tags:

Revision history for this message

Aaron Rosen (arosen) wrote on 2014-02-21:

Can you check the log files if there are any errors or provide them? What's 'fdb' entries and neighbouring entries?

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Mathieu Rohon (mathieu-rohon) wrote on 2014-02-21:

after the reboot --hard, the port status is Build, and the port is still bound to the previous Host.
It looks like nova doesn't send update_port with the new host.

Revision history for this message

Mathieu Rohon (mathieu-rohon) wrote on 2014-02-25:

aaron,

fdb entries are forwarding information on the bridge of the host, and the ip neighbouring entries stands for the ARP responder entries.
I have one network node and two compute nodes

I first create a VM with IP 10.0.0.104 and MAC 00:00:00:44:44:44 on node1, then on network node I have :

# ip neigh show
10.0.0.104 dev vxlan-1001 lladdr 00:00:00:44:44:44 PERMANENT
# bridge fdb show dev vxlan-1001
00:00:00:00:00:00 dst 192.168.254.74 self permanent
00:00:00:44:44:44 dst 192.168.254.74 self permanent

then, I change the binding of the VM in nova database :
mysql> update instances set host = 'node2' where host = 'node1' and deleted = 0;

then i do the reboot:
# nova reboot --hard uuid

Actually, this doesn't seem to be an l2-population only bug, since nova doesn't send the update_port with the new host.
the result is that the agent send "get_device_details" which move the port to the status Build. Then the agent set update_device_up which is not forwarded to the MD since the port is not bound to the agent which sends update_device_up.
here is the log of the plugin :
2014-02-25 18:01:32.424 28952 DEBUG neutron.plugins.ml2.rpc [req-3fd91aee-67b2-4fd7-b0e1-d3afb91ef7f6 None] De
vice tap147edc0d-44 up at agent lb00163ef452ac update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:
186
2014-02-25 18:01:32.427 28952 DEBUG neutron.plugins.ml2.rpc [req-3fd91aee-67b2-4fd7-b0e1-d3afb91ef7f6 None] Device tap147edc0d-44 not bound to the agent host devstack2 update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:192

in the ML2 db, the port is still bound to node1, and the port status is still BUILD.

Mathieu Rohon (mathieu-rohon) on 2014-02-25

summary:

- l2-population : hard reboot a VM after a compute crash
+ ML2 : hard reboot a VM after a compute crash

Mathieu Rohon (mathieu-rohon) on 2014-03-10

Changed in neutron:
status:	Incomplete → New

Aaron Rosen (arosen) on 2014-03-18

tags:

added: network

Eugene Nikanorov (enikanorov) on 2014-05-27

Changed in neutron:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

liuweicai (liuuweicai) wrote on 2014-05-27:

are you using shared storage for /var/lib/nova/instances ?

Sandhya Dasu (sadasu) on 2014-06-19

Changed in neutron:
assignee:	nobody → Sandhya Dasu (sadasu)

Brent Eagles (beagles) on 2014-08-11

tags:

added: neutron

Sean Dague (sdague) on 2014-09-09

no longer affects:

nova

Revision history for this message

Sandhya Dasu (sadasu) wrote on 2015-01-15:

Trying to reproduce this problem in my multi-node setup.

Alan Pevec (apevec) on 2015-11-24

tags:

removed: havana-backport-potential

Revision history for this message

Andreas Scheuring (andreas-scheuring) wrote on 2016-03-07:

As manipulating the database is not a valid operation, the only way to fix this is updating the documentation.
What probably needs to be done is to update neutrons port binding data as well, in the case ml2 is used.

We could do:
update ml2_port_bindings set host = 'new-host' where host = 'old-host';

But on my test system the port is still not bound correctly. I see the following message in the server.

Device tap4b0e8410-ac requested by agent lb5254002a14ad on network eaf89438-2df6-4d1d-a2b0-bc181606f46b not bound, vif_type: bridge

Due to that the agent doesn't set the device up. I see the following message:
Device tap4b0e8410-ac not defined on plugin

Revision history for this message

Andreas Scheuring (andreas-scheuring) wrote on 2016-03-07:

just realized there's another entry in the ml2_port_binding_levels table... let me try this

Sandhya Dasu (sadasu) on 2016-03-07

Changed in neutron:
assignee:	Sandhya Dasu (sadasu) → nobody

Revision history for this message

Andreas Scheuring (andreas-scheuring) wrote on 2016-03-07:

that worked! Additional queries to do:
use neutron
update ml2_port_bindings set host = 'new-host' where host = 'old-host';

update ml2_port_binding_levels set host = 'new-host' where host = 'old-host';

Setting Neutron to won't fix and adding bug against docs + providing a fix.

Changed in neutron:
status:	Confirmed → Won't Fix

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-07: Fix proposed to operations-guide (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289458

Changed in openstack-manuals:
assignee:	nobody → Andreas Scheuring (andreas-scheuring)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-08: Fix merged to operations-guide (master)

#10

Reviewed: https://review.openstack.org/289458
Committed: https://git.openstack.org/cgit/openstack/operations-guide/commit/?id=610019c45496e17a8094d535c856d9de9a16f123
Submitter: Jenkins
Branch: master

commit 610019c45496e17a8094d535c856d9de9a16f123
Author: Andreas Scheuring <email address hidden>
Date: Mon Mar 7 18:13:18 2016 +0100

Ops: Update Compute Node Failure section with Neutron content

Adding content describing how to fix the Neutron ML2 database in the
case of a total compute node failure.

Change-Id: I4712f604f798c6c9ebb6f9d8d82b63d8ac2ed599
Closes-Bug: #1282956

Changed in openstack-manuals:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.