test_volume_boot_pattern fails with DVR routers

Bug #1358554 reported by Armando Migliaccio
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Rossella Sblendido

Bug Description

This happens on master and appears to be a regression of https://review.openstack.org/#/c/112465/

Steps to reproduce:

- Deploy devstack with DVR *ON*
- Run tempest test test_volume_boot_pattern (more details available in [1])
- Watch the test fail

Once change 112465 is reverted, the test will be successful.

This does not imply that the change is at fault, instead, that another failure mode has been unveiled.

In fact, prior to change 112465, call [2] was made with the wrong agent id, which caused the delete_port operation to abort altogether; this is wrong and addressed in 112465. However, the boot from volume test, and its sequence of operations revealed that the clean-up logic is not working as it should.

[1] - https://github.com/openstack/tempest/blob/master/tempest/scenario/test_volume_boot_pattern.py#L31
[2] - https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/plugin.py#L1029

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

More triaging:

As test implementation reveals in [1], these are the following steps:

# create an instance from volume
# write content to volume on instance
# delete instance <<==
# create a 2nd instance from volume
# check the content of written file
# snapshot a volume
# create a 3rd instance from snapshot
# check the content of written file

The step outlined above seems to be the one that causes the failure (test fails on [2]). This would indicate that once the namespace is cleared (thanks to change 112465), it is no longer re-configured correctly.

[1] - https://github.com/openstack/tempest/blob/master/tempest/scenario/test_volume_boot_pattern.py#L155
[2] - https://github.com/openstack/tempest/blob/master/tempest/scenario/test_volume_boot_pattern.py#L163

Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: Undecided → High
Changed in neutron:
assignee: nobody → Rossella Sblendido (rossella-o)
Revision history for this message
Stephen Ma (stephen-ma) wrote :

I reproduced the same problem by running tempest on my devstack setup. The VM did get the fixed ip address from the dhcp server. But it cannot ping to the gateway. From the dhcp namespace, the gateway also can't be pinged. So the qr- device is broken in the qrouter namespace. The gateway port in my testing has uuid 3e795110-ad7f-4159-b89f-7de83cecdaba.

The ovs-vsctl show output on this port qr-3e795110-ad has a tag of 4095. 4095 is the value used by the ovs-agent to mark a port as dead. So the next step is to find out why the ovs-agent think the port is dead.

   Bridge br-ex
        Port br-ex
            Interface br-ex
                type: internal
 ...
    Bridge br-tun
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
    Bridge br-int
    ...
        Port "qr-3e795110-ad"
            tag: 4095
            Interface "qr-3e795110-ad"
                type: internal
  ...
    ovs_version: "2.0.1"

Revision history for this message
Brian Haley (brian-haley) wrote :

I was going to start instrumenting things to help figure this out, but from going through the devstack logs, I can see that there's a call to treat_devices_added_or_updated() that is not finding the port_id in the plugin.

Some quick snippets from the q-agt.log:

Processing port: 3a92ebef-520d-4c68-8b1f-93ade0783625 treat_devices_added_or_updated

 Port 3a92ebef-520d-4c68-8b1f-93ade0783625 updated. Details: {u'profile': {}, u'admin_state_up': True, u'network_id': u'a3f1feb8-4937-4d74-9c45-b2939385b49e', u'segmentation_id': 1004, u'device_owner': u'network:router_interface_distributed', u'physical_network': None, u'mac_address': u'fa:16:3e:a1:48:44', u'device': u'3a92ebef-520d-4c68-8b1f-93ade0783625', u'port_id': u'3a92ebef-520d-4c68-8b1f-93ade0783625', u'fixed_ips': [{u'subnet_id': u'ac5a54af-f1e5-413b-9fed-a771d49f9d2b', u'ip_address': u'10.100.0.1'}], u'network_type': u'vxlan'}

 Assigning 3 as local vlan for net-id=a3f1feb8-4937-4d74-9c45-b2939385b49e

 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qr-3a92ebef-52', 'tag=3'

 update_device_up
 {couple of minutes}
 update_device_down

Then a few minutes later:

Processing port: 3a92ebef-520d-4c68-8b1f-93ade0783625 treat_devices_added_or_updated

 Device 3a92ebef-520d-4c68-8b1f-93ade0783625 not defined on plugin

 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qr-3a92ebef-52', 'tag=4095'

I'll continue looking tomorrow.

Revision history for this message
Rossella Sblendido (rossella-o) wrote :
Download full text (3.3 KiB)

Thanks everybody for your comments.
I can confirm all your findings. The problem is the router-interface port.
I analyzed the logs of this run http://logs.openstack.org/91/114691/6/experimental/check-tempest-dsvm-neutron-dvr/73cff0a/

The router interface is 86c6f7a4-31e6-4cd4-837a-68843730c101

From the q-agt log we see that the router interface is not present in the interface list and it's marked as removed:

2014-08-21 04:42:52.026 28137 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Starting to process devices in:{'current': set([u'fe074629-227d-42e4-8b16-21f2efc0e8c7', u'4423ca10-d95e-4ce7-abc8-25ce4bd308fa', u'c3349086-3b0d-49fe-85ce-c2ae7ce55fcc', u'a376b867-5843-47a4-a158-5157e441d937', u'0a82de5e-e62f-475c-bc9c-ed115bd58afd', u'cd5c3035-2cd5-4e4c-93c7-d20075972120']), 'removed': set([u'86c6f7a4-31e6-4cd4-837a-68843730c101']), 'added': set([u'4423ca10-d95e-4ce7-abc8-25ce4bd308fa'])} rpc_loop /opt/stack/new/neutron/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1400

2014-08-21 04:42:56.116 28137 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Attachment 86c6f7a4-31e6-4cd4-837a-68843730c101 removed

Then the interface is detected again:
2014-08-21 04:42:59.293 28137 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Starting to process devices in:{'current': set([u'fe074629-227d-42e4-8b16-21f2efc0e8c7', u'4423ca10-d95e-4ce7-abc8-25ce4bd308fa', u'c3349086-3b0d-49fe-85ce-c2ae7ce55fcc', u'a376b867-5843-47a4-a158-5157e441d937', u'86c6f7a4-31e6-4cd4-837a-68843730c101', u'0a82de5e-e62f-475c-bc9c-ed115bd58afd', u'cd5c3035-2cd5-4e4c-93c7-d20075972120']), 'removed': set([]), 'added': set([u'86c6f7a4-31e6-4cd4-837a-68843730c101'])} rpc_loop /opt/stack/new/neutron/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1400

2014-08-21 04:43:00.332 28137 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Processing port: 86c6f7a4-31e6-4cd4-837a-68843730c101 treat_devices_added_or_updated /opt/stack/new/neutron/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1090

This operation fails because:
2014-08-21 04:43:00.809 28137 WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Device 86c6f7a4-31e6-4cd4-837a-68843730c101 not defined on plugin

If we check in the q-srv log we see:

2014-08-21 04:43:00.314 28081 DEBUG neutron.plugins.ml2.db [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] No binding for DVR port 86c6f7a4-31e6-4cd4-837a-68843730c101 with host devstack-trusty-hpcloud-b5-1583220 get_dvr_port_binding_by_host /opt/stack/new/neutron/neutron/plugins/ml2/db.py:222
2014-08-21 04:43:00.315 28081 ERROR neutron.plugins.ml2.plugin [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Binding info for DVR port 86c6f7a4-31e6-4cd4-837a-68843730c101 not found

At this point the port is set as dead using the tag 4095

2014-08-21 04:43:01.057 28137 DEBUG neutron.agent.linux.utils [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Running command: ['sudo', '/usr/local/bin/n...

Read more...

Revision history for this message
Rossella Sblendido (rossella-o) wrote :

The cause of the bug is that the router interface is detected as down:

2014-08-21 04:42:56.122 28081 DEBUG neutron.plugins.ml2.rpc [req-74f7c36e-7acb-4398-995c-a5d9c3e5ad98 None] Device 86c6f7a4-31e6-4cd4-837a-68843730c101 no longer exists at agent ovs-agent-devstack-trusty-hpcloud-b5-1583220 update_device_down /opt/stack/new/neutron/neutron/plugins/ml2/rpc.py:156

Then the plugin receives an rpc message to get the device details before having detected the router interface as UP.
So the binding was not updated and no binding is found for that interface on that host.

I will think of a way of solving it and propose a patch asap.

Changed in neutron:
milestone: none → juno-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/117285

Changed in neutron:
status: Confirmed → In Progress
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-3 → juno-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/117285
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=78059968e212d537c813f398da7f76714380bf4e
Submitter: Jenkins
Branch: master

commit 78059968e212d537c813f398da7f76714380bf4e
Author: rossella <email address hidden>
Date: Wed Aug 27 10:48:41 2014 +0000

    Update DVR Binding when router_id changes

    In this patch I modified update_dvr_port_binding to update the
    binding if router_id changes.
    When a new router namespace is created on a host, sync_router
    will call _ensure_host_set_on_port that will update the DVR binding.
    With the current code update_dvr_port_binding won't update the
    binding if there's an existing binding unless the binding
    vif_type is VIF_TYPE_BINDING_FAILED . There's a race when
    the router namespace is destroyed and then recreated on the same
    host.
    The DVR binding is deleted only when the message
    update_device_down is processed. If this message is processed
    after the update_dvr_port_binding triggered by the namespace creation,
    update_dvr_port_binding won't update the binding. When the router
    interface will be detected as UP, no DVR binding will be found.

    Closes-bug: #1358554
    Change-Id: I37fd4ed67dc2019b57e36d082b584c517d8f67a8

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-rc1 → 2014.2
Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

I think I'm seeing the same thing on Juno without DVR. The router interface port has it's tag set to 4095:

    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "qr-c04a5455-c1"
            tag: 4095
            Interface "qr-c04a5455-c1"
                type: internal
        Port int-br-prv
            Interface int-br-prv
                type: patch
                options: {peer=phy-br-prv}

And the port shows up as follows on neutron:
neutron port-show c04a5455-c1b4-4bc5-ad5b-8b16d372c036
+-----------------------+--------------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------------------------------------------------------+
| admin_state_up | True |
| allowed_address_pairs | |
| binding:host_id | node-6 |
| binding:profile | {} |
| binding:vif_details | {} |
| binding:vif_type | binding_failed |
| binding:vnic_type | normal |
| device_id | 602539d1-2cf9-4917-919d-4892976f6d10 |
| device_owner | network:router_interface |
| extra_dhcp_opts | |
| fixed_ips | {"subnet_id": "533e34e4-3474-469b-9e65-cd34498ec99e", "ip_address": "192.168.111.1"} |
| id | c04a5455-c1b4-4bc5-ad5b-8b16d372c036 |
| mac_address | fa:16:3e:f0:21:64 |
| name | |
| network_id | 4f2d6195-5487-4d85-ae8c-87085af201e3 |
| security_groups | |
| status | DOWN |
| tenant_id | 17ce62ae311f4c17b322fbced33178ca |
+-----------------------+--------------------------------------------------------------------------------------+

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.