qr port of br-int is not set vlan after I restart machine and then restart quantum components

Bug #1050512 reported by yong sheng gong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
dan wendlandt

Bug Description

I created a router and add interface, the vlan tag is set right. but after I restart machine and quantum components, the qr port's vlan tag is lost.

I have to delete the router interface by quantum router-delete-interface and run router-add-interface.

Changed in quantum:
importance: Undecided → Critical
Revision history for this message
Gary Kotton (garyk) wrote :

a new path is ready for devstack where the ovs vif driver is updated. maybe we should try and reproduce with the hybrid driver: https://review.openstack.org/#/c/11650/

Revision history for this message
Gary Kotton (garyk) wrote :

sorry comment is on wrong bug

Revision history for this message
Gary Kotton (garyk) wrote :

the ovs has the tap devices as being persistant after boot. maybe whent the agent starts we need to clean up...

Revision history for this message
dan wendlandt (danwent) wrote :

Assuming this is using OVS, the interfaces are actually 'internal' interfaces, not 'tap' devices, so their existence should continue across a reboot. However, they will be destroyed and recreated when the l3-agent is restarted.

Revision history for this message
dan wendlandt (danwent) wrote :

I saw this in my setup even without a restart. Will try and reproduce more with more detailed logging.

Revision history for this message
dan wendlandt (danwent) wrote :

note: this is like an agent issue, as is it the job of the agent to set the device on a particular vlan

Revision history for this message
dan wendlandt (danwent) wrote :

This bug was flying below the radar a bit because it was missing the folsom-rc-potential tag. adding it.

tags: added: folsom-rc-potential
dan wendlandt (danwent)
Changed in quantum:
status: New → Confirmed
dan wendlandt (danwent)
Changed in quantum:
assignee: nobody → Gary Kotton (garyk)
Revision history for this message
Gary Kotton (garyk) wrote :

Would it be possible to try and provide some extra information on how to reproduce the problem. I am trying but to no avail at the moment.
Thanks
Gary

Revision history for this message
yong sheng gong (gongysh) wrote :

I tryied but cannot reproduce. going on...

Revision history for this message
Gary Kotton (garyk) wrote : Re: [Bug 1050512] Re: qr port of br-int is not set vlan after I restart machine and then restart quantum components

On 09/18/2012 09:14 AM, yong sheng gong wrote:
> I tryied but cannot reproduce. going on...
>
Thanks! I'll continue to try and reproduce.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Holding on on targeting this for rc2 until we manage to reproduce it.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I have not been able to reproduce this bug too.
I am therefore suggesting we keep it out of Folsom rc-2

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
I have spent quite a lot of time trying to reproduce this. I have the following findings:
1. The L2 agent polls the ovs every 2 seconds to learn if there are any changes regarding the attached devices. If there are then it will query the quantum plugin to get the information. This is where we have a number ofinteresting things:
    i. If the quantum service has not started (which could be the case here) then by default the requests will hang for the default timeout (rpc_response_timeout=60). This still does not explain why the tag was removed. I think that this value should be changed to at most 5 seconds.
    ii. If the agent gets a response from the plugin that is inconsistent from its configuration then it will set the tag as 4095 (I do not think that this was the case here)
2. When the appliance is rebooted the OVS entries are persistent - that means that the created devices and their tags remain unchanged after reboot. The L2 agent can either delete the entry or it can set the tag to be 4095.
3. I am not sure if the processes were started manually or via systemd. If this is systemd then I think that the packages need to ensure dependencises on the startup order:
    i. rpc message service (if running on host)
    ii. database (mysql if running on host)
    iii. quantum service (if running on host)
    iv. quantum agents (if running on host)
I am going to take the liberty to move this to incomplete. Hopefully someone may have a scenario that reproduces so that we can fix the problem.
I think that we should also address the default timeout for the rpc call.
Thanks
Gary

Gary Kotton (garyk)
Changed in quantum:
status: Confirmed → Incomplete
Revision history for this message
dan wendlandt (danwent) wrote :

Hi folks. This may not be the same thing that yong originally saw, but I think it might explain what I've been seeing recently.

There's a bug that we fixed in RC2 that causes ovs_lib to throw an exception in some circumstances: https://bugs.launchpad.net/quantum/+bug/1050504 . We've fixed this, but I noticed that when we hit this, quantum-openvswitch-agent actually exits completely. This means any new ports that are created are not tagged with a vlan at all, since the agent isn't running at all.

This really confused me, as the old db polling model made sure that no matter what happened in OVS, we would catch the exception and the agent would never exit, it would just loop again, resetting all state with each loop. However, it appears that with the RPC version, this is not the case. An exception at any time in the processing causes the agent to exit completely. At least in the ubuntu packaging, it never is restarted.

We've fixed the particular bug mentioned above, but there will be more, so I really think we need to improve exception handling in RPC mode. This may be as simple as adding a try/catch in rcp_loop(), but I'd have to investigate in more detail how to safely reset state.

Changed in quantum:
importance: Critical → High
Revision history for this message
dan wendlandt (danwent) wrote :

perhaps its as simple as setting sync = True in the exception handling logic? I'd like someone who worked a lot of the RPC stuff (gary?) to comment on this more.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/13426

Changed in quantum:
assignee: Gary Kotton (garyk) → dan wendlandt (danwent)
status: Incomplete → In Progress
Revision history for this message
dan wendlandt (danwent) wrote :

I'd like someone more familiar with the RPC logic to take a look at the above patch and see if they see any problems.

dan wendlandt (danwent)
Changed in quantum:
milestone: none → folsom-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/13426
Committed: http://github.com/openstack/quantum/commit/a13a5737cde4f81e51a2f939f77fdd35d8ab483e
Submitter: Jenkins
Branch: master

commit a13a5737cde4f81e51a2f939f77fdd35d8ab483e
Author: Dan Wendlandt <email address hidden>
Date: Thu Sep 20 21:49:51 2012 -0700

    Add catch-call try/catch within rpc_loop in ovs plugin agent

    related to bug 1050512

    when running in db-mode, the ovs plugin agent will catch any unexpected
    exceptions generated during processing. However, in rpc-mode, this
    does not happen, meaning a small error, even a transient one, causes the
    agent to exit completely. Thic change adds a try-catch block to the
    rcp_loop(), causing the agent to log any unexpected exception, wait for
    the polling period, then retry the loop after resetting all state.

    Change-Id: I76eae1800831e59c5078c4be8fa5ca22298bfb0a

Changed in quantum:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
tags: removed: folsom-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/13453

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (milestone-proposed)

Reviewed: https://review.openstack.org/13453
Committed: http://github.com/openstack/quantum/commit/0cab87a8935559b20eb42b85695859ad10a1df56
Submitter: Jenkins
Branch: milestone-proposed

commit 0cab87a8935559b20eb42b85695859ad10a1df56
Author: Dan Wendlandt <email address hidden>
Date: Thu Sep 20 21:49:51 2012 -0700

    Add catch-call try/catch within rpc_loop in ovs plugin agent

    related to bug 1050512

    when running in db-mode, the ovs plugin agent will catch any unexpected
    exceptions generated during processing. However, in rpc-mode, this
    does not happen, meaning a small error, even a transient one, causes the
    agent to exit completely. Thic change adds a try-catch block to the
    rcp_loop(), causing the agent to log any unexpected exception, wait for
    the polling period, then retry the loop after resetting all state.

    Change-Id: I76eae1800831e59c5078c4be8fa5ca22298bfb0a

Changed in quantum:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in quantum:
milestone: folsom-rc2 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.