Hangs of Neutron l3 agent

Bug #1361710 reported by Tony Tarasov
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Committed
Critical
Ilya Shakhat
4.1.x
In Progress
Critical
Ryan Moe
5.0.x
In Progress
Critical
MOS Maintenance
5.1.x
Fix Released
Critical
Ilya Shakhat
6.0.x
Fix Committed
Critical
Ilya Shakhat

Bug Description

Hello Guys.

We have successfully deployed customer cloud.
Fuel 5.0.1 Custom build with Zabbix implementation.
We have HA mode with additions Fwaas(l3), Lbaas, Vpnaas(l3)

Sometimes neutron-l3-agent is hangs.

No any issues, no any errors inside log files just lack of work.

What we see:
Internal Interfaces inside the router is in DOWN state.
External Gateway isn't accessible from the outside.
Any Floating IP isn't accessible from the outside.
Namespaces can't be created.
L3 agent status is fine.
Services statuses are fine.

How to fix:

Stop neutron-l3-agent via crm
Destroy all routers
Delete all agents with the command neutron agent-delete
Run crm resource start <l3 agent>

Please advise me which information you need.

Revision history for this message
Tony Tarasov (atarasov) wrote :
Revision history for this message
Tony Tarasov (atarasov) wrote :

I will update this bug with new info when I will reproduce it again.

Revision history for this message
Tony Tarasov (atarasov) wrote :
Changed in fuel:
assignee: nobody → Sergey Vasilenko (xenolog)
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

FYI: don't use 'crm' utility.
CRM utility will be removed at 6.0

please use 'pcs'

Changed in fuel:
importance: Undecided → Medium
tags: added: neutron
Revision history for this message
Andrew Woodward (xarses) wrote :

There are some AMQP connection traces at the bottom of the log and A LOT near the top.

It would be helpful to have some time stamps of the incident to correlate with the messages in the log.

The log also shows that the service restarted about a dozen times with in a few min at the end of the log. Was this expected?

Please include output of 'strace -p <pid> -s 2048 2>&1 >log' for ~ 1 minute when it locks up. prior to restarting it. Also include ouput of lsof | grep <pid> and the output from top

Please include the neutron server log around the time of the next event ( a few min before and after should work )

Changed in fuel:
status: New → Incomplete
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Tony Tarasov (atarasov)
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: none → 5.1
Revision history for this message
Tony Tarasov (atarasov) wrote :

Hello Andrew.
See the time stamp: Friday 22 aug 2014. Time start: 00:00

For restarting of agent - it's normal. We troubleshooted a lot.

Revision history for this message
Tony Tarasov (atarasov) wrote :

For now - cloud works fine.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Moved to MOS space, please triage.

Changed in mos:
assignee: nobody → Tony Tarasov (atarasov)
milestone: none → 5.1
no longer affects: fuel
Changed in mos:
assignee: Tony Tarasov (atarasov) → MOS Neutron (mos-neutron)
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Andrey Epifanov (aepifanov)
assignee: Andrey Epifanov (aepifanov) → Eugene Nikanorov (enikanorov)
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

I see a number of issues with vpn functionality which can lead to the described result.
So far I am unable to identify the root cause.
If it's possible to get the environment with the issue, I would look at the issue myself.

Changed in mos:
importance: Undecided → Medium
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Apparently the issue is caused by agent-cleanup.py script that removes router interface from qrouter namespace.

We're continuing to work on the issue.

tags: added: docs
Revision history for this message
Irina Povolotskaya (ipovolotskaya) wrote :

Meg, this information is for you:

Sometimes, the external network connection and floating IP may stop working, but L3 agent has no errors in the logs.
Currently, it's supposed that this is somehow caused by clean-up script that launches or works out in a wrong way, so that it deletes router interfaces that must support a connection to an external network.

There is no stable workaround yet.

A temporary way of getting rid of this bug is the following: to recreate router or reschedule L3 agent.

Changed in mos:
assignee: Eugene Nikanorov (enikanorov) → Meg McRoberts (dreidellhasa)
Changed in mos:
assignee: Meg McRoberts (dreidellhasa) → Eugene Nikanorov (enikanorov)
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Here's cleanup script with more logging

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Target milestone moved to 6.0 since the bug priority is below High and we're in code freeze for 5.1.

Changed in mos:
milestone: 5.1 → 6.0
Changed in mos:
status: New → Triaged
Changed in mos:
status: Triaged → Confirmed
Revision history for this message
Tony Tarasov (atarasov) wrote :

Hey Guys. This Bug affect at one more our customer.
Please feel free to ask operation team about it. We need to fix it asap.

Revision history for this message
Pavel Vaylov (pvaylov) wrote :

But steps to reproduce still not clear.
One of assumption - if we restart 3 agent only it unable to run till old metadata agents restart.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raised to critical due to increased # of affected deployments and due to its operations-blocking behavior (actually, it prevents neutron from operating normally)

Revision history for this message
Miroslav Anashkin (manashkin) wrote :

This
https://review.openstack.org/#/c/125110/2
and this
https://review.openstack.org/#/c/125116/
should fix this bug for 5.1 and 6.0

Looks like forgotten quotation marks on passing arguments to namespace - so only first argument was used and it led to only single namespace creation.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Dmitry Borodanko, it is not clear if changes referenced by Miroslav bug is fixed. I would prefer Neutron team to confirm that, hence moving back to triaged for 5.1 and 6.0.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

The fix is verified on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "27"
  build_id: "2014-10-13_00-01-06"
  astute_sha: "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13"
  fuellib_sha: "46ad455514614ec2600314ac80191e0539ddfc04"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "88a94a11426d356540722593af1603e5089d442c"
  fuelmain_sha: "431350ba204146f815f0e51dd47bf44569ae1f6d"

Scenario:
 1. Create Router
 2. Set gateway to external network
 3. Create network with subnets 11.0.0.0/24 and 111.0.0.0/24
 4. Create another network with subnet 12.0.0.0/24
 5. Plug networks into the router
 6. Start VM on any of the networks, check that VM acquired IP and ping to 8.8.8.8 works
 7. Check where L3 agent is running by command 'neutron agent-list'
 8. Verify network configuration: (L3 agent is running on node-1)

(.venv)developer@fuel:stack$ ssh node-1 ip netns
Warning: Permanently added 'node-1' (RSA) to the list of known hosts.
qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd
haproxy
qrouter-af4f6833-d01c-4e1c-b9b2-29e5c3f2b6e0

(.venv)developer@fuel:stack$ ssh node-1 ip netns exec qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd ip ro
Warning: Permanently added 'node-1' (RSA) to the list of known hosts.
172.18.161.0/24 dev qg-aed693c2-b9 proto kernel scope link src 172.18.161.205
11.0.0.0/24 dev qr-7af18f99-f1 proto kernel scope link src 11.0.0.1
12.0.0.0/24 dev qr-f3512f79-90 proto kernel scope link src 12.0.0.1
111.0.0.0/24 dev qr-1e578c6b-6d proto kernel scope link src 111.0.0.1
default via 172.18.161.1 dev qg-aed693c2-b9

 9. Ask pacemaker to disable-enable L3 agent:
pcs resource disable p_neutron-l3-agent
wait until it is stopped and then:
pcs resource enable p_neutron-l3-agent
 10. L3-agent is moved to other controller (say to node-2)
 11. Verify network configuration:

(.venv)developer@fuel:stack$ ssh node-2 ip netns
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd
qrouter-af4f6833-d01c-4e1c-b9b2-29e5c3f2b6e0
haproxy

(.venv)developer@fuel:stack$ ssh node-2 ip netns exec qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd ip ro
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
172.18.161.0/24 dev qg-aed693c2-b9 proto kernel scope link src 172.18.161.205
11.0.0.0/24 dev qr-7af18f99-f1 proto kernel scope link src 11.0.0.1
12.0.0.0/24 dev qr-f3512f79-90 proto kernel scope link src 12.0.0.1
111.0.0.0/24 dev qr-1e578c6b-6d proto kernel scope link src 111.0.0.1
default via 172.18.161.1 dev qg-aed693c2-b9
 12. Verify that ping from VM still works

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Dmitry M: does Ilya's comment #19 confirm that the fixes listed by Miroslav in comment #17 do fix this problem? If so, can we get these ported to 5.0 and 6.0?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Dmitry B: absolutely yes. Actually Miroslav ported fix to 5.0.x as well, we just wait for Neutron team to confirm it works there as well. And sure the team is expected to port fix to 6.0 too.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Merged to master by the following fix https://review.openstack.org/#/c/125110/2

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Can you propose the 5.0.x version for stable/5.0?

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Had a conversation with Sergey Vasilenko, he will send CR to stable/5.0 for this issue soon.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

It was a month since previous comment, any update on a backport for stable/5.0?

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Dmitry, fixing 5.0.3 was postponed since 5.0.3 was postponed itself.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

We may have postponed the 5.0.3 release, but customers running 5.0.x who have this problem still need a patched package. Now that 6.0 is out, please backport this fix.

Revision history for this message
Ryan Moe (rmoe) wrote :

Fix verified on 4.1.1

'astute_sha': '55df06b2e84fa5d71a1cc0e78dbccab5db29d968',
 'build_id': '2015-01-09_18-40-26',
 'build_number': '5',
 'fuellib_sha': '469ed82eae57bd85939678054dfa0260e8dbf895',
 'fuelmain_sha': 'fa218a36d2686de6bb36ef8c6b33526c9d802e34',
 'mirantis': 'yes',
 'nailgun_sha': '8cc1dff8ab43f50bd28a9c081d7d06fbd3831b98',
 'ostf_sha': 'f4f15b4d98459650c1945b0efc30290a619be824',
 'release': '4.1.1'

Backport here: https://review.openstack.org/#/c/146994/

Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

The backport to stable/5.0 is here https://review.openstack.org/#/c/147095/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.