Hangs of Neutron l3 agent

Bug #1361710 reported by Tony Tarasov on 2014-08-26
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Critical
Ilya Shakhat
4.1.x
Critical
Ryan Moe
5.0.x
Critical
MOS Maintenance
5.1.x
Critical
Ilya Shakhat
6.0.x
Critical
Ilya Shakhat

Bug Description

Hello Guys.

We have successfully deployed customer cloud.
Fuel 5.0.1 Custom build with Zabbix implementation.
We have HA mode with additions Fwaas(l3), Lbaas, Vpnaas(l3)

Sometimes neutron-l3-agent is hangs.

No any issues, no any errors inside log files just lack of work.

What we see:
Internal Interfaces inside the router is in DOWN state.
External Gateway isn't accessible from the outside.
Any Floating IP isn't accessible from the outside.
Namespaces can't be created.
L3 agent status is fine.
Services statuses are fine.

How to fix:

Stop neutron-l3-agent via crm
Destroy all routers
Delete all agents with the command neutron agent-delete
Run crm resource start <l3 agent>

Please advise me which information you need.

Tony Tarasov (atarasov) wrote :
Tony Tarasov (atarasov) wrote :

I will update this bug with new info when I will reproduce it again.

Tony Tarasov (atarasov) wrote :
Changed in fuel:
assignee: nobody → Sergey Vasilenko (xenolog)
Sergey Vasilenko (xenolog) wrote :

FYI: don't use 'crm' utility.
CRM utility will be removed at 6.0

please use 'pcs'

Changed in fuel:
importance: Undecided → Medium
tags: added: neutron
Andrew Woodward (xarses) wrote :

There are some AMQP connection traces at the bottom of the log and A LOT near the top.

It would be helpful to have some time stamps of the incident to correlate with the messages in the log.

The log also shows that the service restarted about a dozen times with in a few min at the end of the log. Was this expected?

Please include output of 'strace -p <pid> -s 2048 2>&1 >log' for ~ 1 minute when it locks up. prior to restarting it. Also include ouput of lsof | grep <pid> and the output from top

Please include the neutron server log around the time of the next event ( a few min before and after should work )

Changed in fuel:
status: New → Incomplete
Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Tony Tarasov (atarasov)
Mike Scherbakov (mihgen) on 2014-08-29
Changed in fuel:
milestone: none → 5.1
Tony Tarasov (atarasov) wrote :

Hello Andrew.
See the time stamp: Friday 22 aug 2014. Time start: 00:00

For restarting of agent - it's normal. We troubleshooted a lot.

Tony Tarasov (atarasov) wrote :

For now - cloud works fine.

Nastya Urlapova (aurlapova) wrote :

Moved to MOS space, please triage.

Changed in mos:
assignee: nobody → Tony Tarasov (atarasov)
milestone: none → 5.1
no longer affects: fuel
Changed in mos:
assignee: Tony Tarasov (atarasov) → MOS Neutron (mos-neutron)
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Andrey Epifanov (aepifanov)
assignee: Andrey Epifanov (aepifanov) → Eugene Nikanorov (enikanorov)
Eugene Nikanorov (enikanorov) wrote :

I see a number of issues with vpn functionality which can lead to the described result.
So far I am unable to identify the root cause.
If it's possible to get the environment with the issue, I would look at the issue myself.

Changed in mos:
importance: Undecided → Medium
Eugene Nikanorov (enikanorov) wrote :

Apparently the issue is caused by agent-cleanup.py script that removes router interface from qrouter namespace.

We're continuing to work on the issue.

tags: added: docs

Meg, this information is for you:

Sometimes, the external network connection and floating IP may stop working, but L3 agent has no errors in the logs.
Currently, it's supposed that this is somehow caused by clean-up script that launches or works out in a wrong way, so that it deletes router interfaces that must support a connection to an external network.

There is no stable workaround yet.

A temporary way of getting rid of this bug is the following: to recreate router or reschedule L3 agent.

Changed in mos:
assignee: Eugene Nikanorov (enikanorov) → Meg McRoberts (dreidellhasa)
Changed in mos:
assignee: Meg McRoberts (dreidellhasa) → Eugene Nikanorov (enikanorov)
Eugene Nikanorov (enikanorov) wrote :

Here's cleanup script with more logging

Dmitry Borodaenko (angdraug) wrote :

Target milestone moved to 6.0 since the bug priority is below High and we're in code freeze for 5.1.

Changed in mos:
milestone: 5.1 → 6.0
Changed in mos:
status: New → Triaged
Changed in mos:
status: Triaged → Confirmed
Tony Tarasov (atarasov) wrote :

Hey Guys. This Bug affect at one more our customer.
Please feel free to ask operation team about it. We need to fix it asap.

Pavel Vaylov (pvaylov) wrote :

But steps to reproduce still not clear.
One of assumption - if we restart 3 agent only it unable to run till old metadata agents restart.

Bogdan Dobrelya (bogdando) wrote :

Raised to critical due to increased # of affected deployments and due to its operations-blocking behavior (actually, it prevents neutron from operating normally)

Miroslav Anashkin (manashkin) wrote :

This
https://review.openstack.org/#/c/125110/2
and this
https://review.openstack.org/#/c/125116/
should fix this bug for 5.1 and 6.0

Looks like forgotten quotation marks on passing arguments to namespace - so only first argument was used and it led to only single namespace creation.

Dmitry Borodanko, it is not clear if changes referenced by Miroslav bug is fixed. I would prefer Neutron team to confirm that, hence moving back to triaged for 5.1 and 6.0.

Ilya Shakhat (shakhat) wrote :

The fix is verified on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1.1"
  api: "1.0"
  build_number: "27"
  build_id: "2014-10-13_00-01-06"
  astute_sha: "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13"
  fuellib_sha: "46ad455514614ec2600314ac80191e0539ddfc04"
  ostf_sha: "64cb59c681658a7a55cc2c09d079072a41beb346"
  nailgun_sha: "88a94a11426d356540722593af1603e5089d442c"
  fuelmain_sha: "431350ba204146f815f0e51dd47bf44569ae1f6d"

Scenario:
 1. Create Router
 2. Set gateway to external network
 3. Create network with subnets 11.0.0.0/24 and 111.0.0.0/24
 4. Create another network with subnet 12.0.0.0/24
 5. Plug networks into the router
 6. Start VM on any of the networks, check that VM acquired IP and ping to 8.8.8.8 works
 7. Check where L3 agent is running by command 'neutron agent-list'
 8. Verify network configuration: (L3 agent is running on node-1)

(.venv)developer@fuel:stack$ ssh node-1 ip netns
Warning: Permanently added 'node-1' (RSA) to the list of known hosts.
qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd
haproxy
qrouter-af4f6833-d01c-4e1c-b9b2-29e5c3f2b6e0

(.venv)developer@fuel:stack$ ssh node-1 ip netns exec qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd ip ro
Warning: Permanently added 'node-1' (RSA) to the list of known hosts.
172.18.161.0/24 dev qg-aed693c2-b9 proto kernel scope link src 172.18.161.205
11.0.0.0/24 dev qr-7af18f99-f1 proto kernel scope link src 11.0.0.1
12.0.0.0/24 dev qr-f3512f79-90 proto kernel scope link src 12.0.0.1
111.0.0.0/24 dev qr-1e578c6b-6d proto kernel scope link src 111.0.0.1
default via 172.18.161.1 dev qg-aed693c2-b9

 9. Ask pacemaker to disable-enable L3 agent:
pcs resource disable p_neutron-l3-agent
wait until it is stopped and then:
pcs resource enable p_neutron-l3-agent
 10. L3-agent is moved to other controller (say to node-2)
 11. Verify network configuration:

(.venv)developer@fuel:stack$ ssh node-2 ip netns
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd
qrouter-af4f6833-d01c-4e1c-b9b2-29e5c3f2b6e0
haproxy

(.venv)developer@fuel:stack$ ssh node-2 ip netns exec qrouter-8a74bfea-0d27-4b29-92a4-942ab6a552dd ip ro
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
172.18.161.0/24 dev qg-aed693c2-b9 proto kernel scope link src 172.18.161.205
11.0.0.0/24 dev qr-7af18f99-f1 proto kernel scope link src 11.0.0.1
12.0.0.0/24 dev qr-f3512f79-90 proto kernel scope link src 12.0.0.1
111.0.0.0/24 dev qr-1e578c6b-6d proto kernel scope link src 111.0.0.1
default via 172.18.161.1 dev qg-aed693c2-b9
 12. Verify that ping from VM still works

Dmitry Borodaenko (angdraug) wrote :

Dmitry M: does Ilya's comment #19 confirm that the fixes listed by Miroslav in comment #17 do fix this problem? If so, can we get these ported to 5.0 and 6.0?

Dmitry B: absolutely yes. Actually Miroslav ported fix to 5.0.x as well, we just wait for Neutron team to confirm it works there as well. And sure the team is expected to port fix to 6.0 too.

Alexander Ignatov (aignatov) wrote :

Merged to master by the following fix https://review.openstack.org/#/c/125110/2

Dmitry Borodaenko (angdraug) wrote :

Can you propose the 5.0.x version for stable/5.0?

Alexander Ignatov (aignatov) wrote :

Had a conversation with Sergey Vasilenko, he will send CR to stable/5.0 for this issue soon.

Dmitry Borodaenko (angdraug) wrote :

It was a month since previous comment, any update on a backport for stable/5.0?

Dmitry, fixing 5.0.3 was postponed since 5.0.3 was postponed itself.

Dmitry Borodaenko (angdraug) wrote :

We may have postponed the 5.0.3 release, but customers running 5.0.x who have this problem still need a patched package. Now that 6.0 is out, please backport this fix.

Ryan Moe (rmoe) wrote :

Fix verified on 4.1.1

'astute_sha': '55df06b2e84fa5d71a1cc0e78dbccab5db29d968',
 'build_id': '2015-01-09_18-40-26',
 'build_number': '5',
 'fuellib_sha': '469ed82eae57bd85939678054dfa0260e8dbf895',
 'fuelmain_sha': 'fa218a36d2686de6bb36ef8c6b33526c9d802e34',
 'mirantis': 'yes',
 'nailgun_sha': '8cc1dff8ab43f50bd28a9c081d7d06fbd3831b98',
 'ostf_sha': 'f4f15b4d98459650c1945b0efc30290a619be824',
 'release': '4.1.1'

Backport here: https://review.openstack.org/#/c/146994/

Sergey Kolekonov (skolekonov) wrote :

The backport to stable/5.0 is here https://review.openstack.org/#/c/147095/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers