Fuel for OpenStack

Many Tracebacks for Ubuntu: Failed rescheduling router <...>: no eligible l3 agent found

Bug #1434568 reported by Anastasia Palkina on 2015-03-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	MOS Neutron	Fuel for OpenStack 6.1

Bug Description

"build_id": "2015-03-19_22-54-44",
"ostf_sha": "b9a090c71682fbea5d9351051827d7d654d07be3",
"build_number": "210",
"release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-03-19_22-54-44", "ostf_sha": "b9a090c71682fbea5d9351051827d7d654d07be3", "build_number": "210", "api": "1.0", "nailgun_sha": "1d2bd383caecc5ec3f86bf93ccca940326f23e97", "production": "docker", "python-fuelclient_sha": "b223dcaf5fdad2f714cd245958fefe03995d6207", "astute_sha": "4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "f3d6353c08d8eb709c7ab100b56dc2bebef4157f", "fuellib_sha": "7764225db5bc653563309912afbb4058283c808b"}}},
"auth_required": true,
"api": "1.0",
"nailgun_sha": "1d2bd383caecc5ec3f86bf93ccca940326f23e97",
"production": "docker",
"python-fuelclient_sha": "b223dcaf5fdad2f714cd245958fefe03995d6207",
"astute_sha": "4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9",
"feature_groups": ["mirantis"],
"release": "6.1",
"fuelmain_sha": "f3d6353c08d8eb709c7ab100b56dc2bebef4157f",
"fuellib_sha": "7764225db5bc653563309912afbb4058283c808b"

1. Create new environment (Ubuntu)
2. Choose Neutron, Vlan
3. Add 1 controller+mongo, 1 compute, 2 mongo
4. Start deployment. It was successful
5. Start OSTF tests. It was successful except test "Check network connectivity from instance via floating IP" which failed on step: Check connectivity to the floating IP using ping command.

6. There is many errors in neutron-server log on node-5:

2015-03-20 13:20:19 ERR

neutron.db.l3_agentschedulers_db [-] Failed to reschedule router 781dc01e-3c3b-4e4c-b875-6298bee4c9da
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db Traceback (most recent call last):
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 111, in reschedule_routers_from_down_agents
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db self.reschedule_router(context, binding.router_id)
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 248, in reschedule_router
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db router_id=router_id)
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db RouterReschedulingFailed: Failed rescheduling router 781dc01e-3c3b-4e4c-b875-6298bee4c9da: no eligible l3 agent found.
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db

In the same test case for CentOS I have no errors in this log (node-11)

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaNjl3NE5VTTF6WTA/view?usp=sharing

Vladimir Kuklin (vkuklin) on 2015-03-23

Changed in fuel:
status:	New → Confirmed

Vladimir Kuklin (vkuklin) on 2015-04-03

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Neutron (mos-neutron)

Revision history for this message

Sergey Kolekonov (skolekonov) wrote on 2015-04-06:

I've found a lot of timeout errors in lrmd.log (from node-5, primary controller)

1689:2015-03-20T11:38:13.806657+00:00 warning: warning: child_timeout_callback: p_neutron-l3-agent_monitor_20000 process (PID 28131) timed out
1690:2015-03-20T11:38:13.807663+00:00 warning: warning: operation_finished: p_neutron-l3-agent_monitor_20000:28131 - timed out after 10000ms
1730:2015-03-20T11:39:34.548347+00:00 warning: warning: operation_finished: p_ceilometer-alarm-evaluator_monitor_20000:28731 - timed out after 30000ms
1731:2015-03-20T11:39:35.420902+00:00 warning: warning: child_timeout_callback: p_heat-engine_monitor_20000 process (PID 28751) timed out

It means that environment worked extremely slow and monitor functions of many Pacemaker resources was unable to work correctly. Neutron agents were restarted several times.

There was only one controller, so if Neutron didn't receive state reports from L3 agent, it was unable to reschedule a router and restore connectivity for instances.

Also from pacemakerd.log:

2015-03-20T11:40:07.896950+00:00 err: error: child_waitpid: Managed process 23705 (lrmd) dumped core
2015-03-20T11:40:07.896950+00:00 notice: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=23705, core=1)
2015-03-20T11:40:07.896950+00:00 notice: notice: pcmk_process_exit: Respawning failed child process: lrmd
2015-03-20T11:40:07.929236+00:00 err: error: pcmk_child_exit: Child process crmd (23708) exited: Generic Pacemaker error (201)
2015-03-20T11:40:07.932405+00:00 notice: notice: pcmk_process_exit: Respawning failed child process: crmd

Please also specify characteristics of cluster nodes (VMs)

Revision history for this message

Sergey Kolekonov (skolekonov) wrote on 2015-04-06:

Also this test works ok on the latest ISO (#281), so moving this bug to invalid.

Changed in fuel:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.