Many Tracebacks for Ubuntu: Failed rescheduling router <...>: no eligible l3 agent found

Bug #1434568 reported by Anastasia Palkina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS Neutron

Bug Description

"build_id": "2015-03-19_22-54-44",
"ostf_sha": "b9a090c71682fbea5d9351051827d7d654d07be3",
"build_number": "210",
"release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-03-19_22-54-44", "ostf_sha": "b9a090c71682fbea5d9351051827d7d654d07be3", "build_number": "210", "api": "1.0", "nailgun_sha": "1d2bd383caecc5ec3f86bf93ccca940326f23e97", "production": "docker", "python-fuelclient_sha": "b223dcaf5fdad2f714cd245958fefe03995d6207", "astute_sha": "4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "f3d6353c08d8eb709c7ab100b56dc2bebef4157f", "fuellib_sha": "7764225db5bc653563309912afbb4058283c808b"}}},
"auth_required": true,
"api": "1.0",
"nailgun_sha": "1d2bd383caecc5ec3f86bf93ccca940326f23e97",
"production": "docker",
"python-fuelclient_sha": "b223dcaf5fdad2f714cd245958fefe03995d6207",
"astute_sha": "4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9",
"feature_groups": ["mirantis"],
"release": "6.1",
"fuelmain_sha": "f3d6353c08d8eb709c7ab100b56dc2bebef4157f",
"fuellib_sha": "7764225db5bc653563309912afbb4058283c808b"

1. Create new environment (Ubuntu)
2. Choose Neutron, Vlan
3. Add 1 controller+mongo, 1 compute, 2 mongo
4. Start deployment. It was successful
5. Start OSTF tests. It was successful except test "Check network connectivity from instance via floating IP" which failed on step: Check connectivity to the floating IP using ping command.

6. There is many errors in neutron-server log on node-5:

2015-03-20 13:20:19 ERR

neutron.db.l3_agentschedulers_db [-] Failed to reschedule router 781dc01e-3c3b-4e4c-b875-6298bee4c9da
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db Traceback (most recent call last):
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 111, in reschedule_routers_from_down_agents
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db self.reschedule_router(context, binding.router_id)
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db File "/usr/lib/python2.7/dist-packages/neutron/db/l3_agentschedulers_db.py", line 248, in reschedule_router
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db router_id=router_id)
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db RouterReschedulingFailed: Failed rescheduling router 781dc01e-3c3b-4e4c-b875-6298bee4c9da: no eligible l3 agent found.
2015-03-20 13:20:19.591 30741 TRACE neutron.db.l3_agentschedulers_db

In the same test case for CentOS I have no errors in this log (node-11)

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaNjl3NE5VTTF6WTA/view?usp=sharing

Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Neutron (mos-neutron)
Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

I've found a lot of timeout errors in lrmd.log (from node-5, primary controller)

1689:2015-03-20T11:38:13.806657+00:00 warning: warning: child_timeout_callback: p_neutron-l3-agent_monitor_20000 process (PID 28131) timed out
1690:2015-03-20T11:38:13.807663+00:00 warning: warning: operation_finished: p_neutron-l3-agent_monitor_20000:28131 - timed out after 10000ms
1730:2015-03-20T11:39:34.548347+00:00 warning: warning: operation_finished: p_ceilometer-alarm-evaluator_monitor_20000:28731 - timed out after 30000ms
1731:2015-03-20T11:39:35.420902+00:00 warning: warning: child_timeout_callback: p_heat-engine_monitor_20000 process (PID 28751) timed out

It means that environment worked extremely slow and monitor functions of many Pacemaker resources was unable to work correctly. Neutron agents were restarted several times.

There was only one controller, so if Neutron didn't receive state reports from L3 agent, it was unable to reschedule a router and restore connectivity for instances.

Also from pacemakerd.log:

2015-03-20T11:40:07.896950+00:00 err: error: child_waitpid: Managed process 23705 (lrmd) dumped core
2015-03-20T11:40:07.896950+00:00 notice: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=23705, core=1)
2015-03-20T11:40:07.896950+00:00 notice: notice: pcmk_process_exit: Respawning failed child process: lrmd
2015-03-20T11:40:07.929236+00:00 err: error: pcmk_child_exit: Child process crmd (23708) exited: Generic Pacemaker error (201)
2015-03-20T11:40:07.932405+00:00 notice: notice: pcmk_process_exit: Respawning failed child process: crmd

Please also specify characteristics of cluster nodes (VMs)

Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

Also this test works ok on the latest ISO (#281), so moving this bug to invalid.

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.