Router resources lost after rescheduling

Bug #1481739 reported by Ilya Shakhat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Oleg Bondarev
7.0.x
Fix Released
High
Oleg Bondarev
8.0.x
Fix Released
High
Oleg Bondarev

Bug Description

Upstream bug: https://bugs.launchpad.net/neutron/+bug/1482630

The env contains router with network plugged in. This router is scheduled to some controller. After suspending of this controller the router is rescheduled in DB, however the corresponding system resources are not created on the new controller.

Steps to reproduce:
1. Deploy env with 3 controllers
2. Create router with network plugged in, set gateway.
3. Find controller where router is scheduled
4. Suspend controller host (or do hard power off in case of HW node)
5. Wait for Rabbit and DB to be back, check Neutron agents from alive controllers report their state successfully
6. Find where the router is scheduled by running neutron l3-agent-list-hosting-router <router>
7. Check that resources are created via ip netns command

The issue reproduces randomly, however the chances to make it higher on slower environment.

Tags: neutron
Revision history for this message
Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "773"
  build_id: "2015-07-31_10-53-55"
  nailgun_sha: "3f278def07cb6d957725b3ae27666d6edb918fdb"
  python-fuelclient_sha: "71bb8fa87ee25f0c1bb84317884da7c917902a63"
  fuel-agent_sha: "b27076b8d87719507bd0f7c225858a64799c5be7"
  fuel-nailgun-agent_sha: "1512b9af6b41cc95c4d891c593aeebe0faca5a63"
  astute_sha: "2caf0c0b192a5bb6868d15b047388f1b4b016b0d"
  fuel-library_sha: "8b93bbcc2627e19f3a7677f3f33a5f4c1f756468"
  fuel-ostf_sha: "f73608d0e31d0a02a7fad6c048e931b7deef148e"
  fuelmain_sha: "e29f2061590846e4628aa834e59518268c0bb6c1"

Revision history for this message
Ilya Shakhat (shakhat) wrote :
Download full text (4.2 KiB)

In the repro the router 79075920-c56e-4c5e-b0e5-1783e88ccb21 was originally on node-3, after rescheduling:
root@node-4:~# neutron l3-agent-list-hosting-router the_router
+--------------------------------------+-------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-------------------+----------------+-------+----------+
| cf54f16a-b791-48f1-977c-d88b493bce1b | node-4.domain.tld | True | :-) | |
+--------------------------------------+-------------------+----------------+-------+----------+
however there's no namespace related for this router:
root@node-4:~# ip netns
qdhcp-6cb05646-3850-46b7-83fd-2eef437707a3
qdhcp-2dd6b0e9-d9a9-4e06-91cd-ceb66585e157
qdhcp-d43241d8-0981-4f45-af80-bf668771169c
qdhcp-b65e0695-df84-491a-b25e-c0073fbedae1
haproxy
vrouter

According to logs:
node-5 / neutron server
===================
2015-08-05T09:25:46.198687+00:00 debug: 2015-08-05 09:25:46.198 17224 DEBUG oslo_messaging._drivers.impl_rabbit [req-75fb7345-6f03-4ebc-9958-1a89df50ff1b ] Publisher.send: sending message l3_agent.node-4.domain.tld to {'oslo.message': '{"_context_roles": ["admin"], "_context_project_name": null, "_context_read_deleted": "no", "_context_request_id": "req-c6ce253c-48c6-4862-915d-9ead4f8e5a7f", "_context_user_name": null, "_context_auth_token": null, "args": {"payload": ["79075920-c56e-4c5e-b0e5-1783e88ccb21"]}, "_context_tenant": null, "_unique_id": "d338434762cd47c496f0551d7ee1eb78", "_context_is_admin": true, "version": "1.0", "_context_timestamp": "2015-08-05 09:24:31.754026", "_context_tenant_name": null, "_context_user": null, "_context_user_id": null, "_context_tenant_id": null, "method": "router_added_to_agent", "_context_project_id": null}', 'oslo.version': '2.0'} with routing key neutron send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:400

node-4 / rabbit
============
=INFO REPORT==== 5-Aug-2015::09:25:44 ===
Starting RabbitMQ 3.3.5 on Erlang R16B03
Copyright (C) 2007-2014 GoPivotal, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
node : rabbit@node-4
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash : soeIWU2jk2YNseTyDSlsEA==
log : /<email address hidden>
sasl log : /<email address hidden>
database dir : /var/lib/rabbitmq/mnesia/rabbit@node-4

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
Limiting to approx 102300 file handles (92068 sockets)

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
Memory limit set to 6419MB of 16049MB total.

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
Disk free limit set to 50MB

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
started TCP Listener on [::]:5673

=INFO REPORT==== 5-Aug-2015::09:25:44 ===
Management plugin started. Port: 15672

=INFO ...

Read more...

Changed in mos:
milestone: none → 7.0
assignee: nobody → Oleg Bondarev (obondarev)
Changed in mos:
status: New → Confirmed
importance: Undecided → High
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/10203

description: updated
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/10203
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: b226b69d7ecd911b4dd53cc63380c4428ef1ca32
Author: Oleg Bondarev <email address hidden>
Date: Fri Aug 14 13:52:20 2015

Ensure l3 agent receives notification about added router

Currently router_added (and other) notifications are sent
to agents with an RPC cast() method which does not ensure that
the message is actually delivered to the recipient.
If the message is lost (due to instability of messaging system
in some failover scenarios for example) neither server nor agent
will be aware of that and router will be "lost" till next agent
resync. Resync will only happen in case of errors on agent side
or restart.
The fix makes server use call() to notify agents about added routers
thus ensuring no routers will be lost.

Closes-Bug: #1481739
Closes-Bug: #1482630
Change-Id: I672b4f5217a6227a5a9bfd5825448d8a9b95a54c

Changed in mos:
status: In Progress → Fix Committed
Anna Babich (ababich)
tags: added: neutron on-verification
Revision history for this message
Anna Babich (ababich) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "252"
  build_id: "2015-08-29_17-24-57"
  nailgun_sha: "3189ccfb8c1dac888e351f535b03bdbc9d392406"
  python-fuelclient_sha: "9643fa07f1290071511066804f962f62fe27b512"
  fuel-agent_sha: "1e8f38bbb864ed99aa8fe862b6367e82afec3263"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "53c86cba593ddbac776ce5a3360240274c20738c"
  fuel-library_sha: "f05b958ef318f70170fe0db71bffcbaadbc39ae4"
  fuel-ostf_sha: "83048d68609854324ceeaf04242e68d658cfb55d"
  fuelmain_sha: "0e54d68392b359bc122e5bbba9249c729eeaf579"

Verified on cluster: neutron+vxlan, 3 controllers, 2 computes

Verification scenario

1. Create router with network plugged in, set gateway:
neutron net-create net01
neutron subnet-create net01 192.168.1.0/24 --enable-dhcp --name net01__subnet
neutron router-create --distributed False router01
neutron router-interface-add router01 net01__subnet
neutron router-gateway-set router01 net04_ext

2. Find controller where router is scheduled:
router_id=$(neutron router-show router01 | grep ' id ' | awk '{print $4}')
neutron l3-agent-list-hosting-router $router_id

3. Get a list of l3-agents:
neutron agent-list | grep l3-

4. Ban l3-agent on a controller node, on which the created router isn't hosted and which will not be used as a destination for manual router's rescheduling

5. On a controller node, which will be used as a destination for manual router's rescheduling, add a script:
#!/bin/bash
# 192.168.0.0/24 - mgmt network
iptables -I INPUT 1 -s 192.168.0.0/24 -j DROP
sleep 70
iptables -D INPUT -s 192.168.0.0/24 -j DROP

6. Against a controller node, where the router is scheduled now, run a command to reschedule this router to another l3-agent:
neutron l3-agent-router-remove $src_agent_id router01 && neutron l3-agent-router-add $dst_agent_id router01

And a second later, run the previously-added script against a node where the router is rescheduled to.

7. Wait for script finishing, Rabbit and DB to be back, check Neutron agents from alive controllers report their state successfully

8. Check that the router is scheduled now on a destination agent:
neutron l3-agent-list-hosting-router $router_id

9. Check that server.log contains following records:
root@node-2:/var/log/neutron# cat server.log | grep 'Failed to notify'
2015-09-04 13:02:21.181 16042 WARNING neutron.db.l3_agentschedulers_db [req-92c660ab-0a51-4496-bc72-2e6ee224f439 ] Failed to notify L3 agent on host node-1.domain.tld about added router. Attempt 1 out of 5
2015-09-04 13:03:21.396 16042 WARNING neutron.db.l3_agentschedulers_db [req-92c660ab-0a51-4496-bc72-2e6ee224f439 ] Failed to notify L3 agent on host node-1.domain.tld about added router. Attempt 2 out of 5

tags: removed: on-verification
Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/13297

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/13297
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: de6278c6e927ceb8dc6a9f3b8558bbc15059dd78
Author: Oleg Bondarev <email address hidden>
Date: Mon Nov 23 10:04:40 2015

Ensure l3 agent receives notification about added router

Currently router_added (and other) notifications are sent
to agents with an RPC cast() method which does not ensure that
the message is actually delivered to the recipient.
If the message is lost (due to instability of messaging system
in some failover scenarios for example) neither server nor agent
will be aware of that and router will be "lost" till next agent
resync. Resync will only happen in case of errors on agent side
or restart.
The fix makes server use call() to notify agents about added routers
thus ensuring no routers will be lost.

This also unifies reschedule_router() method to avoid code duplication
between legacy and dvr agent schedulers.

Upstream: https://review.openstack.org/210378/

Closes-Bug: #1481739
Closes-Bug: #1482630
Change-Id: I672b4f5217a6227a5a9bfd5825448d8a9b95a54c

Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "264"
  build_id: "264"
  fuel-nailgun_sha: "0e09dce510927f2cc490b898e5fe3f813bd791be"
  python-fuelclient_sha: "f033192b84263f0e699458a4274289a5198ae7e4"
  fuel-agent_sha: "660c6514caa8f5fcd482f1cc4008a6028243e009"
  fuel-nailgun-agent_sha: "a33a58d378c117c0f509b0e7badc6f0910364154"
  astute_sha: "48fd58676debcc85951db68df6d77c22daa55e52"
  fuel-library_sha: "ab7e51f345ffb7c256e0f61addcf86553d7c3867"
  fuel-ostf_sha: "23b7ae2a1a57de5a3e1861ffb7805394ca339cc2"
  fuel-mirror_sha: "6534117233a5bdc51d7d47361bc7d511e4b11e6f"
  fuelmenu_sha: "fcb15df4fd1a790b17dd78cf675c11c279040941"
  shotgun_sha: "a0bd06508067935f2ae9be2523ed0d1717b995ce"
  network-checker_sha: "a3534f8885246afb15609c54f91d3b23d599a5b1"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "26adf12c320936a97a9b0a84169a6e58c530e848"
(neutron+vxlan, 2 compute, 3 controller nodes)

Repeat steps from verifying 7.0 but banning agent for step 6. In log I can see:
root@node-11:~# cat /var/log/neutron/server.log | grep 'Failed to notify'
2015-12-23 11:30:07.175 899 WARNING neutron.db.l3_agentschedulers_db [req-48636d7d-8f62-41ab-acaf-00683c760852 - - - - -] Failed to notify L3 agent on host node-13.domain.tld about added router. Attempt 1 out of 2
2015-12-23 11:31:07.461 899 WARNING neutron.db.l3_agentschedulers_db [req-48636d7d-8f62-41ab-acaf-00683c760852 - - - - -] Failed to notify L3 agent on host node-13.domain.tld about added router. Attempt 2 out of 2
2015-12-23 11:36:09.863 899 WARNING neutron.db.l3_agentschedulers_db [req-48636d7d-8f62-41ab-acaf-00683c760852 - - - - -] Failed to notify L3 agent on host node-12.domain.tld about added router. Attempt 1 out of 2
2015-12-23 11:37:09.877 899 WARNING neutron.db.l3_agentschedulers_db [req-48636d7d-8f62-41ab-acaf-00683c760852 - - - - -] Failed to notify L3 agent on host node-12.domain.tld about added router. Attempt 2 out of 2

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/18402

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (9.0/mitaka)

Change abandoned by Oleg Bondarev <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/18402
Reason: The patch is in Mitaka https://review.openstack.org/#/c/210378/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.