[DVR] Lost snat after ban l3-agent with snat for updated router

Bug #1538539 reported by Kristina Berezovskaia
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Oleg Bondarev

Bug Description

After banning l3-agent on node with snat for updated router snat was lost

Steps:
1) Create net1, subnet1
2) Create centralized router, set gateway, add interface to net1
3) Boot vm in net1
4) Update router to Distributed:
neutron router-update router1 --admin_state_up False
neutron router-update router1 --distributed True
neutron router-update router1 --admin_state_up True
5) Check that ping 8.8.8.8 is available from vm
6) Ban l3-agent on node with snat
7) Wait some time

Expected result: snat moved on another controller, ping 8.8.8.8 from vm is available
Currenr result: snat was lost, ping 8.8.8.8 isn't available

Find on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "478"
  build_id: "478"
  fuel-nailgun_sha: "ae949905142507f2cb446071783731468f34a572"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "481ed135de2cb5060cac3795428625befdd1d814"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "420c6fa5f8cb51f3322d95113f783967bde9836e"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6c6b088a3d52dd0eaf43d59f3a3a149c93a07e7e"
(neutron+vlan+dvr, neutron+vxlan+l2pop+dvr)

Neutron server logs in attachment

Tags: area-neutron
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
status: New → Triaged
tags: added: area-neutron
removed: neutron
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

The same situation repeated for firstly created distributed router. Sometimes reschedulling is too slow, sometimes the same situation as for updated router

Logs in attachment

Revision history for this message
Oleg Bondarev (obondarev) wrote :

This is a regression from https://review.openstack.org/#/c/252852
The main problem though is DVR scheduling mechanism which is a big mess (will be fixed in Mitaka!)

So after rescheduling server tries to notify each agent that router is scheduled to and verifies that agent has received that notification - in case of failure scheduling is considered as failed and no more agents are notified (if there are many for dvr case).
For dvr routers after rescheduling snat from down agent router may still be scheduled to down agent as there are dvr serviceable ports on that node. In this case server still tries to notify dead agent, fails and gives up. The fix will be to not try to notify down agents.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/16505

Changed in mos:
status: Triaged → In Progress
Revision history for this message
Alexander Ignatov (aignatov) wrote :

In the review, will be fixed before HCF

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/16505
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 25140a5017481dbfc097cdf73139f06f249fc68c
Author: Oleg Bondarev <email address hidden>
Date: Thu Jan 28 14:02:28 2016

Use cast when notify down agents after dvr router rescheduling

DVR router might still be scheduled to down l3 agent after rescheduling
if there are dvr serviceable ports on the agent's host.
We should not wait for response from such agent (use cast).
Waiting for response in this case might lead to rescheduling failure and
SNAT being 'lost' for DVR router.

The patch also ensures that server tries to notify all agents even
if some notifications fail.

Closes-Bug: #1538539
Change-Id: Ia5514cc4f8f7b6d3f6d05baf7a96e2572b232b81

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify on
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "543"
  build_id: "543"
  fuel-nailgun_sha: "baec8643ca624e52b37873f2dbd511c135d236d9"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "e2d79330d5d708796330fac67722c21f85569b87"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "87dfb6bc25d4650264f09c338ed77c21a3d6fe87"
(vxlan+l2+dvr, 3 controllers, 2 compute)

Repeat steps from description, snat moved from banned l3-agent on another alive agent in some seconds.

Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/18401

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (9.0/mitaka)

Change abandoned by Oleg Bondarev <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/18401
Reason: Not needed since bp/improve-dvr-l3-agent-binding was implemented in Mitaka

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.