OVS agents were declared dead due to controller swact

Bug #1817935 reported by Bart Wensley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bart Wensley

Bug Description

Title
-----
OVS agents were declared dead due to controller swact

Brief Description
-----------------
On one lock operation for controller-1 (at 2019-02-26T23:47:36), the VIM detected that the neutron services were down on both compute hosts and attempted to migrate instances. This appears to be a neutron issue - the neutron-server declared the OVS agents to be dead:
2019-02-26 23:48:27,659.659 22 WARNING neutron.db.agents_db [req-e0783bf5-c304-4c05-a201-4a3d935ea34d - - - - -] Agent healthcheck: found 2 dead agents out of 10:
                Type Last heartbeat host
  Open vSwitch agent 2019-02-26 23:47:12 compute-0
  Open vSwitch agent 2019-02-26 23:47:12 compute-1

I think this happened because the agents failed to report their state due to a messaging timeout:
2019-02-26 23:48:43,650.650 121 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID 7fe3717e70a44416b7cdef3c10f97cdf

I expect this happened due to a temporary rabbitmq outage when the rabbitmq pod was deleted on controller-1 due to the lock. Someone from the neutron team should look at this - we may need to make this more tolerant of temporary messaging interruptions.

Severity
--------
Major - This results in VM getting migrated unnecessarily. They should not be migrated on controller operations.

Steps to Reproduce
------------------
Repeated controller lock/unlock operations (with swact in between).

Expected Behavior
------------------
The neutron server should be tolerant of a very short rabbitmq outage and not declare the OVS agents to be dead.

Actual Behavior
----------------
See above

Reproducibility
---------------
Intermittent - only saw on one out of eight lock/unlocks.

System Configuration
--------------------
2 + 2 system (kubernetes)

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="f/stein"

JOB="STX_build_stein_master"
<email address hidden>"
BUILD_NUMBER="54"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-02-25 19:13:50 +0000"

Timestamp/Logs
--------------
See above.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; medium priority as the issue is intermittent.

If a neutron change is required, a neutron launchpad will be needed.

Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Joseph Richard (josephrichard)
status: New → Triaged
tags: added: stx.2019.05 stx.containers stx.networking
description: updated
Revision history for this message
Joseph Richard (josephrichard) wrote :

Looking at those times listed, it looks like the ovs agents should have reported in at 23:47:42 and 23:48:12, before being declared dead at 23:48:27. I don't see any reason why we should add tol

This is customizable, but neutron defaults to an agent_down_time of 75 seconds[1], after which the agent is reported as dead, if the agent hasn't reported in that time, with the agents defaulting to a report_interval of 30 seconds[2].

How long was rabbit down for? Is there any reason we need to support rabbit being down longer than 30 seconds? How long should we need to tolerate?

[1]https://github.com/openstack/neutron/blob/master/neutron/conf/agent/database/agents_db.py
[2]https://github.com/openstack/neutron/blob/master/neutron/conf/agent/common.py

Changed in starlingx:
assignee: Joseph Richard (josephrichard) → Bart Wensley (bartwensley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/644242

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/644242
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=2fcb4f157005ce496ce65513fa136d1dcb35353f
Submitter: Zuul
Branch: master

commit 2fcb4f157005ce496ce65513fa136d1dcb35353f
Author: Bart Wensley <email address hidden>
Date: Mon Mar 18 10:03:27 2019 -0500

    Increase tolerance for declaring neutron agents down

    The neutron server listens for heartbeats from the various
    neutron agents running on worker nodes. The agents send
    this heartbeat every 30s, but use a synchronous RPC, which
    can take up to 60s to time out if the rabbitmq server
    disappears (e.g. when a controller host is powered down
    unexpectedly). The default timeout is 75s, so if two of
    these async RPC messages time out in a row (due to rabbitmq
    server issues related to a controller power down or swact),
    the neutron agent will be declared down incorrectly. This
    causes the VIM to migrate instances away from the worker
    node, which we want to avoid.

    To make this more tolerant of temporary failures in the
    rabbitmq server, I am increasing the timeout (agent_down_time)
    to 150s.

    Change-Id: Iecd1a7d1034bc8c98853ba279336c26dc7bc3fe9
    Closes-Bug: 1817935
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651525

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/651525
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=c74f21cef679e2a2e9efa9eb20e99b077c124db1
Submitter: Zuul
Branch: master

commit c74f21cef679e2a2e9efa9eb20e99b077c124db1
Author: Bart Wensley <email address hidden>
Date: Wed Apr 10 07:47:49 2019 -0500

    Further increase tolerance for declaring neutron agents down

    The neutron server listens for heartbeats from the various
    neutron agents running on worker nodes. The agents send
    this heartbeat every 30s, but use a synchronous RPC, which
    can take up to 60s to time out if the rabbitmq server
    disappears (e.g. when a controller host is powered down
    unexpectedly). The default timeout is 75s, so if two of
    these async RPC messages time out in a row (due to rabbitmq
    server issues related to a controller power down or swact),
    the neutron agent will be declared down incorrectly. This
    causes the VIM to migrate instances away from the worker
    node, which we want to avoid.

    Commit 2fcb4f15 increased the timeout (agent_down_time)
    to 150s. However, after further testing it has been found
    that 150s is not enough in some rare cases (e.g. when
    rebooting the active controller host). I am increasing the
    timeout (agent_down_time) to 180s.

    Change-Id: Ic0cedf8f20eaf1c1a33defbabcae13fbfb727ec9
    Closes-Bug: 1817935
    Signed-off-by: Bart Wensley <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.