Nova waits indefinitely on ceph client hangs due to network problems

Bug #1834048 reported by Alexander Diana on 2019-06-24
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Lee Yarwood

Bug Description

Description
===========
Requested to be filed by sean-k-mooney as "not a ceph problem".

During what looks like the update_available_resource process, queries to ceph are made to check available space, etc. In cases where there is packet loss between the compute node and ceph, the ceph client may hang for up to 30 seconds per dropped request.

This freezes up nova's queue and enough sequential failures will eventually shows up with a symptom of "too many missed heartbeats" rabbitmq error, which interrupts and restarts the cycle over again.

As suggested by Sean, it might be best to put a configurable timeout on ceph calls during this process to ensure nova doesnt lock up/flap, and ceph backend network issues are reported for debug.

Steps to reproduce
==================
1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc.
2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_available_resource /var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704

Expected result
===============
nova alerting of ceph connection timeout

Actual result
=============
nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus.

Environment
===========
nova==18.1.0
rocky

Logs & Configs
==============
No direct logs other than rabbitmq's complaints of timeouts as a symptom.

Matt Riedemann (mriedem) on 2019-06-25
tags: added: ceph libvirt resource-tracker
Changed in nova:
status: New → Confirmed
tags: added: serviceability

Fix proposed to branch: master
Review: https://review.opendev.org/667421

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/667421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Submitter: Zuul
Branch: master

commit 03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Author: Lee Yarwood <email address hidden>
Date: Tue Jun 25 18:20:24 2019 +0100

    libvirt: Add a rbd_connect_timeout configurable

    Previously the initial call to connect to a RBD cluster via the RADOS
    API could hang indefinitely if network or other environmental related
    issues were encountered.

    When encountered during a call to update_available_resource this can
    result in the local n-cpu service reporting as UP while never being able
    to break out of a subsequent RPC timeout loop as documented in bug

    This change adds a simple timeout configurable to be used when initially
    connecting to the cluster [1][2][3]. The default timeout of 5 seconds
    being sufficiently small enough to ensure that if encountered the n-cpu
    service will be able to be marked as DOWN before a RPC timeout is seen.

    [1] http://docs.ceph.com/docs/luminous/rados/api/python/#rados.Rados.connect
    [2] http://docs.ceph.com/docs/mimic/rados/api/python/#rados.Rados.connect
    [3] http://docs.ceph.com/docs/nautilus/rados/api/python/#rados.Rados.connect

    Closes-bug: #1834048
    Change-Id: I67f341bf895d6cc5d503da274c089d443295199e

Changed in nova:
status: In Progress → Fix Released

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/669167
Reason: Holding off as we didn't get any positive feedback about backporting this downstream, I might reopen these once the change has been released in Train for a while.

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/669168
Reason: Holding off as we didn't get any positive feedback about backporting this downstream, I might reopen these once the change has been released in Train for a while.

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/669169
Reason: Holding off as we didn't get any positive feedback about backporting this downstream, I might reopen these once the change has been released in Train for a while.

Matt Riedemann (mriedem) on 2019-09-18
Changed in nova:
importance: Undecided → Low

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers